Hadoop single node cluster setup
Setting up the environment:
In this tutorial you will know step by step process for setting up a Hadoop Single Node cluster, so that you can play around with the framework and learn more about it.
In This tutorial we are using following Software versions, you can download same by clicking the hyperlinks:
- Ubuntu Linux04.3 LTS
- Hadoop 1.2.1, released August, 2013
If you are using putty to access your Linux box remotely, please install openssh by running this command, this also helps in configuring SSH access easily in later part of installation:
sudo apt-get install openssh-server
Prerequisites:
- Installing Java v1.5+
- Adding dedicated Hadoop system user.
- Configuring SSH access.
- Disabling IPv6.
Before start of installing any applications or softwares, please makes sure your list of packages from all repositories and PPA’s is up to date or if not update them by using this command:
sudo apt-get update
Installing Java v1.5+: For running Hadoop it requires Java v1. 6+, but to be on safer side install latest version Java 1.7+
- Download Latest oracle Java Linux version of the oracle website by using this command
wget https://edelivery.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.tar.gz
If it fails to download, please check with this given command which helps to avoid passing username and password.
wget –no-cookies –no-check-certificate –header “Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com” “https://edelivery.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.tar.gz”
- Copy the Oracle Java binaries into the /usr/local/java directory.
sudo cp -r jdk-7u25-linux-x64.tar.gz /usr/local/jav
- Change the directory to /usr/local/java by using this command
cd /usr/local/java
- Unpack the compressed Java binaries, in the directory /usr/local/java
sudo tar xvzf jdk-7u25-linux-x64.tar.gz
- Edit the system PATH file /etc/profile and add the following system variables to your system path
sudo nano /etc/profile or sudo gedit /etc/profile
- Scroll down to the end of the file using your arrow keys and add the following lines below to the end of your /etc/profile file:
JAVA_HOME=/usr/local/java/jdk1.7.0_25
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
export JAVA_HOME
export PATH
- Inform your Ubuntu Linux system where your Oracle Java JDK/JRE is located. This will tell the system that the new Oracle Java version is available for use.
sudo update-alternatives –install “/usr/bin/javac” “javac” “/usr/local/java/jdk1.7.0_40/bin/javac”
- This command notifies the system that Oracle Java JDK is available for use.Reload your system wide PATH /etc/profile by typing the following command:
. /etc/profile
- Test to see if Oracle Java was installed correctly on your system.
Java -version
- Adding dedicated Hadoop system user.
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required but it is recommended, because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine.
- Adding group:
sudo addgroup Hadoop
- Creating a user and adding the user to a group:
sudo adduser –ingroup Hadoop hduser
It will ask to provide the new UNIX password and Information as shown in below image.
Configuring SSH access: The need for SSH Key based authentication is required so that the master node can then login to slave nodes (and the secondary node) to start/stop them and also local machine if you want to use Hadoop with it. For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.
Before this step you have to make sure that SSH is up and running on your machine and configured it to allow SSH public key authentication.
Generating an SSH key for the hduser user.
a. Login as hduser with sudo
b. Run this Key generation command:
ssh-keyegen -t rsa -P
It will ask to provide the file name in which to save the key, just press has entered so that it will generate the key at ‘/home/hduser/ .ssh’
Enable SSH access to your local machine with this newly created key.
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The final step is to test the SSH setup by connecting to your local machine with the hduser user.
ssh hduser@localhost
This will add localhost permanently to the list of known hosts
Disabling IPv6: We need to disable IPv6 because Ubuntu is using 0.0.0.0 IP for different Hadoop configurations. You will need to run the following commands using a root account:
sudo gedit /etc/sysctl.conf
Add the following lines to the end of the file and reboot the machine, to update the configurations correctly.
#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Hadoop Installation:
Go to Apache Downloads and download Hadoop version 1.2.1(prefer to download any stable versions)
- Run this following command to download Hadoop version 1.2.1
wget http://mirrors.gigenet.com/apache/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
- Unpack the compressed hadoop file by using this command:
tar –xvzf hadoop-1.2.1.tar.gz
- Move hadoop-1.2.1 to hadoop directory by using give command
mv hadoop-1.2.1 hadoop
- Move hadoop package of your choice, I picked /usr/localfor my convenience
sudo mv hadoop /usr/local/
- Make sure to change the owner of all the files to the hduser user and hadoop group by using this command:
sudo chown -R hduser:hadoop Hadoop
Configuring Hadoop: The following are the required files we will use for the perfect configuration of the single node Hadoop cluster.
- hadoop-env.sh
- core-site.xml
- mapred-site.xml
- hdfs-site.xml
We can find the list of files in Hadoop directory which is located at
cd /usr/local/hadoop/conf
hadoop-env.sh: The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open conf/hadoop-env.sh in the editor and set the JAVA_HOME environment variable to the Oracle JDK 7 directory.
export JAVA_HOME=/usr/local/java/jdk1.7.0_25
core-site.xml: Create a duplicate session and login as a non-hadoop user. Create a directory “data”.
sudo mkdir /data
- Change the owner of the data directory to “hduser” .
sudo chown hduser:hadoop /data
- Change the user to “hduser”. Change the directory to /usr/local/hadoop/conf and edit the core-site.xml file.
su -hduser
cd /usr/local/hadoop/conf
vi core-site.xml
- Add the following entry to the file and save and quit the file.
<property>
<name>hadoop.tmp.dir</name>
<value>/data/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system.A URI whose scheme and authority determine the FileSystem implementation.The uri’s scheme determines the config property (fs.SCHEME.impl)naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.
</description>
</property>
mapred-site.xml: Edit the mapred-site.xml file and add the following entry to the file and save and quit the file.
vi mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs at. If “local”, then jobs are run in- process as a single map and reduce task. </description>
</property>
hdfs-site.xml: Edit the hdfs-site.xml file and add the following entry to the file and save and quit the file.
vi hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
</description>
</property
Update $HOME/.bashrc: Go back to the root and edit the .bashrc file.
vi .bashrc
- Add the following lines to the end of the file.
export JAVA_HOME=’/usr/local/java/jdk1.7.0_25′
export HADOOP_HOME=’/usr/local/hadoop’
export PATH=$HADOOP_HOME/bin:$JAVA_HOME/bin:$PATH
Create a directory tmp in the data folder.
mkdir /data/tmp
Formatting and Starting/Stopping the HDFS filesystem via the NameNode: The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your cluster. You need to do this the first time you set up a Hadoop cluster. Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS). To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command.
hadoop namenode -format
- Start Hadoop by running the following command .
start-all.sh
- Run jps coomand to see your all the services up and running
- Run netstat –plten | grep javato see list of ports running.
- Stop Hadoop by running the following command
stop-all.sh
Hadoop Web Interfaces: Hadoop comes with several web interfaces which are by default available at these locations:
- http://localhost:50070/– web UI of the NameNode daemon
- http://localhost:50030/– web UI of the JobTracker daemon
- http://localhost:50060/– web UI of the TaskTracker daemon
By this we are done in setting up a single node hadoop cluster, hope this step by step guide helps you to setup same environment at your place.