Before starting the hadoop setup it is important to create a common user account and a group on all computers. Create the group hadoop and user cluster:
Step1: Log in as root.
Step2: Create the group hadoop using the groupadd utility followed by the name of the group, in this format:
groupadd n hadoop
Where n is an unused group ID greater than 100.
Step3: Create the user cluster using the useradd utility followed by the group (hadoop) and user name (cluster) in this format:
useradd -u n -g hadoop cluster
Step4: Create a password for the user cluster. To do this, use the passwd utility and the following command:
passwd cluster
Download Apache Hadoop stable version from http://www.us.apache.org/dist/hadoop/common/stable/ and put into /usr/local/hadoop-1.0.3
The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22
The HADOOP_HOME is /usr/local/hadoop-1.0.3
Here we provide a practical guide and following configuration files to setup a Hadoop cluster from the beginning on Linux and Mac OSX 10.6 or higher computers. The main configuration files can be downloaded from following link:
/usr/local/hadoop-1.0.3/conf/hadoop-env.sh: http://distmap.googlecode.com/files/hadoop-env.sh
/usr/local/hadoop-1.0.3/conf/core-site.xml: http://distmap.googlecode.com/files/core-site.xml
/usr/local/hadoop-1.0.3/conf/hdfs-site.xml: http://distmap.googlecode.com/files/hdfs-site.xml
/usr/local/hadoop-1.0.3/conf/mapred-site.xml: http://distmap.googlecode.com/files/mapred-site.xml
/usr/local/hadoop-1.0.3/fair-scheduler.xml: http://distmap.googlecode.com/files/fair-scheduler.xml
If you have, for example, the following nodes:
(remember to replace the hostname according to your machine; eg. ubuntu01-01, ubuntu01-02)
master:
ubuntu01-01
slaves:
ubuntu01-02
ubuntu01-03
ubuntu01-04
ubuntu01-05
# /etc/hosts (for master AND slave)
192.168.0.1 ubuntu01-01
192.168.0.2 ubuntu01-02
192.168.0.3 ubuntu01-03
192.168.0.4 ubuntu01-04
192.168.0.5 ubuntu01-05
edit conf/masters as follow:
ubuntu01-01
edit conf/slaves as follow:
ubuntu01-01
ubuntu01-02
ubuntu01-03
ubuntu01-04
ubuntu01-05
cd /usr/local/hadoop-1.0.3
gedit conf/hadoop-env.sh
and change:
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
to:
# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java/jdk1.6.0_22
Save & exit.
cd /usr/local/hadoop-1.0.3
mkdir tmp
mkdir hdfs
mkdir hdfs/name
mkdir hdfs/data
Under conf/, edit the following files, (note that "/path/to/your/hadoop" should be replaced with something like "/usr/local/hadoop-1.0.3")
conf/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://ubuntu01-01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop-1.0.3/tmp</value>
</property>
</configuration>
conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/local/hadoop-1.0.3/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/local/hadoop-1.0.3/hdfs/data</value>
</property>
</configuration>
conf/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>ubuntu01-01:9001</value>
</property>
</configuration>
ssh localhost
You will need a password to log in ssh.
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
exit
Configuration done. Try:
ssh localhost
You should now be able to log in without password.
The master log in must have password-less log in authorities to all slaves.
cluster@ubuntu01-01:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub cluster@ubuntu01-02
cluster@ubuntu01-01:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub cluster@ubuntu01-03
cluster@ubuntu01-01:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub cluster@ubuntu01-04
cluster@ubuntu01-01:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub cluster@ubuntu01-05
You will need the corresponding slave's password to running the above commands. Try:
cluster@ubuntu01-01:~$ ssh ubuntu01-02
cluster@ubuntu01-01:~$ ssh ubuntu01-03
cluster@ubuntu01-01:~$ ssh ubuntu01-04
cluster@ubuntu01-01:~$ ssh ubuntu01-05
You should now be able to log in without password.
You should format the HDFS (Hadoop Distributed File System).
Run the following command on the master:
/usr/local/hadoop-1.0.3/bin/hadoop namenode -format
Run the following command on master:
/usr/local/hadoop-1.0.3/bin/start-dfs.sh
Use the following command on every node to check the status of daemons:
jps
Run jps on master, you should see something like this:
8211 DataNode
7803 NameNode
8620 Jps
8354 SecondaryNameNode
Run jps on the slaves, you should see something like this:
8211 DataNode
8620 Jps
Run the following command on master:
/usr/local/hadoop-1.0.3/bin/start-mapred.sh
Use the following command on every node to check the status of daemons:
jps
Run jps on master, you should see something like this:
8211 DataNode
7803 NameNode
8547 TaskTracker
8620 Jps
8422 JobTracker
8354 SecondaryNameNode
run jps on the slaves, you should see something like this:
8211 DataNode
8547 TaskTracker
8620 Jps
There are some web interfaces that let you know the live progress report for the running Hadoop jobs.
http://localhost:50030/ – web UI for MapReduce job tracker(s)
http://localhost:50060/ – web UI for task tracker(s)
http://localhost:50070/ – web UI for HDFS name node(s)
(WordCount)Create a directory named "input" in HDFS:
/usr/local/hadoop-1.0.3/bin/hadoop dfs -mkdir input
Copy some text file into input
/usr/local/hadoop-1.0.3/bin/hadoop dfs -put conf/* input
Run WordCount
/usr/local/hadoop-1.0.3/bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input output
Display output:
/usr/local/hadoop-1.0.3/bin/hadoop dfs -cat output/*
Close MapReduce daemons
Run on master:
/usr/local/hadoop-1.0.3/bin/stop-mapred.sh
Close HDFS daemons
Run on master:
/usr/local/hadoop-1.0.3/bin/stop-dfs.sh
Various pools can be created under Hadoop cluster to run multiple jobs in parallel. Here we have a fair-scheduler implementation xml configuration file http://distmap.googlecode.com/files/fair-scheduler.xml
DistMapTo run DistMap:
10.1 Start MapReduce daemons
Run on master:
/usr/local/hadoop-1.0.3/bin/start-mapred.sh
10.2 Start HDFS daemons
Run on master:
/usr/local/hadoop-1.0.3/bin/start-dfs.sh
10.3 Change HDFS filesystem permission
Run on master:
/usr/local/hadoop-1.0.3/bin/hadoop dfs -chmod -R 777 /
10.4 Set permission for tmp folder
/usr/local/hadoop-1.0.3/bin/hadoop dfs -chmod -R 777 /usr/local/hadoop-1.0.3/tmp/
10.5 Go to DistMap manual page http://code.google.com/p/distmap/wiki/Manualhttp://code.google.com/p/distmap/wiki/Manual