Before starting the hadoop setup it is important to create a common user account and a group on all computers. Create the group hadoop and user cluster:
Open /Applications/System Preferences/Accounts
To create group hadoop:
Click on + symbol and then select Group from New account drop down menu
Enter group name hadoop and click on Create Group button.
To create user account cluster:
Click on + symbol and then select Administrator from New account drop down menu
Enter Full name and Account name cluster, give password and click on Create Account button.
Download Apache Hadoop stable version from http://www.us.apache.org/dist/hadoop/common/stable/ and put into /usr/local/hadoop-1.0.3
The JAVA_HOME is /System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/
The HADOOP_HOME is /usr/local/hadoop-1.0.3
Here we provide a practical guide and following configuration files to setup a Hadoop cluster from the beginning on Linux and Mac OSX 10.6 or higher computers. The main configuration files can be downloaded from following link:
/usr/local/hadoop-1.0.3/conf/hadoop-env.sh: http://distmap.googlecode.com/files/hadoop-env.sh
/usr/local/hadoop-1.0.3/conf/core-site.xml: http://distmap.googlecode.com/files/core-site.xml
/usr/local/hadoop-1.0.3/conf/hdfs-site.xml: http://distmap.googlecode.com/files/hdfs-site.xml
/usr/local/hadoop-1.0.3/conf/mapred-site.xml: http://distmap.googlecode.com/files/mapred-site.xml
/usr/local/hadoop-1.0.3/fair-scheduler.xml: http://distmap.googlecode.com/files/fair-scheduler.xml
Download Apache Hadoop stable version from http://www.us.apache.org/dist/hadoop/common/stable/ and put into /usr/local/hadoop-1.0.3
If you have, for example, the following nodes:
(remember to replace the hostname according to your machine. eg.macserver01-01,macserver01-02)
master:
macserver01-01
slaves:
macserver01-02
macserver01-03
macserver01-04
macserver01-05
# /etc/hosts (for master AND slave)
192.168.0.1 macserver01-01
192.168.0.2 macserver01-02
192.168.0.3 macserver01-03
192.168.0.4 macserver01-04
192.168.0.5 macserver01-05
edit conf/masters as follow:
macserver01-01
edit conf/slaves as follow:
macserver01-01
macserver01-02
macserver01-03
macserver01-04
macserver01-05
http://distmap.googlecode.com/files/hadoop-env.sh
cd /usr/local/hadoop-1.0.3
gedit conf/hadoop-env.sh
and change:
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
to:
# The java implementation to use. Required.
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/
Save & exit.
cd /usr/local/hadoop-1.0.3
mkdir tmp
mkdir hdfs
mkdir hdfs/name
mkdir hdfs/data
Under conf/, edit the following files, (note that "/path/to/your/hadoop" should be replaced with something like "/usr/local/hadoop-1.0.3")
conf/core-site.xml http://distmap.googlecode.com/files/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://macserver01-01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop-1.0.3/tmp</value>
</property>
</configuration>
conf/hdfs-site.xml http://distmap.googlecode.com/files/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/local/hadoop-1.0.3/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/local/hadoop-1.0.3/hdfs/data</value>
</property>
</configuration>
conf/mapred-site.xml http://distmap.googlecode.com/files/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>macserver01-01:9001</value>
</property>
</configuration>
ssh localhost
You will need a password to log in ssh.
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
exit
Configuration done. Try:
ssh localhost
You should now be able to log in without a password.
The master log in must have password-less log in authorities to all slaves.
cat ~/.ssh/id_rsa.pub | ssh cluster@macserver01-02 "cat - >> ~/.ssh/authorized_keys"
cat ~/.ssh/id_rsa.pub | ssh cluster@macserver01-03 "cat - >> ~/.ssh/authorized_keys"
cat ~/.ssh/id_rsa.pub | ssh cluster@macserver01-04 "cat - >> ~/.ssh/authorized_keys"
cat ~/.ssh/id_rsa.pub | ssh cluster@macserver01-05 "cat - >> ~/.ssh/authorized_keys"
You will need the corresponding slave's password to running the above commands. Try:
cluster@macserver01-01:~$ ssh macserver01-02
cluster@macserver01-01:~$ ssh macserver01-03
cluster@macserver01-01:~$ ssh macserver01-04
cluster@macserver01-01:~$ ssh macserver01-05
You should now be able to log in without a password.
You should format the HDFS (Hadoop Distributed File System).
Run the following command on the master:
/usr/local/hadoop-1.0.3/bin/hadoop namenode -format
Run the following command on master:
/usr/local/hadoop-1.0.3/bin/start-dfs.sh
Use the following command on every node to check the status of daemons:
jps
Run jps on master, you should see something like this:
8211 DataNode
7803 NameNode
8620 Jps
8354 SecondaryNameNode
run jps on the slaves, you should see something like this:
8211 DataNode
8620 Jps
Run the following command on master:
/usr/local/hadoop-1.0.3/bin/start-mapred.sh
Use the following command on every nodes to check the status of daemons:
jps
Run jps on master, you should see something like this:
8211 DataNode
7803 NameNode
8547 TaskTracker
8620 Jps
8422 JobTracker
8354 SecondaryNameNode
run jps on the slaves, you should see something like this:
8211 DataNode
8547 TaskTracker
8620 Jps
There are some web interfaces that let you know the live progress report for the running Hadoop jobs.
http://localhost:50030/ – web UI for MapReduce job tracker(s)
http://localhost:50060/ – web UI for task tracker(s)
http://localhost:50070/ – web UI for HDFS name node(s)
(WordCount)Create a directory named "input" in HDFS:
/usr/local/hadoop-1.0.3/bin/hadoop dfs -mkdir input
Copy some text file into input
/usr/local/hadoop-1.0.3/bin/hadoop dfs -put conf/* input
Run WordCount
/usr/local/hadoop-1.0.3/bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input output
Display output:
/usr/local/hadoop-1.0.3/bin/hadoop dfs -cat output/*
Close MapReduce daemons
Run on master:
/usr/local/hadoop-1.0.3/bin/stop-mapred.sh
Close HDFS daemons
Run on master:
/usr/local/hadoop-1.0.3/bin/stop-dfs.sh
Various pools can be created under Hadoop cluster to run multiple jobs in parallel. Here we have a fair-scheduler implementation xml configuration file http://distmap.googlecode.com/files/fair-scheduler.xml
DistMapTo run DistMap:
10.1 Start MapReduce daemons
Run on master:
/usr/local/hadoop-1.0.3/bin/start-mapred.sh
10.2 Start HDFS daemons
Run on master:
/usr/local/hadoop-1.0.3/bin/start-dfs.sh
10.3 Change HDFS filesystem permission
Run on master:
/usr/local/hadoop-1.0.3/bin/hadoop dfs -chmod -R 777 /
10.4 Set permission for tmp folder
/usr/local/hadoop-1.0.3/bin/hadoop dfs -chmod -R 777 /usr/local/hadoop-1.0.3/tmp/
10.5 Go to DistMap manual page http://code.google.com/p/distmap/wiki/Manualhttp://code.google.com/p/distmap/wiki/Manual