Menu

SetupHadoopLinux

Anonymous
  • Create common group and user account
  • Download Apache Hadoop
  • Our configuration files
  • 1. Network
    • Edit /etc/hosts on every node
    • Then add the following lines in /etc/hosts on every node:
  • 2. Configure
    • For master:
    • For every node do the following:
    • 1) Configure JAVA_HOME
    • 2) Create some directories in hadoop home:
    • 3) Configurations setup
    • 4) Configure passphaseless ssh
  • 3. SSH Access
  • 4. First run
  • 5. Start Cluster
    • 1) Start HDFS Daemons
    • 2) Start Daemons
  • 6. Hadoop Web Interfaces
  • 7. Run a Map Reduce Job (WordCount)
  • 8. Stop Cluster
  • 9 Configure Fair scheduler
  • 10. Run DistMap

Create common group and user account

Before starting the hadoop setup it is important to create a common user account and a group on all computers. Create the group hadoop and user cluster:

Step1: Log in as root.

Step2: Create the group hadoop using the groupadd utility followed by the name of the group, in this format:

groupadd n hadoop

Where n is an unused group ID greater than 100.

Step3: Create the user cluster using the useradd utility followed by the group (hadoop) and user name (cluster) in this format:

        useradd -u n -g hadoop cluster

Step4: Create a password for the user cluster. To do this, use the passwd utility and the following command:

passwd cluster

Download Apache Hadoop

Download Apache Hadoop stable version from http://www.us.apache.org/dist/hadoop/common/stable/ and put into /usr/local/hadoop-1.0.3

The JAVA_HOME is /usr/lib/jvm/java/jdk1.6.0_22

The HADOOP_HOME is /usr/local/hadoop-1.0.3

Our configuration files

Here we provide a practical guide and following configuration files to setup a Hadoop cluster from the beginning on Linux and Mac OSX 10.6 or higher computers. The main configuration files can be downloaded from following link:

/usr/local/hadoop-1.0.3/conf/hadoop-env.sh: http://distmap.googlecode.com/files/hadoop-env.sh

/usr/local/hadoop-1.0.3/conf/core-site.xml: http://distmap.googlecode.com/files/core-site.xml

/usr/local/hadoop-1.0.3/conf/hdfs-site.xml: http://distmap.googlecode.com/files/hdfs-site.xml

/usr/local/hadoop-1.0.3/conf/mapred-site.xml: http://distmap.googlecode.com/files/mapred-site.xml

/usr/local/hadoop-1.0.3/fair-scheduler.xml: http://distmap.googlecode.com/files/fair-scheduler.xml

1. Network

Edit /etc/hosts on every node

If you have, for example, the following nodes:

(remember to replace the hostname according to your machine; eg. ubuntu01-01, ubuntu01-02)

master:

ubuntu01-01

slaves:

ubuntu01-02
ubuntu01-03
ubuntu01-04
ubuntu01-05

Then add the following lines in /etc/hosts on every node:

# /etc/hosts (for master AND slave)

192.168.0.1      ubuntu01-01  
192.168.0.2      ubuntu01-02
192.168.0.3      ubuntu01-03
192.168.0.4      ubuntu01-04
192.168.0.5      ubuntu01-05

2. Configure

For master:

edit conf/masters as follow:

ubuntu01-01

edit conf/slaves as follow:

ubuntu01-01
ubuntu01-02
ubuntu01-03
ubuntu01-04
ubuntu01-05

For every node do the following:

1) Configure JAVA_HOME

cd /usr/local/hadoop-1.0.3
gedit conf/hadoop-env.sh

and change:

# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

to:

# The java implementation to use.  Required.
export JAVA_HOME=/usr/lib/jvm/java/jdk1.6.0_22

Save & exit.

2) Create some directories in hadoop home:

cd /usr/local/hadoop-1.0.3
mkdir tmp
mkdir hdfs
mkdir hdfs/name
mkdir hdfs/data

3) Configurations setup

Under conf/, edit the following files, (note that "/path/to/your/hadoop" should be replaced with something like "/usr/local/hadoop-1.0.3")

conf/core-site.xml

<configuration>
    <property>
      <name>fs.default.name</name>
      <value>hdfs://ubuntu01-01:9000</value>
    </property>
    <property>
      <name>hadoop.tmp.dir</name>
      <value>/usr/local/hadoop-1.0.3/tmp</value>
    </property>
  </configuration>

conf/hdfs-site.xml

<configuration>
    <property>
      <name>dfs.replication</name>
      <value>3</value>
    </property>
    <property>
      <name>dfs.name.dir</name>
      <value>/usr/local/hadoop-1.0.3/hdfs/name</value>
    </property>
    <property>
      <name>dfs.data.dir</name>
      <value>/usr/local/hadoop-1.0.3/hdfs/data</value>
    </property>
  </configuration>

conf/mapred-site.xml

<configuration>
    <property>
      <name>mapred.job.tracker</name>
      <value>ubuntu01-01:9001</value>
    </property>
  </configuration>

4) Configure passphaseless ssh

ssh localhost

You will need a password to log in ssh.

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
exit

Configuration done. Try:

ssh localhost

You should now be able to log in without password.

3. SSH Access

The master log in must have password-less log in authorities to all slaves.

cluster@ubuntu01-01:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub cluster@ubuntu01-02
cluster@ubuntu01-01:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub cluster@ubuntu01-03
cluster@ubuntu01-01:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub cluster@ubuntu01-04
cluster@ubuntu01-01:~$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub cluster@ubuntu01-05

You will need the corresponding slave's password to running the above commands. Try:

cluster@ubuntu01-01:~$ ssh ubuntu01-02
cluster@ubuntu01-01:~$ ssh ubuntu01-03
cluster@ubuntu01-01:~$ ssh ubuntu01-04
cluster@ubuntu01-01:~$ ssh ubuntu01-05

You should now be able to log in without password.

4. First run

You should format the HDFS (Hadoop Distributed File System).

Run the following command on the master:

/usr/local/hadoop-1.0.3/bin/hadoop namenode -format

5. Start Cluster

1) Start HDFS Daemons

Run the following command on master:

/usr/local/hadoop-1.0.3/bin/start-dfs.sh

Use the following command on every node to check the status of daemons:

jps

Run jps on master, you should see something like this:

8211 DataNode
7803 NameNode
8620 Jps
8354 SecondaryNameNode

Run jps on the slaves, you should see something like this:

8211 DataNode
8620 Jps

2) Start MapReduce Daemons

Run the following command on master:

/usr/local/hadoop-1.0.3/bin/start-mapred.sh

Use the following command on every node to check the status of daemons:

jps

Run jps on master, you should see something like this:

8211 DataNode
7803 NameNode
8547 TaskTracker
8620 Jps
8422 JobTracker
8354 SecondaryNameNode

run jps on the slaves, you should see something like this:

8211 DataNode
8547 TaskTracker
8620 Jps

6. Hadoop Web Interfaces

There are some web interfaces that let you know the live progress report for the running Hadoop jobs.

http://localhost:50030/  web UI for MapReduce job tracker(s)

http://localhost:50060/  web UI for task tracker(s)

http://localhost:50070/  web UI for HDFS name node(s)

7. Run a Map Reduce Job(WordCount)

Create a directory named "input" in HDFS:

/usr/local/hadoop-1.0.3/bin/hadoop dfs -mkdir input

Copy some text file into input

/usr/local/hadoop-1.0.3/bin/hadoop dfs -put conf/* input

Run WordCount

/usr/local/hadoop-1.0.3/bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input output

Display output:

/usr/local/hadoop-1.0.3/bin/hadoop dfs -cat output/*

8. Stop Cluster

Close MapReduce daemons

Run on master:

/usr/local/hadoop-1.0.3/bin/stop-mapred.sh

Close HDFS daemons

Run on master:

/usr/local/hadoop-1.0.3/bin/stop-dfs.sh

9 Configure Fair scheduler

Various pools can be created under Hadoop cluster to run multiple jobs in parallel. Here we have a fair-scheduler implementation xml configuration file http://distmap.googlecode.com/files/fair-scheduler.xml

10. Run DistMap

To run DistMap:

10.1 Start MapReduce daemons

Run on master:

/usr/local/hadoop-1.0.3/bin/start-mapred.sh

10.2 Start HDFS daemons

Run on master:

/usr/local/hadoop-1.0.3/bin/start-dfs.sh

10.3 Change HDFS filesystem permission

Run on master:

/usr/local/hadoop-1.0.3/bin/hadoop dfs -chmod -R 777 /

10.4 Set permission for tmp folder

/usr/local/hadoop-1.0.3/bin/hadoop dfs -chmod -R 777 /usr/local/hadoop-1.0.3/tmp/

10.5 Go to DistMap manual page http://code.google.com/p/distmap/wiki/Manualhttp://code.google.com/p/distmap/wiki/Manual


MongoDB Logo MongoDB