distmap Wiki

A toolkit for distributed short read mapping

Brought to you by: ramvinay

SetupHadoopMacintosh

Authors: Anonymous

Create common group and user account
Download Apache Hadoop
Our configuration files
1. Network
- Edit /etc/hosts on every node
- Then add the following lines in /etc/hosts on every node:
2. Configure
- For master:
- For every node do the follwoings:
- 1) Configure JAVA_HOME
- 2) Create some directories in hadoop home:
- 3) Configurations setup
- 4) Configure passphaseless ssh
3. SSH Access
4. First run
5. Start Cluster
- 1) Start HDFS Daemons
- 2) Start Daemons
6. Hadoop Web Interfaces
7. Run a Map Reduce Job (WordCount)
8. Stop Cluster
9 Configure Fair scheduler
10. Run DistMap

Create common group and user account

Before starting the hadoop setup it is important to create a common user account and a group on all computers. Create the group hadoop and user cluster:

Open /Applications/System Preferences/Accounts

To create group hadoop:

Click on + symbol and then select Group from New account drop down menu

Enter group name hadoop and click on Create Group button.

To create user account cluster:

Click on + symbol and then select Administrator from New account drop down menu

Enter Full name and Account name cluster, give password and click on Create Account button.

Download Apache Hadoop

Download Apache Hadoop stable version from http://www.us.apache.org/dist/hadoop/common/stable/ and put into /usr/local/hadoop-1.0.3

The JAVA_HOME is /System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/

The HADOOP_HOME is /usr/local/hadoop-1.0.3

Our configuration files

Here we provide a practical guide and following configuration files to setup a Hadoop cluster from the beginning on Linux and Mac OSX 10.6 or higher computers. The main configuration files can be downloaded from following link:

/usr/local/hadoop-1.0.3/conf/hadoop-env.sh: http://distmap.googlecode.com/files/hadoop-env.sh

/usr/local/hadoop-1.0.3/conf/core-site.xml: http://distmap.googlecode.com/files/core-site.xml

/usr/local/hadoop-1.0.3/conf/hdfs-site.xml: http://distmap.googlecode.com/files/hdfs-site.xml

/usr/local/hadoop-1.0.3/conf/mapred-site.xml: http://distmap.googlecode.com/files/mapred-site.xml

/usr/local/hadoop-1.0.3/fair-scheduler.xml: http://distmap.googlecode.com/files/fair-scheduler.xml

Download Apache Hadoop stable version from http://www.us.apache.org/dist/hadoop/common/stable/ and put into /usr/local/hadoop-1.0.3

1. Network

Edit /etc/hosts on every node

If you have, for example, the following nodes:

(remember to replace the hostname according to your machine. eg.macserver01-01,macserver01-02)

master:

macserver01-01

slaves:

macserver01-02
macserver01-03
macserver01-04
macserver01-05

Then add the following lines in /etc/hosts on every node:

# /etc/hosts (for master AND slave)

192.168.0.1      macserver01-01  
192.168.0.2      macserver01-02
192.168.0.3      macserver01-03
192.168.0.4      macserver01-04
192.168.0.5      macserver01-05

2. Configure

For master:

edit conf/masters as follow:

macserver01-01

edit conf/slaves as follow:

macserver01-01
macserver01-02
macserver01-03
macserver01-04
macserver01-05

For every node do the follwoings:

1) Configure JAVA_HOME

http://distmap.googlecode.com/files/hadoop-env.sh

cd /usr/local/hadoop-1.0.3
gedit conf/hadoop-env.sh

and change:

# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun

to:

# The java implementation to use.  Required.
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/

Save & exit.

2) Create some directories in hadoop home:

cd /usr/local/hadoop-1.0.3
mkdir tmp
mkdir hdfs
mkdir hdfs/name
mkdir hdfs/data

3) Configurations setup

Under conf/, edit the following files, (note that "/path/to/your/hadoop" should be replaced with something like "/usr/local/hadoop-1.0.3")

conf/core-site.xml http://distmap.googlecode.com/files/core-site.xml

<configuration>
    <property>
      <name>fs.default.name</name>
      <value>hdfs://macserver01-01:9000</value>
    </property>
    <property>
      <name>hadoop.tmp.dir</name>
      <value>/usr/local/hadoop-1.0.3/tmp</value>
    </property>
  </configuration>

conf/hdfs-site.xml http://distmap.googlecode.com/files/hdfs-site.xml

<configuration>
    <property>
      <name>dfs.replication</name>
      <value>3</value>
    </property>
    <property>
      <name>dfs.name.dir</name>
      <value>/usr/local/hadoop-1.0.3/hdfs/name</value>
    </property>
    <property>
      <name>dfs.data.dir</name>
      <value>/usr/local/hadoop-1.0.3/hdfs/data</value>
    </property>
  </configuration>

conf/mapred-site.xml http://distmap.googlecode.com/files/mapred-site.xml

<configuration>
    <property>
      <name>mapred.job.tracker</name>
      <value>macserver01-01:9001</value>
    </property>
  </configuration>

4) Configure passphaseless ssh

ssh localhost

You will need a password to log in ssh.

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
exit

Configuration done. Try:

ssh localhost

You should now be able to log in without a password.

3. SSH Access

The master log in must have password-less log in authorities to all slaves.

cat ~/.ssh/id_rsa.pub | ssh cluster@macserver01-02 "cat - >> ~/.ssh/authorized_keys"
cat ~/.ssh/id_rsa.pub | ssh cluster@macserver01-03 "cat - >> ~/.ssh/authorized_keys"
cat ~/.ssh/id_rsa.pub | ssh cluster@macserver01-04 "cat - >> ~/.ssh/authorized_keys"
cat ~/.ssh/id_rsa.pub | ssh cluster@macserver01-05 "cat - >> ~/.ssh/authorized_keys"

You will need the corresponding slave's password to running the above commands. Try:

cluster@macserver01-01:~$ ssh macserver01-02
cluster@macserver01-01:~$ ssh macserver01-03
cluster@macserver01-01:~$ ssh macserver01-04
cluster@macserver01-01:~$ ssh macserver01-05

You should now be able to log in without a password.

4. First run

You should format the HDFS (Hadoop Distributed File System).

Run the following command on the master:

/usr/local/hadoop-1.0.3/bin/hadoop namenode -format

5. Start Cluster

1) Start HDFS Daemons

Run the following command on master:

/usr/local/hadoop-1.0.3/bin/start-dfs.sh

Use the following command on every node to check the status of daemons:

jps

Run jps on master, you should see something like this:

8211 DataNode
7803 NameNode
8620 Jps
8354 SecondaryNameNode

run jps on the slaves, you should see something like this:

8211 DataNode
8620 Jps

2) Start MapReduce Daemons

Run the following command on master:

/usr/local/hadoop-1.0.3/bin/start-mapred.sh

Use the following command on every nodes to check the status of daemons:

jps

Run jps on master, you should see something like this:

8211 DataNode
7803 NameNode
8547 TaskTracker
8620 Jps
8422 JobTracker
8354 SecondaryNameNode

run jps on the slaves, you should see something like this:

8211 DataNode
8547 TaskTracker
8620 Jps

6. Hadoop Web Interfaces

There are some web interfaces that let you know the live progress report for the running Hadoop jobs.

http://localhost:50030/ – web UI for MapReduce job tracker(s)

http://localhost:50060/ – web UI for task tracker(s)

http://localhost:50070/ – web UI for HDFS name node(s)

7. Run a Map Reduce Job`(WordCount)`

Create a directory named "input" in HDFS:

/usr/local/hadoop-1.0.3/bin/hadoop dfs -mkdir input

Copy some text file into input

/usr/local/hadoop-1.0.3/bin/hadoop dfs -put conf/* input

Run WordCount

/usr/local/hadoop-1.0.3/bin/hadoop jar hadoop-0.20.2-examples.jar wordcount input output

Display output:

/usr/local/hadoop-1.0.3/bin/hadoop dfs -cat output/*

8. Stop Cluster

Close MapReduce daemons

Run on master:

/usr/local/hadoop-1.0.3/bin/stop-mapred.sh

Close HDFS daemons

Run on master:

/usr/local/hadoop-1.0.3/bin/stop-dfs.sh

9 Configure Fair scheduler

Various pools can be created under Hadoop cluster to run multiple jobs in parallel. Here we have a fair-scheduler implementation xml configuration file http://distmap.googlecode.com/files/fair-scheduler.xml

10. Run `DistMap`

To run DistMap:

10.1 Start MapReduce daemons

Run on master:

/usr/local/hadoop-1.0.3/bin/start-mapred.sh

10.2 Start HDFS daemons

Run on master:

/usr/local/hadoop-1.0.3/bin/start-dfs.sh

10.3 Change HDFS filesystem permission

Run on master:

/usr/local/hadoop-1.0.3/bin/hadoop dfs -chmod -R 777 /

10.4 Set permission for tmp folder

/usr/local/hadoop-1.0.3/bin/hadoop dfs -chmod -R 777 /usr/local/hadoop-1.0.3/tmp/

10.5 Go to DistMap manual page http://code.google.com/p/distmap/wiki/Manual http://code.google.com/p/distmap/wiki/Manual

distmap Wiki

A toolkit for distributed short read mapping

SetupHadoopMacintosh

Create common group and user account

Download Apache Hadoop

Our configuration files

1. Network

Edit /etc/hosts on every node

Then add the following lines in /etc/hosts on every node:

2. Configure

For master:

For every node do the follwoings:

1) Configure JAVA_HOME

2) Create some directories in hadoop home:

3) Configurations setup

4) Configure passphaseless ssh

3. SSH Access

4. First run

5. Start Cluster

1) Start HDFS Daemons

2) Start MapReduce Daemons

6. Hadoop Web Interfaces

7. Run a Map Reduce Job(WordCount)

8. Stop Cluster

9 Configure Fair scheduler

10. Run DistMap

7. Run a Map Reduce Job`(WordCount)`

10. Run `DistMap`