From: <ven...@ya...> - 2012-07-20 09:57:25
|
Hi, I tried to use myHadoop on a HPC environment and it works fine. Glad about your great work and thanks for open sourcing and sharing this up. When I try to set up a N node cluster, I face intermittent problems when the Hadoop jobs are executed. During batch job submission, I request for 2 nodes and are allocated by pbs for the job. So the job executes in a 2 node cluster. Below are the details on the setup & the error My HDFS BASE_DIR is /apps/hadoop/hadoop-0.20.2/HDFS and it looks like #:/apps/hadoop/hadoop-0.20.2> ls -ltr HDFS/ drwx------ 2 xx itr 4096 2012-07-09 07:26 3 drwx------ 4 xx itr 4096 2012-07-10 03:03 2 drwx------ 4 xx itr 4096 2012-07-20 05:14 1 Note: the folder /apps/hadoop/hadoop-0.20.2 is on the NAS and visible to all compute nodes The HADOOP_DATA_DIR in the setup.env is /apps/hadoop/hadoop-0.20.2/hdump and this directory does not physically exists and created only during the job In my pbs-configure script the symbolic links are created as ln -s /apps/hadoop/hadoop-0.20.2/HDFS/1 /apps/hadoop/hadoop-0.20.2/hdump ln -s /apps/hadoop/hadoop-0.20.2/HDFS/2 /apps/hadoop/hadoop-0.20.2/hdump Note: this folder /apps/hadoop/hadoop-0.20.2/hdump is formatted during job execution and will be deleted after the job execution Now, after the step for the symbolic link creation is executed the BASE_DIR looks like #:/apps/hadoop/hadoop-0.20.2> ls -ltr HDFS/ drwx------ 2 xx itr 4096 2012-07-09 07:26 3 drwx------ 4 xx itr 4096 2012-07-10 03:03 2 drwx------ 4 xx itr 4096 2012-07-20 05:14 1 And inside the 1 folder #:/apps/hadoop/hadoop-0.20.2> ls -ltr HDFS/1 drwx------ 5 xx itr 4096 2012-07-10 03:29 dfs lrwxrwxrwx 1 xx itr 33 2012-07-20 05:06 2 -> /apps/hadoop/hadoop-0.20.2/HDFS/2 drwx------ 3 xx itr 4096 2012-07-20 05:06 mapred and inside the 2 folder #:/apps/hadoop/hadoop-0.20.2> ls -ltr HDFS/1/2 lrwxrwxrwx 1 xx itr 33 2012-07-20 05:06 HDFS/1/2 -> /apps/hadoop/hadoop-0.20.2/HDFS/2 The job completes file. But it throws many intermittent errors like .....INFO mapred.JobClient: Task Id : attempt_201207200444_0005_m_000001_0, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException:..... java.lang.RuntimeException: java.io.FileNotFoundException: /apps/hadoop/hadoop-0.20.2/hdump/mapred/local/taskTracker/jobcache/job_201207200444_0006/attempt_201207200444_0006_m_000000_0/job.xml (No such file or directory) java.lang.ClassCastException: org.apache.hadoop.mapreduce.lib.input.FileSplit incompatible with org.apache.hadoop.mapred.InputSplit From the errors, I guess the cause may be - the hdfs folders for each of the 2 nodes are overwritten my one another or in other words the folder HDFS/1 & HDFS/2 are not utilised and only the HDFS/1 folder is used by the job. Let me know if I miss something here and appreciate your help in resolving the above issue. Thanks. Regards, R. Venkatesh |