Re: [Archive-access-discuss] incremental indexing with nutchwax0.8

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Alexis.

NutchWAX 0.10.0 has lots of bug fixes and improvements over 0.8.0, so  
you may want to start by upgrading your installation.

Does your job complete any tasks before you see this error? Do you  
see any other errors in the logs? Specifically, do you see a  
BindException when you start-all.sh?

The more ARCs you index in a single job, the larger heap space you'll  
need both during indexing and during deployment. This depends, of  
course, on how much text is contained in the documents within the  
ARCs. I've been able to index and deploy batches of 12,000 ARCs with  
heap spaces around 3200m on 4GB machines.

Hope this helps.

-J

On Jun 20, 2007, at 4:19 AM, alexis artes wrote:

> Hi,
>
> We are having problems in doing an incremental indexing. We have  
> initially indexed 3000 arcfiles and trying to indexed 3000 more  
> arcfiles when we encountered the following error.
>
> 2007-06-19 02:49:25,135 INFO  
> org.apache.hadoop.mapred.TaskInProgress: Error from  
> task_0001_r_000035_0: java.net.SocketTimeout
> Exception: timed out waiting for rpc response
>         at org.apache.hadoop.ipc.Client.call(Client.java:312)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:161)
>         at org.apache.hadoop.dfs.$Proxy1.complete(Unknown Source)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close 
> (DFSClient.java:1126)
>         at java.io.FilterOutputStream.close(FilterOutputStream.java: 
> 143)
>         at org.apache.hadoop.fs.FSDataOutputStream$Summer.close 
> (FSDataOutputStream.java:97)
>         at java.io.FilterOutputStream.close(FilterOutputStream.java: 
> 143)
>         at java.io.FilterOutputStream.close(FilterOutputStream.java: 
> 143)
>         at java.io.FilterOutputStream.close(FilterOutputStream.java: 
> 143)
>         at org.apache.hadoop.io.SequenceFile$Writer.close 
> (SequenceFile.java:160)
>         at org.apache.hadoop.io.MapFile$Writer.close(MapFile.java:118)
>         at org.archive.access.nutch.ImportArcs 
> $WaxFetcherOutputFormat$1.close(ImportArcs.java:687)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java: 
> 281)
>         at org.apache.hadoop.mapred.TaskTracker$Child.main 
> (TaskTracker.java:1075)
>
>
> We are using 28 nodes. Our configuration in hadoop-site.xml as  
> follows:
>
> <property>
>         <name>fs.default.name</name>
>         <value>apple001:9000</value>
> </property>
>
> <property>
>         <name>mapred.job.tracker</name>
>         <value>apple001:9001</value>
> </property>
>
> <property>
>         <name>dfs.name.dir</name>
>         <value>/opt/hadoop-0.5.0/filesystem/name</value>
> </property>
>
> <property>
>         <name>dfs.data.dir</name>
>         <value>/opt/hadoop-0.5.0/filesystem/data</value>
> </property>
>
> <property>
>         <name>mapred.local.dir</name>
>         <value>/opt/hadoop-0.5.0/filesystem/mapreduce/local</value>
> </property>
>
> <property>
>         <name>mapred.system.dir</name>
>         <value>/opt/hadoop-0.5.0/temp/hadoop/mapred/system</value>
>         <description>The shared directory where MapReduce stores  
> control files.
>         </description>
> </property>
>
> <property>
>         <name>mapred.temp.dir</name>
>         <value>/opt/hadoop-0.5.0/temp/hadoop/mapred/temp</value>
>         <description>A shared directory for temporary files.
>         </description>
> </property>
>
> <property>
>         <name>mapred.map.tasks</name>
>         <value>89</value>
>         <description>
>         define mapred.map tasks to be number of slave hosts
>         </description>
> </property>
>
> <property>
>         <name>mapred.reduce.tasks</name>
>         <value>53</value>
>         <description>
>         define mapred.reduce tasks to be number of slave hosts
>         </description>
> </property>
>
> <property>
>   <name>mapred.tasktracker.tasks.maximum</name>
>   <value>2</value>
>   <description>The maximum number of tasks that will be run
>   simultaneously by a task tracker.
>   </description>
> </property>
>
> <property>
>         <name>dfs.replication</name>
>         <value>1</value>
> </property>
>
> Moreover, what is the maximum number of arc files that can be  
> indexed in the same batch? We tried 6000 but we encountered errors.
>
>
> Best Regards,
> Alex
>
> Get your own web address.
> Have a HUGE year through Yahoo! Small Business.
> ---------------------------------------------------------------------- 
> ---
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/ 
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss