[Archive-access-discuss] nutchwax - distributed mode problems

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I have managed to index documents using nutchwax in distributed mode for 
several times. But now there is a problem i can not cope with. This time 
not all computers are under the same domain. Machine which hosts 
namenode and jobtracker is under domain webarchiv.cz, while all 
datanodes are under fi.muni.cz (btw all computers are in the same 
building). When the 5th job starts (dedup 1: urls by time) 'info' 
massages are combined with 'warn' ones in the logs like these:

jobtracker:

INFO org.apache.hadoop.mapred.TaskInProgress: Error from 
task_0005_m_000003_3: java.lang.ArrayIndexOutOfBoundsException: -1
        at 
org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:109)
        at 
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:177)
        at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:203)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:215)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1388)

datanodes:

WARN org.apache.hadoop.dfs.DataNode: DataXCeiver
java.io.IOException: Block blk_-7402203219236206647 has already been 
started (though not co
mpleted), and thus cannot be created.
        at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:437)
        at 
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:721)
        at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:550)
        at java.lang.Thread.run(Thread.java:619)

WARN org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_-7402203219236206647 to nymfe01/147.251.53.11:50010
java.net.SocketException: Broken pipe
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at 
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at 
org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:974)
        at java.lang.Thread.run(Thread.java:619)

WARN org.apache.hadoop.dfs.DataNode: Failed to transfer 
blk_-5576786832054029538 to nymfe05/147.251.53.15:50010
java.net.SocketException: Connection reset
        at 
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:96)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at 
org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:974)
        at java.lang.Thread.run(Thread.java:619)

Then it crashes and my terminal says:

07/10/27 22:20:40 INFO indexer.DeleteDuplicates: Dedup: adding indexes 
in: output/indexes
07/10/27 22:20:43 INFO mapred.JobClient: Running job: job_0005
07/10/27 22:20:44 INFO mapred.JobClient:  map 0% reduce 0%
07/10/27 22:21:05 INFO mapred.JobClient:  map 100% reduce 100%
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399)
        at 
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:433)
        at org.archive.access.nutch.Nutchwax.doDedup(Nutchwax.java:257)
        at org.archive.access.nutch.Nutchwax.doAll(Nutchwax.java:156)
        at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:389)
        at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

Can anybody help?
Thanks,
Martin Bella