From: Jaap B. <lor...@ho...> - 2009-08-27 12:38:54
|
Hello, I have some issue with the NutchWAX import. Somehow I cannot seem to import a whole directory of arcs anymore without getting this exception: 2009-08-27 13:36:41,197 INFO nutchwax.Importer - Importing ARC: filedesc://hospicLiberec-20090629183754-00016-har.arc 2009-08-27 13:36:41,198 WARN mapred.LocalJobRunner - job_local_0001 java.net.MalformedURLException: unknown protocol: filedesc at java.net.URL.<init>(URL.java:574) at java.net.URL.<init>(URL.java:464) at java.net.URL.<init>(URL.java:413) at org.archive.io.ArchiveReaderFactory.getArchiveReader(ArchiveReaderFactory.java:99) at org.archive.io.ArchiveReaderFactory.getArchiveReader(ArchiveReaderFactory.java:93) at org.archive.io.ArchiveReaderFactory.get(ArchiveReaderFactory.java:88) at org.archive.nutchwax.Importer.map(Importer.java:194) at org.archive.nutchwax.Importer.map(Importer.java:96) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157) 2009-08-27 13:36:41,329 FATAL nutchwax.Importer - Importer: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1113) at org.archive.nutchwax.Importer.run(Importer.java:663) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.archive.nutchwax.Importer.main(Importer.java:699) It is strange because I used to be able to import this very same dir before. I use the command: ./nutchwax import /tmp/arcdir /tmp/outputdir in the /tmp/arcdir is a file with file paths to the arc files. now I can sometimes succeed in importing just a few of these arc files. (the total nr of arcs is around 25). I read on a nutchwax issue tracker this: "Added boolean configuration option nutchwax.import.abortOnArchiveReadError with default value of 'false'. By default if reading an archive file causes the (W)ARCReader to throw an IOException, we catch it, and proceed with the next archive file in the manifest. Setting this config parameter to true causes the exception to be re-thrown, eventually causing the entire import job to abort." I use version 0.12.7, but this IOException does not seem to be caught by default? I hope you can help me out with this problem. Kind regards, Jaap Blom Nederlands instituut voor Beeld en Geluid Hilversum, The Netherlands _________________________________________________________________ Daarom koop je nu een nieuwe pc http://www.microsoft.com/netherlands/pc/daaromnu.aspx |