From: Aimilia A. <emi...@au...> - 2013-01-25 11:26:23
|
Hello everybody! I use Heritrix 3.1.0 to collect some .warc files. The format of these files is WARC 1.0. Then, with Wayback, I index the urls I have collect. But except from this indexing I also want to have text-indexing. So I tried nutchwax. First, I installed Hadoop 0.9.2 and then Nutchwax 0.10.0 (according to this documentation http://archive-access.sourceforge.net/projects/nutch/apidocs/overview-summary.html#toc). When I run the command ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax-0.10.0.jar all /tmp/inputs /tmp/outputs test The indexing starts normally but then I get the following errors: ..... 13/01/25 02:19:35 INFO conf.Configuration: found resource regex-urlfilter.txt at file:/tmp/hadoop-unjar7698782854953199110/regex-urlfilter.txt 13/01/25 02:19:35 INFO nutch.ImportArcs: opening /home/admin/archive/heritrix-3.1.0/jobs/aueb_v2/20120229120116/warcs/AUEB-20120229120127116-00000-6074~localhost.localdomain~8443.warc.gz 13/01/25 02:19:36 INFO conf.Configuration: found resource wax-parse-plugins.xml at file:/tmp/hadoop-unjar7698782854953199110/wax-parse-plugins.xml 13/01/25 02:20:08 INFO nutch.ImportArcs: Error parsing /home/admin/archive/heritrix-3.1.0/jobs/aueb_v2/20120229120116/warcs/AUEB-20120229120127116-00000-6074~localhost.localdomain~8443.warc.gz 13/01/25 02:20:08 INFO mapred.LocalJobRunner: Error parsing /home/admin/archive/heritrix-3.1.0/jobs/aueb_v2/20120229120116/warcs/AUEB-20120229120127116-00000-6074~localhost.localdomain~8443.warc.gz 13/01/25 02:20:08 WARN nutch.ImportArcs: Error parsing /home/admin/archive/heritrix-3.1.0/jobs/aueb_v2/20120229120116/warcs/AUEB-20120229120127116-00000-6074~localhost.localdomain~8443.warc.gz java.lang.RuntimeException: Retried but no next record (Offset 0) at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:503) at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:449) at org.archive.access.nutch.ImportArcs$IndexingThread.run(ImportArcs.java:356) Caused by: java.io.IOException: Failed parse of Header Line: WARC/1.0 at org.archive.io.warc.WARCRecord.parseHeaderLine(WARCRecord.java:248) at org.archive.io.warc.WARCRecord.parseHeaders(WARCRecord.java:136) at org.archive.io.warc.WARCRecord.<init>(WARCRecord.java:112) at org.archive.io.warc.WARCReader.createArchiveRecord(WARCReader.java:97) at org.archive.io.warc.WARCReaderFactory$CompressedWARCReader$1.innerNext(WARCReaderFactory.java:280) at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:532) at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:491) ... 2 more .... (the same error for all warc files) .... 13/01/25 02:21:04 INFO mapred.JobClient: Running job: job_tyetcg 13/01/25 02:21:04 INFO conf.Configuration: parsing file:/home/admin/archive/hadoop/hadoop-0.9.2/conf/hadoop-default.xml 13/01/25 02:21:04 INFO conf.Configuration: parsing file:/home/admin/archive/hadoop/hadoop-0.9.2/conf/mapred-default.xml 13/01/25 02:21:04 INFO conf.Configuration: parsing /tmp/hadoop-root/mapred/local/localRunner/job_tyetcg.xml 13/01/25 02:21:04 INFO conf.Configuration: parsing file:/home/admin/archive/hadoop/hadoop-0.9.2/conf/mapred-default.xml 13/01/25 02:21:04 INFO mapred.MapTask: opened part-0.out 13/01/25 02:21:04 WARN mapred.LocalJobRunner: job_tyetcg java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:109) at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:177) at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:203) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:215) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109) Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:433) at org.archive.access.nutch.Nutchwax.doDedup(Nutchwax.java:257) at org.archive.access.nutch.Nutchwax.doAll(Nutchwax.java:156) at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:389) at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) So, the output files are created (crawldb indexes linkdb segments) but they are empty. I believe that these errors result from the version of the WARC files. Can you suggest me anything (or the correct version of Nutchwax) that would create/update correctly the output files: crawldb index indexes linkdb segments? If Nutchwax is not supported any more, is there another tool to make text indexing and connect it with Wayback?Thanks in advance,Emily __________ Information from ESET NOD32 Antivirus, version of virus signature database 7930 (20130125) __________ The message was checked by ESET NOD32 Antivirus. http://www.eset.com |