Thread: [Archive-access-discuss] Text-Indexing with NutchWAX

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello everybody!

I use Heritrix 3.1.0 to collect some .warc files. The format of these files is WARC 1.0. Then, with Wayback, I index the urls I have collect. But except from this indexing I also want to have text-indexing. So I tried nutchwax. First, I installed Hadoop 0.9.2 and then Nutchwax 0.10.0 (according to this documentation http://archive-access.sourceforge.net/projects/nutch/apidocs/overview-summary.html#toc). When I run the command 

${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax-0.10.0.jar all /tmp/inputs /tmp/outputs test

The indexing starts normally but then I get the following errors:

.....

13/01/25 02:19:35 INFO conf.Configuration: found resource regex-urlfilter.txt at file:/tmp/hadoop-unjar7698782854953199110/regex-urlfilter.txt

13/01/25 02:19:35 INFO nutch.ImportArcs: opening /home/admin/archive/heritrix-3.1.0/jobs/aueb_v2/20120229120116/warcs/AUEB-20120229120127116-00000-6074~localhost.localdomain~8443.warc.gz

13/01/25 02:19:36 INFO conf.Configuration: found resource wax-parse-plugins.xml at file:/tmp/hadoop-unjar7698782854953199110/wax-parse-plugins.xml

13/01/25 02:20:08 INFO nutch.ImportArcs: Error parsing /home/admin/archive/heritrix-3.1.0/jobs/aueb_v2/20120229120116/warcs/AUEB-20120229120127116-00000-6074~localhost.localdomain~8443.warc.gz

13/01/25 02:20:08 INFO mapred.LocalJobRunner: Error parsing /home/admin/archive/heritrix-3.1.0/jobs/aueb_v2/20120229120116/warcs/AUEB-20120229120127116-00000-6074~localhost.localdomain~8443.warc.gz

13/01/25 02:20:08 WARN nutch.ImportArcs: Error parsing /home/admin/archive/heritrix-3.1.0/jobs/aueb_v2/20120229120116/warcs/AUEB-20120229120127116-00000-6074~localhost.localdomain~8443.warc.gz

java.lang.RuntimeException: Retried but no next record (Offset 0)

      at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:503)

      at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:449)

      at org.archive.access.nutch.ImportArcs$IndexingThread.run(ImportArcs.java:356)

Caused by: java.io.IOException: Failed parse of Header Line: WARC/1.0

      at org.archive.io.warc.WARCRecord.parseHeaderLine(WARCRecord.java:248)

      at org.archive.io.warc.WARCRecord.parseHeaders(WARCRecord.java:136)

      at org.archive.io.warc.WARCRecord.<init>(WARCRecord.java:112)

      at org.archive.io.warc.WARCReader.createArchiveRecord(WARCReader.java:97)

      at org.archive.io.warc.WARCReaderFactory$CompressedWARCReader$1.innerNext(WARCReaderFactory.java:280)

      at org.archive.io.ArchiveReader$ArchiveRecordIterator.exceptionNext(ArchiveReader.java:532)

      at org.archive.io.ArchiveReader$ArchiveRecordIterator.next(ArchiveReader.java:491)

      ... 2 more

....

(the same error for all warc files)

....

13/01/25 02:21:04 INFO mapred.JobClient: Running job: job_tyetcg

13/01/25 02:21:04 INFO conf.Configuration: parsing file:/home/admin/archive/hadoop/hadoop-0.9.2/conf/hadoop-default.xml

13/01/25 02:21:04 INFO conf.Configuration: parsing file:/home/admin/archive/hadoop/hadoop-0.9.2/conf/mapred-default.xml

13/01/25 02:21:04 INFO conf.Configuration: parsing /tmp/hadoop-root/mapred/local/localRunner/job_tyetcg.xml

13/01/25 02:21:04 INFO conf.Configuration: parsing file:/home/admin/archive/hadoop/hadoop-0.9.2/conf/mapred-default.xml

13/01/25 02:21:04 INFO mapred.MapTask: opened part-0.out

13/01/25 02:21:04 WARN mapred.LocalJobRunner: job_tyetcg

java.lang.ArrayIndexOutOfBoundsException: -1

      at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:109)

      at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:177)

      at org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:203)

      at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)

      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:215)

      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109)

Exception in thread "main" java.io.IOException: Job failed!

      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:399)

      at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:433)

      at org.archive.access.nutch.Nutchwax.doDedup(Nutchwax.java:257)

      at org.archive.access.nutch.Nutchwax.doAll(Nutchwax.java:156)

      at org.archive.access.nutch.Nutchwax.doJob(Nutchwax.java:389)

      at org.archive.access.nutch.Nutchwax.main(Nutchwax.java:674)

      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

      at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

      at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

      at java.lang.reflect.Method.invoke(Unknown Source)

      at org.apache.hadoop.util.RunJar.main(RunJar.java:149)

So, the output files are created (crawldb indexes linkdb segments) but they are empty. I believe that these errors result from the version of the WARC files. Can you suggest me anything (or the correct version of Nutchwax) that would create/update correctly the output files: crawldb  index  indexes  linkdb  segments? If Nutchwax is not supported any more, is there another tool to make text indexing and connect it with Wayback?Thanks in advance,Emily

__________ Information from ESET NOD32 Antivirus, version of virus signature database 7930 (20130125) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com

Thread: [Archive-access-discuss] Text-Indexing with NutchWAX

archive-access-discuss