|
From: Miguel C. <mig...@fc...> - 2008-03-03 16:05:35
|
Hi,
I found the bug in the ImportArcs class. This bug makes the import command
to build segments with wrong arc names.
The map method receives a "value" parameter containing an ARCRecord.
This ARCRecord has the url, arc filename and offset. All values are used in
this method, except the arc filename that is set on the first time the map
method is called. So, when a thread is working over a new arc file, the
output for the index will refer the old arc filename.
The bug occur at line 301 ("checkArcName(rec);"). I commented line 545 of
the checkArcName() method to fix the bug.
Regards,
-- Miguel Costa
Portuguese Web Archive
-----Original Message-----
From: Miguel Costa [mailto:mig...@fc...]
Sent: quinta-feira, 28 de Fevereiro de 2008 11:44
To: 'Brad Tofel'
Cc: 'Daniel Gomes'
Subject: FW: [Archive-access-discuss] org.archive.io.NoGzipMagicException
Hi,
I don't know if you found anything else about this problem, but I found the
reason of the problem.
The index has bad references for the ARC files. The offsets returned are ok
but not the ARC files, usually one ARC filename behind:
e.g. returns IAH-20080218190013-00000-T4.arc.gz instead of
IAH-20080218190013-00001-T4.arc.gz
You can see the ARC file and offset debugging the NutchResourceIndex (line
122: document = getHttpDocument(requestUrl)) or much more simple, by
submitting the url in the browser
e.g.
http://localhost:8080/nutchwax/opensearch?query=date%3A20010101000000-200802
18190351+exacturl%3Ahttp%3A%2F%2Fwww.icat.fc.ul.pt%2Fstyles.css&hitsPerPage=
1000&start=0&dedupField=site&hitsPerDup=1000&hitsPerSite=1000
The wayback machine is using the nutchwax index through the opensearch link.
The nutchwax send the XML information from url match. This shows an ARC file
and an offset, but if you use the ARC READER over all ARCs to find this
offset:
e.g. find offset 24042995
arcreader `ls *.arc.gz` | grep 24042995
20080218190054 194.117.42.131
http://www.icat.fc.ul.pt/images/background_voltar.gif image/gif - - 24042995
388 IAH-20080218190013-00002-T4
you get an ARC file different from the expected. In the cases where the
NoGzipMagicException doesn't occur, the ARC file is the correct.
This occurs with one or more reduce tasks in hadoop, so it doesn't seems a
problem from the merge command.
Do you have any idea to solve this?
Regards,
-----Original Message-----
From: Miguel Costa [mailto:mig...@fc...]
Sent: sexta-feira, 22 de Fevereiro de 2008 15:22
To: 'Brad Tofel'
Subject: RE: [Archive-access-discuss] org.archive.io.NoGzipMagicException
More help to the problem.
I'm debuging the code using the org.archive.io.arc.ARCReader class in the
command line.
I can parse and dump all URLs from the arc.gz file.
When I use the offset returned by this dump I can see that the file is OK.
e.g.
/home/nutchwax/heritrix-1.12.1/src/scripts/arcreader -o 2332619
/home/nutchwax/arcs/IAH-20080123100910-00023-thessalian.arc.gz
20080123110842 70.85.38.82
http://www.gastronomias.com/moirasencantadas/imagens/logo.jpg image/jpeg -
4AN5CYCD3OYOZH7ZMOMJEKR37NTTXLT6 2332629 7722
IAH-20080123100910-00023-thessalian
When I put another offset I get the same exception:
Exception in thread "main" java.io.IOException: Not in GZIP format
at
java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:137)
So, the problem seems to be in the creation of the index because the offsets
are computed wrong.
-----Original Message-----
From: Brad Tofel [mailto:br...@ar...]
Sent: quinta-feira, 21 de Fevereiro de 2008 19:52
To: Miguel Costa
Subject: Re: [Archive-access-discuss] org.archive.io.NoGzipMagicException
Darn. The problem didn't surface given the small input you sent.. Ran into
"Unepected End of ZLIB input stream" before the problem you are seeing.
Is there someplace online where you can post the entire file so I can
download it and examine it?
I should be able to receive a 100MB (how large is the original?) attachment
as well, if sending the whole file via email is an option for you.
Thanks,
Brad
Miguel Costa wrote:
|