|
From: Michael S. <st...@ar...> - 2006-07-03 22:59:08
|
Jo=E3o Cl=E1udio Luzio wrote: > Oops.. forgot to say that the arcs where on=20 > /var/local/webarchive/heritrix/jobs/bn_18_test-20060619172505727/=20 > directory but with .arc.gz instead of .arc. > > =20 This should be fine. > Jo=E3o Cl=E1udio Luzio wrote: > =20 >> Hi, >> I've been trying to get the pair up and running for a while now bu= t=20 >> had some problems.. >> Using nutchwax 0.4.3 and the wera (0.4.2RC1 & 0.5.0) I managed to=20 >> get it running but some of the related files (images) >> aren't displayed. Those get: >> <retrievermessage> >> <head> >> <errorcode>4</errorcode> >> <errormessage>Unable to parse Archive Identifier</errormessage= > >> </head> >> </retrievermessage> >> Using wera debug I found that the "[archiveidentifier] =3D>=20 >> 2770/IAH-20060619172903-00000-webarchive1" for a specific search i mad= e. >> (Starting tomcat from the nutchwax indexed data) >> =20 So, it generally works but some of the images don't show sometimes? >> Using wayback I dont have the same problems(I dont use nutchwax wi= th=20 >> wayback..). >> >> I've tried to get nutchwax 0.6.1 and wera running but the opensear= ch=20 >> servlet for the rss from nutchwax gives an exception.. >> =20 Do you still have the exception? >> So i tried nutchwax 0.7.0 (with latest hadoop - standalone), but n= ow=20 >> the arcretriever gives an exception when trying to get the document. >> Using wera debug I found that the "[archiveidentifier] =3D>=20 >> 2234331/filedesc://IAH-20060619172903-00000-webarchive1.arc" for the=20 >> same search i made. >> (Starting tomcat from anywhere) >> >> The exception: >> 7 Bad function argument Cause: java.io.FileNotFoundException:=20 >> /var/local/webarchive/heritrix/jobs/bn_18_test-20060619172505727/filed= esc:/IAH-20060619172903-00000-webarchive1.arc=20 >> does not exist. Stack trace:=20 >> org.archive.io.arc.ARCUtils.isReadable(ARCUtils.java:171)=20 >> org.archive.io.arc.ARCUtils.testCompressedARCFile(ARCUtils.java:94)=20 >> org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:200)=20 >> org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:194)=20 >> no.nb.nwa.retriever.ARCRetriever.getDocument(ARCRetriever.java:410)=20 >> no.nb.nwa.retriever.ARCRetriever.doGet(ARCRetriever.java:131) >> =20 Looks like we shouldn't be putting the 'filedesc:' on front of ARC=20 name? Does ARCRetreiver work if you make a request with=20 IAH-20060619172903-00000-webarchive1.arc instead of=20 filedesc:/IAH-20060619172903-00000-webarchive1.arc? St.Ack |