|
From: <jl...@ex...> - 2006-06-28 17:29:18
|
Hi,
I've been trying to get the pair up and running for a while now but=20
had some problems..
Using nutchwax 0.4.3 and the wera (0.4.2RC1 & 0.5.0) I managed to=20
get it running but some of the related files (images)
aren't displayed. Those get:
<retrievermessage>
<head>
<errorcode>4</errorcode>
<errormessage>Unable to parse Archive Identifier</errormessage>
</head>
</retrievermessage>
Using wera debug I found that the "[archiveidentifier] =3D>=20
2770/IAH-20060619172903-00000-webarchive1" for a specific search i made.
(Starting tomcat from the nutchwax indexed data)
Using wayback I dont have the same problems(I dont use nutchwax with=20
wayback..).
I've tried to get nutchwax 0.6.1 and wera running but the opensearch=20
servlet for the rss from nutchwax gives an exception..
So i tried nutchwax 0.7.0 (with latest hadoop - standalone), but now=20
the arcretriever gives an exception when trying to get the document.
Using wera debug I found that the "[archiveidentifier] =3D>=20
2234331/filedesc://IAH-20060619172903-00000-webarchive1.arc" for the=20
same search i made.
(Starting tomcat from anywhere)
The exception:
7 Bad function argument Cause: java.io.FileNotFoundException:=20
/var/local/webarchive/heritrix/jobs/bn_18_test-20060619172505727/filedesc=
:/IAH-20060619172903-00000-webarchive1.arc=20
does not exist. Stack trace:=20
org.archive.io.arc.ARCUtils.isReadable(ARCUtils.java:171)=20
org.archive.io.arc.ARCUtils.testCompressedARCFile(ARCUtils.java:94)=20
org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:200)=20
org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:194)=20
no.nb.nwa.retriever.ARCRetriever.getDocument(ARCRetriever.java:410)=20
no.nb.nwa.retriever.ARCRetriever.doGet(ARCRetriever.java:131)
....
Also using,
JDK 1.5.0_05
Tomcat 5.5.16
Heritrix 1.6.0
I have tried to figure it out but i'm not having any luck.. I'm a=20
newbie with these tools so I appreciate all the help I can get in=20
getting the latest nutchwax+wera setting going.
Thanks in advance,
Jo=E3o Luzio
|
|
From: <jl...@ex...> - 2006-06-29 10:51:11
|
Oops.. forgot to say that the arcs where on=20 /var/local/webarchive/heritrix/jobs/bn_18_test-20060619172505727/=20 directory but with .arc.gz instead of .arc. Jo=E3o Cl=E1udio Luzio wrote: > Hi, > I've been trying to get the pair up and running for a while now but= =20 > had some problems.. > Using nutchwax 0.4.3 and the wera (0.4.2RC1 & 0.5.0) I managed to=20 > get it running but some of the related files (images) > aren't displayed. Those get: > <retrievermessage> > <head> > <errorcode>4</errorcode> > <errormessage>Unable to parse Archive Identifier</errormessage> > </head> > </retrievermessage> > Using wera debug I found that the "[archiveidentifier] =3D>=20 > 2770/IAH-20060619172903-00000-webarchive1" for a specific search i made= . > (Starting tomcat from the nutchwax indexed data) > > Using wayback I dont have the same problems(I dont use nutchwax wit= h=20 > wayback..). > > I've tried to get nutchwax 0.6.1 and wera running but the opensearc= h=20 > servlet for the rss from nutchwax gives an exception.. > So i tried nutchwax 0.7.0 (with latest hadoop - standalone), but no= w=20 > the arcretriever gives an exception when trying to get the document. > Using wera debug I found that the "[archiveidentifier] =3D>=20 > 2234331/filedesc://IAH-20060619172903-00000-webarchive1.arc" for the=20 > same search i made. > (Starting tomcat from anywhere) > > The exception: > 7 Bad function argument Cause: java.io.FileNotFoundException:=20 > /var/local/webarchive/heritrix/jobs/bn_18_test-20060619172505727/filede= sc:/IAH-20060619172903-00000-webarchive1.arc=20 > does not exist. Stack trace:=20 > org.archive.io.arc.ARCUtils.isReadable(ARCUtils.java:171)=20 > org.archive.io.arc.ARCUtils.testCompressedARCFile(ARCUtils.java:94)=20 > org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:200)=20 > org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:194)=20 > no.nb.nwa.retriever.ARCRetriever.getDocument(ARCRetriever.java:410)=20 > no.nb.nwa.retriever.ARCRetriever.doGet(ARCRetriever.java:131) > .... > > Also using, > JDK 1.5.0_05 > Tomcat 5.5.16 > Heritrix 1.6.0 > > I have tried to figure it out but i'm not having any luck.. I'm a=20 > newbie with these tools so I appreciate all the help I can get in=20 > getting the latest nutchwax+wera setting going. > > Thanks in advance, > Jo=E3o Luzio > > Using Tomcat but need to do more? Need to support web services, securit= y? > Get stuff done quickly with pre-integrated technology to make your job = easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geron= imo > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D120709&bid=3D263057&dat= =3D121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > > =20 |
|
From: Michael S. <st...@ar...> - 2006-07-03 22:59:08
|
Jo=E3o Cl=E1udio Luzio wrote: > Oops.. forgot to say that the arcs where on=20 > /var/local/webarchive/heritrix/jobs/bn_18_test-20060619172505727/=20 > directory but with .arc.gz instead of .arc. > > =20 This should be fine. > Jo=E3o Cl=E1udio Luzio wrote: > =20 >> Hi, >> I've been trying to get the pair up and running for a while now bu= t=20 >> had some problems.. >> Using nutchwax 0.4.3 and the wera (0.4.2RC1 & 0.5.0) I managed to=20 >> get it running but some of the related files (images) >> aren't displayed. Those get: >> <retrievermessage> >> <head> >> <errorcode>4</errorcode> >> <errormessage>Unable to parse Archive Identifier</errormessage= > >> </head> >> </retrievermessage> >> Using wera debug I found that the "[archiveidentifier] =3D>=20 >> 2770/IAH-20060619172903-00000-webarchive1" for a specific search i mad= e. >> (Starting tomcat from the nutchwax indexed data) >> =20 So, it generally works but some of the images don't show sometimes? >> Using wayback I dont have the same problems(I dont use nutchwax wi= th=20 >> wayback..). >> >> I've tried to get nutchwax 0.6.1 and wera running but the opensear= ch=20 >> servlet for the rss from nutchwax gives an exception.. >> =20 Do you still have the exception? >> So i tried nutchwax 0.7.0 (with latest hadoop - standalone), but n= ow=20 >> the arcretriever gives an exception when trying to get the document. >> Using wera debug I found that the "[archiveidentifier] =3D>=20 >> 2234331/filedesc://IAH-20060619172903-00000-webarchive1.arc" for the=20 >> same search i made. >> (Starting tomcat from anywhere) >> >> The exception: >> 7 Bad function argument Cause: java.io.FileNotFoundException:=20 >> /var/local/webarchive/heritrix/jobs/bn_18_test-20060619172505727/filed= esc:/IAH-20060619172903-00000-webarchive1.arc=20 >> does not exist. Stack trace:=20 >> org.archive.io.arc.ARCUtils.isReadable(ARCUtils.java:171)=20 >> org.archive.io.arc.ARCUtils.testCompressedARCFile(ARCUtils.java:94)=20 >> org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:200)=20 >> org.archive.io.arc.ARCReaderFactory.get(ARCReaderFactory.java:194)=20 >> no.nb.nwa.retriever.ARCRetriever.getDocument(ARCRetriever.java:410)=20 >> no.nb.nwa.retriever.ARCRetriever.doGet(ARCRetriever.java:131) >> =20 Looks like we shouldn't be putting the 'filedesc:' on front of ARC=20 name? Does ARCRetreiver work if you make a request with=20 IAH-20060619172903-00000-webarchive1.arc instead of=20 filedesc:/IAH-20060619172903-00000-webarchive1.arc? St.Ack |