Re: [Archive-access-discuss] problems running in standalone mode

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Just FYI,
My problem was resolved by switching to the nightly build of NutchWAX 
(at St.Ack's advice) and switching to the .5 version of hadoop (I think 
I was using 0.6.2).

I now can generate a search page properly.

A few problems remain, though.
1) All results link to a non-existent page: 
http://example.com/test/--dateOfCrawl--/http://--actualwebsite--.com/

2) The "Other versions" link likewise directs me example.com

I have looked for a way to change that in the configuration, but 
couldn't find it.   The "Other versions" has me curious though; is this 
going to be an integration point for something like WERA?

Additional problem:
3) Inaccurate "hits" count: the page claims to display results 1-3 out 
of 20, but the "next page" displays nothing ("Results 4-3").   This bug 
seems to originate from it not taking into account the pages hidden by 
the "more from cnn.com".   Because I currently just have a single-domain 
crawled, it's especially obvious.

...

Also, I was wondering; would implementing something like query expansion 
be accomplished in the same manner as it is in nutch?   That is, would 
changing the nutch configuration file in the webapps directory to 
perform query expansion work in NutchWAX?

Thanks,
James

James Grahn wrote:
> Greets,
> I have been attempting to follow the tutorial to get NutchWAX up and 
> running in standalone mode, but I've reached an error that confounds me.
> 
> The printlns seem to indicate that NutchWAX does successfully import the 
> ARC files.
> 
> I see this line:
> opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz
> 
> And after many individual pages being imported, I see this line:
> 
> 061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz
> 
> This followed by more individual pages.   So that seems fine.   But no 
> index is generated and the printlns end like this:
> 
> ...
> 061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869 
> text/html
> 061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html
> Exception in thread "main" java.io.IOException: Job failed!
>          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>          at 
> org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519)
>          at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154)
>          at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139)
>          at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246)
>          at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439)
>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>          at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>          at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:585)
>          at org.apache.hadoop.util.RunJar.main(RunJar.java:130)
> 
> 
> --------
> 
> Any suggestions for this error?   I am using a hadoop installation I 
> acquired with the current version of nutch, and am running the "all" 
> command as per the tutorial:
> 
> ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all 
> /tmp/inputs /tmp/outputs test
> 
> 
> Thanks,
> James
> 
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> 
>