|
From: James G. <jg...@si...> - 2006-11-07 18:22:49
|
Just FYI, My problem was resolved by switching to the nightly build of NutchWAX (at St.Ack's advice) and switching to the .5 version of hadoop (I think I was using 0.6.2). I now can generate a search page properly. A few problems remain, though. 1) All results link to a non-existent page: http://example.com/test/--dateOfCrawl--/http://--actualwebsite--.com/ 2) The "Other versions" link likewise directs me example.com I have looked for a way to change that in the configuration, but couldn't find it. The "Other versions" has me curious though; is this going to be an integration point for something like WERA? Additional problem: 3) Inaccurate "hits" count: the page claims to display results 1-3 out of 20, but the "next page" displays nothing ("Results 4-3"). This bug seems to originate from it not taking into account the pages hidden by the "more from cnn.com". Because I currently just have a single-domain crawled, it's especially obvious. ... Also, I was wondering; would implementing something like query expansion be accomplished in the same manner as it is in nutch? That is, would changing the nutch configuration file in the webapps directory to perform query expansion work in NutchWAX? Thanks, James James Grahn wrote: > Greets, > I have been attempting to follow the tutorial to get NutchWAX up and > running in standalone mode, but I've reached an error that confounds me. > > The printlns seem to indicate that NutchWAX does successfully import the > ARC files. > > I see this line: > opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz > > And after many individual pages being imported, I see this line: > > 061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz > > This followed by more individual pages. So that seems fine. But no > index is generated and the printlns end like this: > > ... > 061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869 > text/html > 061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357) > at > org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519) > at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154) > at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139) > at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246) > at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.apache.hadoop.util.RunJar.main(RunJar.java:130) > > > -------- > > Any suggestions for this error? I am using a hadoop installation I > acquired with the current version of nutch, and am running the "all" > command as per the tutorial: > > ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all > /tmp/inputs /tmp/outputs test > > > Thanks, > James > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > |