|
From: Michael S. <st...@ar...> - 2006-11-07 18:46:42
|
James Grahn wrote: > Just FYI, > My problem was resolved by switching to the nightly build of NutchWAX > (at St.Ack's advice) and switching to the .5 version of hadoop (I think > I was using 0.6.2). > > I now can generate a search page properly. > > A few problems remain, though. > 1) All results link to a non-existent page: > http://example.com/test/--dateOfCrawl--/http://--actualwebsite--.com/ > Checkout the 'Searching' section here: http://archive-access.sourceforge.net/projects/nutch/apidocs/overview-summary.html. It doesn't make mention of the 'wax.host' property you'll need to change -- I'll fix that -- but if you look at this file, available in the src version of nutchwax, it notes the property to change and others you might want to also change: http://archive-access.cvs.sourceforge.net/*checkout*/archive-access/archive-access/projects/nutch/conf/hadoop-site.xml.template?revision=1.7 > 2) The "Other versions" link likewise directs me example.com > > I have looked for a way to change that in the configuration, but > couldn't find it. The "Other versions" has me curious though; is this > going to be an integration point for something like WERA? > > By default, we'll only show the most recent version of a page. If there are multiple versions in an index, we'll show all (Sets hitsPerDup to '0' which says show all -- usually hitsPerDup is 1). > Additional problem: > 3) Inaccurate "hits" count: the page claims to display results 1-3 out > of 20, but the "next page" displays nothing ("Results 4-3"). This bug > seems to originate from it not taking into account the pages hidden by > the "more from cnn.com". Because I currently just have a single-domain > crawled, it's especially obvious. > Yeah. Known issue. Need to fix. Also in play is the fact that we only show one hit per site by default (Add hitsPerSite=0 to your query string to confirm this is rather than 'more' is the issue). > ... > > Also, I was wondering; would implementing something like query expansion > be accomplished in the same manner as it is in nutch? That is, would > changing the nutch configuration file in the webapps directory to > perform query expansion work in NutchWAX? > Nutchwax includes near all of nutch. Only reason it wouldn't work would be because we've not built in a plugin or some conf file or our jsp page diverges slightly from default nutch. If you let me know whats missing, I'll change build scripts to include it. Yours, St.Ack > Thanks, > James > > James Grahn wrote: > >> Greets, >> I have been attempting to follow the tutorial to get NutchWAX up and >> running in standalone mode, but I've reached an error that confounds me. >> >> The printlns seem to indicate that NutchWAX does successfully import the >> ARC files. >> >> I see this line: >> opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz >> >> And after many individual pages being imported, I see this line: >> >> 061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz >> >> This followed by more individual pages. So that seems fine. But no >> index is generated and the printlns end like this: >> >> ... >> 061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869 >> text/html >> 061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html >> Exception in thread "main" java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357) >> at >> org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519) >> at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154) >> at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139) >> at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246) >> at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:585) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:130) >> >> >> -------- >> >> Any suggestions for this error? I am using a hadoop installation I >> acquired with the current version of nutch, and am running the "all" >> command as per the tutorial: >> >> ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all >> /tmp/inputs /tmp/outputs test >> >> >> Thanks, >> James >> >> ------------------------------------------------------------------------- >> Using Tomcat but need to do more? Need to support web services, security? >> Get stuff done quickly with pre-integrated technology to make your job easier >> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >> _______________________________________________ >> Archive-access-discuss mailing list >> Arc...@li... >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >> >> >> > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |