|
From: James G. <jg...@si...> - 2006-11-02 17:46:04
|
Greets, I have been attempting to follow the tutorial to get NutchWAX up and running in standalone mode, but I've reached an error that confounds me. The printlns seem to indicate that NutchWAX does successfully import the ARC files. I see this line: opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz And after many individual pages being imported, I see this line: 061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz This followed by more individual pages. So that seems fine. But no index is generated and the printlns end like this: ... 061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869 text/html 061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357) at org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519) at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154) at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139) at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246) at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.util.RunJar.main(RunJar.java:130) -------- Any suggestions for this error? I am using a hadoop installation I acquired with the current version of nutch, and am running the "all" command as per the tutorial: ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all /tmp/inputs /tmp/outputs test Thanks, James |
|
From: Kaisa K. <kau...@cc...> - 2006-11-03 07:56:24
|
I had something similar and was given the advice to use the very lastest version of nutchwax with hadoop-0.5.0 (and not hadoop-0.7.2 for example) On Thu, 2 Nov 2006, James Grahn wrote: > Greets, > I have been attempting to follow the tutorial to get NutchWAX up and > running in standalone mode, but I've reached an error that confounds me. > > The printlns seem to indicate that NutchWAX does successfully import the > ARC files. > > I see this line: > opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz > > And after many individual pages being imported, I see this line: > > 061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz > > This followed by more individual pages. So that seems fine. But no > index is generated and the printlns end like this: > > ... > 061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869 > text/html > 061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357) > at > org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519) > at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154) > at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139) > at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246) > at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.apache.hadoop.util.RunJar.main(RunJar.java:130) > > > -------- > > Any suggestions for this error? I am using a hadoop installation I > acquired with the current version of nutch, and am running the "all" > command as per the tutorial: > > ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all > /tmp/inputs /tmp/outputs test > > > Thanks, > James > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |
|
From: James G. <jg...@si...> - 2006-11-07 18:22:49
|
Just FYI, My problem was resolved by switching to the nightly build of NutchWAX (at St.Ack's advice) and switching to the .5 version of hadoop (I think I was using 0.6.2). I now can generate a search page properly. A few problems remain, though. 1) All results link to a non-existent page: http://example.com/test/--dateOfCrawl--/http://--actualwebsite--.com/ 2) The "Other versions" link likewise directs me example.com I have looked for a way to change that in the configuration, but couldn't find it. The "Other versions" has me curious though; is this going to be an integration point for something like WERA? Additional problem: 3) Inaccurate "hits" count: the page claims to display results 1-3 out of 20, but the "next page" displays nothing ("Results 4-3"). This bug seems to originate from it not taking into account the pages hidden by the "more from cnn.com". Because I currently just have a single-domain crawled, it's especially obvious. ... Also, I was wondering; would implementing something like query expansion be accomplished in the same manner as it is in nutch? That is, would changing the nutch configuration file in the webapps directory to perform query expansion work in NutchWAX? Thanks, James James Grahn wrote: > Greets, > I have been attempting to follow the tutorial to get NutchWAX up and > running in standalone mode, but I've reached an error that confounds me. > > The printlns seem to indicate that NutchWAX does successfully import the > ARC files. > > I see this line: > opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz > > And after many individual pages being imported, I see this line: > > 061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz > > This followed by more individual pages. So that seems fine. But no > index is generated and the printlns end like this: > > ... > 061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869 > text/html > 061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357) > at > org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519) > at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154) > at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139) > at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246) > at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org.apache.hadoop.util.RunJar.main(RunJar.java:130) > > > -------- > > Any suggestions for this error? I am using a hadoop installation I > acquired with the current version of nutch, and am running the "all" > command as per the tutorial: > > ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all > /tmp/inputs /tmp/outputs test > > > Thanks, > James > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > |
|
From: Michael S. <st...@ar...> - 2006-11-07 18:46:42
|
James Grahn wrote: > Just FYI, > My problem was resolved by switching to the nightly build of NutchWAX > (at St.Ack's advice) and switching to the .5 version of hadoop (I think > I was using 0.6.2). > > I now can generate a search page properly. > > A few problems remain, though. > 1) All results link to a non-existent page: > http://example.com/test/--dateOfCrawl--/http://--actualwebsite--.com/ > Checkout the 'Searching' section here: http://archive-access.sourceforge.net/projects/nutch/apidocs/overview-summary.html. It doesn't make mention of the 'wax.host' property you'll need to change -- I'll fix that -- but if you look at this file, available in the src version of nutchwax, it notes the property to change and others you might want to also change: http://archive-access.cvs.sourceforge.net/*checkout*/archive-access/archive-access/projects/nutch/conf/hadoop-site.xml.template?revision=1.7 > 2) The "Other versions" link likewise directs me example.com > > I have looked for a way to change that in the configuration, but > couldn't find it. The "Other versions" has me curious though; is this > going to be an integration point for something like WERA? > > By default, we'll only show the most recent version of a page. If there are multiple versions in an index, we'll show all (Sets hitsPerDup to '0' which says show all -- usually hitsPerDup is 1). > Additional problem: > 3) Inaccurate "hits" count: the page claims to display results 1-3 out > of 20, but the "next page" displays nothing ("Results 4-3"). This bug > seems to originate from it not taking into account the pages hidden by > the "more from cnn.com". Because I currently just have a single-domain > crawled, it's especially obvious. > Yeah. Known issue. Need to fix. Also in play is the fact that we only show one hit per site by default (Add hitsPerSite=0 to your query string to confirm this is rather than 'more' is the issue). > ... > > Also, I was wondering; would implementing something like query expansion > be accomplished in the same manner as it is in nutch? That is, would > changing the nutch configuration file in the webapps directory to > perform query expansion work in NutchWAX? > Nutchwax includes near all of nutch. Only reason it wouldn't work would be because we've not built in a plugin or some conf file or our jsp page diverges slightly from default nutch. If you let me know whats missing, I'll change build scripts to include it. Yours, St.Ack > Thanks, > James > > James Grahn wrote: > >> Greets, >> I have been attempting to follow the tutorial to get NutchWAX up and >> running in standalone mode, but I've reached an error that confounds me. >> >> The printlns seem to indicate that NutchWAX does successfully import the >> ARC files. >> >> I see this line: >> opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz >> >> And after many individual pages being imported, I see this line: >> >> 061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz >> >> This followed by more individual pages. So that seems fine. But no >> index is generated and the printlns end like this: >> >> ... >> 061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869 >> text/html >> 061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html >> Exception in thread "main" java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357) >> at >> org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519) >> at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154) >> at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139) >> at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246) >> at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:585) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:130) >> >> >> -------- >> >> Any suggestions for this error? I am using a hadoop installation I >> acquired with the current version of nutch, and am running the "all" >> command as per the tutorial: >> >> ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all >> /tmp/inputs /tmp/outputs test >> >> >> Thanks, >> James >> >> ------------------------------------------------------------------------- >> Using Tomcat but need to do more? Need to support web services, security? >> Get stuff done quickly with pre-integrated technology to make your job easier >> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >> _______________________________________________ >> Archive-access-discuss mailing list >> Arc...@li... >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >> >> >> > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > |