Thread: [Archive-access-discuss] problems running in standalone mode

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss

[Archive-access-discuss] problems running in standalone mode

From: James G. <jg...@si...> - 2006-11-02 17:46:04

Greets,
I have been attempting to follow the tutorial to get NutchWAX up and 
running in standalone mode, but I've reached an error that confounds me.

The printlns seem to indicate that NutchWAX does successfully import the 
ARC files.

I see this line:
opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz

And after many individual pages being imported, I see this line:

061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz

This followed by more individual pages.   So that seems fine.   But no 
index is generated and the printlns end like this:

...
061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869 
text/html
061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html
Exception in thread "main" java.io.IOException: Job failed!
         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
         at 
org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519)
         at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154)
         at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139)
         at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246)
         at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
         at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:585)
         at org.apache.hadoop.util.RunJar.main(RunJar.java:130)


--------

Any suggestions for this error?   I am using a hadoop installation I 
acquired with the current version of nutch, and am running the "all" 
command as per the tutorial:

${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all 
/tmp/inputs /tmp/outputs test


Thanks,
James

Re: [Archive-access-discuss] problems running in standalone mode

From: Kaisa K. <kau...@cc...> - 2006-11-03 07:56:24

I had something similar and was given the advice to use
the very lastest version of nutchwax with hadoop-0.5.0
(and not hadoop-0.7.2 for example)


On Thu, 2 Nov 2006, James Grahn wrote:

> Greets,
> I have been attempting to follow the tutorial to get NutchWAX up and
> running in standalone mode, but I've reached an error that confounds me.
>
> The printlns seem to indicate that NutchWAX does successfully import the
> ARC files.
>
> I see this line:
> opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz
>
> And after many individual pages being imported, I see this line:
>
> 061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz
>
> This followed by more individual pages.   So that seems fine.   But no
> index is generated and the printlns end like this:
>
> ...
> 061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869
> text/html
> 061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>         at
> org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519)
>         at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154)
>         at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139)
>         at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246)
>         at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:585)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:130)
>
>
> --------
>
> Any suggestions for this error?   I am using a hadoop installation I
> acquired with the current version of nutch, and am running the "all"
> command as per the tutorial:
>
> ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all
> /tmp/inputs /tmp/outputs test
>
>
> Thanks,
> James
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>

Re: [Archive-access-discuss] problems running in standalone mode

From: James G. <jg...@si...> - 2006-11-07 18:22:49

Just FYI,
My problem was resolved by switching to the nightly build of NutchWAX 
(at St.Ack's advice) and switching to the .5 version of hadoop (I think 
I was using 0.6.2).

I now can generate a search page properly.

A few problems remain, though.
1) All results link to a non-existent page: 
http://example.com/test/--dateOfCrawl--/http://--actualwebsite--.com/

2) The "Other versions" link likewise directs me example.com

I have looked for a way to change that in the configuration, but 
couldn't find it.   The "Other versions" has me curious though; is this 
going to be an integration point for something like WERA?

Additional problem:
3) Inaccurate "hits" count: the page claims to display results 1-3 out 
of 20, but the "next page" displays nothing ("Results 4-3").   This bug 
seems to originate from it not taking into account the pages hidden by 
the "more from cnn.com".   Because I currently just have a single-domain 
crawled, it's especially obvious.

...

Also, I was wondering; would implementing something like query expansion 
be accomplished in the same manner as it is in nutch?   That is, would 
changing the nutch configuration file in the webapps directory to 
perform query expansion work in NutchWAX?

Thanks,
James

James Grahn wrote:
> Greets,
> I have been attempting to follow the tutorial to get NutchWAX up and 
> running in standalone mode, but I've reached an error that confounds me.
> 
> The printlns seem to indicate that NutchWAX does successfully import the 
> ARC files.
> 
> I see this line:
> opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz
> 
> And after many individual pages being imported, I see this line:
> 
> 061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz
> 
> This followed by more individual pages.   So that seems fine.   But no 
> index is generated and the printlns end like this:
> 
> ...
> 061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869 
> text/html
> 061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html
> Exception in thread "main" java.io.IOException: Job failed!
>          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>          at 
> org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519)
>          at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154)
>          at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139)
>          at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246)
>          at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439)
>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>          at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>          at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:585)
>          at org.apache.hadoop.util.RunJar.main(RunJar.java:130)
> 
> 
> --------
> 
> Any suggestions for this error?   I am using a hadoop installation I 
> acquired with the current version of nutch, and am running the "all" 
> command as per the tutorial:
> 
> ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all 
> /tmp/inputs /tmp/outputs test
> 
> 
> Thanks,
> James
> 
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> 
>

Re: [Archive-access-discuss] problems running in standalone mode

From: Michael S. <st...@ar...> - 2006-11-07 18:46:42

James Grahn wrote:
> Just FYI,
> My problem was resolved by switching to the nightly build of NutchWAX 
> (at St.Ack's advice) and switching to the .5 version of hadoop (I think 
> I was using 0.6.2).
>
> I now can generate a search page properly.
>
> A few problems remain, though.
> 1) All results link to a non-existent page: 
> http://example.com/test/--dateOfCrawl--/http://--actualwebsite--.com/
>   
Checkout the 'Searching' section here: 
http://archive-access.sourceforge.net/projects/nutch/apidocs/overview-summary.html. 
It doesn't make mention of the 'wax.host' property you'll need to change 
-- I'll fix that -- but if you look at this file, available in the src 
version of nutchwax, it notes the property to change and others you 
might want to also change: 
http://archive-access.cvs.sourceforge.net/*checkout*/archive-access/archive-access/projects/nutch/conf/hadoop-site.xml.template?revision=1.7


> 2) The "Other versions" link likewise directs me example.com
>
> I have looked for a way to change that in the configuration, but 
> couldn't find it.   The "Other versions" has me curious though; is this 
> going to be an integration point for something like WERA?
>
>   
By default, we'll only show the most recent version of a page. If there 
are multiple versions in an index, we'll show all (Sets hitsPerDup to 
'0' which says show all -- usually hitsPerDup is 1).

> Additional problem:
> 3) Inaccurate "hits" count: the page claims to display results 1-3 out 
> of 20, but the "next page" displays nothing ("Results 4-3").   This bug 
> seems to originate from it not taking into account the pages hidden by 
> the "more from cnn.com".   Because I currently just have a single-domain 
> crawled, it's especially obvious.
>   
Yeah. Known issue. Need to fix. Also in play is the fact that we only 
show one hit per site by default (Add hitsPerSite=0 to your query string 
to confirm this is rather than 'more' is the issue).

> ...
>
> Also, I was wondering; would implementing something like query expansion 
> be accomplished in the same manner as it is in nutch?   That is, would 
> changing the nutch configuration file in the webapps directory to 
> perform query expansion work in NutchWAX?
>   
Nutchwax includes near all of nutch. Only reason it wouldn't work would 
be because we've not built in a plugin or some conf file or our jsp page 
diverges slightly from default nutch. If you let me know whats missing, 
I'll change build scripts to include it.

Yours,
St.Ack


> Thanks,
> James
>
> James Grahn wrote:
>   
>> Greets,
>> I have been attempting to follow the tutorial to get NutchWAX up and 
>> running in standalone mode, but I've reached an error that confounds me.
>>
>> The printlns seem to indicate that NutchWAX does successfully import the 
>> ARC files.
>>
>> I see this line:
>> opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz
>>
>> And after many individual pages being imported, I see this line:
>>
>> 061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz
>>
>> This followed by more individual pages.   So that seems fine.   But no 
>> index is generated and the printlns end like this:
>>
>> ...
>> 061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869 
>> text/html
>> 061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html
>> Exception in thread "main" java.io.IOException: Job failed!
>>          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>>          at 
>> org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519)
>>          at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154)
>>          at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139)
>>          at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246)
>>          at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439)
>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>          at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>          at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>          at java.lang.reflect.Method.invoke(Method.java:585)
>>          at org.apache.hadoop.util.RunJar.main(RunJar.java:130)
>>
>>
>> --------
>>
>> Any suggestions for this error?   I am using a hadoop installation I 
>> acquired with the current version of nutch, and am running the "all" 
>> command as per the tutorial:
>>
>> ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all 
>> /tmp/inputs /tmp/outputs test
>>
>>
>> Thanks,
>> James
>>
>> -------------------------------------------------------------------------
>> Using Tomcat but need to do more? Need to support web services, security?
>> Get stuff done quickly with pre-integrated technology to make your job easier
>> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
>> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>> _______________________________________________
>> Archive-access-discuss mailing list
>> Arc...@li...
>> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>>
>>
>>     
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>