Re: [Archive-access-discuss] problems running in standalone mode

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

James Grahn wrote:
> Just FYI,
> My problem was resolved by switching to the nightly build of NutchWAX 
> (at St.Ack's advice) and switching to the .5 version of hadoop (I think 
> I was using 0.6.2).
>
> I now can generate a search page properly.
>
> A few problems remain, though.
> 1) All results link to a non-existent page: 
> http://example.com/test/--dateOfCrawl--/http://--actualwebsite--.com/
>   
Checkout the 'Searching' section here: 
http://archive-access.sourceforge.net/projects/nutch/apidocs/overview-summary.html. 
It doesn't make mention of the 'wax.host' property you'll need to change 
-- I'll fix that -- but if you look at this file, available in the src 
version of nutchwax, it notes the property to change and others you 
might want to also change: 
http://archive-access.cvs.sourceforge.net/*checkout*/archive-access/archive-access/projects/nutch/conf/hadoop-site.xml.template?revision=1.7


> 2) The "Other versions" link likewise directs me example.com
>
> I have looked for a way to change that in the configuration, but 
> couldn't find it.   The "Other versions" has me curious though; is this 
> going to be an integration point for something like WERA?
>
>   
By default, we'll only show the most recent version of a page. If there 
are multiple versions in an index, we'll show all (Sets hitsPerDup to 
'0' which says show all -- usually hitsPerDup is 1).

> Additional problem:
> 3) Inaccurate "hits" count: the page claims to display results 1-3 out 
> of 20, but the "next page" displays nothing ("Results 4-3").   This bug 
> seems to originate from it not taking into account the pages hidden by 
> the "more from cnn.com".   Because I currently just have a single-domain 
> crawled, it's especially obvious.
>   
Yeah. Known issue. Need to fix. Also in play is the fact that we only 
show one hit per site by default (Add hitsPerSite=0 to your query string 
to confirm this is rather than 'more' is the issue).

> ...
>
> Also, I was wondering; would implementing something like query expansion 
> be accomplished in the same manner as it is in nutch?   That is, would 
> changing the nutch configuration file in the webapps directory to 
> perform query expansion work in NutchWAX?
>   
Nutchwax includes near all of nutch. Only reason it wouldn't work would 
be because we've not built in a plugin or some conf file or our jsp page 
diverges slightly from default nutch. If you let me know whats missing, 
I'll change build scripts to include it.

Yours,
St.Ack


> Thanks,
> James
>
> James Grahn wrote:
>   
>> Greets,
>> I have been attempting to follow the tutorial to get NutchWAX up and 
>> running in standalone mode, but I've reached an error that confounds me.
>>
>> The printlns seem to indicate that NutchWAX does successfully import the 
>> ARC files.
>>
>> I see this line:
>> opening /tmp/mirror/heretrix/IAH-20061026194403-00000.arc.gz
>>
>> And after many individual pages being imported, I see this line:
>>
>> 061102 115327 opening /tmp/mirror/heretrix/IAH-20061026194522-00001.arc.gz
>>
>> This followed by more individual pages.   So that seems fine.   But no 
>> index is generated and the printlns end like this:
>>
>> ...
>> 061102 115345 adding http://www.cnn.com/CNN/Programs/student.news/ 24869 
>> text/html
>> 061102 115345 adding http://www.cnn.com/CNN/Programs/people/ 367 text/html
>> Exception in thread "main" java.io.IOException: Job failed!
>>          at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
>>          at 
>> org.archive.access.nutch.ImportArcs.importArcs(ImportArcs.java:519)
>>          at org.archive.access.nutch.IndexArcs.doImport(IndexArcs.java:154)
>>          at org.archive.access.nutch.IndexArcs.doAll(IndexArcs.java:139)
>>          at org.archive.access.nutch.IndexArcs.doJob(IndexArcs.java:246)
>>          at org.archive.access.nutch.IndexArcs.main(IndexArcs.java:439)
>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>          at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>          at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>          at java.lang.reflect.Method.invoke(Method.java:585)
>>          at org.apache.hadoop.util.RunJar.main(RunJar.java:130)
>>
>>
>> --------
>>
>> Any suggestions for this error?   I am using a hadoop installation I 
>> acquired with the current version of nutch, and am running the "all" 
>> command as per the tutorial:
>>
>> ${HADOOP_HOME}/bin/hadoop jar ${NUTCHWAX_HOME}/nutchwax.jar all 
>> /tmp/inputs /tmp/outputs test
>>
>>
>> Thanks,
>> James
>>
>> -------------------------------------------------------------------------
>> Using Tomcat but need to do more? Need to support web services, security?
>> Get stuff done quickly with pre-integrated technology to make your job easier
>> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
>> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
>> _______________________________________________
>> Archive-access-discuss mailing list
>> Arc...@li...
>> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>>
>>
>>     
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>