Re: [Archive-access-discuss] Search multiple versions of one URL

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I'm currently having the same problem that Natalia initially had...

I'm using the nightly build (from a few days back) of nutchwax and am 
trying to build an index that will be used by wera.

It seems to me that if you are going to store the crawls under different 
collection names, then you have to do multiple imports (with differing 
collection names), before proceeding through update, invert, index, 
dedup, and merge.

I have been attempting to do this with multiple collections, using the 
optional "segments" arguments to keep the tools aware of the multiple 
collections.

I've gone through several permutations of the command line arguments but 
have not had any luck yet; what's the proper sequence of commands to get 
this running?

Thanks,
James

Michael Stack wrote:
> Natalia Torres wrote:
>> Hello
>>
>> I'm trying nutchwax+wera whith multiple crawls of some web pages. After 
>> index it I can't see it on wera. The Overview page only shows one crawl 
>> date. For us that's an important issue.
>>
>>   
>> I found it as a bug from july in the Nutchwax bug list (1518431 - Search 
>> multiple versions of one URL broken).
>>
>>   
>> There's a new version cooming soon? How can I solve it?
>>
>>   
> Did you give each crawl a different collection name or are they indexed 
> all with the same collection name?
> 
> In nutch, the URL for a page is used as the key in mapreduce processing 
> (Keys are used to identify records and must be unique).  It makes it so 
> you can only have one URL in a nutch index.  While an URL as primary key 
> is far from optimal, its convenient having the key be an URL.  It makes 
> it so the URL is easily available at various points during indexing 
> processing. 
> 
> In nutchwax, we've made it so that the key is collection-name + URL so 
> you can have multiple URLs as long as they are of different 
> collections.  This is a climb-down from how it used to work in nutchwax 
> -- pre-mapreduce -- where you could have multiple URLs distingushed by 
> date alone.
> 
> I'm wondering if a key of collection-name+URL is sufficient?  It means 
> indexing, collection names must be carefully chosen.  Otherwise, we need 
> to make the key uglier still: collection-name+URL+date.
> 
> Yours,
> St.Ack
> P.S.  Yes a new release is imminent.
> 
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> 
>