|
From: Michael S. <st...@ar...> - 2006-11-08 17:24:04
|
Natalia Torres wrote: > Hello > > I'm trying nutchwax+wera whith multiple crawls of some web pages. After > index it I can't see it on wera. The Overview page only shows one crawl > date. For us that's an important issue. > > > I found it as a bug from july in the Nutchwax bug list (1518431 - Search > multiple versions of one URL broken). > > > There's a new version cooming soon? How can I solve it? > > Did you give each crawl a different collection name or are they indexed all with the same collection name? In nutch, the URL for a page is used as the key in mapreduce processing (Keys are used to identify records and must be unique). It makes it so you can only have one URL in a nutch index. While an URL as primary key is far from optimal, its convenient having the key be an URL. It makes it so the URL is easily available at various points during indexing processing. In nutchwax, we've made it so that the key is collection-name + URL so you can have multiple URLs as long as they are of different collections. This is a climb-down from how it used to work in nutchwax -- pre-mapreduce -- where you could have multiple URLs distingushed by date alone. I'm wondering if a key of collection-name+URL is sufficient? It means indexing, collection names must be carefully chosen. Otherwise, we need to make the key uglier still: collection-name+URL+date. Yours, St.Ack P.S. Yes a new release is imminent. |