|
From: Andrea G. <and...@ha...> - 2006-04-21 18:54:17
|
Hello, I have been reading documentation for nutchwax, nutch and lucene trying to figure out if there's a way to do what I need to do: basically to allow curators to "tag" particular archived web sites as belonging to a collection for the purposes of restricting searches to that collection and for generating web pages related to that collection. The trick is that these collections can be defined at any time, post-harvest (heritrix), post-index (nutchwax). And web sites can belong to multiple collections. Sometimes the collections are hierarchical, sometimes they are not. This is my thinking on it so far. I'm hoping that someone will step in with a better or more elegant way to do this. If I restrict their collections to be defined by a set of seed URIs (rather than all archived URIs) I think it's more manageable. Picture a database (call it myDB) that manages a list of seed URIs, each associated with a unique seedId. I can perform separate heritrix crawls per seed URI. When I index that set of ARC files (associated with a single seed URI) I can set the command-line "collection" field to the seed ID. Then all URIs associated with a seed URI will have the same indexed "collection" value. This posting might be hard to read because the word collection is overused - the nutchwax collection field would be used in this case to group together sites that came from the same seed. Then in that separate database (myDB) I can manage associations defined at any time between curator-defined collections and the seedIds. I wouldn't want to add these collection "tags" to the index because to do this I'd probably have to add these new collection values to the content in the ARC files, then write a parse, index and query filter to handle the new field, right? Or is there a way to just add a field directly to the index for a set of seedIds? So the idea is to translate a user's search query into indexed fields using the associations in myDB. Say the user searches with (and myCollection is the field name for the curator's collection, which isn't a lucene field): myCollection:Asia cats then this could be translated behind-the-scenes to a lucene query: collection:seed1 OR collection:seed7 OR collection:seed8 cats (assuming the crawls associated with seeds 1, 7 and 8 were mapped to the Asia collection.) I don't think nutchWAX can support the OR queries yet, is that right? Has anyone else figured out a different way to do this or have a different idea? thanks, Andrea |