Thread: [Archive-access-discuss] post-index collections

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello,

I have been reading documentation for nutchwax, nutch and lucene trying to 
figure out if there's a way to do what I need to do: basically to allow 
curators to "tag" particular archived web sites as belonging to a 
collection for the purposes of restricting searches to that collection and 
for generating web pages related to that collection.

The trick is that these collections can be defined at any time, 
post-harvest (heritrix), post-index (nutchwax). And web sites can belong to 
multiple collections. Sometimes the collections are hierarchical, sometimes 
they are not.

This is my thinking on it so far. I'm hoping that someone will step in with 
a better or more elegant way to do this.

If I restrict their collections to be defined by a set of seed URIs (rather 
than all archived URIs) I think it's more manageable. Picture a database 
(call it myDB) that manages a list of seed URIs, each associated with a 
unique seedId. I can perform separate heritrix crawls per seed URI. When I 
index that set of ARC files (associated with a single seed URI) I can set 
the command-line "collection" field to the seed ID. Then all URIs 
associated with a seed URI will have the same indexed "collection" value. 
This posting might be hard to read because the word collection is overused 
- the nutchwax collection field would be used in this case to group 
together sites that came from the same seed.

Then in that separate database (myDB) I can manage associations defined at 
any time between curator-defined collections and the seedIds. I wouldn't 
want to add these collection "tags" to the index because to do this I'd 
probably have to add these new collection values to the content in the ARC 
files, then write a parse, index and query filter to handle the new field, 
right? Or is there a way to just add a field directly to the index for a 
set of seedIds?

So the idea is to translate a user's search query into indexed fields using 
the associations in myDB. Say the user searches with (and myCollection is 
the field name for the curator's collection, which isn't a lucene field):
myCollection:Asia cats
then this could be translated behind-the-scenes to a lucene query:
collection:seed1 OR collection:seed7 OR collection:seed8 cats
(assuming the crawls associated with seeds 1, 7 and 8 were mapped to the 
Asia collection.)

I don't think nutchWAX can support the OR queries yet, is that right?

Has anyone else figured out a different way to do this or have a different 
idea?

thanks,
Andrea

Thread: [Archive-access-discuss] post-index collections

archive-access-discuss