Re: [Archive-access-discuss] post-index collections

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Michael,

>>I have been reading documentation for nutchwax, nutch and lucene trying 
>>to figure out if there's a way to do what I need to do: basically to 
>>allow curators to "tag" particular archived web sites as belonging to a 
>>collection for the purposes of restricting searches to that collection 
>>and for generating web pages related to that collection.
>>
>>The trick is that these collections can be defined at any time, 
>>post-harvest (heritrix), post-index (nutchwax). And web sites can belong 
>>to multiple collections. Sometimes the collections are hierarchical, 
>>sometimes they are not.
>>
>>This is my thinking on it so far. I'm hoping that someone will step in 
>>with a better or more elegant way to do this.
>>
>>If I restrict their collections to be defined by a set of seed URIs 
>>(rather than all archived URIs) I think it's more manageable. Picture a 
>>database (call it myDB) that manages a list of seed URIs, each associated 
>>with a unique seedId. I can perform separate heritrix crawls per seed 
>>URI. When I index that set of ARC files (associated with a single seed 
>>URI) I can set the command-line "collection" field to the seed ID. Then 
>>all URIs associated with a seed URI will have the same indexed 
>>"collection" value. This posting might be hard to read because the word 
>>collection is overused - the nutchwax collection field would be used in 
>>this case to group together sites that came from the same seed.
>>
>>Then in that separate database (myDB) I can manage associations defined 
>>at any time between curator-defined collections and the seedIds. I 
>>wouldn't want to add these collection "tags" to the index because to do 
>>this I'd probably have to add these new collection values to the content 
>>in the ARC files, then write a parse, index and query filter to handle 
>>the new field, right? Or is there a way to just add a field directly to 
>>the index for a set of seedIds?
>
>Not at the moment but I've been playing and we could add a new step that 
>did nothing but read from a data source and add metadata from the data 
>source to the Nutch(WAX) segment (In particular, rewrite the parse_data 
>file in the segment, the file that holds the 'metadata' such as fetch 
>time, etc.).
>
>So, you wouldn't have to touch the ARCs, just the product of the 
>Nutch(WAX) parse.
>
>You could tag a page as being of multiple collections: E.g. of collection 
>1, 7 and 8.
>
>After adding the metadata,  you'd have to reindex.
>
>Would that work for you?

That would be great! There is a need in general (at least for us - probably 
for others?) to be able to add metadata to already-harvested content. We 
have another situation like this where the curators would want to add 
subjects to particular URLs. So we could come up with a generic solution to 
this - add any field (e.g. collection, subject), tell it which ? to apply 
this to. Would the new fields be associated at the ARC-level or URI level?

>>So the idea is to translate a user's search query into indexed fields 
>>using the associations in myDB. Say the user searches with (and 
>>myCollection is the field name for the curator's collection, which isn't 
>>a lucene field):
>>myCollection:Asia cats
>>then this could be translated behind-the-scenes to a lucene query:
>>collection:seed1 OR collection:seed7 OR collection:seed8 cats
>>(assuming the crawls associated with seeds 1, 7 and 8 were mapped to the 
>>Asia collection.)
>>
>>I don't think nutchWAX can support the OR queries yet, is that right?
>Thats right.  No OR  yet.
>
>But, we're sort of having a similar problem to you here at the archive 
>(archiveit.org in particular).
>
>They have done similar to your idea in that they have tried to add in a 
>little indirection naming collections by ID instead of explicitly.
>Querying one collection works now or querying all collections but awkward 
>is querying a couple of collections.  One thought is to amend the 
>collection query-time plugin so it can take a list of collections: E.g. 
>'collection:asia,europe,australia cats'.  This would find instances of 
>cats in all three listed collections.  Would break if the list of 
>collections was in the hundreds I'd imagine.  And its not what you want.

I think that that could work. To reduce the query length problem you could 
hide the actual query syntax from the user. Like the archiveit.org way, you 
could use the collection IDs in the query to keep it the query shorter:
'collection:1,7,8 cats'
by either translating the user's selection of asia, europe and austrailia 
from a drop-down list, or translating the user's typed in 
collection:asia,europe,australia to collection:1,7,8 before the query is 
executed.

>I suppose you could do 3 separate queries aggregating the results?
>Would that be onerous?

I'll probably try the single query in a list first to not have to deal with 
ordering the results.

thanks,
Andrea

>St.Ack
>