Re: [Archive-access-discuss] post-index collections

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Andrea Goethals wrote:
.....
>>
>> Not at the moment but I've been playing and we could add a new step 
>> that did nothing but read from a data source and add metadata from 
>> the data source to the Nutch(WAX) segment (In particular, rewrite the 
>> parse_data file in the segment, the file that holds the 'metadata' 
>> such as fetch time, etc.).
>>
>> So, you wouldn't have to touch the ARCs, just the product of the 
>> Nutch(WAX) parse.
>>
>> You could tag a page as being of multiple collections: E.g. of 
>> collection 1, 7 and 8.
>>
>> After adding the metadata,  you'd have to reindex.
>>
>> Would that work for you?
>
> That would be great! There is a need in general (at least for us - 
> probably for others?) to be able to add metadata to already-harvested 
> content. We have another situation like this where the curators would 
> want to add subjects to particular URLs. So we could come up with a 
> generic solution to this - add any field (e.g. collection, subject), 
> tell it which ? to apply this to. Would the new fields be associated 
> at the ARC-level or URI level?

At URI level.

I've added an RFE: 
http://sourceforge.net/tracker/index.php?func=detail&aid=1475667&group_id=118427&atid=681140.  
I'll start in on it after the 0.6.0 release of nutchwax (Should be any 
week soon -- but I've been saying that with a while now...).

...
>
> I think that that could work. To reduce the query length problem you 
> could hide the actual query syntax from the user. Like the 
> archiveit.org way, you could use the collection IDs in the query to 
> keep it the query shorter:
> 'collection:1,7,8 cats'
> by either translating the user's selection of asia, europe and 
> austrailia from a drop-down list, or translating the user's typed in 
> collection:asia,europe,australia to collection:1,7,8 before the query 
> is executed.
>
Yes.  That sounds right.

St.Ack