|
From: Michael S. <st...@ar...> - 2006-04-24 17:13:32
|
Andrea Goethals wrote: ..... >> >> Not at the moment but I've been playing and we could add a new step >> that did nothing but read from a data source and add metadata from >> the data source to the Nutch(WAX) segment (In particular, rewrite the >> parse_data file in the segment, the file that holds the 'metadata' >> such as fetch time, etc.). >> >> So, you wouldn't have to touch the ARCs, just the product of the >> Nutch(WAX) parse. >> >> You could tag a page as being of multiple collections: E.g. of >> collection 1, 7 and 8. >> >> After adding the metadata, you'd have to reindex. >> >> Would that work for you? > > That would be great! There is a need in general (at least for us - > probably for others?) to be able to add metadata to already-harvested > content. We have another situation like this where the curators would > want to add subjects to particular URLs. So we could come up with a > generic solution to this - add any field (e.g. collection, subject), > tell it which ? to apply this to. Would the new fields be associated > at the ARC-level or URI level? At URI level. I've added an RFE: http://sourceforge.net/tracker/index.php?func=detail&aid=1475667&group_id=118427&atid=681140. I'll start in on it after the 0.6.0 release of nutchwax (Should be any week soon -- but I've been saying that with a while now...). ... > > I think that that could work. To reduce the query length problem you > could hide the actual query syntax from the user. Like the > archiveit.org way, you could use the collection IDs in the query to > keep it the query shorter: > 'collection:1,7,8 cats' > by either translating the user's selection of asia, europe and > austrailia from a drop-down list, or translating the user's typed in > collection:asia,europe,australia to collection:1,7,8 before the query > is executed. > Yes. That sounds right. St.Ack |