|
From: Andrea G. <and...@ha...> - 2006-04-21 18:54:17
|
Hello, I have been reading documentation for nutchwax, nutch and lucene trying to figure out if there's a way to do what I need to do: basically to allow curators to "tag" particular archived web sites as belonging to a collection for the purposes of restricting searches to that collection and for generating web pages related to that collection. The trick is that these collections can be defined at any time, post-harvest (heritrix), post-index (nutchwax). And web sites can belong to multiple collections. Sometimes the collections are hierarchical, sometimes they are not. This is my thinking on it so far. I'm hoping that someone will step in with a better or more elegant way to do this. If I restrict their collections to be defined by a set of seed URIs (rather than all archived URIs) I think it's more manageable. Picture a database (call it myDB) that manages a list of seed URIs, each associated with a unique seedId. I can perform separate heritrix crawls per seed URI. When I index that set of ARC files (associated with a single seed URI) I can set the command-line "collection" field to the seed ID. Then all URIs associated with a seed URI will have the same indexed "collection" value. This posting might be hard to read because the word collection is overused - the nutchwax collection field would be used in this case to group together sites that came from the same seed. Then in that separate database (myDB) I can manage associations defined at any time between curator-defined collections and the seedIds. I wouldn't want to add these collection "tags" to the index because to do this I'd probably have to add these new collection values to the content in the ARC files, then write a parse, index and query filter to handle the new field, right? Or is there a way to just add a field directly to the index for a set of seedIds? So the idea is to translate a user's search query into indexed fields using the associations in myDB. Say the user searches with (and myCollection is the field name for the curator's collection, which isn't a lucene field): myCollection:Asia cats then this could be translated behind-the-scenes to a lucene query: collection:seed1 OR collection:seed7 OR collection:seed8 cats (assuming the crawls associated with seeds 1, 7 and 8 were mapped to the Asia collection.) I don't think nutchWAX can support the OR queries yet, is that right? Has anyone else figured out a different way to do this or have a different idea? thanks, Andrea |
|
From: Michael S. <st...@ar...> - 2006-04-24 16:38:33
|
Andrea Goethals wrote: > Hello, Hello Andrea. > > I have been reading documentation for nutchwax, nutch and lucene > trying to figure out if there's a way to do what I need to do: > basically to allow curators to "tag" particular archived web sites as > belonging to a collection for the purposes of restricting searches to > that collection and for generating web pages related to that collection. > > The trick is that these collections can be defined at any time, > post-harvest (heritrix), post-index (nutchwax). And web sites can > belong to multiple collections. Sometimes the collections are > hierarchical, sometimes they are not. > > This is my thinking on it so far. I'm hoping that someone will step in > with a better or more elegant way to do this. > > If I restrict their collections to be defined by a set of seed URIs > (rather than all archived URIs) I think it's more manageable. Picture > a database (call it myDB) that manages a list of seed URIs, each > associated with a unique seedId. I can perform separate heritrix > crawls per seed URI. When I index that set of ARC files (associated > with a single seed URI) I can set the command-line "collection" field > to the seed ID. Then all URIs associated with a seed URI will have the > same indexed "collection" value. This posting might be hard to read > because the word collection is overused - the nutchwax collection > field would be used in this case to group together sites that came > from the same seed. > > Then in that separate database (myDB) I can manage associations > defined at any time between curator-defined collections and the > seedIds. I wouldn't want to add these collection "tags" to the index > because to do this I'd probably have to add these new collection > values to the content in the ARC files, then write a parse, index and > query filter to handle the new field, right? Or is there a way to just > add a field directly to the index for a set of seedIds? Not at the moment but I've been playing and we could add a new step that did nothing but read from a data source and add metadata from the data source to the Nutch(WAX) segment (In particular, rewrite the parse_data file in the segment, the file that holds the 'metadata' such as fetch time, etc.). So, you wouldn't have to touch the ARCs, just the product of the Nutch(WAX) parse. You could tag a page as being of multiple collections: E.g. of collection 1, 7 and 8. After adding the metadata, you'd have to reindex. Would that work for you? > > So the idea is to translate a user's search query into indexed fields > using the associations in myDB. Say the user searches with (and > myCollection is the field name for the curator's collection, which > isn't a lucene field): > myCollection:Asia cats > then this could be translated behind-the-scenes to a lucene query: > collection:seed1 OR collection:seed7 OR collection:seed8 cats > (assuming the crawls associated with seeds 1, 7 and 8 were mapped to > the Asia collection.) > > I don't think nutchWAX can support the OR queries yet, is that right? Thats right. No OR yet. But, we're sort of having a similar problem to you here at the archive (archiveit.org in particular). They have done similar to your idea in that they have tried to add in a little indirection naming collections by ID instead of explicitly. Querying one collection works now or querying all collections but awkward is querying a couple of collections. One thought is to amend the collection query-time plugin so it can take a list of collections: E.g. 'collection:asia,europe,australia cats'. This would find instances of cats in all three listed collections. Would break if the list of collections was in the hundreds I'd imagine. And its not what you want. I suppose you could do 3 separate queries aggregating the results? Would that be onerous? St.Ack |
|
From: Lukas M. <mat...@ce...> - 2006-04-24 20:27:41
|
Dne po 24. dubna 2006 18:38 Michael Stack napsal(a): > Andrea Goethals wrote: > > Hello, > > Hello Andrea. > > > I have been reading documentation for nutchwax, nutch and lucene > > trying to figure out if there's a way to do what I need to do: > > basically to allow curators to "tag" particular archived web sites as > > belonging to a collection for the purposes of restricting searches to > > that collection and for generating web pages related to that collection. > > > > The trick is that these collections can be defined at any time, > > post-harvest (heritrix), post-index (nutchwax). And web sites can > > belong to multiple collections. Sometimes the collections are > > hierarchical, sometimes they are not. > > > > This is my thinking on it so far. I'm hoping that someone will step in > > with a better or more elegant way to do this. > > > > If I restrict their collections to be defined by a set of seed URIs > > (rather than all archived URIs) I think it's more manageable. Picture > > a database (call it myDB) that manages a list of seed URIs, each > > associated with a unique seedId. I can perform separate heritrix > > crawls per seed URI. When I index that set of ARC files (associated > > with a single seed URI) I can set the command-line "collection" field > > to the seed ID. Then all URIs associated with a seed URI will have the > > same indexed "collection" value. This posting might be hard to read > > because the word collection is overused - the nutchwax collection > > field would be used in this case to group together sites that came > > from the same seed. We had to solve similiar situation. In separate database we defined special metadata (etc.collection,contract with author) for each URI and than we'd like to feed nutchwax with tagged set of records from ARCs (arc name + offset). In other way database extra metadata are used for accessing documents through wayback machine. > > > > Then in that separate database (myDB) I can manage associations > > defined at any time between curator-defined collections and the > > seedIds. I wouldn't want to add these collection "tags" to the index > > because to do this I'd probably have to add these new collection > > values to the content in the ARC files, then write a parse, index and > > query filter to handle the new field, right? Or is there a way to just > > add a field directly to the index for a set of seedIds? > > Not at the moment but I've been playing and we could add a new step that > did nothing but read from a data source and add metadata from the data > source to the Nutch(WAX) segment (In particular, rewrite the parse_data > file in the segment, the file that holds the 'metadata' such as fetch > time, etc.). > > So, you wouldn't have to touch the ARCs, just the product of the > Nutch(WAX) parse. > > You could tag a page as being of multiple collections: E.g. of > collection 1, 7 and 8. > > After adding the metadata, you'd have to reindex. > > Would that work for you? > > > So the idea is to translate a user's search query into indexed fields > > using the associations in myDB. Say the user searches with (and > > myCollection is the field name for the curator's collection, which > > isn't a lucene field): > > myCollection:Asia cats > > then this could be translated behind-the-scenes to a lucene query: > > collection:seed1 OR collection:seed7 OR collection:seed8 cats > > (assuming the crawls associated with seeds 1, 7 and 8 were mapped to > > the Asia collection.) > > > > I don't think nutchWAX can support the OR queries yet, is that right? > > Thats right. No OR yet. > > But, we're sort of having a similar problem to you here at the archive > (archiveit.org in particular). > > They have done similar to your idea in that they have tried to add in a > little indirection naming collections by ID instead of explicitly. > Querying one collection works now or querying all collections but > awkward is querying a couple of collections. One thought is to amend > the collection query-time plugin so it can take a list of collections: > E.g. 'collection:asia,europe,australia cats'. This would find instances > of cats in all three listed collections. Would break if the list of > collections was in the hundreds I'd imagine. And its not what you want. > > I suppose you could do 3 separate queries aggregating the results? > Would that be onerous? > > St.Ack > > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |
|
From: Oskar G. <osk...@kb...> - 2006-04-27 11:35:27
|
Hi everybody! (insert voice of Dr. Nick Riviera here) WAXToolbar is a firefox extension, to aid browsing a web archive, that works tightly together with the new open source Wayback Machine. The first official release -- 0.2.0 -- is now available at: http://archive-access.sourceforge.net/projects/waxtoolbar/ Some basic information on how to install and use it are also there. A few minor changes have to be made to the configuration of the Wayback as well, but those are also covered in the documentation. **Please not that for the toolbar to work you have to get the latest Wayback Machine from CVS HEAD, since changes have been added since the 0.4.1 release.** Regards, Oskar Grenholm National Library of Sweden |
|
From: Andrea G. <and...@ha...> - 2006-04-24 17:00:26
|
Hi Michael, >>I have been reading documentation for nutchwax, nutch and lucene trying >>to figure out if there's a way to do what I need to do: basically to >>allow curators to "tag" particular archived web sites as belonging to a >>collection for the purposes of restricting searches to that collection >>and for generating web pages related to that collection. >> >>The trick is that these collections can be defined at any time, >>post-harvest (heritrix), post-index (nutchwax). And web sites can belong >>to multiple collections. Sometimes the collections are hierarchical, >>sometimes they are not. >> >>This is my thinking on it so far. I'm hoping that someone will step in >>with a better or more elegant way to do this. >> >>If I restrict their collections to be defined by a set of seed URIs >>(rather than all archived URIs) I think it's more manageable. Picture a >>database (call it myDB) that manages a list of seed URIs, each associated >>with a unique seedId. I can perform separate heritrix crawls per seed >>URI. When I index that set of ARC files (associated with a single seed >>URI) I can set the command-line "collection" field to the seed ID. Then >>all URIs associated with a seed URI will have the same indexed >>"collection" value. This posting might be hard to read because the word >>collection is overused - the nutchwax collection field would be used in >>this case to group together sites that came from the same seed. >> >>Then in that separate database (myDB) I can manage associations defined >>at any time between curator-defined collections and the seedIds. I >>wouldn't want to add these collection "tags" to the index because to do >>this I'd probably have to add these new collection values to the content >>in the ARC files, then write a parse, index and query filter to handle >>the new field, right? Or is there a way to just add a field directly to >>the index for a set of seedIds? > >Not at the moment but I've been playing and we could add a new step that >did nothing but read from a data source and add metadata from the data >source to the Nutch(WAX) segment (In particular, rewrite the parse_data >file in the segment, the file that holds the 'metadata' such as fetch >time, etc.). > >So, you wouldn't have to touch the ARCs, just the product of the >Nutch(WAX) parse. > >You could tag a page as being of multiple collections: E.g. of collection >1, 7 and 8. > >After adding the metadata, you'd have to reindex. > >Would that work for you? That would be great! There is a need in general (at least for us - probably for others?) to be able to add metadata to already-harvested content. We have another situation like this where the curators would want to add subjects to particular URLs. So we could come up with a generic solution to this - add any field (e.g. collection, subject), tell it which ? to apply this to. Would the new fields be associated at the ARC-level or URI level? >>So the idea is to translate a user's search query into indexed fields >>using the associations in myDB. Say the user searches with (and >>myCollection is the field name for the curator's collection, which isn't >>a lucene field): >>myCollection:Asia cats >>then this could be translated behind-the-scenes to a lucene query: >>collection:seed1 OR collection:seed7 OR collection:seed8 cats >>(assuming the crawls associated with seeds 1, 7 and 8 were mapped to the >>Asia collection.) >> >>I don't think nutchWAX can support the OR queries yet, is that right? >Thats right. No OR yet. > >But, we're sort of having a similar problem to you here at the archive >(archiveit.org in particular). > >They have done similar to your idea in that they have tried to add in a >little indirection naming collections by ID instead of explicitly. >Querying one collection works now or querying all collections but awkward >is querying a couple of collections. One thought is to amend the >collection query-time plugin so it can take a list of collections: E.g. >'collection:asia,europe,australia cats'. This would find instances of >cats in all three listed collections. Would break if the list of >collections was in the hundreds I'd imagine. And its not what you want. I think that that could work. To reduce the query length problem you could hide the actual query syntax from the user. Like the archiveit.org way, you could use the collection IDs in the query to keep it the query shorter: 'collection:1,7,8 cats' by either translating the user's selection of asia, europe and austrailia from a drop-down list, or translating the user's typed in collection:asia,europe,australia to collection:1,7,8 before the query is executed. >I suppose you could do 3 separate queries aggregating the results? >Would that be onerous? I'll probably try the single query in a list first to not have to deal with ordering the results. thanks, Andrea >St.Ack > |
|
From: Michael S. <st...@ar...> - 2006-04-24 17:13:32
|
Andrea Goethals wrote: ..... >> >> Not at the moment but I've been playing and we could add a new step >> that did nothing but read from a data source and add metadata from >> the data source to the Nutch(WAX) segment (In particular, rewrite the >> parse_data file in the segment, the file that holds the 'metadata' >> such as fetch time, etc.). >> >> So, you wouldn't have to touch the ARCs, just the product of the >> Nutch(WAX) parse. >> >> You could tag a page as being of multiple collections: E.g. of >> collection 1, 7 and 8. >> >> After adding the metadata, you'd have to reindex. >> >> Would that work for you? > > That would be great! There is a need in general (at least for us - > probably for others?) to be able to add metadata to already-harvested > content. We have another situation like this where the curators would > want to add subjects to particular URLs. So we could come up with a > generic solution to this - add any field (e.g. collection, subject), > tell it which ? to apply this to. Would the new fields be associated > at the ARC-level or URI level? At URI level. I've added an RFE: http://sourceforge.net/tracker/index.php?func=detail&aid=1475667&group_id=118427&atid=681140. I'll start in on it after the 0.6.0 release of nutchwax (Should be any week soon -- but I've been saying that with a while now...). ... > > I think that that could work. To reduce the query length problem you > could hide the actual query syntax from the user. Like the > archiveit.org way, you could use the collection IDs in the query to > keep it the query shorter: > 'collection:1,7,8 cats' > by either translating the user's selection of asia, europe and > austrailia from a drop-down list, or translating the user's typed in > collection:asia,europe,australia to collection:1,7,8 before the query > is executed. > Yes. That sounds right. St.Ack |