From: Leo S. <leo...@df...> - 2007-01-08 09:19:11
|
Ok, I have updated the wiki-page with pro/con votes, please continue the discussion both via e-mail and on the wiki, so that=20 we can read it monthsl later also. Here my feedback: Solutions: CompoundObjectProcessor? <https://gnowsis.opendfki.de/wiki/CompoundObjectProcessor> =B6 <https://gnowsis.opendfki.de/wiki/ApertureArchives#CompoundObjectProc= essor> Leo: how about naming it "Sub-Crawler" or "MicroCrawler?=20 <https://gnowsis.opendfki.de/wiki/MicroCrawler>". This is, a crawler=20 that is crawling inside a bigger crawl process to crawl sub-resources. * apply a Crawler on a DataSource? <https://gnowsis.opendfki.de/wiki/DataSource>, producing a queue of DataObjects? <https://gnowsis.opendfki.de/wiki/DataObjects>. * for every DataObject? <https://gnowsis.opendfki.de/wiki/DataObject> in this set: o determine the MIME type of the stream o see if there is a CompoundObjectProcessor? <https://gnowsis.opendfki.de/wiki/CompoundObjectProcessor> impl for this MIME type. if yes: + apply the CompoundObjectProcessor? <https://gnowsis.opendfki.de/wiki/CompoundObjectProcess= or> on this DataObject? <https://gnowsis.opendfki.de/wiki/DataObject> and put all resulting DataObjects? <https://gnowsis.opendfki.de/wiki/DataObjects> in the queue o if no: + see if there is an Extractor impl for this MIME type and if so, apply it on the DataObject? <https://gnowsis.opendfki.de/wiki/DataObject> The CompoundObjectProcessor?=20 <https://gnowsis.opendfki.de/wiki/CompoundObjectProcessor> could be=20 given an AccessData? <https://gnowsis.opendfki.de/wiki/AccessData>=20 instance, just like Crawler, to make incremental crawling of such=20 objects possible. Giving the CompoundObjectProcessor?=20 <https://gnowsis.opendfki.de/wiki/CompoundObjectProcessor> a DataObject?=20 <https://gnowsis.opendfki.de/wiki/DataObject> rather than, say, an=20 InputStream? <https://gnowsis.opendfki.de/wiki/InputStream> allows it to=20 add container-specific metadata for the archive itself (#entries,=20 uncompressed size, etc) and to retrieve metadata it may require (e.g.=20 the name of the archive file). Pro: * Leo: could handle most prolbems Con: * Leo: When you have the file extension ".xml", there is a billion choices how to extract the info from it. Vote: * Leo: + Merge Crawler and Exctractor =B6 <https://gnowsis.opendfki.de/wiki/ApertureArchives#MergeCrawlerandExc= tractor> Alternative: find a way to generalize the Crawler and Extractor APIs=20 into one XYZ API: you put a source description in and it produces=20 DataObjects? <https://gnowsis.opendfki.de/wiki/DataObjects> that get=20 processed recursively and exhaustively. Feels a bit tricky and=20 over-generalization to me but I wanted to mention it, perhaps someone=20 has good ideas in this direction. Pro: Con: * Leo: that would make it soo generic that it is useless. Vote: * Leo: - Let Extractor do more =B6 <https://gnowsis.opendfki.de/wiki/ApertureArchives#LetExtractordomore= > The Extractor interface was designed to return more than one resource=20 anyway. It can do that by wrapping them inside the RDFContainer, we have=20 done that with addresses in e-mails already, using anonymous nodes or=20 URI nodes in between (for sender/cc). Extractor can return a bigger RDF graph inside one RDF Container (which=20 works already), but the RDFContainer could be extended with a list of=20 resources contained within. The list can be done either using RDF=20 metadata (x aperture:isContainedIn y) or with a Java list. Pro: * Leo: works today Con: * Leo: hard to optimize Lucene index afterwards Es begab sich aber da Christiaan Fluit zur rechten Zeit 05.01.2007 14:30=20 folgendes schrieb: > Gunnar Aastrand Grimnes wrote: > =20 >> Has anyone got any good ideas about this? :) >> =20 > > Some rough ideas (partially repeats stuff I wrote in=20 > http://gnowsis.opendfki.de/wiki/ApertureArchives): > > * I think this calls for another major API, next to Crawler and=20 > Extractor, as it seems to be something altogether different. I call it=20 > CompoundObjectProcessor for now, still looking for a better name.=20 > Typical processing in AutoFocus or Gnowsis would then be: > > - apply a Crawler on a DataSource, producing a queue of DataObjects. > - for every DataObject in this set: > - determine the MIME type of the stream > - see if there is a CompoundObjectProcessor impl for this MIME typ= e. > if yes: > - apply the CompoundObjectProcessor on this DataObject and pu= t > all resulting DataObjects in the queue > if no: > - see if there is an Extractor impl for this MIME type and > if so, apply it on the DataObject > > The CompoundObjectProcessor could be given an AccessData instance, just= =20 > like Crawler, to make incremental crawling of such objects possible. We= =20 > have seen cases where zip files were *adapted* periodically, e.g. backu= p=20 > archives or dumps of a document management system. Also, IMAP supports=20 > editing of existing messages (removing attachments, for example), MSN=20 > Messenger puts all the logs for all sessions in a single file per=20 > contact, etc. This means that when incrementally crawling a zip file to= =20 > which a single file was added, the latter file would be reported as new= ,=20 > the zip file itself as changed and all the other files in the zip file=20 > as unchanged. > > Giving the CompoundObjectProcessor a DataObject rather than, say, an=20 > InputStream allows it to add container-specific metadata for the archiv= e=20 > itself (#entries, uncompressed size, etc) and to retrieve metadata it=20 > may require (e.g. the name of the archive file). > > * Alternative: find a way to generalize the Crawler and Extractor APIs=20 > into one XYZ API: you put a source description in and it produces=20 > DataObjects that get processed recursively and exhaustively. Feels a bi= t=20 > tricky and over-generalization to me but I wanted to mention it, perhap= s=20 > someone has good ideas in this direction. > > * Arjohn recently referred me to the Commons VFS project=20 > (http://jakarta.apache.org/commons/vfs/). From the 1.0 release notes: > > "Commons VFS provides a single API for accessing various different file > systems. It presents a uniform view of the files from various different > sources, such as the files on local disk, on an HTTP server, or inside = a > Zip archive. For example, you can use filenames like=20 > "tar:gz:http://anyhost/dir/mytar.tar.gz!/mytar.tar!/path/in/tar/README.= txt" > to access a compressed tar file located on a web server." > > Could be useful, they seem to handle multiple schemes, multiple archive= =20 > formats and infinite nesting. I didn't look at it in detail thus far.=20 > It's not clear to me right now how Aperture and Commons VFS would be=20 > integrated. > > > Chris > -- > > -----------------------------------------------------------------------= -- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share= your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D= DEVDEV > _______________________________________________ > Aperture-devel mailing list > Ape...@li... > https://lists.sourceforge.net/lists/listinfo/aperture-devel > =20 --=20 ____________________________________________________ DI Leo Sauermann http://www.dfki.de/~sauermann=20 DFKI GmbH P.O. Box 2080 Fon: +49 631 205-3503 67608 Kaiserslautern Fax: +49 631 205-3472 Germany Mail: leo...@df... ____________________________________________________ |