From: Gunnar A. G. <gun...@df...> - 2007-01-05 12:54:27
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello everyone, We recently came across a problem with crawling some special XML files. These files contain several "things" that we would like to have mapped to several data-objects. This is analogous to the problem of archive files, this problem I have previously discussed with Chris. I.e. when the fileSystemCrawler finds a zip-file we might want to crawl the content of the file, not just the file itself. Currently the architecture makes kinda tricky. The FileSystemCrawler will find one file, and only in the crawlerHandler.reportNew do we find the mime-type of the file, and only then do we invoke the appropriate extractor to get the contents. The main problem with the current setup is that The extractor method "extract" gets a single RDFContainer, and a single ID, there is no way for it to report other data-objects. One solution is of course to hard-code zip/tar/rar/whatever support into the FileSystemCrawler, but that's kinda crappy, cause you would like to be able to crawl zip files as email attachments as well. This problem also arises for things like thunderbird addressbooks (a single file on disk), or iCal files, etc. but it's not as natural to crawl these as a part of a file-system crawl - and these are solved by having dedicated crawlers for these. Leo just pointed me to this link, where this was discussed previously: http://gnowsis.opendfki.de/wiki/ApertureArchives Has anyone got any good ideas about this? :) - -- Gunnar Aastrand Grimnes gunnar.grimnes [AT] dfki.de DFKI GmbH Knowledge Management Erwin-Schroedinger-Strasse D-67663 Kaiserslautern Germany Office: +49 631 205 3438 Mobile: +49 177 277 4397 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFnktafD15aMgAOfcRAoVYAJ94EvqWGrOt15kUr8oXjyhmhhHKzwCfbVTs cbnM9Ilc2fY893+9/dAndcY= =9lk+ -----END PGP SIGNATURE----- |