[Aperture-devel] crawling files containing dataobjects

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello everyone,

We recently came across a problem with crawling some special XML files. These
files contain several "things" that we would like to have mapped to several
data-objects.

This is analogous to the problem of archive files, this problem I have
previously discussed with Chris. I.e. when the fileSystemCrawler finds a
zip-file we might want to crawl the content of the file, not just the file
itself. Currently the architecture makes kinda tricky. The FileSystemCrawler
will find one file, and only in the crawlerHandler.reportNew do we find the
mime-type of the file, and only then do we invoke the appropriate extractor to
get the contents.

The main problem with the current setup is that  The extractor method "extract"
gets a single RDFContainer, and a single ID, there is no way for it to report
other data-objects.

One solution is of course to hard-code zip/tar/rar/whatever support into the
FileSystemCrawler, but that's kinda crappy, cause you would like to be able to
crawl zip files as email attachments as well.

This problem also arises for things like thunderbird addressbooks (a single file
on disk), or iCal files, etc. but it's not as natural to crawl these as a part
of a file-system crawl - and these are solved by having dedicated crawlers for
these.

Leo just pointed me to this link, where this was discussed previously:
http://gnowsis.opendfki.de/wiki/ApertureArchives

Has anyone got any good ideas about this? :)
- --
Gunnar Aastrand Grimnes
gunnar.grimnes [AT] dfki.de

DFKI GmbH
Knowledge Management
Erwin-Schroedinger-Strasse
D-67663 Kaiserslautern
Germany

Office: +49 631 205 3438
Mobile: +49 177 277 4397

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFnktafD15aMgAOfcRAoVYAJ94EvqWGrOt15kUr8oXjyhmhhHKzwCfbVTs
cbnM9Ilc2fY893+9/dAndcY=
=9lk+
-----END PGP SIGNATURE-----