Re: [Aperture-devel] crawling files containing dataobjects

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Ok,

I have updated the wiki-page with pro/con votes,
please continue the discussion both via e-mail and on the wiki, so that=20
we can read it monthsl later also.

Here my feedback:

Solutions:

    CompoundObjectProcessor?
    <https://gnowsis.opendfki.de/wiki/CompoundObjectProcessor> =B6
    <https://gnowsis.opendfki.de/wiki/ApertureArchives#CompoundObjectProc=
essor>

Leo: how about naming it "Sub-Crawler" or "MicroCrawler?=20
<https://gnowsis.opendfki.de/wiki/MicroCrawler>". This is, a crawler=20
that is crawling inside a bigger crawl process to crawl sub-resources.

    * apply a Crawler on a DataSource?
      <https://gnowsis.opendfki.de/wiki/DataSource>, producing a queue
      of DataObjects? <https://gnowsis.opendfki.de/wiki/DataObjects>.
    * for every DataObject?
      <https://gnowsis.opendfki.de/wiki/DataObject> in this set:
          o determine the MIME type of the stream
          o see if there is a CompoundObjectProcessor?
            <https://gnowsis.opendfki.de/wiki/CompoundObjectProcessor>
            impl for this MIME type. if yes:
                + apply the CompoundObjectProcessor?
                  <https://gnowsis.opendfki.de/wiki/CompoundObjectProcess=
or>
                  on this DataObject?
                  <https://gnowsis.opendfki.de/wiki/DataObject> and put
                  all resulting DataObjects?
                  <https://gnowsis.opendfki.de/wiki/DataObjects> in the
                  queue
          o if no:
                + see if there is an Extractor impl for this MIME type
                  and if so, apply it on the DataObject?
                  <https://gnowsis.opendfki.de/wiki/DataObject>

The CompoundObjectProcessor?=20
<https://gnowsis.opendfki.de/wiki/CompoundObjectProcessor> could be=20
given an AccessData? <https://gnowsis.opendfki.de/wiki/AccessData>=20
instance, just like Crawler, to make incremental crawling of such=20
objects possible.

Giving the CompoundObjectProcessor?=20
<https://gnowsis.opendfki.de/wiki/CompoundObjectProcessor> a DataObject?=20
<https://gnowsis.opendfki.de/wiki/DataObject> rather than, say, an=20
InputStream? <https://gnowsis.opendfki.de/wiki/InputStream> allows it to=20
add container-specific metadata for the archive itself (#entries,=20
uncompressed size, etc) and to retrieve metadata it may require (e.g.=20
the name of the archive file).

Pro:

    * Leo: could handle most prolbems

Con:

    * Leo: When you have the file extension ".xml", there is a billion
      choices how to extract the info from it.

Vote:

    * Leo: +

    Merge Crawler and Exctractor =B6
    <https://gnowsis.opendfki.de/wiki/ApertureArchives#MergeCrawlerandExc=
tractor>

Alternative: find a way to generalize the Crawler and Extractor APIs=20
into one XYZ API: you put a source description in and it produces=20
DataObjects? <https://gnowsis.opendfki.de/wiki/DataObjects> that get=20
processed recursively and exhaustively. Feels a bit tricky and=20
over-generalization to me but I wanted to mention it, perhaps someone=20
has good ideas in this direction.

Pro:

Con:

    * Leo: that would make it soo generic that it is useless.

Vote:

    * Leo: -

    Let Extractor do more =B6
    <https://gnowsis.opendfki.de/wiki/ApertureArchives#LetExtractordomore=
>

The Extractor interface was designed to return more than one resource=20
anyway. It can do that by wrapping them inside the RDFContainer, we have=20
done that with addresses in e-mails already, using anonymous nodes or=20
URI nodes in between (for sender/cc).

Extractor can return a bigger RDF graph inside one RDF Container (which=20
works already), but the RDFContainer could be extended with a list of=20
resources contained within. The list can be done either using RDF=20
metadata (x aperture:isContainedIn y) or with a Java list.

Pro:

    * Leo: works today

Con:

    * Leo: hard to optimize Lucene index afterwards

Es begab sich aber da Christiaan Fluit zur rechten Zeit 05.01.2007 14:30=20
folgendes schrieb:
> Gunnar Aastrand Grimnes wrote:
>  =20
>> Has anyone got any good ideas about this? :)
>>    =20
>
> Some rough ideas (partially repeats stuff I wrote in=20
> http://gnowsis.opendfki.de/wiki/ApertureArchives):
>
> * I think this calls for another major API, next to Crawler and=20
> Extractor, as it seems to be something altogether different. I call it=20
> CompoundObjectProcessor for now, still looking for a better name.=20
> Typical processing in AutoFocus or Gnowsis would then be:
>
> - apply a Crawler on a DataSource, producing a queue of DataObjects.
> - for every DataObject in this set:
>      - determine the MIME type of the stream
>      - see if there is a CompoundObjectProcessor impl for this MIME typ=
e.
>        if yes:
>           - apply the CompoundObjectProcessor on this DataObject and pu=
t
>             all resulting DataObjects in the queue
>        if no:
>           - see if there is an Extractor impl for this MIME type and
>             if so, apply it on the DataObject
>
> The CompoundObjectProcessor could be given an AccessData instance, just=
=20
> like Crawler, to make incremental crawling of such objects possible. We=
=20
> have seen cases where zip files were *adapted* periodically, e.g. backu=
p=20
> archives or dumps of a document management system. Also, IMAP supports=20
> editing of existing messages (removing attachments, for example), MSN=20
> Messenger puts all the logs for all sessions in a single file per=20
> contact, etc. This means that when incrementally crawling a zip file to=
=20
> which a single file was added, the latter file would be reported as new=
,=20
> the zip file itself as changed and all the other files in the zip file=20
> as unchanged.
>
> Giving the CompoundObjectProcessor a DataObject rather than, say, an=20
> InputStream allows it to add container-specific metadata for the archiv=
e=20
> itself (#entries, uncompressed size, etc) and to retrieve metadata it=20
> may require (e.g. the name of the archive file).
>
> * Alternative: find a way to generalize the Crawler and Extractor APIs=20
> into one XYZ API: you put a source description in and it produces=20
> DataObjects that get processed recursively and exhaustively. Feels a bi=
t=20
> tricky and over-generalization to me but I wanted to mention it, perhap=
s=20
> someone has good ideas in this direction.
>
> * Arjohn recently referred me to the Commons VFS project=20
> (http://jakarta.apache.org/commons/vfs/). From the 1.0 release notes:
>
> "Commons VFS provides a single API for accessing various different file
> systems. It presents a uniform view of the files from various different
> sources, such as the files on local disk, on an HTTP server, or inside =
a
> Zip archive. For example, you can use filenames like=20
> "tar:gz:http://anyhost/dir/mytar.tar.gz!/mytar.tar!/path/in/tar/README.=
txt"
> to access a compressed tar file located on a web server."
>
> Could be useful, they seem to handle multiple schemes, multiple archive=
=20
> formats and infinite nesting. I didn't look at it in detail thus far.=20
> It's not clear to me right now how Aperture and Commons VFS would be=20
> integrated.
>
>
> Chris
> --
>
> -----------------------------------------------------------------------=
--
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share=
 your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge&CID=3D=
DEVDEV
> _______________________________________________
> Aperture-devel mailing list
> Ape...@li...
> https://lists.sourceforge.net/lists/listinfo/aperture-devel
>  =20

--=20
____________________________________________________
DI Leo Sauermann       http://www.dfki.de/~sauermann=20
DFKI GmbH
P.O. Box 2080          Fon:   +49 631 205-3503
67608 Kaiserslautern   Fax:   +49 631 205-3472
Germany                Mail:  leo...@df...
____________________________________________________