Hi Antoni, Aperture

It was Antoni Mylka who said at the right time 19.11.2008 15:10 the following words:
2008/11/14 Leo Sauermann <leo.sauermann@dfki.de>:
Hi Aperture,

most of our discussions on how to process binaries using Extractors and
and the delicate communication between CrawlerHandler, SubCrawler, and
is lost somewhere in e-mails.

CrawlerHandlerBase has (in the ...example... folder) a "processBinary"
that behaves correctly, this is our "standard" way to do it (although it
misses subcrawlers)

I moved this method now to CrawlerHandlerBase in the main Java source
(commited in 1485)
because it is needed as a reference implementation of
"this is how it works".

I would propose we keep it like this and value feedback from you,



It's ok, though I have some reservations.
- SubCrawlerRegistry is missing
this is exactly why we should have the base impl :-)

I added SubCrawlerRegistry and I had to extend the method signature of processBinary
to include passing the crawler.

I also removed the CrawlerHandlerBase impl from the Examples code,
it was used by org.semanticdesktop.aperture.examples.tutorials....

(it was now obsoleted)

I won't replace the SimpleCrawlerHandler though, it has much much features and is also correclty implemented.

- It should work with either extractor or subcrawler registry set to
null, if it is to be THE default crawler handler base class, it
shouldn't force the users to use extractors and subcrawlers
ok, I changed that, good point.
- The class creates a new in-memory model for each data object which
is discarded. as it is the class acts as a /dev/null sink for data.
I'd rather force the client to supply a ModelSet, or to make the
getRDFContainerFactory method abstract, to make it plain, that

If you want the data:
- set up the registries, the datasource and the crawler
- set up your data store (ModelSet) and pass it to the handler
- crawl
Now you have your data in your data store.

This would allow many people to use aperture as a black-box, without
having to implement the crawler handler, or even extend the default

The default behavior of creating an in-memory RDF container for the data object is fine,
it encapsulates the data in a separated space and that is pretty safe,
only I fear that we forget to call close() sometimes.

Passing the data store to the BaseCrawlerHandler is too much -
* we have no clue how to store the data we crawl! In separated named graphs? which graph uri? any other metadata? how to update stored metadata?
* here is where the glue code should happen.

But I see a need to document WHAT exactly has to be programmed in the crawler handler to connect it to a database.

What we could do is,
implement the objectNew, objectChanged, objectNotModified, objectRemoved methods in a simple way,
including the processbinary calls and maybe some crawler reporting (x objects new, etc. we often need that)
Inside these methods we then call abstract methods to indicate that now something has to be done

such as :
    public void objectNew(Crawler crawler, DataObject object) {
        processBinary(crawler, object);
        objectNewDatabase(crawler, object); // this abstract method needs to be implemented

but alas it also makes things more complicated.
I would maybe leave it as is. We were quite successfull with the abstraction layers at the moment, the idea was always that CrawlerHandler is the glue-code you must write.


kind regards

DI Leo Sauermann       http://www.dfki.de/~sauermann 

Deutsches Forschungszentrum fuer 
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080           Fon:   +49 631 20575-116
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:  leo.sauermann@dfki.de

Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313