It was Christiaan Fluit who said at the right time 16.07.2007 16:03 the
> Gunnar Aastrand Grimnes wrote:
>> * Day 1: Setup datasource for c:\documents - crawl everything
>> * Day 2: do not recrawl all of c:\documents, but only c:\documents\pdf
>> Is it possible using the current aperture to do incremental crawling of just
>> this sub-directory? Using the dataaccess for the first crawl *might* do it, but
>> then the access data is broken when a full crawl is required later.
>> In general, the logic behind AccessData objects isn't very well documented, the
>> javadoc in particular says:
>> [...]AccessData proposes a number of keys to use when storing values, combined
>> with a proposed value encoding. This is to ensure that several DataAccessors and
>> possibly other components can share the same AccessData instance without
>> resulting in conflicts.
>> This seems to hint that sharing access-data objects is possible, but it isn't :)
> My guess is that when I wrote this line, I had a setup in mind with a
> single Crawler and one or more DataAccessors. Multiple Crawlers or
> multiple DataSources is a different matter that didn't come to my mind.
> The only problem I can imagine is indeed with AccessData.clear(), as
> CrawlerBase.clear assumes() that it is the component governing the
> entire AccessData.
the problem is, that the crawler selectes resources for deletion based
on the data in AccessData. In the setup c:\documents | c:\documents\pdf
this means that after the second crawl, only c:\documents\pdf remains in
given the current architecture, changing this behavior is more than a
> I'm not sure yet what the best way to solve this would be. For example,
> should this really be a different DataSource (currently the only way to
> do it, expect for the AccessData problem)
should be two datasources:
one c:\documents, configured with DomainBoundaries to exclude
and then c:\documents\pdf.
> or should we have the ability
> to instruct the Crawler to only recrawl a specific part of a domain?
that would be possible, but that also assumes that the crawler knows
about the container/hasPart structure, which is not a requirement yet.
It is an architectural change i would like to avoid for now (if only one
user needs it at the moemnt)
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> Aperture-devel mailing list
DI Leo Sauermann http://www.dfki.de/~sauermann
Deutsches Forschungszentrum fuer
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080 Fon: +49 631 20575-116
D-67663 Kaiserslautern Fax: +49 631 20575-102
Germany Mail: leo.sauermann@...
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313