before the rest:
* 16mb ram is ok, if it doesn't grow bigger.

Another Idea just hit me:
What if we move deprecatedUrls to the CrawlerHandler and let him count which Urls were "new, unchanged, modified" (he gets informed anyway) and then let him mass-delete the deleted resources himself?
Hm, this would break the diea that crawlers have optimized ways of detecting changes... maybe dumb

(this would though require us to release aperture 2.0, its an incompability break, and I would rather keep the current architecture and write down this suggestion into the 2.0 wishlist)

I have found another solution, see bottom

It was Antoni Myłka who said at the right time 14.02.2008 14:47 the following words:
Christiaan Fluit pisze:
Antoni Myłka wrote:
Leo Sauermann pisze:
as said in my original mail, the main question is:
does subcrawler change the protected field "deprecatedUrls" of CrawlerBase?


No it doesn't, because it doesn't have access to it.
I've been thinking about a change in how deprecated URLs are handled 
(purely for performance reasons), which may also solve this issue.

One of the problems is that the Crawler currently maintains the set of 
all deprecated URLs in main memory. In other words, when you start to 
incrementally crawl a source that after the last crawl had 100.000 
files, you start your crawl with creating a set of 100.000 Strings in 
main memory.

Clearly not a very scalable solution, considering how expensive Strings 
are. For 100.000 Strings of approx. 60 chars length (still very modest 
numbers in our use cases -  we like to go up an order of magnitude or 
more), this already means about 16 MB of String data (2 * #chars + 40).

I've been thinking about a different approach that can also solve the 
SubCrawler vs. deprecated URLs problem. It boils down to giving each 
crawled item, whether it is new, changed or unchanged, a timestamp in 
the access data, reflecting the time when the crawl was started. 
Detection of unchanged items then simply means looking up all items in 
the AccessData with a different timestamp.

Such an approach can solve the SubCrawler problem, provided that 
SubCrawlers apply the same timestamp to their IDs. This we can solve 
transparently, by giving the AccessData a "current timestamp" property 
maintained by CrawlerBase that is applied to each new or modified ID 
automatically by the AccessData. Also, CrawlerBase still has to know 
which IDs were generated by a SubCrawler for which parent IDs, so that 
child IDs can get their timestamp adjusted when the parent ID is 
reported as an unchanged object. Given the solution outlined in the 
previous mail (maintaining a set of links between parent IDs and child 
IDs in the AccessData), it has all the necessary information. In other 
words: we can solve it once and for all in CrawlerBase, no need to worry 
about it in concrete Crawler or CrawlerHandler implementations.

- solves SubCrawler issue of "undeprecating" URLs
- lowers memory consumption (we're running into OOMEs currently!)

- slows down crawling: means AccessData is modified for every inspected 
item, especially noticeable when AccessData is backed by a Repository
- some changes in API and/or Crawler-AccessData interaction, which still 
may have consequences for Aperture users

con: too much architecture change



This basically means expanding the functionality of AccessData.

- timestamps.
This can be done without incompatible API changes - I hope, we could use 
the initialize() method (which is already there) to initialize the 
timestamp and then use this timestamp to mark all entries that have been 
accessed or modified. Then this timestamp could be used to find all 
deleted resources.

- referenced id's
This functionality is already there too, but some review of the crawlers 
would be necessary to check if they actually use it (I would bet they 
don't). Theoretically this might lead to drastic improvements.

Other options:
- expose the deprecatedUrls to subcrawlers (I don't like it),
- place the SubCrawlerHandler functionality in the Crawler. Much more 
reasonable IMHO. Now that I think of it, it's even better.

Have Crawler extend SubCrawlerHandler and provide the 
DefaultSubCrawlerHandler functionality in the CrawlerBase.

- no need for anyone to learn a new interface, 99% of users wouldn't 
need to create their own implementations of SubCrawlerHandler
- no need for he DefaultSubCrawlerHandler class
- the applications will work with their own crawler handlers
- the default implementation in the CrawlerBase will take care about 
updating deprecated urls, regardless of wether we implement the 
timestamps or not

- none IMHO
- We intermingle crawler, subcrawler and crawlerhandler.

CrawlerHandler has to decide, based on mimetype, if to run a sub-crawler or not.
Then, the sub-crawler has to be passed a SubCrawlerHandler which is connected to the Crawler currently running.

So we need a method "Crawler.getSubCrawlerHandler()" returning a SubCrawlerHandler that does the deprecatedUri thing,
which I like more.

BTW i wonder how it will actually work, we spot a zip file, apply a 
ZipSubCrawler to it, get a stream to a file, identify it as a vcard, 
pass it to a vcard subcrawler, then process it, return an attached photo 
and pass the photo to an ExifExtractor. And then the zipSubCrawler would 
get to another zip entry and hope that the zip stream will actually take 
it there.

it may get really tricky in a general case - one stream, a tree of 
subcrawlers/extractors to process it. Some will create an in-memory 
model of the content of the stream (like ical4j does) some will not. 
This is going to be fun...
from the side of AccessData, it works with the "deprecatedUrls" if the subcrawlers always get initialized by the main crawlerhandler
and the crawlerhandler passes them to the main crawler, which manages the deprecatedUrls.

In CrawlerHandlerBase, processBinary, we would have (after/before "// apply an Extractor if available")

(subcrawler needs to be a field, to be able to "stop" it also)
subcrawler = (pick the first from) subCrawlerRegistry.getSubCrawlerFactoriesFor(mimeType);

subcrawler.subCrawl(dataObject.getID(), bufferedStream, crawler.getAccessData(),
                        crawler.getSubCrawlerHandler(), null(charset), mimeType, dataObject.getMetadata())

At this point, at the moment, these variables miss:
* crawler (not passed to processBinary)
* crawler.getSubCrawlerHandler() (not existing)
* charset is always null here

We see that crawler and subcrawler are intermingled here.

A PROBLEM is also, that the "Crawler.stop()" method can not reach the
subcrawler, because the crawler is not aware of it.

Looking at these problems, I would suppose that the following may be a better solution

add a method
"Crawler.runSubCrawler(SubCrawler subcrawler, DataObject object, InputStream stream, Charset charset, String mimeType)"

The crawler can then react to "stop()" and invoke the stop of the subcrawler,
and the crawler can provide a hidden implementation (private class) of the SubCrawlerHandler,
* keeping track of the AccessData on objectNew|NotModified|changed
* and reporting the dataobjects back to the original crawlerhandler

This does not expose the recursive nature of the crawling process back to the crawlerhandler,
but at least connects the internal structures of the crawler and subcrawler quite well.


All kinds of comments welcome

Antoni Mylka

This email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Aperture-devel mailing list


DI Leo Sauermann 

Deutsches Forschungszentrum fuer 
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080           Fon:   +49 631 20575-116
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:

Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313