before the rest:
* 16mb ram is ok, if it doesn't grow bigger.
Another Idea just hit me:
What if we move deprecatedUrls to the CrawlerHandler and let him count
which Urls were "new, unchanged, modified" (he gets informed anyway)
and then let him mass-delete the deleted resources himself?
Hm, this would break the diea that crawlers have optimized ways of
detecting changes... maybe dumb
(this would though require us to release aperture 2.0, its an
incompability break, and I would rather keep the current architecture
and write down this suggestion into the 2.0 wishlist)
I have found another solution, see bottom
It was Antoni Myłka who said at the right time 14.02.2008 14:47 the
Christiaan Fluit pisze:
Antoni Myłka wrote:
Leo Sauermann pisze:
as said in my original mail, the main question is:
does subcrawler change the protected field "deprecatedUrls" of CrawlerBase?
No it doesn't, because it doesn't have access to it.
I've been thinking about a change in how deprecated URLs are handled
(purely for performance reasons), which may also solve this issue.
One of the problems is that the Crawler currently maintains the set of
all deprecated URLs in main memory. In other words, when you start to
incrementally crawl a source that after the last crawl had 100.000
files, you start your crawl with creating a set of 100.000 Strings in
Clearly not a very scalable solution, considering how expensive Strings
are. For 100.000 Strings of approx. 60 chars length (still very modest
numbers in our use cases - we like to go up an order of magnitude or
more), this already means about 16 MB of String data (2 * #chars + 40).
I've been thinking about a different approach that can also solve the
SubCrawler vs. deprecated URLs problem. It boils down to giving each
crawled item, whether it is new, changed or unchanged, a timestamp in
the access data, reflecting the time when the crawl was started.
Detection of unchanged items then simply means looking up all items in
the AccessData with a different timestamp.
Such an approach can solve the SubCrawler problem, provided that
SubCrawlers apply the same timestamp to their IDs. This we can solve
transparently, by giving the AccessData a "current timestamp" property
maintained by CrawlerBase that is applied to each new or modified ID
automatically by the AccessData. Also, CrawlerBase still has to know
which IDs were generated by a SubCrawler for which parent IDs, so that
child IDs can get their timestamp adjusted when the parent ID is
reported as an unchanged object. Given the solution outlined in the
previous mail (maintaining a set of links between parent IDs and child
IDs in the AccessData), it has all the necessary information. In other
words: we can solve it once and for all in CrawlerBase, no need to worry
about it in concrete Crawler or CrawlerHandler implementations.
- solves SubCrawler issue of "undeprecating" URLs
- lowers memory consumption (we're running into OOMEs currently!)
- slows down crawling: means AccessData is modified for every inspected
item, especially noticeable when AccessData is backed by a Repository
- some changes in API and/or Crawler-AccessData interaction, which still
may have consequences for Aperture users
con: too much architecture change
This basically means expanding the functionality of AccessData.
This can be done without incompatible API changes - I hope, we could use
the initialize() method (which is already there) to initialize the
timestamp and then use this timestamp to mark all entries that have been
accessed or modified. Then this timestamp could be used to find all
- referenced id's
This functionality is already there too, but some review of the crawlers
would be necessary to check if they actually use it (I would bet they
don't). Theoretically this might lead to drastic improvements.
- expose the deprecatedUrls to subcrawlers (I don't like it),
- place the SubCrawlerHandler functionality in the Crawler. Much more
reasonable IMHO. Now that I think of it, it's even better.
Have Crawler extend SubCrawlerHandler and provide the
DefaultSubCrawlerHandler functionality in the CrawlerBase.
- no need for anyone to learn a new interface, 99% of users wouldn't
need to create their own implementations of SubCrawlerHandler
- no need for he DefaultSubCrawlerHandler class
- the applications will work with their own crawler handlers
- the default implementation in the CrawlerBase will take care about
updating deprecated urls, regardless of wether we implement the
timestamps or not
- none IMHO
- We intermingle crawler, subcrawler and crawlerhandler.
CrawlerHandler has to decide, based on mimetype, if to run a
sub-crawler or not.
Then, the sub-crawler has to be passed a SubCrawlerHandler which is
connected to the Crawler currently running.
So we need a method "Crawler.getSubCrawlerHandler()" returning a
SubCrawlerHandler that does the deprecatedUri thing,
which I like more.
from the side of AccessData, it works with the "deprecatedUrls" if the
subcrawlers always get initialized by the main crawlerhandler
BTW i wonder how it will actually work, we spot a zip file, apply a
ZipSubCrawler to it, get a stream to a file, identify it as a vcard,
pass it to a vcard subcrawler, then process it, return an attached photo
and pass the photo to an ExifExtractor. And then the zipSubCrawler would
get to another zip entry and hope that the zip stream will actually take
it may get really tricky in a general case - one stream, a tree of
subcrawlers/extractors to process it. Some will create an in-memory
model of the content of the stream (like ical4j does) some will not.
This is going to be fun...
and the crawlerhandler passes them to the main crawler, which manages
In CrawlerHandlerBase, processBinary, we would have (after/before "//
apply an Extractor if available")
(subcrawler needs to be a field, to be able to "stop" it also)
subcrawler = (pick the first from)
At this point, at the moment, these variables miss:
* crawler (not passed to processBinary)
* crawler.getSubCrawlerHandler() (not existing)
* charset is always null here
We see that crawler and subcrawler are intermingled here.
A PROBLEM is also, that the "Crawler.stop()" method can not reach the
subcrawler, because the crawler is not aware of it.
Looking at these problems, I would suppose that the following may be a
add a method
"Crawler.runSubCrawler(SubCrawler subcrawler, DataObject object,
InputStream stream, Charset charset, String mimeType)"
The crawler can then react to "stop()" and invoke the stop of the
and the crawler can provide a hidden implementation (private class) of
* keeping track of the AccessData on objectNew|NotModified|changed
* and reporting the dataobjects back to the original crawlerhandler
This does not expose the recursive nature of the crawling process back
to the crawlerhandler,
but at least connects the internal structures of the crawler and
subcrawler quite well.
All kinds of comments welcome
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
Aperture-devel mailing list
DI Leo Sauermann http://www.dfki.de/~sauermann
Deutsches Forschungszentrum fuer
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080 Fon: +49 631 20575-116
D-67663 Kaiserslautern Fax: +49 631 20575-102
Germany Mail: email@example.com
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313