It was Christiaan Fluit who said at the right time 20.09.2007 12:11 the
My description describes what I thought that the implementations could
be, but actually you are right telling how they are.
Leo Sauermann wrote:
"""I am implementing a CrawlerHandler to store crawled data in my RDF
store, and I wonder about multithreading, how do the crawlers report
Once a crawler is started, the crawler handler 's methods will get
called asynchronously by the crawler, and the crawler can take endlessly
to finish. The methods of the crawler (start/stop) can be called anytime
and must be implemented thread safe. The methods of the CrawlerHandler
(add/remove/changed resource) will be called within the thread(s) of the
crawler, a Crawler Handler should synchronize operations on its
datastore so that multiple calls don't interfere.
When you implement a crawler, you can use multiple threads for crawling,
but be aware that calling the methods of the CrawlerHandler can stop
your thread a little."""
I am not sure whether your description is incorrect or whether I just
interpret it wrongly, so here's my feedback, to check if we agree on
things, and some suggestions on refinements.
You are right, we implemented the crawlers single-threaded.
But I assumed, that the methods of the Crawler (start, stop) indicate
that it can be used asynchronously,
and that the callback architecture (added, deleted, etc) nurtures the
possibilities of multi-threading.
We use multi-threaded crawlers in nepomuk, where we start multiple
crawlers in parallel (I think) and they all report back to the same
crawlerhandler. (I think)
Antoni - what do you think about the multi-threading issues?
any conclusion should be documented in the wiki and in the javadoc.
The Crawlers I am familiar with (FileSystemCrawler, WebCrawler and
ImapCrawler) are all single-threaded, i.e. their crawl methods run
entirely within the Thread that invokes them, no new Threads are
started. This means that for these Crawlers the CrawlerHandler methods
are also invoked from this same Thread.
I would call this way of invoking callback methods synchronous, it would
be asynchronous if invocation of the CrawlerHandler methods would take
place from a different Thread, correct?
I am not sure whether your use of the term "asynchronous" refers to the
current Crawler implementations (in which case I disagree, at least for
the Crawlers that I mentioned) or whether this is what you think you
should be prepared for in your CrawlerHandler implementation. The latter
may be good practice but may also be overkill when no Crawler exists
that actually creates new Threads. Is there any such Crawler at the moment?
If we do want to encourage people to implement their CrawlerHandler so
that it can be used with a multi-threaded Crawler, we should offer some
documentation on how that can/should be achieved. For example, is making
all CrawlerHandler method implementations synchronized enough already or
could that potentially lead to deadlocks when used with certain Crawlers?
Deadlocks and alike are really my biggest fear with this issue, I don't
like having to synchronize stuff in code that is invoked through a
callback API when it's not clear in which Thread the callbacks are
invoked. That is why I always encouraged to keep Crawler implementations
single-threaded, so that the integrator that creates/chooses the
CrawlerHandler implementation and invokes the Crawler.crawl method can
control all threading.
Those Crawlers that really need multi-threading for e.g. performance
reasons can also do so by letting the Crawler.crawl method start a new
Thread that does the actual crawling and letting the former Thread wait
(as in Object.wait) on some buffer through which crawl results are
passed through internally. It can then loop over all results that the
crawling thread(s) create and pass them to the CrawlerHandler. This way
the CrawlerHandler still runs in the same Thread as the Crawler.crawl
method and you risk no deadlocks.
I also noticed that the Javadoc does not mention anything on this topic yet.
It seems your suggestions is clever, only the thread that invoked
"crawl()" is allowed to call the callback methods, internally the
crawler can run multiple threads, that hides the synchronization
problem inside the crawlers.
(whoever wants to do it, can have all the fun inside the crawler)
Also, I would still allow two crawlers reporting back to the same
crawler handler in paralllel, this would mean to make the
crawler-hanlder's callback mehtods synchronized, then we are on the
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
Aperture-devel mailing list
DI Leo Sauermann http://www.dfki.de/~sauermann
Deutsches Forschungszentrum fuer
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080 Fon: +49 631 20575-116
D-67663 Kaiserslautern Fax: +49 631 20575-102
Germany Mail: firstname.lastname@example.org
Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313