It was Christiaan Fluit who said at the right time 20.09.2007 12:11 the following words:
Leo Sauermann wrote:
"""I am implementing a CrawlerHandler to store crawled data in my RDF 
store, and I wonder about multithreading, how do the crawlers report 
crawled data?
Once a crawler is started, the crawler handler 's methods will get 
called asynchronously by the crawler, and the crawler can take endlessly 
to finish. The methods of the crawler (start/stop) can be called anytime 
and must be implemented thread safe. The methods of the CrawlerHandler 
(add/remove/changed resource) will be called within the thread(s) of the 
crawler, a Crawler Handler should synchronize operations on its 
datastore so that multiple calls don't interfere.
When you implement a crawler, you can use multiple threads for crawling, 
but be aware that calling the methods of the CrawlerHandler can stop 
your thread a little."""

I am not sure whether your description is incorrect or whether I just 
interpret it wrongly, so here's my feedback, to check if we agree on 
things, and some suggestions on refinements.
My description describes what I thought that the implementations could be, but actually you are right telling how they are.
You are right, we implemented the crawlers single-threaded.

But I assumed, that the methods of the Crawler (start, stop) indicate that it can be used asynchronously,
and that the callback architecture (added, deleted, etc) nurtures the possibilities of multi-threading.

We use multi-threaded crawlers in nepomuk, where we start multiple crawlers in parallel (I think) and they all report back to the same crawlerhandler. (I think)

Antoni  - what do you think about the multi-threading issues?

more below....
The Crawlers I am familiar with (FileSystemCrawler, WebCrawler and 
ImapCrawler) are all single-threaded, i.e. their crawl methods run 
entirely within the Thread that invokes them, no new Threads are 
started. This means that for these Crawlers the CrawlerHandler methods 
are also invoked from this same Thread.

I would call this way of invoking callback methods synchronous, it would 
be asynchronous if invocation of the CrawlerHandler methods would take 
place from a different Thread, correct?

I am not sure whether your use of the term "asynchronous" refers to the 
current Crawler implementations (in which case I disagree, at least for 
the Crawlers that I mentioned) or whether this is what you think you 
should be prepared for in your CrawlerHandler implementation. The latter 
may be good practice but may also be overkill when no Crawler exists 
that actually creates new Threads. Is there any such Crawler at the moment?

If we do want to encourage people to implement their CrawlerHandler so 
that it can be used with a multi-threaded Crawler, we should offer some 
documentation on how that can/should be achieved. For example, is making 
all CrawlerHandler method implementations synchronized enough already or 
could that potentially lead to deadlocks when used with certain Crawlers?

Deadlocks and alike are really my biggest fear with this issue, I don't 
like having to synchronize stuff in code that is invoked through a 
callback API when it's not clear in which Thread the callbacks are 
invoked. That is why I always encouraged to keep Crawler implementations 
single-threaded, so that the integrator that creates/chooses the 
CrawlerHandler implementation and invokes the Crawler.crawl method can 
control all threading.

Those Crawlers that really need multi-threading for e.g. performance 
reasons can also do so by letting the Crawler.crawl method start a new 
Thread that does the actual crawling and letting the former Thread wait 
(as in Object.wait) on some buffer through which crawl results are 
passed through internally. It can then loop over all results that the 
crawling thread(s) create and pass them to the CrawlerHandler. This way 
the CrawlerHandler still runs in the same Thread as the Crawler.crawl 
method and you risk no deadlocks.

I also noticed that the Javadoc does not mention anything on this topic yet.
any conclusion should be documented in the wiki and in the javadoc.

It seems your suggestions is clever, only the thread that invoked "crawl()" is allowed to call the callback methods, internally the crawler can run multiple threads, that hides the synchronization problem inside the crawlers.
(whoever wants to do it, can have all the fun inside the crawler)

Also, I would still allow two crawlers reporting back to the same crawler handler in paralllel, this would mean to make the crawler-hanlder's callback mehtods synchronized, then we are on the safe side.



Kind regards,


This email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
Aperture-devel mailing list

DI Leo Sauermann 

Deutsches Forschungszentrum fuer 
Kuenstliche Intelligenz DFKI GmbH
Trippstadter Strasse 122
P.O. Box 2080           Fon:   +49 631 20575-116
D-67663 Kaiserslautern  Fax:   +49 631 20575-102
Germany                 Mail:

Prof.Dr.Dr.h.c.mult. Wolfgang Wahlster (Vorsitzender)
Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats:
Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313