Menu

#6 Application does not exit when multiple Threads are used

weblech-0.0.3
open
nobody
None
5
2004-06-11
2004-06-11
Anonymous
No

When indexing a site with 1000 links we decided to use
10 threads. All files were downloaded at the speed of
light but the DOS window never closed. We waited
20min and it stayed open, paused on the last URL. Re-
running the test with Threads=1 was no problem (albeit
solwer).

btw. We added a Thread.sleep in Spider.downloadURL()
after every HTTP Get to avoid killing the server
(important on some servers).

obj = urlGetter.getURL(url);
try {
Thread.sleep(config.getDelay());
} catch (InterruptedException iex) {
// Ignore
}

Discussion

  • Nobody/Anonymous

    Logged In: NO

    Have found the cause. For example if a page has the following
    link:
    <a href="" onclick="xxx">link</a>
    The resulting url is:
    onclick="xxx
    This returns an IOException in URLGetter when it tries to open
    an this invalid url.
    Solution in HTMLParser:
    Old code:
    int closeQuotePos = input.indexOf("\"", attrPos + attrStr.length
    ()+1);
    New code:
    int closeQuotePos = input.indexOf("\"", attrPos + attrStr.length
    ());

    However this is just a specific solution. The general problem is
    that this IOException is not handled correctly - even though
    it is caught. I think we end up with a null url in the queue
    somehow.

     
  • Marc

    Marc - 2004-06-12

    Logged In: YES
    user_id=1061819

    Have found the cause. For example if a page has the following
    link:
    <a href="" onclick="xxx">link</a>
    The resulting url is:
    onclick="xxx
    This returns an IOException in URLGetter when it tries to open
    an this invalid url.
    Solution in HTMLParser:
    Old code:
    int closeQuotePos = input.indexOf("\"", attrPos + attrStr.length
    ()+1);
    New code:
    int closeQuotePos = input.indexOf("\"", attrPos + attrStr.length
    ());

    However this is just a specific solution. The general problem is
    that this IOException is not handled correctly - even though
    it is caught. I think we end up with a null url in the queue
    somehow.

     
  • Marc

    Marc - 2004-06-17

    Logged In: YES
    user_id=1061819

    I discovered the reason why the application never exited.
    Because you have a variable called "downloadsInProgress"
    which is supposed to keep track of the number of downloads
    currently being downloaded. If this gets out of synch with the
    actual number of downloads then the system thinks there are
    still downloads occuring but there aren't.
    3 Suggestions:
    1) The system should never wait for downloads for ever. A
    simple timeout should occur, see listing 1 below.
    2) The bug which causes the system to lose track of how
    many downloads are in progress is commented in Listing 2
    below.
    3) The idea of a seperate variable to count something is not
    good. This variable is superflous and one should rely on the
    urlsDownloading.size() method to determine the number of
    downloads in progress.

    Listing 1 (Spider.run())
    -------------------------------------
    if(queueSize() == 0 && downloadsInProgress > 0) {
    timeoutTimer++;
    _logClass.info("Queue is empty. Waiting for " +
    downloadsInProgress + " downloadsInProgress, timer=" +
    timeoutTimer);
    // Wait for a download to finish before seeing if
    this thread should stop
    try {
    Thread.sleep(QUEUE_CHECK_INTERVAL);
    } catch(InterruptedException ignored) {
    }
    // Check the timeout
    if (timeoutTimer > 20) {
    // Break this loop due to timeout
    _logClass.error("Timed out waiting for last " +
    downloadsInProgress + " (" + urlsDownloading.size() + "
    downloads. Aborting");
    for(Iterator i = urlsDownloading.iterator();
    i.hasNext(); ) {
    URL u = (URL)i.next();
    _logClass.error(" - " + u.toExternalForm());
    }

    synchronized(urlsDownloading) {
    urlsDownloading.clear();
    break;
    }
    }
    // Have another go at the loop
    continue;
    -------------------------------------
    Listing 2 (Spider.run())
    -------------------------------------
    downloadsInProgress++;
    synchronized(urlsDownloading)
    {
    urlsDownloading.add(nextURL);
    }
    int newDepth = nextURL.getDepth() + 1;
    int maxDepth = config.getMaxDepth();
    // The following line is necessary and was missing
    downloadsInProgress--;
    synchronized(urlsDownloading)
    {
    urlsDownloading.remove(nextURL);
    }
    -------------------------------------

     
  • scruf

    scruf - 2005-07-13

    Logged In: YES
    user_id=1311727

    This is a real bummer. I kept waiting and waiting. A
    simple Control-C stopped the process and allowed a clean exit.

     

Log in to post a comment.