WebLech URL Spider / Bugs / #6 Application does not exit when multiple Threads are used

Nobody/Anonymous - 2004-06-12

Logged In: NO

Have found the cause. For example if a page has the following
link:
<a href="" onclick="xxx">link</a>
The resulting url is:
onclick="xxx
This returns an IOException in URLGetter when it tries to open
an this invalid url.
Solution in HTMLParser:
Old code:
int closeQuotePos = input.indexOf("\"", attrPos + attrStr.length
()+1);
New code:
int closeQuotePos = input.indexOf("\"", attrPos + attrStr.length
());

However this is just a specific solution. The general problem is
that this IOException is not handled correctly - even though
it is caught. I think we end up with a null url in the queue
somehow.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc - 2004-06-12

Logged In: YES
user_id=1061819

Have found the cause. For example if a page has the following
link:
<a href="" onclick="xxx">link</a>
The resulting url is:
onclick="xxx
This returns an IOException in URLGetter when it tries to open
an this invalid url.
Solution in HTMLParser:
Old code:
int closeQuotePos = input.indexOf("\"", attrPos + attrStr.length
()+1);
New code:
int closeQuotePos = input.indexOf("\"", attrPos + attrStr.length
());

However this is just a specific solution. The general problem is
that this IOException is not handled correctly - even though
it is caught. I think we end up with a null url in the queue
somehow.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc - 2004-06-17

Logged In: YES
user_id=1061819

I discovered the reason why the application never exited.
Because you have a variable called "downloadsInProgress"
which is supposed to keep track of the number of downloads
currently being downloaded. If this gets out of synch with the
actual number of downloads then the system thinks there are
still downloads occuring but there aren't.
3 Suggestions:
1) The system should never wait for downloads for ever. A
simple timeout should occur, see listing 1 below.
2) The bug which causes the system to lose track of how
many downloads are in progress is commented in Listing 2
below.
3) The idea of a seperate variable to count something is not
good. This variable is superflous and one should rely on the
urlsDownloading.size() method to determine the number of
downloads in progress.

Listing 1 (Spider.run())
-------------------------------------
if(queueSize() == 0 && downloadsInProgress > 0) {
timeoutTimer++;
_logClass.info("Queue is empty. Waiting for " +
downloadsInProgress + " downloadsInProgress, timer=" +
timeoutTimer);
// Wait for a download to finish before seeing if
this thread should stop
try {
Thread.sleep(QUEUE_CHECK_INTERVAL);
} catch(InterruptedException ignored) {
}
// Check the timeout
if (timeoutTimer > 20) {
// Break this loop due to timeout
_logClass.error("Timed out waiting for last " +
downloadsInProgress + " (" + urlsDownloading.size() + "
downloads. Aborting");
for(Iterator i = urlsDownloading.iterator();
i.hasNext(); ) {
URL u = (URL)i.next();
_logClass.error(" - " + u.toExternalForm());
}

synchronized(urlsDownloading) {
urlsDownloading.clear();
break;
}
}
// Have another go at the loop
continue;
-------------------------------------
Listing 2 (Spider.run())
-------------------------------------
downloadsInProgress++;
synchronized(urlsDownloading)
{
urlsDownloading.add(nextURL);
}
int newDepth = nextURL.getDepth() + 1;
int maxDepth = config.getMaxDepth();
// The following line is necessary and was missing
downloadsInProgress--;
synchronized(urlsDownloading)
{
urlsDownloading.remove(nextURL);
}
-------------------------------------

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

scruf - 2005-07-13

Logged In: YES
user_id=1311727

This is a real bummer. I kept waiting and waiting. A
simple Control-C stopped the process and allowed a clean exit.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Application does not exit when multiple Threads are used

Group

Searches

Help

#6 Application does not exit when multiple Threads are used

Discussion