When indexing a site with 1000 links we decided to use
10 threads. All files were downloaded at the speed of
light but the DOS window never closed. We waited
20min and it stayed open, paused on the last URL. Re-
running the test with Threads=1 was no problem (albeit
solwer).
btw. We added a Thread.sleep in Spider.downloadURL()
after every HTTP Get to avoid killing the server
(important on some servers).
obj = urlGetter.getURL(url);
try {
Thread.sleep(config.getDelay());
} catch (InterruptedException iex) {
// Ignore
}
Logged In: NO
Have found the cause. For example if a page has the following
link:
<a href="" onclick="xxx">link</a>
The resulting url is:
onclick="xxx
This returns an IOException in URLGetter when it tries to open
an this invalid url.
Solution in HTMLParser:
Old code:
int closeQuotePos = input.indexOf("\"", attrPos + attrStr.length
()+1);
New code:
int closeQuotePos = input.indexOf("\"", attrPos + attrStr.length
());
However this is just a specific solution. The general problem is
that this IOException is not handled correctly - even though
it is caught. I think we end up with a null url in the queue
somehow.
Logged In: YES
user_id=1061819
Have found the cause. For example if a page has the following
link:
<a href="" onclick="xxx">link</a>
The resulting url is:
onclick="xxx
This returns an IOException in URLGetter when it tries to open
an this invalid url.
Solution in HTMLParser:
Old code:
int closeQuotePos = input.indexOf("\"", attrPos + attrStr.length
()+1);
New code:
int closeQuotePos = input.indexOf("\"", attrPos + attrStr.length
());
However this is just a specific solution. The general problem is
that this IOException is not handled correctly - even though
it is caught. I think we end up with a null url in the queue
somehow.
Logged In: YES
user_id=1061819
I discovered the reason why the application never exited.
Because you have a variable called "downloadsInProgress"
which is supposed to keep track of the number of downloads
currently being downloaded. If this gets out of synch with the
actual number of downloads then the system thinks there are
still downloads occuring but there aren't.
3 Suggestions:
1) The system should never wait for downloads for ever. A
simple timeout should occur, see listing 1 below.
2) The bug which causes the system to lose track of how
many downloads are in progress is commented in Listing 2
below.
3) The idea of a seperate variable to count something is not
good. This variable is superflous and one should rely on the
urlsDownloading.size() method to determine the number of
downloads in progress.
Listing 1 (Spider.run())
-------------------------------------
if(queueSize() == 0 && downloadsInProgress > 0) {
timeoutTimer++;
_logClass.info("Queue is empty. Waiting for " +
downloadsInProgress + " downloadsInProgress, timer=" +
timeoutTimer);
// Wait for a download to finish before seeing if
this thread should stop
try {
Thread.sleep(QUEUE_CHECK_INTERVAL);
} catch(InterruptedException ignored) {
}
// Check the timeout
if (timeoutTimer > 20) {
// Break this loop due to timeout
_logClass.error("Timed out waiting for last " +
downloadsInProgress + " (" + urlsDownloading.size() + "
downloads. Aborting");
for(Iterator i = urlsDownloading.iterator();
i.hasNext(); ) {
URL u = (URL)i.next();
_logClass.error(" - " + u.toExternalForm());
}
synchronized(urlsDownloading) {
urlsDownloading.clear();
break;
}
}
// Have another go at the loop
continue;
-------------------------------------
Listing 2 (Spider.run())
-------------------------------------
downloadsInProgress++;
synchronized(urlsDownloading)
{
urlsDownloading.add(nextURL);
}
int newDepth = nextURL.getDepth() + 1;
int maxDepth = config.getMaxDepth();
// The following line is necessary and was missing
downloadsInProgress--;
synchronized(urlsDownloading)
{
urlsDownloading.remove(nextURL);
}
-------------------------------------
Logged In: YES
user_id=1311727
This is a real bummer. I kept waiting and waiting. A
simple Control-C stopped the process and allowed a clean exit.