Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

9 NoSuchElementException in URI queues halts crawling - ID: 973294
Last Update: Comment added ( karl-ia )

A crawl that had been progressing very well suddenly
stopped making progress about 75 minutes into the
crawl. No alerts and the runtime-errors and
local-errors logs did not contain any relevant errors.

Looking through the heritrix_out.log the a flurry of
the following Exceptions could be found at the time the
crawl stopped making progress

----
java.util.NoSuchElementException
at
org.archive.util.DiskQueue.dequeue()Ljava.lang.Object;(DiskQueue.java:181)
at
org.archive.util.DiskBackedDeque.backingDequeue()Ljava.lang.Object;(DiskBac
kedDeque.java:114)
at
org.archive.util.DiskBackedQueue.dequeue()Ljava.lang.Object;(DiskBackedQueu
e.java:104)
at
org.archive.crawler.frontier.KeyedQueue.dequeue()Lorg.archive.crawler.datam
odel.CrawlURI;(Optimized
Method)
at
org.archive.crawler.frontier.Frontier.dequeueFromReady()Lorg.archive.crawle
r.datamodel.CrawlURI;(Optimized
Method)
at
org.archive.crawler.frontier.Frontier.next()Lorg.archive.crawler.datamodel.
CrawlURI;(Optimized
Method)
at
org.archive.crawler.frontier.Frontier.next(I)Lorg.archive.crawler.datamodel
.CrawlURI;(Frontier.java:573)
at
org.archive.crawler.framework.ToeThread.run()V(ToeThread.java:135)
at
java.lang.Thread.startThreadFromVM(Ljava.lang.Thread;)V(Unknown
Source)
-----

The crawl was a 'wide' crawl of 1000 .is domains using
a custom scope based off the Domain scope (I can not
see any relation between it and this error). It had
covered approx. 230,000 documents (with 310,000
waiting) when this occured. The Frontier report did not
contain any hint that anything was amiss.


Kristinn Sigurdsson ( kristinn_sig ) - 2004-06-15 14:59

9

Closed

Fixed

Gordon Mohr

Disk I/O

None

Public


Comments ( 13 )

Date: 2007-03-14 00:13
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-175 -- please add further
comments at that location.


Date: 2004-10-04 18:55
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

I ran a crawl with HEAD over the weekend using both Ansi and
Tom Emerson's problematic URLs and crawl is still running
today monday midday. Closing as fixed.


Date: 2004-09-30 23:37
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Added a fix for the case of www.changeup.com.%3C.

Here is commit:

* src/java/org/archive/crawler/datamodel/UURIFactory.java
We were removing the first slash in a path if the first
thing in
path was a '/..' even if was '/...' (Assumption in the
regex was
that we'd find '/../' or some such).
* src/java/org/archive/crawler/datamodel/UURIFactoryTest.java
Test for case of '/...'.




Date: 2004-09-30 00:46
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here is how we got to the bad URL (From Tom):

2004-09-27-20-44-03 - http://www.joins.com/
2004-09-27-20-44-44 L http://plus.joins.com/
2004-09-27-20-44-29 P http://plus.joins.com/robots.txt
2004-09-27-20-44-35 R http://www.joins.com/joinsplus
2004-09-27-20-45-00 R http://www.joins.com/joinsplus/
2004-09-27-20-47-37 L http://license.joins.com/
2004-09-27-21-01-14 L
http://license.joins.com/main.asp?gate=02&g_Page=changeup&lic_id=70
2004-09-27-23-02-17 L
http://license.joins.com/board/etc_board_list.asp?board_name=new_main&gate=02&g_Page=changeup&g_sPage=notice&site=changeup&b_Flag=List&lic_id=70&b_type=&nPage=1&keyfield=all&keyword=&category=G&g_alt=%EA%B3%B5%EC%A7%80%EC%82%AC%ED%95%AD
2004-09-27-23-07-09 E http://www.changeup.com.%3C/a


Date: 2004-09-29 20:45
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Here's a new permutation from Tom Emerson:

09/27/2004 19:03:34 -0400 WARNING org.archive.util.DevUtils
warnHandle Failed to get class key: Illegal domain label:
www.changeup.com.%3c CrawlURI(dns:www.changeup.com.%3C)
org.apache.commons.httpclient.URIException: Illegal domain
label: www.changeup.com.%3c at
org.archive.crawler.datamodel.UURIFactory.checkDomainlabel(UURIFactory.java:550)
at
org.archive.crawler.datamodel.UURIFactory.fixup(UURIFactory.java:446)
at
org.archive.crawler.datamodel.UURIFactory.create(UURIFactory.java:282)
at
org.archive.crawler.datamodel.UURIFactory.create(UURIFactory.java:272)
at
org.archive.crawler.datamodel.UURIFactory.getInstance(UURIFactory.java:241)
at
org.archive.crawler.datamodel.CrawlURI.calculateClassKey(CrawlURI.java:397)
at
org.archive.crawler.datamodel.CrawlURI.getClassKey(CrawlURI.java:380)
at
org.archive.crawler.frontier.HostQueuesFrontier.keyedQueueFor(HostQueuesFrontier.java:889)
at
org.archive.crawler.frontier.HostQueuesFrontier.enqueueToKeyed(HostQueuesFrontier.java:961)
at
org.archive.crawler.frontier.HostQueuesFrontier.receive(HostQueuesFrontier.java:465)
at
org.archive.crawler.util.FPUriUniqFilter.addForce(FPUriUniqFilter.java:79)
at
org.archive.crawler.frontier.HostQueuesFrontier.innerSchedule(HostQueuesFrontier.java:419)
at
org.archive.crawler.frontier.HostQueuesFrontier.innerBatchFlush(HostQueuesFrontier.java:397)
at
org.archive.crawler.frontier.HostQueuesFrontier.innerFinished(HostQueuesFrontier.java:646)
at
org.archive.crawler.frontier.HostQueuesFrontier.finished(HostQueuesFrontier.java:629)


Looks like we need to fail domains before we give them to
the URI parent class inside in the fixup code.


Date: 2004-09-28 18:34
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Reopening because not fully squashed yet. I have a checkin
to make that I think will take care of the last seen
instance found in a broad crawl -- the check for > 2083 was
applied to one constructor of UURI only, need to apply to
other constructor -- and hopefully that'll fix it but
keeping issue open while working on it.


Date: 2004-06-30 19:24
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

When a keyedqueue's last item required retrying, the old
queue had been discarded. A new queue was created, which in
site-first mode would be INACTIVE (and on the
inactive-queues list). However, the code for scheduling a
delayed retry assumed (with an assertion), that the queue
would always be READY or EMPTY.

Without assertions enabled, the queue would be snoozed,
resulting in it having a single item, and being on both the
snooze-list and the inactive-list. Thus both when it wakes,
and when it was activated, it would be expected to have
items -- which it might not, if the other event happened first.

This is believed to be the cause of the NoSuchElementException.

Fixed by making the snooze-code in Frontier aware of what to
do if a queue is INACTIVE (activate it and remove it from
the inactive list) -- so that the snooze takes precedence.


Date: 2004-06-16 14:45
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

Looking through the local errors log I found that the
robots.txt fetch for each of the hosts in question had
gotten a ConnectionException (leading to a -2 status code)
Connection refused. Each seemed to have exactly 6 of these.

Example:
----
20040616125748803 -2 . #3
http://www.aikman.is/robots.txt . . .
P http://www.aikman.is/
java.net.ConnectException: connect: Connection refused
20040616131251902 -2 . #95
http://www.aikman.is/robots.txt . . .
P http://www.aikman.is/
java.net.ConnectException: connect: Connection refused
20040616132759000 -2 . #97
http://www.aikman.is/robots.txt . . .
P http://www.aikman.is/
java.net.ConnectException: connect: Connection refused
20040616134304057 -2 . #162
http://www.aikman.is/robots.txt . . .
P http://www.aikman.is/
java.net.ConnectException: connect: Connection refused
20040616135810252 -2 . #108
http://www.aikman.is/robots.txt . . .
P http://www.aikman.is/
java.net.ConnectException: connect: Connection refused
20040616141317665 -2 . #129
http://www.aikman.is/robots.txt . . .
P http://www.aikman.is/
java.net.ConnectException: connect: Connection refused
----

Since retry delay is 15 min then that means that this is
clearly related since each time this problem has occured it
has started 75 minutes into the crawl, or when these
problematic URIs have been tried 6 times. The funny thing is
that max retries is set to 30 so I can't understand why
after exactly 6 tries this happens???

Of course this means that there should be a URI in those
queues (the seed) so the question seems to be what happens
to the seed and why isn't the failed robots.txt tried more
then 6 times?


Date: 2004-06-16 14:17
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

Ok, I can now report that my submitted workaround is indeed
... err working. These queues no longer halt the crawl. That
still leaves the underlying problem.

75 mins into the crawl I got 7 of these errors. Alerts follow:

----


A NoSuchElementException occured while thread ToeThread #152
tried to dequeue an item from www.13.is
Queue report follows:
KeyedQueue www.13.is
Length: 0
Status: EMPTY
Last enqueued: http://www.13.is/robots.txt
Last dequeued: http://www.13.is/robots.txt


Associated Throwable: java.util.NoSuchElementException

Stacktrace:
java.util.NoSuchElementException
at
org.archive.util.DiskQueue.dequeue()Ljava.lang.Object;(DiskQueue.java:181)
at
org.archive.util.DiskBackedDeque.backingDequeue()Ljava.lang.Object;(DiskBackedDeque.java:114)
at
org.archive.util.DiskBackedQueue.dequeue()Ljava.lang.Object;(DiskBackedQueue.java:104)
at
org.archive.crawler.frontier.KeyedQueue.dequeue()Lorg.archive.crawler.datamodel.CrawlURI;(KeyedQueue.java:397)
at
org.archive.crawler.frontier.Frontier.next()Lorg.archive.crawler.datamodel.CrawlURI;(Optimized
Method)
at
org.archive.crawler.frontier.Frontier.next(I)Lorg.archive.crawler.datamodel.CrawlURI;(Frontier.java:581)
at
org.archive.crawler.framework.ToeThread.run()V(ToeThread.java:135)
at
java.lang.Thread.startThreadFromVM(Ljava.lang.Thread;)V(Unknown
Source)
----

It seems clear to me that somehow empty queues are being
tagged as ready. Could this be a problem with bringing
queues out of INACTIVE state???


Date: 2004-06-16 13:16
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

No no other IO errors anywhere.

It looks like somehow an empty queue is getting marked as
READY.
Will continue to run test crawls.


Date: 2004-06-15 17:32
Sender: gojomoProject Admin

Logged In: YES
user_id=144912

Some other IO error may have triggered this; was any evident
in any logs or output?


Date: 2004-06-15 17:14
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

Have added code that should better catch this bug and
(hopefully) allow a crawl to continue despite it. The
underlying cause is still there but with improved reporting
it will hopefully be easier to track down. A repeat of my
earlier crawl is underway with the updated code.


Date: 2004-06-15 15:03
Sender: kristinn_sigProject Admin

Logged In: YES
user_id=892643

I should have noted that I was using the 0.10.0 release.


Attached File

No Files Currently Attached

Changes ( 8 )

Field Old Value Date By
close_date 2004-06-30 19:24 2004-10-04 18:55 stack-sf
status_id Open 2004-10-04 18:55 stack-sf
resolution_id Accepted 2004-10-04 18:55 stack-sf
resolution_id Fixed 2004-09-28 18:34 stack-sf
status_id Closed 2004-09-28 18:34 stack-sf
status_id Open 2004-06-30 19:24 gojomo
close_date - 2004-06-30 19:24 gojomo
resolution_id None 2004-06-30 19:24 gojomo