Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

8 KeyedQueue server<->key mismatch noted: pfbuser<->mprsrv.agr - ID: 1031607
Last Update: Comment added ( karl-ia )

The below has been reported by Ansi and Tom Emerson.
Seems easy enough to reproduce. Making it high
priority because two list members reported it.

KeyedQueue server<->key mismatch noted:
pfbuser<->mprsrv.agri.gov.cn
java.util.NoSuchElementException
at
org.archive.queue.TieredQueue.peek(TieredQueue.java(Inlined
Compiled Code))
at
org.archive.queue.TieredQueue.dequeue(TieredQueue.java(Inlined
Compiled Code))
at
org.archive.crawler.frontier.KeyedQueue.dequeue(KeyedQueue.java(Inlined
Compiled Code))
at
org.archive.crawler.frontier.HostQueuesFrontier.dequeueFromReady(HostQueues
Frontier.java(Compiled
Code))
at
org.archive.crawler.frontier.HostQueuesFrontier.next(HostQueuesFrontier.jav
a(Compiled
Code))
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:115)
java.util.NoSuchElementException
at
org.archive.queue.TieredQueue.peek(TieredQueue.java(Inlined
Compiled Code))
at
org.archive.queue.TieredQueue.dequeue(TieredQueue.java(Inlined
Compiled Code))
at
org.archive.crawler.frontier.KeyedQueue.dequeue(KeyedQueue.java(Inlined
Compiled Code))
at
org.archive.crawler.frontier.HostQueuesFrontier.dequeueFromReady(HostQueues
Frontier.java(Compiled
Code))
at
org.archive.crawler.frontier.HostQueuesFrontier.next(HostQueuesFrontier.jav
a(Compiled
Code))
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:115)


Here is the Tom report:



I've had a crawl running for several days. Late Friday
it stopped
fetching anything. Any attempts to view the Reports for
the job
(http://.../admin/reports.jsp) to an exception being
returned to the
browser:

An error occured


java.util.NoSuchElementException

java.util.NoSuchElementException
at java.util.TreeMap.key(TreeMap.java:433)
at java.util.TreeMap.firstKey(TreeMap.java:287)
at java.util.TreeSet.first(TreeSet.java:407)
at
org.archive.crawler.framework.ToePool.oneLineReport(ToePool.java:106)
at
org.archive.crawler.framework.CrawlController.oneLineReportThreads(CrawlCon
troller.java:1054)
at
org.archive.crawler.admin.CrawlJobHandler.getThreadOneLine(CrawlJobHandler.
java:944)
at
org.archive.crawler.jspc.admin.reports_jsp._jspService(Unknown
Source)
at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:137)
at
javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:358)
at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHand
ler.java:294)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
at
org.mortbay.http.HttpContext.handle(HttpContext.java:1807)
at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContex
t.java:525)
at
org.mortbay.http.HttpContext.handle(HttpContext.java:1757)
at
org.mortbay.http.HttpServer.service(HttpServer.java:879)
at
org.mortbay.http.HttpConnection.service(HttpConnection.java:790)
at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:961)
at
org.mortbay.http.HttpConnection.handle(HttpConnection.java:807)
at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:197)
at
org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:276)
at
org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:511)

There are no Alerts for the crawl.

I can view the logs from the WUI:

crawl.log's last entry is a successful fetch on Friday
evening.

local-errors.log's last entry indicates various socket
timeouts on the
fetches, ending Friday evening.

progress-statistics.log is still being updated,
indicating nothing has
been downloaded since Friday evening.

runtime-errors.log is empty.

uri-errors.log shows nothing unusual --- just the
seemingly standard
set of bogus URLs that show up on the web.

heritrix_out.log shows some more interesting data,
though I'm not sure
how to interpret. Things seem to be plugging around
just fine, then
things go south:

#29 2155ms
finished(http://ricerca.gazzetta.it/scalcio/3.0.764977133.shtml)
via http://ricerca.g
azzetta.it/
#43 1847ms
finished(http://directory.alguer.it/index.php?browse=/Regional/Europe/Unite
d_Kingdom/
Northern_Ireland/Society_and_Culture/) via
http://directory.alguer.it/index.php?browse=/Regional
/Europe/United_Kingdom/Society_and_Culture/
KeyedQueue server<->key mismatch noted:
mailto<->lastampa.it
Exception in thread "ToeThread #10"
java.util.NoSuchElementException
at
org.archive.queue.TieredQueue.peek(TieredQueue.java:139)
at
org.archive.queue.TieredQueue.dequeue(TieredQueue.java:103)
at
org.archive.crawler.frontier.KeyedQueue.dequeue(KeyedQueue.java:342)
at
org.archive.crawler.frontier.HostQueuesFrontier.dequeueFromReady(HostQueues
Frontier.j
ava:826)
at
org.archive.crawler.frontier.HostQueuesFrontier.next(HostQueuesFrontier.jav
a:524)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:115)
Exception in thread "ToeThread #21"
java.util.NoSuchElementException
at
org.archive.queue.TieredQueue.peek(TieredQueue.java:139)
at
org.archive.queue.TieredQueue.dequeue(TieredQueue.java:103)
at
org.archive.crawler.frontier.KeyedQueue.dequeue(KeyedQueue.java:342)
at
org.archive.crawler.frontier.HostQueuesFrontier.dequeueFromReady(HostQueues
Frontier.j
ava:826)
at
org.archive.crawler.frontier.HostQueuesFrontier.next(HostQueuesFrontier.jav
a:524)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:115)

and so on and so on...

Pretty much after all of the toe threads die like this,
the only
activity in that log are various connections to the
server to see
what's happening.

Could this be a memory problem? The machine this is
running on had 4
heritrix instances running each with the default heap
size. It has 1G
physical memory, 2G swap. The WUI cosole indicates:

Used memory: 127666 KB
Heap size: 204264 KB
Max heap size: 260160 KB

The crawl was 37% complete, 442152 / 1177003
downloaded/queued, after
almost 112 hours.

This is running CVS HEAD from the morning of 8
September, modified
with the ContentTypeRegexpFilter code (my version, not
Stack's
integration). The same order file (modulo the expected
changes, e.g.,
pathnames) is being and has been used successfully on
other similarly
sized crawls.

I've written off this crawl: I have enough data for my
immediate
needs. But I wanted to ping you all and see what you
thought the
problem could be.

-tree


Michael Stack ( stack-sf ) - 2004-09-21 01:22

8

Closed

Fixed

Michael Stack

None

None

Public


Comments ( 2 )

Date: 2007-03-14 00:16
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-248 -- please add further
comments at that location.


Date: 2004-09-22 18:25
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Recent Frontier refactoring has it so bogus queue names
cause crawler crashes. I made it so the getting of queue
names now is less likely to get garbage.

+ I added to UURI a get authority minus userinfo (We were
having trouble with userinfo thinking all before the ':' was
the name of a server -- i.e. the ':' was assumed a delimiter
between host and port).
+ I made it so only http, https, ftp and dns schemes make it
out of UURI so less likely we'll get bogus queue names.

Closing.


Attached File

No Files Currently Attached

Changes ( 4 )

Field Old Value Date By
assigned_to nobody 2004-09-22 18:25 stack-sf
close_date - 2004-09-22 18:25 stack-sf
status_id Open 2004-09-22 18:25 stack-sf
resolution_id None 2004-09-22 18:25 stack-sf