The below has been reported by Ansi and Tom Emerson.
Seems easy enough to reproduce. Making it high
priority because two list members reported it.
KeyedQueue server<->key mismatch noted:
pfbuser<->mprsrv.agri.gov.cn
java.util.NoSuchElementException
at
org.archive.queue.TieredQueue.peek(TieredQueue.java(Inlined
Compiled Code))
at
org.archive.queue.TieredQueue.dequeue(TieredQueue.java(Inlined
Compiled Code))
at
org.archive.crawler.frontier.KeyedQueue.dequeue(KeyedQueue.java(Inlined
Compiled Code))
at
org.archive.crawler.frontier.HostQueuesFrontier.dequeueFromReady(HostQueues
Frontier.java(Compiled
Code))
at
org.archive.crawler.frontier.HostQueuesFrontier.next(HostQueuesFrontier.jav
a(Compiled
Code))
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:115)
java.util.NoSuchElementException
at
org.archive.queue.TieredQueue.peek(TieredQueue.java(Inlined
Compiled Code))
at
org.archive.queue.TieredQueue.dequeue(TieredQueue.java(Inlined
Compiled Code))
at
org.archive.crawler.frontier.KeyedQueue.dequeue(KeyedQueue.java(Inlined
Compiled Code))
at
org.archive.crawler.frontier.HostQueuesFrontier.dequeueFromReady(HostQueues
Frontier.java(Compiled
Code))
at
org.archive.crawler.frontier.HostQueuesFrontier.next(HostQueuesFrontier.jav
a(Compiled
Code))
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:115)
Here is the Tom report:
I've had a crawl running for several days. Late Friday
it stopped
fetching anything. Any attempts to view the Reports for
the job
(http://.../admin/reports.jsp) to an exception being
returned to the
browser:
An error occured
java.util.NoSuchElementException
java.util.NoSuchElementException
at java.util.TreeMap.key(TreeMap.java:433)
at java.util.TreeMap.firstKey(TreeMap.java:287)
at java.util.TreeSet.first(TreeSet.java:407)
at
org.archive.crawler.framework.ToePool.oneLineReport(ToePool.java:106)
at
org.archive.crawler.framework.CrawlController.oneLineReportThreads(CrawlCon
troller.java:1054)
at
org.archive.crawler.admin.CrawlJobHandler.getThreadOneLine(CrawlJobHandler.
java:944)
at
org.archive.crawler.jspc.admin.reports_jsp._jspService(Unknown
Source)
at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:137)
at
javax.servlet.http.HttpServlet.service(HttpServlet.java:853)
at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:358)
at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHand
ler.java:294)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
at
org.mortbay.http.HttpContext.handle(HttpContext.java:1807)
at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContex
t.java:525)
at
org.mortbay.http.HttpContext.handle(HttpContext.java:1757)
at
org.mortbay.http.HttpServer.service(HttpServer.java:879)
at
org.mortbay.http.HttpConnection.service(HttpConnection.java:790)
at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:961)
at
org.mortbay.http.HttpConnection.handle(HttpConnection.java:807)
at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:197)
at
org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:276)
at
org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:511)
There are no Alerts for the crawl.
I can view the logs from the WUI:
crawl.log's last entry is a successful fetch on Friday
evening.
local-errors.log's last entry indicates various socket
timeouts on the
fetches, ending Friday evening.
progress-statistics.log is still being updated,
indicating nothing has
been downloaded since Friday evening.
runtime-errors.log is empty.
uri-errors.log shows nothing unusual --- just the
seemingly standard
set of bogus URLs that show up on the web.
heritrix_out.log shows some more interesting data,
though I'm not sure
how to interpret. Things seem to be plugging around
just fine, then
things go south:
#29 2155ms
finished(http://ricerca.gazzetta.it/scalcio/3.0.764977133.shtml)
via http://ricerca.g
azzetta.it/
#43 1847ms
finished(http://directory.alguer.it/index.php?browse=/Regional/Europe/Unite
d_Kingdom/
Northern_Ireland/Society_and_Culture/) via
http://directory.alguer.it/index.php?browse=/Regional
/Europe/United_Kingdom/Society_and_Culture/
KeyedQueue server<->key mismatch noted:
mailto<->lastampa.it
Exception in thread "ToeThread #10"
java.util.NoSuchElementException
at
org.archive.queue.TieredQueue.peek(TieredQueue.java:139)
at
org.archive.queue.TieredQueue.dequeue(TieredQueue.java:103)
at
org.archive.crawler.frontier.KeyedQueue.dequeue(KeyedQueue.java:342)
at
org.archive.crawler.frontier.HostQueuesFrontier.dequeueFromReady(HostQueues
Frontier.j
ava:826)
at
org.archive.crawler.frontier.HostQueuesFrontier.next(HostQueuesFrontier.jav
a:524)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:115)
Exception in thread "ToeThread #21"
java.util.NoSuchElementException
at
org.archive.queue.TieredQueue.peek(TieredQueue.java:139)
at
org.archive.queue.TieredQueue.dequeue(TieredQueue.java:103)
at
org.archive.crawler.frontier.KeyedQueue.dequeue(KeyedQueue.java:342)
at
org.archive.crawler.frontier.HostQueuesFrontier.dequeueFromReady(HostQueues
Frontier.j
ava:826)
at
org.archive.crawler.frontier.HostQueuesFrontier.next(HostQueuesFrontier.jav
a:524)
at
org.archive.crawler.framework.ToeThread.run(ToeThread.java:115)
and so on and so on...
Pretty much after all of the toe threads die like this,
the only
activity in that log are various connections to the
server to see
what's happening.
Could this be a memory problem? The machine this is
running on had 4
heritrix instances running each with the default heap
size. It has 1G
physical memory, 2G swap. The WUI cosole indicates:
Used memory: 127666 KB
Heap size: 204264 KB
Max heap size: 260160 KB
The crawl was 37% complete, 442152 / 1177003
downloaded/queued, after
almost 112 hours.
This is running CVS HEAD from the morning of 8
September, modified
with the ContentTypeRegexpFilter code (my version, not
Stack's
integration). The same order file (modulo the expected
changes, e.g.,
pathnames) is being and has been used successfully on
other similarly
sized crawls.
I've written off this crawl: I have enough data for my
immediate
needs. But I wanted to ping you all and see what you
thought the
problem could be.
-tree
Michael Stack
None
None
Public
|
Date: 2007-03-14 00:16
|
|
Date: 2004-09-22 18:25 Logged In: YES |
| Field | Old Value | Date | By |
|---|---|---|---|
| assigned_to | nobody | 2004-09-22 18:25 | stack-sf |
| close_date | - | 2004-09-22 18:25 | stack-sf |
| status_id | Open | 2004-09-22 18:25 | stack-sf |
| resolution_id | None | 2004-09-22 18:25 | stack-sf |
Copyright © 2010 Geeknet, Inc. All rights reserved. Terms of Use