Share

Heritrix: Internet Archive Web Crawler

Tracker: Bugs

7 glibc 2.3.2 NPTL hang (Was bdbfrontier stall in...) - ID: 1086554
Last Update: Comment added ( karl-ia )

In the test bench crawling at ~2 million docs an hour
going against the infiniturl application its possible
to hang the crawler: All queues are snoozed and the
ClockDaemon is stuck in a non-interruptable wait.

Here is the culprit:

"Thread-8" prio=1 tid=0x085e3358 nid=0x59f6 in
Object.wait() [0x9e37f000..0x9e37f640]
at java.lang.Object.wait(Native Method) at
java.lang.Object.wait(Object.java:474) at
EDU.oswego.cs.dl.util.concurrent.ClockDaemon.nextTask(ClockDaemon.java:320)

- locked <0xab4ed100> (a
EDU.oswego.cs.dl.util.concurrent.ClockDaemon)
at
EDU.oswego.cs.dl.util.concurrent.ClockDaemon$RunLoop.run(ClockDaemon.java:3
61)
at java.lang.Thread.run(Thread.java:595)



Michael Stack ( stack-sf ) - 2004-12-16 16:34

7

Closed

Wont Fix

Michael Stack

None

None

Public


Comments ( 7 )

Date: 2007-03-14 00:19
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-313 -- please add further
comments at that location.


Date: 2004-12-29 20:03
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Marking as 'wont fix' (Fix is upgrade glibc). Added as
known limitation to release notes.


Date: 2004-12-24 19:08
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

5 .FR crawls ran overnight without hanging so the workaround
seems to get us over the worst of the hanging problem when
usijng NPTL. Downing the priority for now.

Leaving issue open because the hang can still happen -- if
not in ClockDaemon, then elsewhere in the deflate for
instance (See '[ 1068403 ] ARCWriter gzip deflate hang') --
and it looks like likely the hang is fixed in 2.3.3 version
of glibc (Gordon research seems to indicate this and tests
on Nostromo machine, a fedora core 2 which has 2.3.3).

On NPTL itself, we want to use it because crawls on
testbench show it about 10% faster than linuxthreads but it
also exhibits smaller virtual memory size than a
linuxthreads based crawler. See
http://crawler.archive.org/cgi-bin/wiki.pl?EffectOfXss.


Date: 2004-12-21 23:20
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed a workaround patch that removes ClockDaemon.
Crawler seems to last in overnight tests on crawling11 with
NPTL enabled.


Date: 2004-12-21 15:35
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

A permutation of this issue has been showing up in the .FR
crawl. One version had two threads both waiting to lock the
ClockDaemon#Heap of tasks but thread dumps showed no one
holding the lock.

The problem seems like a glibc/NPTL problem because when we
disable NPTL, the crawl -- while slow as molasses -- runs
without hiccup.

The problem we're seeing resembles this one from the
blackdown lists:
http://www.blackdown.org/java-linux/java-linux@java.blackdown.org/java-linux-msg00089.html

Of note, removing ClockDaemon seems to workaround this
problem (Crawling11 lasted overnight).

Upping the priority since we need to get a fix or workaround
to make the .FR crawl work, the current highest priority.


Date: 2004-12-16 23:30
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Tested without the damping feature on a sun jdk1.4.2 and it
hung in the same place inside a ClockDaemon wait.

IBMJDK142 hangs almost immediately. The ClockDaemon is in
wait but so are a bunch of threads in bdb:

Kernel is 2.6.9dualp4.






Date: 2004-12-16 16:38
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Applied a mitigating patch.

@@ -681,28 +681,61 @@
readyQueue(wq);
}
}
+
/**
- * Wake any queues sitting in the snoozed queue whose
time has come
+ * Wake any queues sitting in the snoozed queue whose
time has come.
*/
void wakeQueues() {
long now = System.currentTimeMillis();
-// logger.info("wakeReadyQueues() at "+now);
+ int tasksCompleted = 0;
+ long interval = -1;
synchronized (snoozedClassQueues) {
while (true) {
if (snoozedClassQueues.isEmpty()) {
- return;
+ break;
}
BdbWorkQueue peek = (BdbWorkQueue)
snoozedClassQueues.first();
- if (peek.getWakeTime() <= now) {
+ interval = peek.getWakeTime() - now;
+ if (interval <= 0) {
snoozedClassQueues.remove(peek);
peek.setWakeTime(0);
reenqueueQueue(peek);
+ tasksCompleted++;
} else {
-// logger.info("declining to wake
"+peek.getClassKey()+"("+peek.getWakeTime()+") at "+now);
- return;
+ break;
}
}
}
+ if (tasksCompleted <= 0 && (interval > 0)){
+ final long maxSleepTime = 100;
+ // We've done no work. Go to sleep to stop
wait/notify trashing.
+ // We trash because without the below sleep, we
leave here and
+ // go into a wait. In times of high
concurrency we're
+ // continually notified out of the
ClockDaemon#restart method
+ // everytime a new task is added to the queue.
I've logged
+ // this happening on occasion at over 100 times
a second. Adding
+ // in this damping effect because its possible
to hang this thread
+ // inside ClockDaemon#wait such that it is not even
+ // interruptable on jdk1.5.0. This addition
does not eliminate the
+ // hang. It does postpone it (Hang happens
after 8 hours of
+ // highspeed crawling rather than after
30mins-2hrs).
+ //
+ // Don't sleep more than 100 milliseconds in
case something
+ // gets scheduled ahead of current head of
queue while
+ // we're asleep (Items are scheduled at now +
politeness or
+ // now + retry interval; the latter could get
scheduled first.
+ // Could make for some takeup lag if politeness
is off or
+ // retries are on a dime.
+ if (logger.isLoggable(Level.INFO)) {
+ logger.info("Sleeping for " +
Math.min(interval,
+ maxSleepTime));
+ }
+ try {
+ Thread.sleep(Math.min(interval, maxSleepTime));
+ } catch (InterruptedException e) {
+ e.printStackTrace();
+ }
+ }
}


Attached File

No Files Currently Attached

Changes ( 8 )

Field Old Value Date By
status_id Open 2004-12-29 20:03 stack-sf
resolution_id None 2004-12-29 20:03 stack-sf
close_date - 2004-12-29 20:03 stack-sf
summary glibc 2.3.2 hang (Was bdbfrontier stall in wakeQueues...) 2004-12-24 19:08 stack-sf
priority 9 2004-12-24 19:08 stack-sf
priority 5 2004-12-21 15:35 stack-sf
summary bdbfrontier stall in wakeQueues (ClockDaemon#wait) 2004-12-21 15:35 stack-sf
assigned_to nobody 2004-12-16 16:38 stack-sf