From: James N. <jn...@nk...> - 2002-12-02 22:53:00
|
Okay, I've run into a bit of a problem. It's reproducible, but I'm finding it difficult to debug. I noticed that my UML image (a secondary nameserver) died pretty consistently if it was forced to named-xfer all of its zones (about 4600 of them). The error on the console is: sleeping process 21881 got unexpected signal : 29 (where the process number changes, obviously, but the signal remains the same. 29 is SIGIO isn't it?). Backtrace: #0 0xa0114e57 in pause () at af_packet.c:1884 #1 0xa0098ef5 in tracer_panic ( format=0xa0151540 "sleeping process %d got unexpected signal : %d\n") at tracer.c:86 #2 0xa00990bf in sleeping_process_signal (pid=21881, sig=29) at tracer.c:163 #3 0xa00994d0 in tracer (init_proc=0xa0098104 <start_kernel_proc>, sp=0xa017fffc) at tracer.c:304 #4 0xa009822e in start_uml_tt () at process_kern.c:471 #5 0xa0096b71 in linux_main (argc=13, argv=0xbffff834) at um_arch.c:353 #6 0xa000a31b in main (argc=13, argv=0xbffff834, envp=0xbffff86c) at arch/um/main.c:125 (I'm not experienced with gdb. Is there other information I could provide that would make this more useful?) I've run through several experiments trying to reproduce the problem in my UML test lab, and was able to reproduce it with the following experiment: UML Hosting Box : "glot" UML Image : "umlneal09" Server on the name network as glot: "smurg" umlneal09 is connected to glot via a TUN interface. glot is the only hop between umlneal09 and smurg. On smurg, I set up a netcat listener on port 4200. It does nothing but accept a connection. On umlneal09, I ran the following script: while true; do nc -zvn smurg 4200 & done This spawns netcat (a simple utility for sending data across network connections) in a continuous connection loop, making and dropping connections just as fast as it can. The afore mentioned crash can occur as early as 10 connections into it, or as late as several seconds. What makes this a particularly nasty problem to debug is that load on the UML Image or the UML Hosting Box (such as, say, a gdb/ddd attached to the process), can keep the problem from occuring. Performing the same experiment against "localhost" doesn't produce the crash. Massive forking (while /bin/true; do /bin/true; done) does not seem to cause a crash. Any ideas? -James |