Thread: [Sbcl-devel] really strange bug, might be alien related

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

dear list,

i have an iolib based web server which pretty much stays alive through
stress tests (last time: 5 hours of 50 parallel threads sending
requests, 100% ok reponse), but in 3 days and 45 minutes it goes
unresponsive in various ways, even without any load. i have 4 samples
of it now.

sometimes i get an exception about an unhandled memory fault,
sometimes i can see some ldb related c functions on the c call stack
with gdb, and sometimes i get the excerpt that you can find at the end
of the mail. this last one looks like a gc related deadlock, but it
also happened after the magic 3 days and 45 minutes, so it's most
probably a random sideeffect of a heap corruption.

the worker threads are calling this iolib funcion in a loop that
timeouts once each second: (IO.MULTIPLEX:WAIT-UNTIL-FD-READY 7 :INPUT
1 NIL):

(iter (with socket = (socket-of server))
      (until (shutdown-initiated-p server))
      (for (values readable writable) = (wait-until-fd-ready (fd-of
socket) :input 1))
      (server.dribble "wait-until-fd-ready returned with readable ~S,
writable ~S in thread ~A" readable writable (current-thread))
      (until (shutdown-initiated-p server))
      (unless readable
        (next-iteration))
      (for stream-socket = (accept-connection socket))
      (when stream-socket
        (worker-loop/serve-one-request threaded? server worker stream-socket)))

this all strongly suggests that it's a timer related bug causing a
heap corruption, and w-u-fd-r indeed calls get-monotonic-time in
osicat, which in turn calls the (nix:clock-gettime
nix:clock-monotonic) alien. i've looked around for lisp side allocated
buffers updated by alien code, but unfortunately i couldn't spot
anything strange.

next time i will shut down the web server and fool around in the repl
after the magic 3 days and 45 seconds have passed. and if the timer
stuff works as expected then start up the web server workers and see
what happens.

i would be grateful if someone could suggest something to look at,
because i'm kinda lost now and it takes 3 days to try out an idea...

thanks for your time,

-- 
 attila

(gdb) info threads
  10 Thread 0xb7e496b0 (LWP 15965)  0xb7fe3410 in __kernel_vsyscall ()
  9 Thread 0xb74ebb90 (LWP 15968)  0xb7fe3410 in __kernel_vsyscall ()
  8 Thread 0xb70a6b90 (LWP 16032)  0xb7fe3410 in __kernel_vsyscall ()
  7 Thread 0xb6c61b90 (LWP 16033)  0xb7fe3410 in __kernel_vsyscall ()
  6 Thread 0xb681cb90 (LWP 16035)  0xb7fe3410 in __kernel_vsyscall ()
  5 Thread 0xb63d7b90 (LWP 16036)  0xb7fe3410 in __kernel_vsyscall ()
  4 Thread 0xb5f92b90 (LWP 16037)  0xb7fe3410 in __kernel_vsyscall ()
  3 Thread 0xb5b4db90 (LWP 16038)  0xb7fe3410 in __kernel_vsyscall ()
  2 Thread 0xb5707b90 (LWP 16039)  0xb7fe3410 in __kernel_vsyscall ()
  1 Thread 0xb51bab90 (LWP 31105)  0xb7fe3410 in __kernel_vsyscall ()
(gdb) thread 1
[Switching to thread 1 (Thread 0xb51bab90 (LWP 31105))]#0  0xb7fe3410
in __kernel_vsyscall ()
(gdb) back
#0  0xb7fe3410 in __kernel_vsyscall ()
#1  0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6
#3  0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb51b9c0c,
void_context=0xb51b9c8c) at interrupt.c:787
#4  <signal handler called>
#5  0xb7fe3410 in __kernel_vsyscall ()
#6  0xb7fcc196 in nanosleep () from /lib/tls/i686/cmov/libpthread.so.0
#7  0x0952b017 in ?? ()
#8  0x00000000 in ?? ()
(gdb) thread 2
[Switching to thread 2 (Thread 0xb5707b90 (LWP 16039))]#0  0xb7fe3410
in __kernel_vsyscall ()
(gdb) back
#0  0xb7fe3410 in __kernel_vsyscall ()
#1  0xb7fcb589 in __lll_lock_wait () from /lib/tls/i686/cmov/libpthread.so.0
#2  0xb7fc6ba6 in _L_lock_95 () from /lib/tls/i686/cmov/libpthread.so.0
#3  0xb7fc658a in pthread_mutex_lock () from /lib/tls/i686/cmov/libpthread.so.0
#4  0x0805bd0e in gc_stop_the_world () at thread.c:607
#5  0x090e3323 in ?? ()
#6  0x00000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (Thread 0xb5b4db90 (LWP 16038))]#0  0xb7fe3410
in __kernel_vsyscall ()
(gdb) back
#0  0xb7fe3410 in __kernel_vsyscall ()
#1  0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6
#3  0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb5b4c40c,
void_context=0xb5b4c48c) at interrupt.c:787
#4  <signal handler called>
#5  0xb7fe340e in __kernel_vsyscall ()
#6  0xb7f1d807 in syscall () from /lib/tls/i686/cmov/libc.so.6
#7  0x0805d834 in futex_wait (lock_word=0xe036a88, oldval=8, sec=-1,
usec=0) at linux-os.c:83
#8  0x09507045 in ?? ()
#9  0x00000014 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thread 4
[Switching to thread 4 (Thread 0xb5f92b90 (LWP 16037))]#0  0xb7fe3410
in __kernel_vsyscall ()
(gdb) back
#0  0xb7fe3410 in __kernel_vsyscall ()
#1  0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6
#3  0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb5f9166c,
void_context=0xb5f916ec) at interrupt.c:787
#4  <signal handler called>
#5  0xb7fe340e in __kernel_vsyscall ()
#6  0xb7f1d807 in syscall () from /lib/tls/i686/cmov/libc.so.6
#7  0x0805d834 in futex_wait (lock_word=0xe76bf5c, oldval=245100785,
sec=-1, usec=0) at linux-os.c:83
#8  0x0b71652b in ?? ()
#9  0x01100027 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thread 5
[Switching to thread 5 (Thread 0xb63d7b90 (LWP 16036))]#0  0xb7fe3410
in __kernel_vsyscall ()
(gdb) back
#0  0xb7fe3410 in __kernel_vsyscall ()
#1  0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6
#3  0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb63d666c,
void_context=0xb63d66ec) at interrupt.c:787
#4  <signal handler called>
#5  0xb7fe340e in __kernel_vsyscall ()
#6  0xb7f1d807 in syscall () from /lib/tls/i686/cmov/libc.so.6
#7  0x0805d834 in futex_wait (lock_word=0xe76bf5c, oldval=244987721,
sec=-1, usec=0) at linux-os.c:83
#8  0x0b71652b in ?? ()
#9  0x01100027 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thread 6
[Switching to thread 6 (Thread 0xb681cb90 (LWP 16035))]#0  0xb7fe3410
in __kernel_vsyscall ()
(gdb) back
#0  0xb7fe3410 in __kernel_vsyscall ()
#1  0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6
#3  0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb681b66c,
void_context=0xb681b6ec) at interrupt.c:787
#4  <signal handler called>
#5  0xb7fe340e in __kernel_vsyscall ()
#6  0xb7f1d807 in syscall () from /lib/tls/i686/cmov/libc.so.6
#7  0x0805d834 in futex_wait (lock_word=0xe76bf5c, oldval=261960449,
sec=-1, usec=0) at linux-os.c:83
#8  0x0b71652b in ?? ()
#9  0x01100027 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thread 7
[Switching to thread 7 (Thread 0xb6c61b90 (LWP 16033))]#0  0xb7fe3410
in __kernel_vsyscall ()
(gdb) back
#0  0xb7fe3410 in __kernel_vsyscall ()
#1  0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6
#3  0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb6c6066c,
void_context=0xb6c606ec) at interrupt.c:787
#4  <signal handler called>
#5  0xb7fe340e in __kernel_vsyscall ()
#6  0xb7f1d807 in syscall () from /lib/tls/i686/cmov/libc.so.6
#7  0x0805d834 in futex_wait (lock_word=0xe76bf5c, oldval=248278601,
sec=-1, usec=0) at linux-os.c:83
#8  0x0b71652b in ?? ()
#9  0x01100027 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thread 8
[Switching to thread 8 (Thread 0xb70a6b90 (LWP 16032))]#0  0xb7fe3410
in __kernel_vsyscall ()
(gdb) back
#0  0xb7fe3410 in __kernel_vsyscall ()
#1  0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6
#3  0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb70a566c,
void_context=0xb70a56ec) at interrupt.c:787
#4  <signal handler called>
#5  0xb7fe340e in __kernel_vsyscall ()
#6  0xb7f1d807 in syscall () from /lib/tls/i686/cmov/libc.so.6
#7  0x0805d834 in futex_wait (lock_word=0xe76bf5c, oldval=248225153,
sec=-1, usec=0) at linux-os.c:83
#8  0x0b71652b in ?? ()
#9  0x01100027 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thread 9
[Switching to thread 9 (Thread 0xb74ebb90 (LWP 15968))]#0  0xb7fe3410
in __kernel_vsyscall ()
(gdb) back
#0  0xb7fe3410 in __kernel_vsyscall ()
#1  0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6
#3  0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb74ea80c,
void_context=0xb74ea88c) at interrupt.c:787
#4  <signal handler called>
#5  0xb7fe340e in __kernel_vsyscall ()
#6  0xb7fcbbb8 in accept () from /lib/tls/i686/cmov/libpthread.so.0
#7  0x0b13c611 in ?? ()
#8  0xb7fc4adb in pthread_create@@GLIBC_2.1 () from
/lib/tls/i686/cmov/libpthread.so.0
#9  0xb7fe2b08 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thread 10
[Switching to thread 10 (Thread 0xb7e496b0 (LWP 15965))]#0  0xb7fe3410
in __kernel_vsyscall ()
(gdb) back
#0  0xb7fe3410 in __kernel_vsyscall ()
#1  0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6
#2  0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6
#3  0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb7a0e99c,
void_context=0xb7a0ea1c) at interrupt.c:787
#4  <signal handler called>
#5  0xb7fe3410 in __kernel_vsyscall ()
#6  0xb7fcc196 in nanosleep () from /lib/tls/i686/cmov/libpthread.so.0
#7  0x0952b017 in ?? ()
#8  0x0f9ab99f in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

Thread: [Sbcl-devel] really strange bug, might be alien related

Common Lisp compiler and runtime

sbcl-devel