From: Attila L. <att...@gm...> - 2008-07-24 17:13:36
|
dear list, i have an iolib based web server which pretty much stays alive through stress tests (last time: 5 hours of 50 parallel threads sending requests, 100% ok reponse), but in 3 days and 45 minutes it goes unresponsive in various ways, even without any load. i have 4 samples of it now. sometimes i get an exception about an unhandled memory fault, sometimes i can see some ldb related c functions on the c call stack with gdb, and sometimes i get the excerpt that you can find at the end of the mail. this last one looks like a gc related deadlock, but it also happened after the magic 3 days and 45 minutes, so it's most probably a random sideeffect of a heap corruption. the worker threads are calling this iolib funcion in a loop that timeouts once each second: (IO.MULTIPLEX:WAIT-UNTIL-FD-READY 7 :INPUT 1 NIL): (iter (with socket = (socket-of server)) (until (shutdown-initiated-p server)) (for (values readable writable) = (wait-until-fd-ready (fd-of socket) :input 1)) (server.dribble "wait-until-fd-ready returned with readable ~S, writable ~S in thread ~A" readable writable (current-thread)) (until (shutdown-initiated-p server)) (unless readable (next-iteration)) (for stream-socket = (accept-connection socket)) (when stream-socket (worker-loop/serve-one-request threaded? server worker stream-socket))) this all strongly suggests that it's a timer related bug causing a heap corruption, and w-u-fd-r indeed calls get-monotonic-time in osicat, which in turn calls the (nix:clock-gettime nix:clock-monotonic) alien. i've looked around for lisp side allocated buffers updated by alien code, but unfortunately i couldn't spot anything strange. next time i will shut down the web server and fool around in the repl after the magic 3 days and 45 seconds have passed. and if the timer stuff works as expected then start up the web server workers and see what happens. i would be grateful if someone could suggest something to look at, because i'm kinda lost now and it takes 3 days to try out an idea... thanks for your time, -- attila (gdb) info threads 10 Thread 0xb7e496b0 (LWP 15965) 0xb7fe3410 in __kernel_vsyscall () 9 Thread 0xb74ebb90 (LWP 15968) 0xb7fe3410 in __kernel_vsyscall () 8 Thread 0xb70a6b90 (LWP 16032) 0xb7fe3410 in __kernel_vsyscall () 7 Thread 0xb6c61b90 (LWP 16033) 0xb7fe3410 in __kernel_vsyscall () 6 Thread 0xb681cb90 (LWP 16035) 0xb7fe3410 in __kernel_vsyscall () 5 Thread 0xb63d7b90 (LWP 16036) 0xb7fe3410 in __kernel_vsyscall () 4 Thread 0xb5f92b90 (LWP 16037) 0xb7fe3410 in __kernel_vsyscall () 3 Thread 0xb5b4db90 (LWP 16038) 0xb7fe3410 in __kernel_vsyscall () 2 Thread 0xb5707b90 (LWP 16039) 0xb7fe3410 in __kernel_vsyscall () 1 Thread 0xb51bab90 (LWP 31105) 0xb7fe3410 in __kernel_vsyscall () (gdb) thread 1 [Switching to thread 1 (Thread 0xb51bab90 (LWP 31105))]#0 0xb7fe3410 in __kernel_vsyscall () (gdb) back #0 0xb7fe3410 in __kernel_vsyscall () #1 0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6 #2 0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6 #3 0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb51b9c0c, void_context=0xb51b9c8c) at interrupt.c:787 #4 <signal handler called> #5 0xb7fe3410 in __kernel_vsyscall () #6 0xb7fcc196 in nanosleep () from /lib/tls/i686/cmov/libpthread.so.0 #7 0x0952b017 in ?? () #8 0x00000000 in ?? () (gdb) thread 2 [Switching to thread 2 (Thread 0xb5707b90 (LWP 16039))]#0 0xb7fe3410 in __kernel_vsyscall () (gdb) back #0 0xb7fe3410 in __kernel_vsyscall () #1 0xb7fcb589 in __lll_lock_wait () from /lib/tls/i686/cmov/libpthread.so.0 #2 0xb7fc6ba6 in _L_lock_95 () from /lib/tls/i686/cmov/libpthread.so.0 #3 0xb7fc658a in pthread_mutex_lock () from /lib/tls/i686/cmov/libpthread.so.0 #4 0x0805bd0e in gc_stop_the_world () at thread.c:607 #5 0x090e3323 in ?? () #6 0x00000000 in ?? () (gdb) thread 3 [Switching to thread 3 (Thread 0xb5b4db90 (LWP 16038))]#0 0xb7fe3410 in __kernel_vsyscall () (gdb) back #0 0xb7fe3410 in __kernel_vsyscall () #1 0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6 #2 0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6 #3 0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb5b4c40c, void_context=0xb5b4c48c) at interrupt.c:787 #4 <signal handler called> #5 0xb7fe340e in __kernel_vsyscall () #6 0xb7f1d807 in syscall () from /lib/tls/i686/cmov/libc.so.6 #7 0x0805d834 in futex_wait (lock_word=0xe036a88, oldval=8, sec=-1, usec=0) at linux-os.c:83 #8 0x09507045 in ?? () #9 0x00000014 in ?? () Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) thread 4 [Switching to thread 4 (Thread 0xb5f92b90 (LWP 16037))]#0 0xb7fe3410 in __kernel_vsyscall () (gdb) back #0 0xb7fe3410 in __kernel_vsyscall () #1 0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6 #2 0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6 #3 0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb5f9166c, void_context=0xb5f916ec) at interrupt.c:787 #4 <signal handler called> #5 0xb7fe340e in __kernel_vsyscall () #6 0xb7f1d807 in syscall () from /lib/tls/i686/cmov/libc.so.6 #7 0x0805d834 in futex_wait (lock_word=0xe76bf5c, oldval=245100785, sec=-1, usec=0) at linux-os.c:83 #8 0x0b71652b in ?? () #9 0x01100027 in ?? () Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) thread 5 [Switching to thread 5 (Thread 0xb63d7b90 (LWP 16036))]#0 0xb7fe3410 in __kernel_vsyscall () (gdb) back #0 0xb7fe3410 in __kernel_vsyscall () #1 0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6 #2 0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6 #3 0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb63d666c, void_context=0xb63d66ec) at interrupt.c:787 #4 <signal handler called> #5 0xb7fe340e in __kernel_vsyscall () #6 0xb7f1d807 in syscall () from /lib/tls/i686/cmov/libc.so.6 #7 0x0805d834 in futex_wait (lock_word=0xe76bf5c, oldval=244987721, sec=-1, usec=0) at linux-os.c:83 #8 0x0b71652b in ?? () #9 0x01100027 in ?? () Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) thread 6 [Switching to thread 6 (Thread 0xb681cb90 (LWP 16035))]#0 0xb7fe3410 in __kernel_vsyscall () (gdb) back #0 0xb7fe3410 in __kernel_vsyscall () #1 0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6 #2 0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6 #3 0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb681b66c, void_context=0xb681b6ec) at interrupt.c:787 #4 <signal handler called> #5 0xb7fe340e in __kernel_vsyscall () #6 0xb7f1d807 in syscall () from /lib/tls/i686/cmov/libc.so.6 #7 0x0805d834 in futex_wait (lock_word=0xe76bf5c, oldval=261960449, sec=-1, usec=0) at linux-os.c:83 #8 0x0b71652b in ?? () #9 0x01100027 in ?? () Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) thread 7 [Switching to thread 7 (Thread 0xb6c61b90 (LWP 16033))]#0 0xb7fe3410 in __kernel_vsyscall () (gdb) back #0 0xb7fe3410 in __kernel_vsyscall () #1 0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6 #2 0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6 #3 0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb6c6066c, void_context=0xb6c606ec) at interrupt.c:787 #4 <signal handler called> #5 0xb7fe340e in __kernel_vsyscall () #6 0xb7f1d807 in syscall () from /lib/tls/i686/cmov/libc.so.6 #7 0x0805d834 in futex_wait (lock_word=0xe76bf5c, oldval=248278601, sec=-1, usec=0) at linux-os.c:83 #8 0x0b71652b in ?? () #9 0x01100027 in ?? () Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) thread 8 [Switching to thread 8 (Thread 0xb70a6b90 (LWP 16032))]#0 0xb7fe3410 in __kernel_vsyscall () (gdb) back #0 0xb7fe3410 in __kernel_vsyscall () #1 0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6 #2 0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6 #3 0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb70a566c, void_context=0xb70a56ec) at interrupt.c:787 #4 <signal handler called> #5 0xb7fe340e in __kernel_vsyscall () #6 0xb7f1d807 in syscall () from /lib/tls/i686/cmov/libc.so.6 #7 0x0805d834 in futex_wait (lock_word=0xe76bf5c, oldval=248225153, sec=-1, usec=0) at linux-os.c:83 #8 0x0b71652b in ?? () #9 0x01100027 in ?? () Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) thread 9 [Switching to thread 9 (Thread 0xb74ebb90 (LWP 15968))]#0 0xb7fe3410 in __kernel_vsyscall () (gdb) back #0 0xb7fe3410 in __kernel_vsyscall () #1 0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6 #2 0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6 #3 0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb74ea80c, void_context=0xb74ea88c) at interrupt.c:787 #4 <signal handler called> #5 0xb7fe340e in __kernel_vsyscall () #6 0xb7fcbbb8 in accept () from /lib/tls/i686/cmov/libpthread.so.0 #7 0x0b13c611 in ?? () #8 0xb7fc4adb in pthread_create@@GLIBC_2.1 () from /lib/tls/i686/cmov/libpthread.so.0 #9 0xb7fe2b08 in ?? () Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) thread 10 [Switching to thread 10 (Thread 0xb7e496b0 (LWP 15965))]#0 0xb7fe3410 in __kernel_vsyscall () (gdb) back #0 0xb7fe3410 in __kernel_vsyscall () #1 0xb7e76346 in ?? () from /lib/tls/i686/cmov/libc.so.6 #2 0xb7e763ff in sigwaitinfo () from /lib/tls/i686/cmov/libc.so.6 #3 0x08056576 in sig_stop_for_gc_handler (signal=35, info=0xb7a0e99c, void_context=0xb7a0ea1c) at interrupt.c:787 #4 <signal handler called> #5 0xb7fe3410 in __kernel_vsyscall () #6 0xb7fcc196 in nanosleep () from /lib/tls/i686/cmov/libpthread.so.0 #7 0x0952b017 in ?? () #8 0x0f9ab99f in ?? () Backtrace stopped: previous frame inner to this frame (corrupt stack?) (gdb) |
From: Christophe R. <cs...@ca...> - 2008-07-24 17:36:51
|
"Attila Lendvai" <att...@gm...> writes: > i have an iolib based web server which pretty much stays alive through > stress tests (last time: 5 hours of 50 parallel threads sending > requests, 100% ok reponse), but in 3 days and 45 minutes it goes > unresponsive in various ways, even without any load. i have 4 samples > of it now. 3 days and 45 minutes is 261900 seconds, which is sufficiently close to 262144 seconds to trigger my "hm, that's interesting" reaction. 2^18 isn't a particularly interesting number, but it probably isn't too hard to construct some other subsystem which would turn that 2^18 into 2^29 or 2^32... (Sorry, I don't have enough time to investigate further right now) Best, Christophe |
From: Andy H. <ah...@gm...> - 2008-07-24 23:21:24
|
I believe this may be the same alien stack leakage surprise which I reported several weeks ago. This following loop should exhibit the same failure in a matter of seconds: (progn (loop repeat (- (expt 2 18) 150) do (IO.MULTIPLEX:WAIT-UNTIL-FD-READY 2 :INPUT 0)) (loop (format *trace-output* "~&Any moment now.. ~X~%" sb-vm::*alien-stack*) (IO.MULTIPLEX:WAIT-UNTIL-FD-READY 2 :INPUT 0))) This occurs as expected on my linux/x86-64 machine, printing a series of stack addresses until it crashes, although you might have to tweak the magic number of iterations to see the diagnostic message. I think that the issue is again that using return/return-from can bypass the decrement of the alien stack pointer. Iolib's wait-until-fd-ready is written in exactly this way, and I think it can be worked around as I described before, by wrapping the with-alien (or with-foreign-objects, in this case) catch and throw, which have the magical effect of saving/restoring the alien stack pointer around the block. I've tried this out on a version wait-until-fd-ready: (defun wait-until-fd-ready (file-descriptor event-type &optional timeout errorp) "Poll file descriptor `FILE-DESCRIPTOR' for I/O readiness. `EVENT-TYPE' must be either :INPUT, :OUTPUT or :IO. `TIMEOUT' must be either a non-negative integer measured in seconds, or `NIL' meaning no timeout at all. If `ERRORP' is not NIL and a timeout occurs, then a condition of type `POLL-TIMEOUT' is signaled. Returns two boolean values indicating readability and writeability of `FILE-DESCRIPTOR'." (flet ((poll-error (unix-err) (error 'poll-error :fd file-descriptor :identifier (osicat-sys:system-error-identifier unix-err)))) (values-list (catch 'kludge (with-foreign-object (pollfd 'pollfd) (bzero pollfd size-of-pollfd) (with-foreign-slots ((fd events revents) pollfd pollfd) (setf fd file-descriptor events (compute-poll-flags event-type)) (handler-case (let ((ret (nix:repeat-upon-condition-decreasing-timeout ((nix:eintr) remaining-time timeout) (poll pollfd 1 (timeout->milisec remaining-time))))) (when (zerop ret) (if errorp (error 'poll-timeout :fd file-descriptor :event-type event-type) #+NIL (return-from wait-until-fd-ready (values nil nil)) (throw 'kludge (list nil nil))))) (nix:posix-error (err) (poll-error err))) (throw 'kludge (multiple-value-list (process-poll-revents revents file-descriptor))) )) )))) Compiling that into iolib, my former test case seems to run indefinitely. Interesting to see someone else encounter this problem so soon.. |
From: Nikodemus S. <nik...@ra...> - 2008-07-25 08:49:16
Attachments:
alien-leak.patch
alien-leak-hotpatch.lisp
|
On Fri, Jul 25, 2008 at 2:21 AM, Andy Hefner <ah...@gm...> wrote: > I believe this may be the same alien stack leakage surprise which I > reported several weeks ago. This following loop should exhibit the > same failure in a matter of seconds: Thank you -- both for figuring this out, and the original report! The attached patch should make WITH-ALIEN not leak in the presence of non-local exits. (You can also hotpatch SBCL using the attached lisp file -- but it doesn't fix already compiled WITH-ALIENs either, naturally.) Cheers, -- Nikodemus |
From: Nikodemus S. <nik...@ra...> - 2008-07-30 13:56:41
|
On Fri, Jul 25, 2008 at 11:49 AM, Nikodemus Siivola <nik...@ra...> wrote: > On Fri, Jul 25, 2008 at 2:21 AM, Andy Hefner <ah...@gm...> wrote: >> I believe this may be the same alien stack leakage surprise which I >> reported several weeks ago. This following loop should exhibit the >> same failure in a matter of seconds: > > Thank you -- both for figuring this out, and the original report! > > The attached patch should make WITH-ALIEN not leak in the presence of > non-local exits. (You can also hotpatch SBCL using the attached lisp > file -- but it doesn't fix already compiled WITH-ALIENs either, > naturally.) Fixed in 1.0.19.2. Cheers, -- Nikodemus |
From: Attila L. <att...@gm...> - 2008-07-26 15:02:46
|
> The attached patch should make WITH-ALIEN not leak in the presence of > non-local exits. (You can also hotpatch SBCL using the attached lisp > file -- but it doesn't fix already compiled WITH-ALIENs either, > naturally.) i'm very happy for the fix, but before we move on... isn't there a way to guard this situation with at least an aver that avoids heap corruption with a useful message? -- attila |
From: Nikodemus S. <nik...@ra...> - 2008-07-27 08:13:21
|
On Sat, Jul 26, 2008 at 6:02 PM, Attila Lendvai <att...@gm...> wrote: >> The attached patch should make WITH-ALIEN not leak in the presence of >> non-local exits. (You can also hotpatch SBCL using the attached lisp >> file -- but it doesn't fix already compiled WITH-ALIENs either, >> naturally.) > > i'm very happy for the fix, but before we move on... isn't there a way > to guard this situation with at least an aver that avoids heap > corruption with a useful message? Yes: we could (should) protect the end of alien stack with a pair of guard-pages, just like with control stack. Cheers, -- Nikodemus |