From: Thiemo S. <th...@ne...> - 2005-05-20 09:49:40
|
Hello All, I think I found a race condition for gencgc. Scenario: An ALLOCATION inside a pa section causes alloc() at gencgc.c:4151 to try this: if(!data->pending_handler) maybe_defer_handler(interrupt_maybe_gc_int,data,0,0,0); but signals aren't necessarily blocked at that time, and may use the deferral mechanism at the same time. This means it can check pending_handler to be NULL, and has then race window. If a signal is deferred after the two lines above, only the garbage collection signalling gets lost. If a deferred signal happens before, then it will get lost, the data overwritten by the gc. Thiemo |
From: Robert M. <bob...@bo...> - 2005-05-24 22:43:16
|
Hi I was wondering if I could get an expert opinion; could this problem below be the cause of the segmentation faults and other errors experienced when running the alloc / interrupt / pseudo-atomic test in tests/threads.impure.lisp of the SBCL source? If so, I'm a bit confused about it - doesn't the fact that the code must already be pseudo-atomic* in order to get to this part of the code already guarantee that the interrupts are being blocked? (* My assertion that the code is already pseudo-atomic is based on the following code preceding what's below: #ifdef LISP_FEATURE_SB_THREAD if(!SymbolValue(PSEUDO_ATOMIC_ATOMIC,th)) { register u32 fs; fprintf(stderr, "fatal error in thread 0x%x, pid=%d\n", th,getpid()); __asm__("movl %fs,%0" : "=r" (fs) : ); fprintf(stderr, "fs is %x, th->tls_cookie=%x \n", debug_get_fs(),th->tls_cookie); lose("If you see this message before 2004.01.31, mail details to sbcl-devel\n"); } #else gc_assert(SymbolValue(PSEUDO_ATOMIC_ATOMIC,th)); #endif ) On Fri, 2005-05-20 at 11:49 +0200, Thiemo Seufer wrote: > Hello All, > > I think I found a race condition for gencgc. Scenario: > > An ALLOCATION inside a pa section causes alloc() at gencgc.c:4151 > to try this: > > if(!data->pending_handler) > maybe_defer_handler(interrupt_maybe_gc_int,data,0,0,0); > > but signals aren't necessarily blocked at that time, and may use the > deferral mechanism at the same time. This means it can check > pending_handler to be NULL, and has then race window. > > If a signal is deferred after the two lines above, only the garbage > collection signalling gets lost. If a deferred signal happens before, > then it will get lost, the data overwritten by the gc. > > > Thiemo > > > ------------------------------------------------------- > This SF.Net email is sponsored by Oracle Space Sweepstakes > Want to be the first software developer in space? > Enter now for the Oracle Space Sweepstakes! > http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click > _______________________________________________ > Sbcl-devel mailing list > Sbc...@li... > https://lists.sourceforge.net/lists/listinfo/sbcl-devel > -- Robert Marlow <bob...@bo...> |
From: Thiemo S. <th...@ne...> - 2005-05-27 21:22:03
|
Robert Marlow wrote: > Hi > > I was wondering if I could get an expert opinion; could this problem > below be the cause of the segmentation faults and other errors > experienced when running the alloc / interrupt / pseudo-atomic test in > tests/threads.impure.lisp of the SBCL source? Probably, I haven't tested it. > If so, I'm a bit confused about it - doesn't the fact that the code must > already be pseudo-atomic* in order to get to this part of the code > already guarantee that the interrupts are being blocked? pseudo-atomic is set in pa_alloc(), but this doesn't mean signals are blocked. A signal interrupt is only detected. > (* My assertion that the code is already pseudo-atomic is based on the > following code preceding what's below: > > #ifdef LISP_FEATURE_SB_THREAD > if(!SymbolValue(PSEUDO_ATOMIC_ATOMIC,th)) { > register u32 fs; > fprintf(stderr, "fatal error in thread 0x%x, pid=%d\n", > th,getpid()); > __asm__("movl %fs,%0" : "=r" (fs) : ); > fprintf(stderr, "fs is %x, th->tls_cookie=%x \n", > debug_get_fs(),th->tls_cookie); > lose("If you see this message before 2004.01.31, mail > details to sbcl-devel\n"); > } > #else > gc_assert(SymbolValue(PSEUDO_ATOMIC_ATOMIC,th)); > #endif > ) This code is #if 0, btw. Thiemo |
From: <me...@ho...> - 2005-05-29 17:50:56
Attachments:
gencgc.patch
|
On Friday 20 May 2005 11:49, Thiemo Seufer wrote: > Hello All, > > I think I found a race condition for gencgc. Scenario: > > An ALLOCATION inside a pa section causes alloc() at gencgc.c:4151 > to try this: > > if(!data->pending_handler) > maybe_defer_handler(interrupt_maybe_gc_int,data,0,0,0); > > but signals aren't necessarily blocked at that time, and may use the > deferral mechanism at the same time. This means it can check > pending_handler to be NULL, and has then race window. > > If a signal is deferred after the two lines above, only the garbage > collection signalling gets lost. If a deferred signal happens before, > then it will get lost, the data overwritten by the gc. > > > Thiemo Here is a possible fix for review. It's not entirely clean, but then it wasn't very clean before, either. At least the ugliness is on the turf of the offender now (gencgc). Slightly off: is NSIG guaranteed to be a multiple of 8? Gabor |
From: Thiemo S. <th...@ne...> - 2005-05-29 18:04:56
|
Gábor Melis wrote: [snip] > Here is a possible fix for review. It's not entirely clean, but then it wasn't > very clean before, either. At least the ugliness is on the turf of the > offender now (gencgc). > > Slightly off: is NSIG guaranteed to be a multiple of 8? AFAIK it usually is 2^N, but there's no guarantee. A more robust solution would up-align to the next word boundary. Thiemo |
From: <me...@ho...> - 2005-05-31 14:34:41
Attachments:
signals.patch
|
I find it very hard to reason about signal handling, so please review this patch with great vengeance. Maybe even test it. It does 3 things: - fix Thiemo's gencgc/maybe_defer_handler race (this is unchanged) - interrupt_handle_pending: check for the pending handler being null. I think this is caused by the race in with-interrupts and pa_alloc (see comment in patch) - run_deferred_handler: set the pending handler to null, before calling it to guard against the handler enabling interrupts ... It was assembled in a bit of hurry for Robert's sake, so it is not fully tested. Interrupt-thread now seems to work except for the occasional 'The value (2) is not of type CONS.' from copy-list (!). I'll have a look at that later. I wonder if it helps with the kludge_sigset_for_gc problem. Gabor |
From: Christophe R. <cs...@ca...> - 2005-06-01 10:24:22
|
G=E1bor Melis <me...@ho...> writes: > I wonder if it helps with the kludge_sigset_for_gc problem. I think I worked out why it would help with the kludge_sigset_for_gc problem, and why x86-64/x86 are stabler with sprof than the non-x86 platforms. As usual, the key point is the garbage collector, or possibly in this case the allocator support for the garbage collector. The key point is that the gencgc allocator never triggers a signal itself during the pseudo-atomic section: so on those platforms, getting _two_ signals within an PA section (which is generally fairly short, though not as short as with the cheneygc allocator) is fairly rare: I would imagine that spending 10ms (the default SIGALRM rate) inside one PA section is vanishingly unlikely. It can almost certainly happen, particularly in the presence of threads, where STOP_FOR_GC and INTERRUPT_THREAD signals can be thrown around, but it's less likely. On the other hand, cheneygc ports will get a signal in normal operation in a PA section on a frequent basis when consing is occuring: a signal (usually the MEMORY_FAULT signal) occurs when hitting the write barrier at the end of the active semispace _during_ the allocation itself. Given that, it's not too surprising that two signals can happen during a PA section -- all that is needed is that the ALRM hit during the one which triggers a GC. Under normal operation, the gencgc ports only get signals when they store pointers in higher-generation objects, which does not happen inside a PA section. So, given this difference in the distribution of signals, it would be interesting indeed for someone on a cheneygc port to back out the kludge_sigset_for_gc thing, try Gabor's patch, and see if sprof still causes problems. (The rest of Gabor's patch looked good to my untrained eye). Cheers, Christophe |
From: <me...@ho...> - 2005-06-06 09:52:52
Attachments:
signals.patch
|
On Tuesday 31 May 2005 16:33, G=C3=A1bor Melis wrote: > I find it very hard to reason about signal handling, so please review > this patch with great vengeance. Maybe even test it. rc1 =2D more defensiveness: enforce invariants: checks for signal masks, interr= upts =2D refactoring: undoably_install_low_level_interrupt_handler wraps blockable handlers in low_level_maybe_now_maybe_later =2D interrupt_maybe_gc (on cheneygc platforms): check for already pending=20 handler before calling maybe_defer_handler =2D don't unblock signals unconditionally in interrupt_maybe_gc_int just re= store=20 the sigmask from the interrupted context (kludge removed) I would very much like to hear from non x86 platforms. How did they break,= =20 where, ... Oh yeah, are os_restore_fp_control and DARWIN_FIX_CONTEXT needed in every=20 signal handler? Interrupt-thread still fails, but I'm starting to suspect it has to do with= =20 arrange_return_to_lisp_function. The test case below doesn't invoke gc at=20 all, does not defer signals, but still breaks. I need put some effort into = my=20 assembly-fu so please comment away. (in-package "SB-THREAD") (defparameter *i* 0) (declaim (notinline xxx)) (defun xxx (i) (declare (optimize (debug 1) (speed 1))) (unless (typep i 'fixnum) (error "!!!!!!!!!!!"))) (defun check-stuff () (loop (xxx *i*))) (let ((c (make-thread (lambda () (handler-bind ((error #'(lambda (cond) (princ cond) (sb-debug:backtrace most-positive-fixnum)))) (check-stuff)))))) (let ((func (lambda () (princ ".") (force-output)))) (sb-sys:with-pinned-objects (func) (dotimes (i 100) (sleep (random 1d0)) (interrupt-thread c func)) (sleep 1) (terminate-thread c)))) (format t "~&interrupt test done~%") ;; give the other thread time to die before we leave, otherwise the ;; overall exit status is 0, not 104 (sleep 2)=20 (sb-ext:quit :unix-status 104) |
From: Christophe R. <cs...@ca...> - 2005-06-06 12:27:04
|
G=E1bor Melis <me...@ho...> writes: > I would very much like to hear from non x86 platforms. How did they > break, where, ... On SPARC/SunOS, while building PCL: ; SYS:SRC;PCL;FIXUP.FASL.NEWEST written ; compilation finished in 0:00:01 fatal error encountered in SBCL pid 28545: blockable signal 1 not blocked Let me know what I can do to debug this. Cheers, Christophe |
From: <me...@ho...> - 2005-06-06 12:39:38
|
On Monday 06 June 2005 14:25, Christophe Rhodes wrote: > G=E1bor Melis <me...@ho...> writes: > > I would very much like to hear from non x86 platforms. How did they > > break, where, ... > > On SPARC/SunOS, while building PCL: > > ; SYS:SRC;PCL;FIXUP.FASL.NEWEST written > ; compilation finished in 0:00:01 > fatal error encountered in SBCL pid 28545: > blockable signal 1 not blocked > > Let me know what I can do to debug this. Which signal handler fails this check? A generous tail of the output with=20 QSHOW and QSHOW_SIGNALS would help. |
From: Christophe R. <cs...@ca...> - 2005-06-06 13:23:48
|
G=E1bor Melis <me...@ho...> writes: > On Monday 06 June 2005 14:25, Christophe Rhodes wrote: >> G=E1bor Melis <me...@ho...> writes: >> > I would very much like to hear from non x86 platforms. How did they >> > break, where, ... >> >> On SPARC/SunOS, while building PCL: >> >> ; SYS:SRC;PCL;FIXUP.FASL.NEWEST written >> ; compilation finished in 0:00:01 >> fatal error encountered in SBCL pid 28545: >> blockable signal 1 not blocked >> >> Let me know what I can do to debug this. > > Which signal handler fails this check? A generous tail of the output with= =20 > QSHOW and QSHOW_SIGNALS would help. There's not very much that's terribly exciting, I think: below find a tail with QSHOW/QSHOW_SIGNALS, and after that a gdb backtrace. (Another observation at the bottom). ; compiling (DEFMETHOD COMPUTE-DISCRIMINATING-FUNCTION ...)/maybe_defer_han= dler(19840,11),thread=3D29170: deferred(PA) ; compiling (DEFMETHOD UPDATE-GF-DFUN ...) ; compiling (DEFMETHOD (SETF CLASS-NAME) ...) ; compiling (DEFMETHOD FUNCTION-KEYWORDS ...) ; compiling (DEFUN METHOD-LL->GENERIC-FUNCTION-LL ...) ; compiling (DEFMETHOD GENERIC-FUNCTION-PRETTY-ARGLIST ...) ; compiling (DEFMETHOD METHOD-PRETTY-ARGLIST ...); ; compilation unit finished ; caught 4 STYLE-WARNING conditions ; SYS:SRC;PCL;METHODS.FASL.NEWEST written ; compilation finished in 0:00:15 ; compiling file "/home/crhodes/sbcl/src/pcl/fixup.lisp" (written 31 MAY 20= 04 02:56:27 PM): ; compiling (IN-PACKAGE "SB-PCL") ; compiling (!FIX-EARLY-GENERIC-FUNCTIONS) ; compiling (COMPUTE-STANDARD-SLOT-LOCATIONS) ; compiling (DOLIST (S #) ...) ; file: /home/crhodes/sbcl/src/pcl/fixup.lisp ; in: DOLIST (S '(CONDITION STRUCTURE-OBJECT)) ; (SB-INT:DOHASH ; (SB-PCL::K SB-PCL::V ; (SB-KERNEL:CLASSOID-SUBCLASSES ; (SB-KERNEL:FIND-CLASSOID SB-PCL::S))) ; (FIND-CLASS (SB-KERNEL:CLASSOID-NAME SB-PCL::K))) ; --> WITH-HASH-TABLE-ITERATOR LET MACROLET LOOP BLOCK TAGBODY PROGN ; --> MULTIPLE-VALUE-BIND MULTIPLE-VALUE-CALL ; =3D=3D> ; #'(LAMBDA (&OPTIONAL (#:G9) (SB-PCL::K) (SB-PCL::V) &REST #:G14) ; (DECLARE (IGNORE #:G14)) ; (UNLESS #:G9 (RETURN NIL)) ; (FIND-CLASS (SB-KERNEL:CLASSOID-NAME SB-PCL::K))) ; ; caught STYLE-WARNING: ; #S(SIMPLE-STYLE-WARNING ; :ACTUAL-INITARGS (FORMAT-CONTROL The variable ~S is defined but neve= r used. ; FORMAT-ARGUMENTS ; (V)) ; :ASSIGNED-SLOTS NIL) ; compiling (SETQ *BOOT-STATE* ...) ; compiling (DEFUN PRINT-STD-INSTANCE ...); ; compilation unit finished ; caught 1 STYLE-WARNING condition ; SYS:SRC;PCL;FIXUP.FASL.NEWEST written ; compilation finished in 0:00:01 /maybe_defer_handler(19840,11),thread=3D29170: deferred(PA) /maybe_defer_handler(19840,11),thread=3D29170: deferred(PA) fatal error encountered in SBCL pid 29170: blockable signal 1 not blocked LDB monitor ldb> back Backtrace: <Frame 0x28009457, CODE: ???, <no LRA>, PC: ???> Bogus callee value (0x28009457). (gdb) back #0 0xff21c1e0 in _read () from /usr/lib/libc.so.1 #1 0xff20c4e8 in _filbuf () from /usr/lib/libc.so.1 #2 0xff20eb90 in fgets () from /usr/lib/libc.so.1 #3 0x0001a5c0 in sub_monitor () at monitor.c:438 #4 0x0001a2b8 in ldb_monitor () at monitor.c:489 #5 0x00017df4 in lose (fmt=3D0x24328 "blockable signal %d not blocked") at interr.c:68 #6 0x00018e18 in interrupt_internal_error (signal=3D4, info=3D0xffbff5b8, context=3D0xffbff300, continuable=3D0) at interrupt.c:114 #7 0x00022234 in sigill_handler (signal=3D4, siginfo=3D0xffbff5b8, void_context=3D0xffbff300) at sparc-arch.c:234 #8 <signal handler called> #9 0x10d523f8 in ?? () #10 0x10072508 in ?? () #11 0x00020204 in main (argc=3D228352, argv=3D0xa, envp=3D0x37c00) at runti= me.c:345 Signal 4 on this platform is SIGILL. This isn't directly related to the above, but I also think that low_level_maybe_now_maybe_later() needs to use arch_os_get_context() to make it work on Sparc/Linux. Let me know what next :-) Cheers, Christophe |
From: Christophe R. <cs...@ca...> - 2005-06-06 14:27:56
|
Christophe Rhodes <cs...@ca...> writes: > There's not very much that's terribly exciting, I think: below find a > tail with QSHOW/QSHOW_SIGNALS, and after that a gdb backtrace. Removing the sigprocmask() in sigill_handler, at Gabor's suggestion, makes the build run to completion. I shall now remove all the debugging stuff, and stress-test sb-sprof or something. (Suggestions welcome) Cheers, Christophe |
From: <me...@ho...> - 2005-06-06 14:57:11
Attachments:
signals.patch
|
On Monday 06 June 2005 16:26, Christophe Rhodes wrote: > Christophe Rhodes <cs...@ca...> writes: > > There's not very much that's terribly exciting, I think: below find a > > tail with QSHOW/QSHOW_SIGNALS, and after that a gdb backtrace. > > Removing the sigprocmask() in sigill_handler, at Gabor's suggestion, > makes the build run to completion. I shall now remove all the > debugging stuff, and stress-test sb-sprof or something. (Suggestions > welcome) > > Cheers, > > Christophe rc2 - removed misguided sigprocmask calls from mips, hppa and sparc sig{trap,ill} handlers - removed non-x86 version of handle_breakpoint (interrupts are enabled in now common handle_breakpoint) - fixed arrange_return_to_lisp_function/post_signal_tramp to save and restore eflags (interrupt-threads seems to work) - commented on the need to restore fp state arrange_r_t_l_f Gabor PS: as far as I can see low_level_maybe_now_maybe_later() does use arch_os_get_context(). You meant low_level_interrupt_handle_now? |
From: Gabor M. <gab...@es...> - 2005-06-07 14:08:14
|
Is everyone confortable with it? Then I'll commit it soonish. |
From: Christophe R. <cs...@ca...> - 2005-06-07 14:47:03
|
Gabor Melis <gab...@es...> writes: > Is everyone confortable with it? Then I'll commit it soonish. Oh, I forgot to tell the list that it caused the mean samples before failure on Sparc/SunOS/sb-sprof to become much, much larger. (I have managed to confuse it ("-1397857 is not an (OR STREAM T NULL)") by a suitably-placed C-c, but at least it no longer hits an uninterruptible CPU-chewing state. (So this is a vote for committage from me) On the other hand, I have not audited the patch for correctness wrt calling arch_os_get_context() exactly once on contexts from signal handlers. I presume it's right, in that none of the "important signals" (ILL, SEGV) on the sparc are in the blockable set, but I thought it would be worth mentioning it in case it wasn't on the radar. Cheers, Christophe |