From: Ross L. <rlo...@po...> - 2009-02-14 03:42:04
|
Referencing the mail from Ben Lambert forwarded by Nikodemus Siivola on Thu, Aug 14, 2008. http://sourceforge.net/mailarchive/forum.php?thread_name=633d72b0808132326t65e04c6es863e497b6d668415%40mail.gmail.com&forum_name=sbcl-devel I also see this hang in timer.impure.lisp with SBCL 1.0.25 built from a SBCL 1.0.25 without sb-thread (itself built from 1.0.20 without sb-thread). Host is x86_64, 2x quad-core Xeon@2.50GHz, 32GB RAM, RHEL AS4 Kernel 2.6.9-68.9, libc 2.3.4. Commenting out the test suggested in the original mail (quoted below), does not allow the rest of the tests to complete. > #+sb-thread > (with-test (:name (:with-timeout :many-at-the-same-time)) > ...elided... > (assert ok)))) I threw in some output to track where it hangs next which points to it hanging in: (with-test (:name (:hash-cache :interrupt)) ... (loop (assert (eq wanted (subtypep type1 type2)))))))) Tried to take a backtrace with gdb: (gdb) bt #0 0x0000001000fd0274 in ?? () #1 0x0000000000000010 in ?? () #2 0x3d4ccccd0000001a in ?? () #3 0x0000001003423387 in ?? () #4 0x0000001003423357 in ?? () #5 0x000000100341a8a8 in ?? () #6 0x00000010034233d1 in ?? () #7 0x000000100341a8a8 in ?? () #8 0x0000002a97e422e0 in ?? () #9 0x0000002a97e421a7 in ?? () #10 0x0000002a97e422e7 in ?? () #11 0x0000002a97e421b7 in ?? () #12 0x0000000020100017 in ?? () #13 0x00000010000813bf in ?? () #14 0x0000002a97e421c9 in ?? () #15 0x0000000000000236 in ?? () #16 0x000000100341aa18 in ?? () #17 0x00000010034232af in ?? () #18 0x0000002a97e431e0 in ?? () #19 0x0000002a97e42dd0 in ?? () #20 0x0000002a98042ff8 in ?? () #21 0x0000001000219a31 in ?? () #22 0x00000010034232af in ?? () #23 0x0000002a97e421e0 in ?? () #24 0x0000002a97e431d0 in ?? () #25 0x0000002a97e42dd0 in ?? () #26 0x0000002a98042ff8 in ?? () #27 0x000000100342352f in ?? () #28 0x0000002a97e42190 in ?? () #29 0x0000002a97e431e0 in ?? () #30 0x0000002a97e42dd0 in ?? () #31 0x0000002a98042ff8 in ?? () #32 0x0000000000000000 in ?? () skipping that test lets it get through the next few and hang up in the last test: #+sb-thread (with-test (:name (:timer :threaded-stress)) ... (mapcar #'sb-thread:join-thread threads))))) tried for a backtrace: (gdb) bt #0 0x0000002a95aceb39 in syscall () from /lib64/tls/libc.so.6 #1 0x0000000000417348 in futex_wait (lock_word=0x100342df20, oldval=16, sec=-1, usec=0) at linux-os.c:83 #2 0x00000010004d2d01 in ?? () #3 0x0000002a97e41f07 in ?? () #4 0x0000000020100017 in ?? () #5 0x0000002a97e42050 in ?? () #6 0x000000100343435f in ?? () #7 0x0000000000000008 in ?? () #8 0x0000002a97e41e30 in ?? () #9 0x000000100342df01 in ?? () #10 0x000000100277b331 in ?? () #11 0x000000002010004f in ?? () #12 0x0000002a97e41f10 in ?? () #13 0x000000100014b768 in ?? () #14 0x0000002a97e42050 in ?? () #15 0x0000002a99e50000 in ?? () #16 0x0000002a99e50000 in ?? () #17 0x0000002a98043000 in ?? () #18 0x0000000000000010 in ?? () #19 0x0000002a97e41f50 in ?? () #20 0x0000000000000100 in ?? () #21 0x0000002a97e41fa0 in ?? () #22 0x00000000004157f4 in create_thread (initial_function=Variable "initial_function" is not available. ) at thread.c:538 Previous frame inner to this frame (corrupt stack?) Poking around, it looks like the threads never join. Let me know if there's any other info I can provide. Thanks. - Ross |
From: Gábor M. <me...@re...> - 2009-02-19 16:23:33
|
On Miércoles 18 Febrero 2009, Ross Lonstein wrote: > On Wed, Feb 18, 2009 at 12:29:35PM -0600, Ross Lonstein wrote: > > On Wed, Feb 18, 2009 at 06:36:55PM +0100, Gbor Melis wrote: > > [snip] > > > > > Switching to private, this may be long. What does this do in a > > > fresh sbcl? > > > > > > (trace sb-impl::run-expired-timers) > > > (sb-unix:unix-setitimer :real 0 0 0 100) > > > > Gzipped output attached. > > [snip] > > Actually, looks like I didn't catch the output from the eval. > Doing it by hand this time... > > - Ross (reCCing sbcl-devel) It seems that this is a known bug in linux 2.6.9: https://bugzilla.redhat.com/show_bug.cgi?id=118044 You may want to compile the short test program in one of the comment on that page with gcc -lpthread. The program exits in 1s normally and hangs if you have this bug. Cheers, Gabor |
From: Gábor M. <me...@re...> - 2009-02-14 19:45:48
|
On Sábado 14 Febrero 2009, Ross Lonstein wrote: > Referencing the mail from Ben Lambert forwarded by Nikodemus Siivola > on Thu, Aug 14, 2008. > > http://sourceforge.net/mailarchive/forum.php?thread_name=633d72b08081 >32326t65e04c6es863e497b6d668415%40mail.gmail.com&forum_name=sbcl-devel > > I also see this hang in timer.impure.lisp with SBCL 1.0.25 built from > a SBCL 1.0.25 without sb-thread (itself built from 1.0.20 without > sb-thread). > > Host is x86_64, 2x quad-core Xeon@2.50GHz, 32GB RAM, RHEL AS4 Kernel > 2.6.9-68.9, libc 2.3.4. There is a whole lot of interrupt, timer and thread related fixes to be committed in time for 1.0.26. There is a fair chance that this problem will also be cured. If not, please rereport it here. Gabor Melis |
From: Ross L. <rlo...@po...> - 2009-02-18 12:48:03
|
On Sat, Feb 14, 2009 at 08:45:35PM +0100, Gbor Melis wrote: [snip] > There is a whole lot of interrupt, timer and thread related fixes to be > committed in time for 1.0.26. There is a fair chance that this problem > will also be cured. If not, please rereport it here. [snip] Noting your commit, I pulled 1.0.25.54 from git and built with sb-thread from 1.0.25. I see the nine "not ok" messages followed by: ::: UNEXPECTED-FAILURE (:WITH-TIMEOUT :MANY-AT-THE-SAME-TIME) ::: Running (:WITH-TIMEOUT :DEAD-THREAD) ::: Success (:WITH-TIMEOUT :DEAD-THREAD) ::: Running (:HASH-CACHE :INTERRUPT) which hangs indefinitely (which is to say, more than 30 minutes chewing cpu time before I killed it). Let me know if you want a trace, etc. - Ross |
From: Gábor M. <me...@re...> - 2009-02-18 13:12:28
|
On Miércoles 18 Febrero 2009, Ross Lonstein wrote: > On Sat, Feb 14, 2009 at 08:45:35PM +0100, Gbor Melis wrote: > [snip] > > > There is a whole lot of interrupt, timer and thread related fixes > > to be committed in time for 1.0.26. There is a fair chance that > > this problem will also be cured. If not, please rereport it here. > > [snip] > > Noting your commit, I pulled 1.0.25.54 from git and built with > > sb-thread from 1.0.25. I see the nine "not ok" messages followed by: > ::: UNEXPECTED-FAILURE (:WITH-TIMEOUT :MANY-AT-THE-SAME-TIME) > ::: Running (:WITH-TIMEOUT :DEAD-THREAD) > ::: Success (:WITH-TIMEOUT :DEAD-THREAD) > ::: Running (:HASH-CACHE :INTERRUPT) > > which hangs indefinitely (which is to say, more than 30 minutes > chewing cpu time before I killed it). > > Let me know if you want a trace, etc. > > - Ross Let's try to isolate this. What does this do in itself? (sb-thread:make-thread (lambda () (handler-case (sb-ext:with-timeout 0.5 (sleep 5) (format t "~%not ok~%")) (timeout () )))) If it prints "not ok", could you please compile sbcl with QSHOW noise like this: diff --git a/src/runtime/runtime.h b/src/runtime/runtime.h index 01f4031..836ebfa 100644 --- a/src/runtime/runtime.h +++ b/src/runtime/runtime.h @@ -29,8 +29,8 @@ #define thread_mutex_unlock(l) 0 #endif -/* #define QSHOW */ /* Enable low-level debugging output? */ -/* #define QSHOW_SAFE */ /* Enable blocking interrupts for each SHOW. */ +#define QSHOW /* Enable low-level debugging output? */ +#define QSHOW_SAFE /* Enable blocking interrupts for each SHOW. */ #ifdef QSHOW @@ -80,7 +80,7 @@ extern sigset_t blockable_sigset; * necessarily reentrant. But it can still be very convenient for * figuring out what's going on when you have a signal handling * problem.. */ -#define QSHOW_SIGNALS 0 +#define QSHOW_SIGNALS 1 #if QSHOW_SIGNALS #define FSHOW_SIGNAL FSHOW and repeat the experiment? Even if the small test case doesn't fail the QSHOW output of "sh run-test.sh timer.impure.lisp" may be informative. Cheers, Gabor |
From: Ross L. <rlo...@po...> - 2009-02-18 13:34:25
|
On Wed, Feb 18, 2009 at 02:12:12PM +0100, Gbor Melis wrote: [snip] > Let's try to isolate this. What does this do in itself? > > (sb-thread:make-thread > (lambda () > (handler-case > (sb-ext:with-timeout 0.5 > (sleep 5) > (format t "~%not ok~%")) > (timeout () > )))) [snip] This succeeds and the thread ends quietly, when I check with (sb-thread:list-all-threads) it only lists the initial thread: * (sb-thread:list-all-threads) (#<SB-THREAD:THREAD "initial thread" RUNNING {A8B97A1}>) * (sb-thread:make-thread (lambda () (handler-case (sb-ext:with-timeout 0.5 (sleep 5) (format t "~%not ok~%")) (timeout () )))) #<SB-THREAD:THREAD RUNNING {ACF7BA1}> * (sb-thread:list-all-threads) (#<SB-THREAD:THREAD "initial thread" RUNNING {A8B97A1}>) * > and repeat the experiment? Even if the small test case doesn't fail > the QSHOW output of "sh run-test.sh timer.impure.lisp" may be > informative. [snip] I'll give this a try later this evening and report. Thanks. - Ross |
From: Ross L. <rlo...@po...> - 2009-02-18 15:37:35
Attachments:
sbcl-1.0.25.54-qshow_20090218.out.gz
|
On Wed, Feb 18, 2009 at 07:34:19AM -0600, Ross Lonstein wrote: > On Wed, Feb 18, 2009 at 02:12:12PM +0100, Gbor Melis wrote: [snip] > > the QSHOW output of "sh run-test.sh timer.impure.lisp" may be > > informative. > [snip] > > I'll give this a try later this evening and report. [snip] Done. Killed it after a some time (at least 20 minutes). See attached gzipped output. - Ross |
From: Gábor M. <me...@re...> - 2009-02-18 16:01:50
|
On Miércoles 18 Febrero 2009, Ross Lonstein wrote: > On Wed, Feb 18, 2009 at 07:34:19AM -0600, Ross Lonstein wrote: > > On Wed, Feb 18, 2009 at 02:12:12PM +0100, Gbor Melis wrote: > > [snip] > > > > the QSHOW output of "sh run-test.sh timer.impure.lisp" may be > > > informative. > > > > [snip] > > > > I'll give this a try later this evening and report. > > [snip] > > Done. Killed it after a some time (at least 20 minutes). See attached > gzipped output. > > - Ross Interesting. In the many-at-the-same-time test only one sigarlm is triggered. Can you repeat the test with SB-IMPL::RUN-EXPIRED-TIMERS SB-IMPL::PEEK-SCHEDULE GET-INTERNAL-REAL-TIME SB-IMPL::SET-SYSTEM-TIMER SB-UNIX:UNIX-SETITIMER all traced? |
From: Ross L. <rlo...@po...> - 2009-02-18 16:15:35
Attachments:
sbcl-1.0.25.54-qshow_20090218-traced.out.gz
|
On Wed, Feb 18, 2009 at 05:01:35PM +0100, Gbor Melis wrote: [snip] > Interesting. In the many-at-the-same-time test only one sigarlm is > triggered. Can you repeat the test with > SB-IMPL::RUN-EXPIRED-TIMERS > SB-IMPL::PEEK-SCHEDULE > GET-INTERNAL-REAL-TIME > SB-IMPL::SET-SYSTEM-TIMER > SB-UNIX:UNIX-SETITIMER > all traced? [snip] Attached, gzipped output. - Ross |
From: Gábor M. <me...@re...> - 2009-02-20 14:50:15
|
On Jueves 19 Febrero 2009, Ross Lonstein wrote: > At Thu, 19 Feb 2009 11:25:40 +0100, Gábor Melis wrote: > [snip] > > > It seems that this is a known bug in linux 2.6.9: > > > > https://bugzilla.redhat.com/show_bug.cgi?id=118044 > > > > You may want to compile the short test program in one of the > > comment on that page with gcc -lpthread. The program exits in 1s > > normally and hangs if you have this bug. > > [snip] > > The test program always hangs. Looks like you found the root cause. > > Since these are work machines and RHEL5 is off in the distance[1], > there's nothing I can do to upgrade the kernel. > > Thanks. > > - Ross > > Footnotes: > [1] At least a few dozen of the several thousand deployed. > Individual machines will upgrade but I'll have to deal with older > releases for the foreseeable future. It should be possible work around this bug in linux by introducing a dedicated timer thread that calls setitimer. That said, we know of two other bugs in linux < 2.6.14: the nohup problem [1] and the SA_NODEFER bug in sigaction that's documented nowadays in the manpage of sigaction. The latter has a workaround already as evidenced by the "SA_NODEFER doesn't work" messages when starting up. I'm not sure adding yet another workaround is worth the complexity, sb-sprof won't work as advertised in any case. [1] http://common-lisp.net/pipermail/slime-devel/2006-March/004649.html |