From: Ivan S. <iv...@gm...> - 2008-10-30 20:33:19
|
Hello. I've run into some deadlocks in PCL code. I wrote about it on #lisp, and dlowe noted that this has to deal with cffi foreign string class being changed. There are 11 threads that are locked up waiting on an unnamed spinlock in PCL (I guess it's *pcl-lock*). A backtrace from such thread looks like 0: ((LAMBDA (SWANK-BACKEND::DEBUGGER-LOOP-FN)) #<FUNCTION (LAMBDA #) {1002BD5429}>) 1: (SWANK::DEBUG-IN-EMACS #<SIMPLE-ERROR {10056EB6A1}>) 2: (SWANK:INVOKE-SLIME-DEBUGGER #<SIMPLE-ERROR {10056EB6A1}>) 3: (SWANK:SIMPLE-BREAK "Interrupt from Emacs") 4: ((LAMBDA (SWANK-BACKEND::HOOK SWANK-BACKEND::FUN)) #<FUNCTION SWANK:SWANK-DEBUGGER-HOOK> #<FUNCTION (LAMBDA #) {100999E919}>) 5: (SWANK::CALL-WITH-REDIRECTED-IO #<SWANK::CONNECTION {1003831E21}> #<CLOSURE (LAMBDA #) {10056EB589}>) 6: (SWANK::CALL-WITH-CONNECTION #<SWANK::CONNECTION {1003831E21}> #<FUNCTION (LAMBDA #) {100999E919}>) 7: (SWANK::INVOKE-OR-QUEUE-INTERRUPT #<CLOSURE (LAMBDA #) {10056EA969}>) 8: ((FLET SB-UNIX::INTERRUPTION)) 9: ((FLET #:WITHOUT-INTERRUPTS-BODY-[INVOKE-INTERRUPTION]10)) 10: (SB-SYS:INVOKE-INTERRUPTION #<FUNCTION (FLET SB-UNIX::INTERRUPTION) {1000070B99}>) 11: ("foreign function: #x41EF72") 12: ("foreign function: #x41F030") 13: (SB-PCL::INVALIDATE-WRAPPER #<SB-PCL::WRAPPER #<STANDARD-CLASS CFFI::FOREIGN-STRING-TYPE> {10054899A1}> :FLUSH #<SB-PCL::WRAPPER #<STANDARD-CLASS CFFI::FOREIGN-STRING-TYPE> {10055C4931}>) 14: ((FLET #:WITHOUT-INTERRUPTS-BODY-[CALL-WITH-SPINLOCK]448)) 15: (SB-THREAD::CALL-WITH-SPINLOCK #<CLOSURE (FLET SB-THREAD::WITH-SPINLOCK-THUNK) {7FBC8AAC45A9}> #S(SB-THREAD::SPINLOCK :NAME NIL :VALUE #<SB-THREAD:THREAD RUNNING {1002730081}>)) 16: (SB-PCL::FORCE-CACHE-FLUSHES #<STANDARD-CLASS CFFI::FOREIGN-STRING-TYPE>) 17: (SB-PCL::CHECK-WRAPPER-VALIDITY #<error printing object>) 18: (SB-PCL::CACHE-MISS-VALUES #<error printing object>) 19: (SB-PCL::CACHING-MISS #<error printing object>) 20: (MODBUS-INT::MODBUS-INIT-TCP #.(SB-SYS:INT-SAP #X00A20FA0) "127.0.0.1" 2029) 21: (MODBUS::MODBUS-SLAVE-ALLOC #<MODBUS::MODBUS-SLAVE {1002881FE1}>) 22: (MODBUS::MODBUS-SLAVE-LOOP #<MODBUS::MODBUS-SLAVE {1002881FE1}>) 23: ((LAMBDA ())) 24: ((FLET SB-THREAD::WITH-MUTEX-THUNK)) 25: ((FLET #:WITHOUT-INTERRUPTS-BODY-[CALL-WITH-MUTEX]477)) 26: (SB-THREAD::CALL-WITH-MUTEX ..) 27: ((LAMBDA ())) 28: ("foreign function: #x41EF72") 29: ("foreign function: #x41654A") MODBUS-INT::MODBUS-INIT-TCP function definition looks like (defcfun "modbus_init_tcp" :void (mb-param :pointer) (ip-address :string) (port :int)) which expands into (PROGN NIL (DEFUN MODBUS-INIT-TCP (MB-PARAM IP-ADDRESS PORT) (LET ((#:G947 MB-PARAM)) (MULTIPLE-VALUE-BIND (#:G948 #:PARAM974) (TRANSLATE-TO-FOREIGN IP-ADDRESS #<CFFI::FOREIGN-STRING-TYPE :UTF-8>) (UNWIND-PROTECT (PROGN (LET ((#:G949 PORT)) (CFFI-SYS:%FOREIGN-FUNCALL "modbus_init_tcp" (:POINTER #:G947 :POINTER #:G948 :INT #:G949 :VOID) :CALLING-CONVENTION :CDECL :LIBRARY :DEFAULT))) (FREE-TRANSLATED-OBJECT #:G948 #<CFFI::FOREIGN-STRING-TYPE :UTF-8> #:PARAM974)))))) My guess is that this may be somehow connected with the fact that CFFI-FOREIGN-STRING-TYPE class isn't used at all before these threads are started, and they all are started simultaneously. I never encountered this problem before I switched to a dual-core computer, and it's not easily reproducible (it occurs once in 10-20 runs). Ivan |
From: Attila L. <att...@gm...> - 2008-10-31 10:02:41
|
i think this is probably the same issue we wrote about in "PCL infinite recursion". it happens for us on our live servers that are running 1.0.12.2. (we'll update them in a month or two. my answer to Nikodemus about whether it happens with sbcl head is still waiting among my drafts mails...) Ivan, what are the CPU's doing when it happens? are they busy-looping for you, too? in our case, probably due to some CPU level parallelism, some wrapper remains invalid forever which puts some code in PCL to loop for ever. iirc, since 1.0.12 Nikodemus fixed numerous threading issues, so in head it may be a deadlock now, but i'm just guessing here. the similarities: - 4 cores with numerous threads - it happens rarely, but when it happens it's usually early in the day - SB-PCL::INVALIDATE-WRAPPER is on the stack > There are 11 threads that are locked up waiting on an unnamed > spinlock in PCL (I guess it's *pcl-lock*). A backtrace from such > thread looks like are you sure all of them are doing the same? *iirc* the situation is this: one thread grabs the spinlock and starts to get rid of the invalidness, fails to do so therefore loops forever holding the lock, and the other threads do their busy waiting on the spinlock it holds. hth, -- attila |
From: Nikodemus S. <nik...@ra...> - 2008-10-31 17:46:51
|
On Thu, Oct 30, 2008 at 10:33 PM, Ivan Shvedunov <iv...@gm...> wrote: > Hello. > I've run into some deadlocks in PCL code. I wrote about it on #lisp, > and dlowe noted that this has to deal with cffi foreign string class > being changed. > There are 11 threads that are locked up waiting on an unnamed > spinlock in PCL (I guess it's *pcl-lock*). A backtrace from such > thread looks like Thanks for the report! A few questions: 1. What is the SBCL version, and which platform is this? How many processors are there? 2. Can you get a backtrace from the thread holding the lock? 3. Can you produce a (semi) repeatable test case you could send? (Even if it's 1000 lines and requires several external packages, that's fine.) Cheers, -- Nikodemus |
From: Ivan S. <iv...@gm...> - 2008-11-02 13:04:38
|
On Fri, Oct 31, 2008 at 8:46 PM, Nikodemus Siivola <nik...@ra...> wrote: > > Thanks for the report! > > A few questions: > > 1. What is the SBCL version, and which platform is this? How many > processors are there? SBCL 1.0.21 Ubuntu 8.10 kernel: Linux i-shvedunov 2.6.27-7-generic #1 SMP Thu Oct 30 04:12:22 UTC 2008 x86_64 GNU/Linux /proc/cpuinfo: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Pentium(R) Dual CPU E2200 @ 2.20GHz stepping : 13 cpu MHz : 1200.000 cache size : 1024 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl pni monitor ds_cpl est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 4399.75 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Pentium(R) Dual CPU E2200 @ 2.20GHz stepping : 13 cpu MHz : 1200.000 cache size : 1024 KB physical id : 0 siblings : 2 core id : 1 cpu cores : 2 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl pni monitor ds_cpl est tm2 ssse3 cx16 xtpr lahf_lm bogomips : 4399.78 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: > 2. Can you get a backtrace from the thread holding the lock? Well, I tried to reproduce the problem today via ssh to my machine at work, but to no avail :( For some mystical reason (difference in CPU load?) I can only reproduce the condition when working locally. Will try again next week when I can. > 3. Can you produce a (semi) repeatable test case you could send? (Even > if it's 1000 lines and requires several external packages, that's > fine.) I'll try to extract necessary code (also next week), though it isn't that easy. The whole thing is, well, considerably more than 1000 lines, has dependincies on a patched C library (libmodbus) and patched CLG version. Concerning earlier Attila's message: yes, all threads are busy-waiting on spinlocks. Investigating what's going on is complicated because when I try to debug a locked up thread from SLIME, SLIME debugger itself often locks up in the same CACHE-MISS-VALUES etc., and actual interruption only happens after several attempts, which in these cases also shows up in the stack trace. I'll try to learn what's going on without SLIME when I can. Ivan |
From: Nikodemus S. <nik...@ra...> - 2008-11-02 20:17:56
|
On Sun, Nov 2, 2008 at 3:04 PM, Ivan Shvedunov <iv...@gm...> wrote: > Concerning earlier Attila's message: yes, all threads are busy-waiting > on spinlocks. Investigating what's going on is complicated because > when I try to debug a locked up thread from SLIME, SLIME debugger > itself often locks up in the same CACHE-MISS-VALUES etc., and actual > interruption only happens after several attempts, which in these cases > also shows up in the stack trace. I'll try to learn what's going on > without SLIME when I can. GDB may be helpful. Minitutorial: gdb -p <sbcl pid> info threads # shows threads thread <id> # switches to a thread call backtrace_from_fp($ebp, 100) # prints backtrace for thread to the SBCL stdout # $rbp on x86-64 Cheers, -- Nikodemus |