From: Daniel F. <drf...@gm...> - 2007-12-08 02:52:00
|
Hello list, I have recently determined that SBCL seems to be having an issue with garbage collection grabbing a recursive lock. I have found this bug affects every version I have tested from the latest to sbcl 1.0.0. The hardware is a core 2 duo (/proc/cpuinfo for one of these machines appended) running GNU/Linux. I have determined that on my two to three hour test set that this bug may occur one in two times. Unfortunately it's a total blocker at that point, so it's debilitating. Mysteriously, the stack trace/condition raised by the recursive lock error sometimes reports that the :owner is nil and the :state is 0 for the mutex (at least by the time it gets to printing the condition). Unfortunately I am not able to supply the code that can reproduce the bug. I can, however, say that it is drakma talking to hunchentoot in the same SBCL process (some testing/pre-caching/computation procedure). I also don't think it'll appear at the same rate for everyone with different hardware. I'm poking around in the gc, interrupt, signal, and thread sections of SBCL, but it takes a long time to confirm or deny if anything I'm doing has an effect. Send any patches that you think will affect the bug or test cases that you want think may be able to reproduce the bug (I may get around to writing such a test case eventually). Some more obvious causes for failure: • without-interrupts could be not working entirely properly • without-interrupts may not be called in the right places to guard everything that needs to be guarded from interrupts Thanks in advance for your assistance, fdr Without further ado, here are three separate condition dumps: > > debugger invoked on a SIMPLE-ERROR in thread #<THREAD "initial thread" {A6DA829}>: > Recursive lock attempt #S(SB-THREAD:MUTEX > :NAME "GC lock" > :%OWNER NIL > :STATE 0). > > Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL. > > restarts (invokable by number or by possibly-abbreviated name): > 0: [ABORT] Exit debugger, returning to top level. > > (SB-THREAD:GET-MUTEX > #<unavailable argument> > #<unavailable argument> > #<unavailable argument>) > 0] > WARNING: Starting a select without a timeout while interrupts are disabled. > > > > WARNING: recursive lock attempt #S(SB-THREAD:MUTEX :NAME "GC lock" :VALUE NIL) > > Thread: #<THREAD "initial thread" {A6D7809}> > 0: (BACKTRACE 536870911 #<SYNONYM-STREAM :SYMBOL *TERMINAL-IO* {90F6951}>) > 1: (SB-THREAD:GET-MUTEX > #<unavailable argument> > #<unavailable argument> > #<unavailable argument>) > 2: ((FLET SB-UNIX::WITHOUT-INTERRUPTS-THUNK) NIL) > 3: (SB-THREAD::CALL-WITH-MUTEX > #<CLOSURE (FLET SB-THREAD::WITH-MUTEX-THUNK) {B79DCBDD}> > #S(SB-THREAD:MUTEX :NAME "GC lock" :VALUE NIL) > #<SB-THREAD:THREAD "initial thread" {A6D7809}> > T) > 4: ((FLET SB-UNIX::WITHOUT-INTERRUPTS-THUNK) #<unavailable argument>) > 5: ((FLET SB-UNIX::RUN-WITHOUT-INTERRUPTS)) > 6: (SB-UNIX::CALL-WITHOUT-INTERRUPTS > #<CLOSURE (FLET SB-UNIX::WITHOUT-INTERRUPTS-THUNK) {B79DCCCD}>) > 7: (SB-KERNEL:SUB-GC) > 8: ("foreign function: call_into_lisp") > 9: ("foreign function: funcall0") > 10: ("foreign function: maybe_gc") > 11: ("foreign function: interrupt_handle_pending") > 12: ("foreign function: handle_trap") > 13: (MAKE-ARRAY 8212) > 14: (SB-IMPL::DATA-VECTOR-FROM-INITS > (8212) > 8212 > (UNSIGNED-BYTE 8) > NIL > NIL > NIL > NIL) > 15: (MAKE-ARRAY 8212) > 16: ((SB-PCL::FAST-METHOD TRIVIAL-GRAY-STREAMS:STREAM-WRITE-SEQUENCE > (FLEXI-STREAMS:FLEXI-OUTPUT-STREAM "#<...>" . "#<...>")) > #<unavailable argument> > #<unavailable argument> > #<unavailable argument> > #<unavailable argument> > #<unavailable argument> > #<unavailable argument>) > > > debugger invoked on a SIMPLE-ERROR in thread #<THREAD "initial thread" {A6FC719}>: > Recursive lock attempt #S(SB-THREAD:MUTEX > :NAME "GC lock" > :%OWNER NIL > :STATE 0). > > Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL. > > (no restarts: If you didn't do this on purpose, please report it as a bug.) > (SB-THREAD:GET-MUTEX > #<unavailable argument> > #<unavailable argument> > #<unavailable argument>) > 0] > WARNING: Starting a select without a timeout while interrupts are disabled. Here is cpu info. It has occurred on two machines with non-identical (but similar) chips. > processor : 0 > vendor_id : GenuineIntel > cpu family : 6 > model : 15 > model name : Intel(R) Core(TM)2 CPU 6420 @ 2.13GHz > stepping : 6 > cpu MHz : 1600.000 > cache size : 4096 KB > physical id : 0 > siblings : 2 > core id : 0 > cpu cores : 2 > fdiv_bug : no > hlt_bug : no > f00f_bug : no > coma_bug : no > fpu : yes > fpu_exception : yes > cpuid level : 10 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm > bogomips : 4259.03 > clflush size : 64 > > processor : 1 > vendor_id : GenuineIntel > cpu family : 6 > model : 15 > model name : Intel(R) Core(TM)2 CPU 6420 @ 2.13GHz > stepping : 6 > cpu MHz : 2133.000 > cache size : 4096 KB > physical id : 0 > siblings : 2 > core id : 1 > cpu cores : 2 > fdiv_bug : no > hlt_bug : no > f00f_bug : no > coma_bug : no > fpu : yes > fpu_exception : yes > cpuid level : 10 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr lahf_lm > bogomips : 4256.00 > clflush size : 64 |
From: Nikodemus S. <nik...@ra...> - 2007-12-08 10:24:09
|
T24gRGVjIDgsIDIwMDcgMjo1MSBBTSwgRGFuaWVsIEZhcmluYSA8ZHJmYXJpbmFAZ21haWwuY29t PiB3cm90ZToKCj4gSSBoYXZlIHJlY2VudGx5IGRldGVybWluZWQgdGhhdCBTQkNMIHNlZW1zIHRv IGJlIGhhdmluZyBhbiBpc3N1ZSB3aXRoCj4gZ2FyYmFnZSBjb2xsZWN0aW9uIGdyYWJiaW5nIGEg cmVjdXJzaXZlIGxvY2suIEkgaGF2ZSBmb3VuZCB0aGlzIGJ1Zwo+IGFmZmVjdHMgZXZlcnkgdmVy c2lvbiBJIGhhdmUgdGVzdGVkIGZyb20gdGhlIGxhdGVzdCB0byBzYmNsIDEuMC4wLgo+Cj4gVGhl IGhhcmR3YXJlIGlzIGEgY29yZSAyIGR1byAoL3Byb2MvY3B1aW5mbyBmb3Igb25lIG9mIHRoZXNl IG1hY2hpbmVzCj4gYXBwZW5kZWQpIHJ1bm5pbmcgR05VL0xpbnV4LiBJIGhhdmUgZGV0ZXJtaW5l ZCB0aGF0IG9uIG15IHR3byB0byB0aHJlZQo+IGhvdXIgdGVzdCBzZXQgdGhhdCB0aGlzIGJ1ZyBt YXkgb2NjdXIgb25lIGluIHR3byB0aW1lcy4gVW5mb3J0dW5hdGVseQo+IGl0J3MgYSB0b3RhbCBi bG9ja2VyIGF0IHRoYXQgcG9pbnQsIHNvIGl0J3MgZGViaWxpdGF0aW5nLgoKSnVzdCBhIHNhbml0 eSBjaGVjazogbm90aGluZydzIGJpbmRpbmcgKkNVUlJFTlQtVEhSRUFEKiBieSBhbnkgY2hhbmNl LAp1c2luZyBHRVQtTVVURVggd2l0aCBhbiBleHBsaWNpdCBORVctT1dORVIsIG9yIGZyb2JiaW5n IE1VVEVYLU9XTkVSL1ZBTFVFPwoKRG9lcyB0aGUgaXNzdWUgbWFuaWZlc3QgaWYgeW91IHVzZSAo cmVzdHJpY3QtY29tcGlsZXItcG9saWN5ICdzYWZldHkgMyk/CgpUaGVyZSBpcyBhbHNvIGEga25v d24gKGFuZCBmaXhlZCkgaXNzdWUgaW4gU0JDTHMgaW4gdGhlIHJhbmdlIDEuMC43Lj8KLSAxLjAu OS4zOCB0aGF0IGNvdWxkCm1hbmlmZXN0IGluIHRoaXMgbWFubmVyIC0tIHN0YXkgYXdheSBmcm9t IHRob3NlLiBJIHN1Z2dlc3QgeW91IHJ1biB3aXRoIDEuMC4xMi4KCj4gSSdtIHBva2luZyBhcm91 bmQgaW4gdGhlIGdjLCBpbnRlcnJ1cHQsIHNpZ25hbCwgYW5kIHRocmVhZCBzZWN0aW9ucyBvZgo+ IFNCQ0wsIGJ1dCBpdCB0YWtlcyBhIGxvbmcgdGltZSB0byBjb25maXJtIG9yIGRlbnkgaWYgYW55 dGhpbmcgSSdtIGRvaW5nCgpBcmUgeW91IHNheWluZyB5b3UncmUgbG9va2luZyBmb3IgdGhlIHBy b2JsZW0sIG9yIGFyZSB5b3Ugc2F5aW5nIHlvdSdyZSBydW5uaW5nCmEgcGF0Y2hlZCBTQkNMPwoK QXQgYW55IHJhdGUsIHdoYXQncyBpbiAqRkVBVFVSRVMqPyBUbyB2ZXJpZnkgdGhhdCB5b3UgaGF2 ZSBhIHNhbmUgYnVpbGQsIHBsZWFzZQpzZW5kIHRoZSByZXN1bHRzIG9mIChESVNBU1NFTUJMRSAn U0ItVEhSRUFEOkdFVC1NVVRFWCkuCgo+IGhhcyBhbiBlZmZlY3QuIFNlbmQgYW55IHBhdGNoZXMg dGhhdCB5b3UgdGhpbmsgd2lsbCBhZmZlY3QgdGhlIGJ1ZyBvcgo+IHRlc3QgY2FzZXMgdGhhdCB5 b3Ugd2FudCB0aGluayBtYXkgYmUgYWJsZSB0byByZXByb2R1Y2UgdGhlIGJ1ZyAoSSBtYXkKPiBn ZXQgYXJvdW5kIHRvIHdyaXRpbmcgc3VjaCBhIHRlc3QgY2FzZSBldmVudHVhbGx5KS4KPgo+IFNv bWUgbW9yZSBvYnZpb3VzIGNhdXNlcyBmb3IgZmFpbHVyZToKPgo+ICAgICAgICAg4oCiIHdpdGhv dXQtaW50ZXJydXB0cyBjb3VsZCBiZSBub3Qgd29ya2luZyBlbnRpcmVseSBwcm9wZXJseQo+ICAg ICAgICAg4oCiIHdpdGhvdXQtaW50ZXJydXB0cyBtYXkgbm90IGJlIGNhbGxlZCBpbiB0aGUgcmln aHQgcGxhY2VzIHRvCj4gICAgICAgICBndWFyZCBldmVyeXRoaW5nIHRoYXQgbmVlZHMgdG8gYmUg Z3VhcmRlZCBmcm9tIGludGVycnVwdHMKCkkgZG9uJ3Qgc2VlIGhvdyB0aGlzIHdvdWxkIGJlIHRo ZSBjYXVzZS4gKFdoaWNoIG9mIGNvdXJzZSBkb2Vzbid0IG1lYW4gdGhhdAppdCBjYW4ndCBiZSAt LSBJIGp1c3QgZG9uJ3QgZmluZCBpdCBsaWtlbHkuKSBTVUItR0MgYW5kIEdFVC1NVVRFWCBhcmUK dGhlIG9idmlvdXMKY2FuZGlkYXRlcy4KCj4gV2l0aG91dCBmdXJ0aGVyIGFkbywgaGVyZSBhcmUg dGhyZWUgc2VwYXJhdGUgY29uZGl0aW9uIGR1bXBzOgoKV2hpY2ggdmVyc2lvbnMgb2YgU0JDTCBh cmUgdGhlc2UgZnJvbT8KCkNoZWVycywKCiAtLSBOaWtvZGVtdXMgU2lpdm9sYQoKPiA+Cj4gPiBk ZWJ1Z2dlciBpbnZva2VkIG9uIGEgU0lNUExFLUVSUk9SIGluIHRocmVhZCAjPFRIUkVBRCAiaW5p dGlhbCB0aHJlYWQiIHtBNkRBODI5fT46Cj4gPiAgIFJlY3Vyc2l2ZSBsb2NrIGF0dGVtcHQgI1Mo U0ItVEhSRUFEOk1VVEVYCj4gPiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgOk5BTUUgIkdD IGxvY2siCj4gPiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgOiVPV05FUiBOSUwKPiA+ICAg ICAgICAgICAgICAgICAgICAgICAgICAgICA6U1RBVEUgMCkuCj4gPgo+ID4gVHlwZSBIRUxQIGZv ciBkZWJ1Z2dlciBoZWxwLCBvciAoU0ItRVhUOlFVSVQpIHRvIGV4aXQgZnJvbSBTQkNMLgo+ID4K PiA+IHJlc3RhcnRzIChpbnZva2FibGUgYnkgbnVtYmVyIG9yIGJ5IHBvc3NpYmx5LWFiYnJldmlh dGVkIG5hbWUpOgo+ID4gICAwOiBbQUJPUlRdIEV4aXQgZGVidWdnZXIsIHJldHVybmluZyB0byB0 b3AgbGV2ZWwuCj4gPgo+ID4gKFNCLVRIUkVBRDpHRVQtTVVURVgKPiA+ICAjPHVuYXZhaWxhYmxl IGFyZ3VtZW50Pgo+ID4gICM8dW5hdmFpbGFibGUgYXJndW1lbnQ+Cj4gPiAgIzx1bmF2YWlsYWJs ZSBhcmd1bWVudD4pCj4gPiAwXQo+ID4gV0FSTklORzogU3RhcnRpbmcgYSBzZWxlY3Qgd2l0aG91 dCBhIHRpbWVvdXQgd2hpbGUgaW50ZXJydXB0cyBhcmUgZGlzYWJsZWQuCj4gPgo+ID4KPgo+Cj4K PiA+Cj4gPiBXQVJOSU5HOiByZWN1cnNpdmUgbG9jayBhdHRlbXB0ICNTKFNCLVRIUkVBRDpNVVRF WCA6TkFNRSAiR0MgbG9jayIgOlZBTFVFIE5JTCkKPiA+Cj4gPiBUaHJlYWQ6ICM8VEhSRUFEICJp bml0aWFsIHRocmVhZCIge0E2RDc4MDl9Pgo+ID4gMDogKEJBQ0tUUkFDRSA1MzY4NzA5MTEgIzxT WU5PTllNLVNUUkVBTSA6U1lNQk9MICpURVJNSU5BTC1JTyogezkwRjY5NTF9PikKPiA+IDE6IChT Qi1USFJFQUQ6R0VULU1VVEVYCj4gPiAgICAgIzx1bmF2YWlsYWJsZSBhcmd1bWVudD4KPiA+ICAg ICAjPHVuYXZhaWxhYmxlIGFyZ3VtZW50Pgo+ID4gICAgICM8dW5hdmFpbGFibGUgYXJndW1lbnQ+ KQo+ID4gMjogKChGTEVUIFNCLVVOSVg6OldJVEhPVVQtSU5URVJSVVBUUy1USFVOSykgTklMKQo+ ID4gMzogKFNCLVRIUkVBRDo6Q0FMTC1XSVRILU1VVEVYCj4gPiAgICAgIzxDTE9TVVJFIChGTEVU IFNCLVRIUkVBRDo6V0lUSC1NVVRFWC1USFVOSykge0I3OURDQkREfT4KPiA+ICAgICAjUyhTQi1U SFJFQUQ6TVVURVggOk5BTUUgIkdDIGxvY2siIDpWQUxVRSBOSUwpCj4gPiAgICAgIzxTQi1USFJF QUQ6VEhSRUFEICJpbml0aWFsIHRocmVhZCIge0E2RDc4MDl9Pgo+ID4gICAgIFQpCj4gPiA0OiAo KEZMRVQgU0ItVU5JWDo6V0lUSE9VVC1JTlRFUlJVUFRTLVRIVU5LKSAjPHVuYXZhaWxhYmxlIGFy Z3VtZW50PikKPiA+IDU6ICgoRkxFVCBTQi1VTklYOjpSVU4tV0lUSE9VVC1JTlRFUlJVUFRTKSkK PiA+IDY6IChTQi1VTklYOjpDQUxMLVdJVEhPVVQtSU5URVJSVVBUUwo+ID4gICAgICM8Q0xPU1VS RSAoRkxFVCBTQi1VTklYOjpXSVRIT1VULUlOVEVSUlVQVFMtVEhVTkspIHtCNzlEQ0NDRH0+KQo+ ID4gNzogKFNCLUtFUk5FTDpTVUItR0MpCj4gPiA4OiAoImZvcmVpZ24gZnVuY3Rpb246IGNhbGxf aW50b19saXNwIikKPiA+IDk6ICgiZm9yZWlnbiBmdW5jdGlvbjogZnVuY2FsbDAiKQo+ID4gMTA6 ICgiZm9yZWlnbiBmdW5jdGlvbjogbWF5YmVfZ2MiKQo+ID4gMTE6ICgiZm9yZWlnbiBmdW5jdGlv bjogaW50ZXJydXB0X2hhbmRsZV9wZW5kaW5nIikKPiA+IDEyOiAoImZvcmVpZ24gZnVuY3Rpb246 IGhhbmRsZV90cmFwIikKPiA+IDEzOiAoTUFLRS1BUlJBWSA4MjEyKQo+ID4gMTQ6IChTQi1JTVBM OjpEQVRBLVZFQ1RPUi1GUk9NLUlOSVRTCj4gPiAgICAgICg4MjEyKQo+ID4gICAgICA4MjEyCj4g PiAgICAgIChVTlNJR05FRC1CWVRFIDgpCj4gPiAgICAgIE5JTAo+ID4gICAgICBOSUwKPiA+ICAg ICAgTklMCj4gPiAgICAgIE5JTCkKPiA+IDE1OiAoTUFLRS1BUlJBWSA4MjEyKQo+ID4gMTY6ICgo U0ItUENMOjpGQVNULU1FVEhPRCBUUklWSUFMLUdSQVktU1RSRUFNUzpTVFJFQU0tV1JJVEUtU0VR VUVOQ0UKPiA+ICAgICAgIChGTEVYSS1TVFJFQU1TOkZMRVhJLU9VVFBVVC1TVFJFQU0gIiM8Li4u PiIgLiAiIzwuLi4+IikpCj4gPiAgICAgICM8dW5hdmFpbGFibGUgYXJndW1lbnQ+Cj4gPiAgICAg ICM8dW5hdmFpbGFibGUgYXJndW1lbnQ+Cj4gPiAgICAgICM8dW5hdmFpbGFibGUgYXJndW1lbnQ+ Cj4gPiAgICAgICM8dW5hdmFpbGFibGUgYXJndW1lbnQ+Cj4gPiAgICAgICM8dW5hdmFpbGFibGUg YXJndW1lbnQ+Cj4gPiAgICAgICM8dW5hdmFpbGFibGUgYXJndW1lbnQ+KQo+ID4KPiA+Cj4gPiBk ZWJ1Z2dlciBpbnZva2VkIG9uIGEgU0lNUExFLUVSUk9SIGluIHRocmVhZCAjPFRIUkVBRCAiaW5p dGlhbCB0aHJlYWQiIHtBNkZDNzE5fT46Cj4gPiAgIFJlY3Vyc2l2ZSBsb2NrIGF0dGVtcHQgI1Mo U0ItVEhSRUFEOk1VVEVYCj4gPiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgOk5BTUUgIkdD IGxvY2siCj4gPiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgOiVPV05FUiBOSUwKPiA+ICAg ICAgICAgICAgICAgICAgICAgICAgICAgICA6U1RBVEUgMCkuCj4gPgo+ID4gVHlwZSBIRUxQIGZv ciBkZWJ1Z2dlciBoZWxwLCBvciAoU0ItRVhUOlFVSVQpIHRvIGV4aXQgZnJvbSBTQkNMLgo+ID4K PiA+IChubyByZXN0YXJ0czogSWYgeW91IGRpZG4ndCBkbyB0aGlzIG9uIHB1cnBvc2UsIHBsZWFz ZSByZXBvcnQgaXQgYXMgYSBidWcuKQo+Cj4KPgo+ID4gKFNCLVRIUkVBRDpHRVQtTVVURVgKPiA+ ICAjPHVuYXZhaWxhYmxlIGFyZ3VtZW50Pgo+ID4gICM8dW5hdmFpbGFibGUgYXJndW1lbnQ+Cj4g PiAgIzx1bmF2YWlsYWJsZSBhcmd1bWVudD4pCj4gPiAwXQo+ID4gV0FSTklORzogU3RhcnRpbmcg YSBzZWxlY3Qgd2l0aG91dCBhIHRpbWVvdXQgd2hpbGUgaW50ZXJydXB0cyBhcmUgZGlzYWJsZWQu Cj4KPgo+Cj4gSGVyZSBpcyBjcHUgaW5mby4gSXQgaGFzIG9jY3VycmVkIG9uIHR3byBtYWNoaW5l cyB3aXRoIG5vbi1pZGVudGljYWwKPiAoYnV0IHNpbWlsYXIpIGNoaXBzLgo+Cj4gPiBwcm9jZXNz b3IgICAgICAgOiAwCj4gPiB2ZW5kb3JfaWQgICAgICAgOiBHZW51aW5lSW50ZWwKPiA+IGNwdSBm YW1pbHkgICAgICA6IDYKPiA+IG1vZGVsICAgICAgICAgICA6IDE1Cj4gPiBtb2RlbCBuYW1lICAg ICAgOiBJbnRlbChSKSBDb3JlKFRNKTIgQ1BVICAgICAgICAgIDY0MjAgIEAgMi4xM0dIego+ID4g c3RlcHBpbmcgICAgICAgIDogNgo+ID4gY3B1IE1IeiAgICAgICAgIDogMTYwMC4wMDAKPiA+IGNh Y2hlIHNpemUgICAgICA6IDQwOTYgS0IKPiA+IHBoeXNpY2FsIGlkICAgICA6IDAKPiA+IHNpYmxp bmdzICAgICAgICA6IDIKPiA+IGNvcmUgaWQgICAgICAgICA6IDAKPiA+IGNwdSBjb3JlcyAgICAg ICA6IDIKPiA+IGZkaXZfYnVnICAgICAgICA6IG5vCj4gPiBobHRfYnVnICAgICAgICAgOiBubwo+ ID4gZjAwZl9idWcgICAgICAgIDogbm8KPiA+IGNvbWFfYnVnICAgICAgICA6IG5vCj4gPiBmcHUg ICAgICAgICAgICAgOiB5ZXMKPiA+IGZwdV9leGNlcHRpb24gICA6IHllcwo+ID4gY3B1aWQgbGV2 ZWwgICAgIDogMTAKPiA+IHdwICAgICAgICAgICAgICA6IHllcwo+ID4gZmxhZ3MgICAgICAgICAg IDogZnB1IHZtZSBkZSBwc2UgdHNjIG1zciBwYWUgbWNlIGN4OCBhcGljIHNlcCBtdHJyIHBnZSBt Y2EgY21vdiBwYXQgcHNlMzYgY2xmbHVzaCBkdHMgYWNwaSBtbXggZnhzciBzc2Ugc3NlMiBzcyBo dCB0bSBwYmUgbnggbG0gY29uc3RhbnRfdHNjIHBuaSBtb25pdG9yIGRzX2NwbCB2bXggZXN0IHRt MiBzc3NlMyBjeDE2IHh0cHIgbGFoZl9sbQo+ID4gYm9nb21pcHMgICAgICAgIDogNDI1OS4wMwo+ ID4gY2xmbHVzaCBzaXplICAgIDogNjQKPiA+Cj4gPiBwcm9jZXNzb3IgICAgICAgOiAxCj4gPiB2 ZW5kb3JfaWQgICAgICAgOiBHZW51aW5lSW50ZWwKPiA+IGNwdSBmYW1pbHkgICAgICA6IDYKPiA+ IG1vZGVsICAgICAgICAgICA6IDE1Cj4gPiBtb2RlbCBuYW1lICAgICAgOiBJbnRlbChSKSBDb3Jl KFRNKTIgQ1BVICAgICAgICAgIDY0MjAgIEAgMi4xM0dIego+ID4gc3RlcHBpbmcgICAgICAgIDog Ngo+ID4gY3B1IE1IeiAgICAgICAgIDogMjEzMy4wMDAKPiA+IGNhY2hlIHNpemUgICAgICA6IDQw OTYgS0IKPiA+IHBoeXNpY2FsIGlkICAgICA6IDAKPiA+IHNpYmxpbmdzICAgICAgICA6IDIKPiA+ IGNvcmUgaWQgICAgICAgICA6IDEKPiA+IGNwdSBjb3JlcyAgICAgICA6IDIKPiA+IGZkaXZfYnVn ICAgICAgICA6IG5vCj4gPiBobHRfYnVnICAgICAgICAgOiBubwo+ID4gZjAwZl9idWcgICAgICAg IDogbm8KPiA+IGNvbWFfYnVnICAgICAgICA6IG5vCj4gPiBmcHUgICAgICAgICAgICAgOiB5ZXMK PiA+IGZwdV9leGNlcHRpb24gICA6IHllcwo+ID4gY3B1aWQgbGV2ZWwgICAgIDogMTAKPiA+IHdw ICAgICAgICAgICAgICA6IHllcwo+ID4gZmxhZ3MgICAgICAgICAgIDogZnB1IHZtZSBkZSBwc2Ug dHNjIG1zciBwYWUgbWNlIGN4OCBhcGljIHNlcCBtdHJyIHBnZSBtY2EgY21vdiBwYXQgcHNlMzYg Y2xmbHVzaCBkdHMgYWNwaSBtbXggZnhzciBzc2Ugc3NlMiBzcyBodCB0bSBwYmUgbnggbG0gY29u c3RhbnRfdHNjIHBuaSBtb25pdG9yIGRzX2NwbCB2bXggZXN0IHRtMiBzc3NlMyBjeDE2IHh0cHIg bGFoZl9sbQo+ID4gYm9nb21pcHMgICAgICAgIDogNDI1Ni4wMAo+ID4gY2xmbHVzaCBzaXplICAg IDogNjQKPgo+Cj4gLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQo+IFNGLk5ldCBlbWFpbCBpcyBzcG9uc29yZWQg Ynk6Cj4gQ2hlY2sgb3V0IHRoZSBuZXcgU291cmNlRm9yZ2UubmV0IE1hcmtldHBsYWNlLgo+IEl0 J3MgdGhlIGJlc3QgcGxhY2UgdG8gYnV5IG9yIHNlbGwgc2VydmljZXMgZm9yCj4ganVzdCBhYm91 dCBhbnl0aGluZyBPcGVuIFNvdXJjZS4KPiBodHRwOi8vc291cmNlZm9yZ2UubmV0L3NlcnZpY2Vz L2J1eS9pbmRleC5waHAKPiBfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fXwo+IFNiY2wtZGV2ZWwgbWFpbGluZyBsaXN0Cj4gU2JjbC1kZXZlbEBsaXN0cy5zb3Vy Y2Vmb3JnZS5uZXQKPiBodHRwczovL2xpc3RzLnNvdXJjZWZvcmdlLm5ldC9saXN0cy9saXN0aW5m by9zYmNsLWRldmVsCj4K |
From: Daniel F. <drf...@gm...> - 2007-12-08 21:38:16
|
On Dec 8, 2007 2:23 AM, Nikodemus Siivola <nik...@ra...> wrot= e: > On Dec 8, 2007 2:51 AM, Daniel Farina <drf...@gm...> wrote: > > > I have recently determined that SBCL seems to be having an issue with > > garbage collection grabbing a recursive lock. I have found this bug > > affects every version I have tested from the latest to sbcl 1.0.0. > > > > The hardware is a core 2 duo (/proc/cpuinfo for one of these machines > > appended) running GNU/Linux. I have determined that on my two to three > > hour test set that this bug may occur one in two times. Unfortunately > > it's a total blocker at that point, so it's debilitating. > > Just a sanity check: nothing's binding *CURRENT-THREAD* by any chance, > using GET-MUTEX with an explicit NEW-OWNER, or frobbing MUTEX-OWNER/VALUE= ? > I don't write any odd threaded code here that would do such a thing. I suppose it's possible that one of the libraries I use (most obviously relevant: drakma, hunchentoot, bordeaux-threads for a portable WITH-MUTEX) might. > Does the issue manifest if you use (restrict-compiler-policy 'safety 3)? > I'll give this a try, but any sort of verification is going to take about a day to run at least a few trials. I'll get started on this and let you know. > There is also a known (and fixed) issue in SBCLs in the range 1.0.7.? > - 1.0.9.38 that could > manifest in this manner -- stay away from those. I suggest you run with 1= .0.12. > > > I'm poking around in the gc, interrupt, signal, and thread sections of > > SBCL, but it takes a long time to confirm or deny if anything I'm doing > > Are you saying you're looking for the problem, or are you saying you're r= unning > a patched SBCL? Patching SBCL: only in the last day or so (around the time I wrote this email). Prior to that I had been trying to see if I could find a revision of SBCL that worked to try and identify a regression-causing changeset. > At any rate, what's in *FEATURES*? To verify that you have a sane build, = please > send the results of (DISASSEMBLE 'SB-THREAD:GET-MUTEX). > *FEATURES*: (:OS-PROVIDES-DLADDR :SB-THREAD :ANSI-CL :COMMON-LISP :SBCL :SB-DOC :SB-TEST :SB-LDB :SB-PACKAGE-LOCKS :SB-UNICODE :SB-EVAL :SB-SOURCE-LOCATIONS :IEEE-FLOATING-POINT :X86 :UNIX :ELF :LINUX :LARGEFILE :GENCGC :STACK-GROWS-DOWNWARD-NOT-UPWARD :C-STACK-IS-CONTROL-STACK :COMPARE-AND-SWAP-VOPS :UNWIND-TO-FRAME-AND-CALL-VOP :STACK-ALLOCATABLE-CLOSURES :ALIEN-CALLBACKS :LINKAGE-TABLE :OS-PROVIDES-DLOPEN :OS-PROVIDES-PUTWC :OS-PROVIDES-SUSECONDS-T) (DISASSEMBLE 'SB-THREAD:GET-MUTEX): This is long, so I have attached it. WARNING: It comes from a patched version of SBCL that has some minor changes to SUB-GC. When I get back home tonight I will have time to build an unaltered SBCL if this disassembly seems bad. > > has an effect. Send any patches that you think will affect the bug or > > test cases that you want think may be able to reproduce the bug (I may > > get around to writing such a test case eventually). > > > > Some more obvious causes for failure: > > > > =95 without-interrupts could be not working entirely properly > > =95 without-interrupts may not be called in the right places to > > guard everything that needs to be guarded from interrupts > > I don't see how this would be the cause. (Which of course doesn't mean th= at > it can't be -- I just don't find it likely.) SUB-GC and GET-MUTEX are > the obvious > candidates. Okay, I had been staring at the SUB-GC procedure for a while and had been wondering if everything is being guarded properly even though I could not identify a principled reason for failure there after close inspection. > > Without further ado, here are three separate condition dumps: > > Which versions of SBCL are these from? These vary, unfortunately. The error tended to be similar, so I just tried to get a nice orthogonal set errors. I can generate you new condition dumps from whatever version you wish. fdr |
From: Daniel F. <drf...@gm...> - 2007-12-08 21:40:13
Attachments:
disassemble-sb-mutex.txt
|
Argh, forgot to attach. |
From: Nikodemus S. <nik...@ra...> - 2007-12-08 22:08:06
|
On Dec 8, 2007 9:35 PM, Daniel Farina <drf...@gm...> wrote: > > There is also a known (and fixed) issue in SBCLs in the range 1.0.7.? > > - 1.0.9.38 that could > > manifest in this manner -- stay away from those. I suggest you run with 1.0.12. > > > > > I'm poking around in the gc, interrupt, signal, and thread sections of > > > SBCL, but it takes a long time to confirm or deny if anything I'm doing > > > > Are you saying you're looking for the problem, or are you saying you're running > > a patched SBCL? > Patching SBCL: only in the last day or so (around the time I wrote > this email). Prior to that I had been trying to see if I could find a > revision of SBCL that worked to try and identify a regression-causing > changeset. What is the latest clean (unpatched) SBCL you have seen this in? Can you verify for sure that you've seen this in an unpatched 1.0.9.39 or later? What patches are you running with? Unless there are patches you *need*, it would be better if you could reproduce this in a clean build from eg. 1.0.12. > > At any rate, what's in *FEATURES*? To verify that you have a sane build, please > > send the results of (DISASSEMBLE 'SB-THREAD:GET-MUTEX). At a first glance that seems about right (in the sense that it has the correct full calls and CMPXCHG has the LOCK prefix). > WARNING: It comes from a patched version of SBCL that has some minor > changes to SUB-GC. When I get back home tonight I will have time to > build an unaltered SBCL if this disassembly seems bad. SUB-GC changes should not matter for GET-MUTEX disassembly, but SUB-GC changes could certainly be responsible for the error you are seeing... > These vary, unfortunately. The error tended to be similar, so I just > tried to get a nice orthogonal set errors. I can generate you new > condition dumps from whatever version you wish. Current HEAD or 1.0.12, unpatched. Cheers, -- Nikodemus |
From: Daniel F. <drf...@gm...> - 2007-12-08 22:25:20
|
On Dec 8, 2007 2:08 PM, Nikodemus Siivola <nik...@ra...> wrote: > On Dec 8, 2007 9:35 PM, Daniel Farina <drf...@gm...> wrote: > > > > There is also a known (and fixed) issue in SBCLs in the range 1.0.7.? > > > - 1.0.9.38 that could > > > manifest in this manner -- stay away from those. I suggest you run with 1.0.12. > > > > > > > I'm poking around in the gc, interrupt, signal, and thread sections of > > > > SBCL, but it takes a long time to confirm or deny if anything I'm doing > > > > > > Are you saying you're looking for the problem, or are you saying you're running > > > a patched SBCL? > > > Patching SBCL: only in the last day or so (around the time I wrote > > this email). Prior to that I had been trying to see if I could find a > > revision of SBCL that worked to try and identify a regression-causing > > changeset. > > What is the latest clean (unpatched) SBCL you have seen this in? Can you > verify for sure that you've seen this in an unpatched 1.0.9.39 or later? > Absolutely. I have tested un-altered versions in quite a large range. > What patches are you running with? Unless there are patches you *need*, > it would be better if you could reproduce this in a clean build from eg. > 1.0.12. Just some minor playing around in SUB-GC, but as mentioned before I spent a week or so verifying the existence of the bug in a number of versions between 1.0.0 through 1.0.12.X with no modifications. I actually first started noticing this problem from a binary upstream version. > > > At any rate, what's in *FEATURES*? To verify that you have a sane build, please > > > send the results of (DISASSEMBLE 'SB-THREAD:GET-MUTEX). > > At a first glance that seems about right (in the sense that it has the correct > full calls and CMPXCHG has the LOCK prefix). > > > WARNING: It comes from a patched version of SBCL that has some minor > > changes to SUB-GC. When I get back home tonight I will have time to > > build an unaltered SBCL if this disassembly seems bad. > > SUB-GC changes should not matter for GET-MUTEX disassembly, but > SUB-GC changes could certainly be responsible for the error you are > seeing... > I thought not, but I just figured I should come clean and document any and all changes for this particular dump. > > These vary, unfortunately. The error tended to be similar, so I just > > tried to get a nice orthogonal set errors. I can generate you new > > condition dumps from whatever version you wish. > > Current HEAD or 1.0.12, unpatched. > Will do. fdr |
From: Daniel F. <drf...@gm...> - 2007-12-09 20:00:08
|
On Sat, 2007-12-08 at 22:08 +0000, Nikodemus Siivola wrote: > On Dec 8, 2007 9:35 PM, Daniel Farina <drf...@gm...> wrote: > > These vary, unfortunately. The error tended to be similar, so I just > > tried to get a nice orthogonal set errors. I can generate you new > > condition dumps from whatever version you wish. > > Current HEAD or 1.0.12, unpatched. Here is one example from 1.0.12.17: debugger invoked on a SIMPLE-ERROR in thread #<THREAD "initial thread" {A6FD7B9}>: Recursive lock attempt #S(SB-THREAD:MUTEX :NAME "GC lock" :%OWNER NIL :STATE 0). Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL. restarts (invokable by number or by possibly-abbreviated name): 0: [ABORT] Exit debugger, returning to top level. (SB-THREAD:GET-MUTEX #<unavailable argument> #<unavailable argument> #<unavailable argument>) 0] WARNING: Starting a select without a timeout while interrupts are disabled fdr |
From: Daniel F. <drf...@gm...> - 2007-12-10 02:26:52
|
On Sat, 2007-12-08 at 22:08 +0000, Nikodemus Siivola wrote: > Current HEAD or 1.0.12, unpatched. > One more, just because. debugger invoked on a SIMPLE-ERROR in thread #<THREAD "initial thread" {A6FD7B9}>: Recursive lock attempt #S(SB-THREAD:MUTEX :NAME "GC lock" :%OWNER NIL :STATE 0). Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL. restarts (invokable by number or by possibly-abbreviated name): 0: [ABORT] Exit debugger, returning to top level. (SB-THREAD:GET-MUTEX #<unavailable argument> #<unavailable argument> #<unavailable argument>) 0] WARNING: Starting a select without a timeout while interrupts are disabled. fdr |
From: Nikodemus S. <nik...@ra...> - 2007-12-10 14:08:45
|
On Dec 10, 2007 2:26 AM, Daniel Farina <drf...@gm...> wrote: > > On Sat, 2007-12-08 at 22:08 +0000, Nikodemus Siivola wrote: > > Current HEAD or 1.0.12, unpatched. > > > > One more, just because. Mmm... full backtrace please? Just for clarity, please also report the LISP-IMPLEMENTATION-VERSION and *FEATURES* from that image, and output of "uname -a". Cheers, -- Nikodemus Siivola |
From: Daniel F. <drf...@gm...> - 2007-12-10 16:59:26
|
On Mon, 2007-12-10 at 14:08 +0000, Nikodemus Siivola wrote: > On Dec 10, 2007 2:26 AM, Daniel Farina <drf...@gm...> wrote: > > > > On Sat, 2007-12-08 at 22:08 +0000, Nikodemus Siivola wrote: > > > Current HEAD or 1.0.12, unpatched. > > > > > > > One more, just because. > > Mmm... full backtrace please? Just for clarity, please also report the > LISP-IMPLEMENTATION-VERSION and *FEATURES* from that image, > and output of "uname -a". > Unfortunately, that is everything. The more full backtrace seem to have occurred in older versions when -- for what is likely no principled reason whatsoever -- the GC recursive lock error occurred in other threads other than the main one. The error actually began to happen in the initial thread somewhere between 1.0.10 and 1.0.12 which is why we began to notice the error. Previous failings in other threads were non-fatal. I have recently (for no reason whatsoever) compiled from upstream and (re)confirmed the existence of the error on 1.0.12.21. (LISP-IMPLEMENTATION-VERSION) "1.0.12.21" *FEATURES* (:SB-BSD-SOCKETS-ADDRINFO :ASDF :OS-PROVIDES-DLADDR :SB-THREAD :ANSI-CL :COMMON-LISP :SBCL :SB-DOC :SB-TEST :SB-LDB :SB-PACKAGE-LOCKS :SB-UNICODE :SB-EVAL :SB-SOURCE-LOCATIONS :IEEE-FLOATING-POINT :X86 :UNIX :ELF :LINUX :LARGEFILE :GENCGC :STACK-GROWS-DOWNWARD-NOT-UPWARD :C-STACK-IS-CONTROL-STACK :COMPARE-AND-SWAP-VOPS :UNWIND-TO-FRAME-AND-CALL-VOP :STACK-ALLOCATABLE-CLOSURES :ALIEN-CALLBACKS :LINKAGE-TABLE :OS-PROVIDES-DLOPEN :OS-PROVIDES-PUTWC :OS-PROVIDES-SUSECONDS-T) `uname-a` Linux Tenacity 2.6.22-14-generic #1 SMP Sun Oct 14 23:05:12 GMT 2007 i686 GNU/Linux A rather samey 1.0.12.21 trace debugger invoked on a SIMPLE-ERROR in thread #<THREAD "initial thread" {A70B719}>: Recursive lock attempt #S(SB-THREAD:MUTEX :NAME "GC lock" :%OWNER NIL :STATE 0). Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL. restarts (invokable by number or by possibly-abbreviated name): 0: [ABORT] Exit debugger, returning to top level. (SB-THREAD:GET-MUTEX #<unavailable argument> #<unavailable argument> #<unavailable argument>) 0] WARNING: Starting a select without a timeout while interrupts are disabled. fdr |
From: Daniel F. <drf...@gm...> - 2007-12-10 19:29:58
|
On Dec 8, 2007 2:23 AM, Nikodemus Siivola <nik...@ra...> wrote: > Does the issue manifest if you use (restrict-compiler-policy 'safety 3)? > I almost dropped this request, sorry: I cleaned all my FASLs, used RESTRICT-COMPILER-POLICY, and then ran the tests again on SBCL 1.0.12.21. No change: the error still occurs frequently and without usable backtrace. Perhaps I should try 'debug 3? Hopefully timing issues will not alias out the error. fdr |
From: Daniel F. <drf...@gm...> - 2007-12-11 10:02:36
|
On Mon, 2007-12-10 at 11:29 -0800, Daniel Farina wrote: > On Dec 8, 2007 2:23 AM, Nikodemus Siivola <nik...@ra...> wrote: > > Does the issue manifest if you use (restrict-compiler-policy 'safety 3)? > > > > I almost dropped this request, sorry: > > I cleaned all my FASLs, used RESTRICT-COMPILER-POLICY, and then ran > the tests again on SBCL 1.0.12.21. No change: the error still occurs > frequently and without usable backtrace. Perhaps I should try 'debug > 3? Hopefully timing issues will not alias out the error. > Unfortunately, RESTRICT-COMPILER-POLICY with SAFETY 3 and DEBUG 3 did not yield any sort of an informative condition trace or change in the error. I'm going to start trying to write a test case that's duplicates the problem much more quickly and with less code. I had added some logging to my program and think this is possible given some patters I see. fdr |
From: Daniel F. <drf...@gm...> - 2007-12-11 21:04:05
|
On Dec 10, 2007 6:08 AM, Nikodemus Siivola <nik...@ra...> wrote: > On Dec 10, 2007 2:26 AM, Daniel Farina <drf...@gm...> wrote: > > > > On Sat, 2007-12-08 at 22:08 +0000, Nikodemus Siivola wrote: > > > Current HEAD or 1.0.12, unpatched. > > > > > > > One more, just because. > > Mmm... full backtrace please? Just for clarity, please also report the > LISP-IMPLEMENTATION-VERSION and *FEATURES* from that image, > and output of "uname -a". > > Cheers, > > -- Nikodemus Siivola > I hadn't realized that SBCL won't dump at least a good number of frames by default. (Whereas when the bug occurred in the not-initial-thread it DID seem to dump the trace). Sorry about the misunderstanding. I was thankfully corrected, here's a complete stack trace. The failure upon calling (make-array) seems pretty consistent. I have also determined that a sleep between invocations of drama's http request will mask the bug, so I'm going to try and write some code that does some rapid allocations and see if I can write a better test case. Again, sorry for the silly error. fdr Long trace follows. debugger invoked on a SIMPLE-ERROR in thread #<THREAD "initial thread" {A70B719}>: Recursive lock attempt #S(SB-THREAD:MUTEX :NAME "GC lock" :%OWNER NIL :STATE 0). Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL. restarts (invokable by number or by possibly-abbreviated name): 0: [ABORT] Exit debugger, returning to top level. (SB-THREAD:GET-MUTEX #<unavailable argument> #<unavailable argument> #<unavailable argument>) 0] WARNING: Starting a select without a timeout while interrupts are disabled. backtrace 0: (SB-THREAD:GET-MUTEX #<unavailable argument> #<unavailable argument> #<unavailable argument>) 1: ((FLET SB-UNIX::WITHOUT-INTERRUPTS-THUNK) NIL) 2: (SB-THREAD::CALL-WITH-MUTEX #<CLOSURE (FLET SB-THREAD::WITH-MUTEX-THUNK) {B79DA57D}> #S(SB-THREAD:MUTEX :NAME "GC lock" :%OWNER NIL :STATE 0) #<SB-THREAD:THREAD "initial thread" {A70B719}> T) 3: ((FLET SB-UNIX::WITHOUT-INTERRUPTS-THUNK) #<unavailable argument>) 4: ((FLET SB-UNIX::RUN-WITHOUT-INTERRUPTS)) 5: (SB-UNIX::CALL-WITHOUT-INTERRUPTS #<CLOSURE (FLET SB-UNIX::WITHOUT-INTERRUPTS-THUNK) {B79DA66D}>) 6: (SB-KERNEL:SUB-GC) 7: ("foreign function: call_into_lisp") 8: ("foreign function: funcall0") 9: ("foreign function: maybe_gc") 10: ("foreign function: interrupt_handle_pending") 11: ("foreign function: handle_trap") 12: (MAKE-ARRAY 8212) 13: (SB-IMPL::DATA-VECTOR-FROM-INITS (8212) 8212 (UNSIGNED-BYTE 8) NIL NIL NIL NIL) 14: (MAKE-ARRAY 8212) 15: ((SB-PCL::FAST-METHOD TRIVIAL-GRAY-STREAMS:STREAM-WRITE-SEQUENCE (FLEXI-STREAMS:FLEXI-OUTPUT-STREAM "#<...>" . "#<...>")) #<unused argument> #<unused argument> #<FLEXI-STREAMS::FLEXI-LATIN-1-IO-STREAM {DA01719}> #<unavailable argument> 0 10) 16: ((LAMBDA (SB-PCL::.PV. SB-PCL::.NEXT-METHOD-CALL. "#<...>" . "#<...>")) #<unused argument> #<unused argument> #<FLEXI-STREAMS::FLEXI-LATIN-1-IO-STREAM {DA01719}> "User-Agent" 0 10 -303600718 0) 17: ((LAMBDA (SB-PCL::.ARG0. SB-PCL::.ARG1. "#<...>" . "#<...>")) #<unavailable argument> #<unavailable argument> #<unavailable argument> #<unavailable argument> #<unavailable argument> #<unavailable argument>) 18: ((SB-PCL::FAST-METHOD STREAM-WRITE-STRING (FLEXI-STREAMS:FLEXI-OUTPUT-STREAM "#<...>" . "#<...>")) #<unused argument> #<unused argument> #<FLEXI-STREAMS::FLEXI-LATIN-1-IO-STREAM {DA01719}> "User-Agent" 0 10) 19: (WRITE-STRING "User-Agent" #<FLEXI-STREAMS::FLEXI-LATIN-1-IO-STREAM {DA01719}> :START 0 :END 10) 20: (SB-PRETTY:OUTPUT-PRETTY-OBJECT "User-Agent" #<FLEXI-STREAMS::FLEXI-LATIN-1-IO-STREAM {DA01719}>) 21: (PRINC "User-Agent" #<FLEXI-STREAMS::FLEXI-LATIN-1-IO-STREAM {DA01719}>) 22: (SB-FORMAT::A-FORMAT-DIRECTIVE-INTERPRETER #<FLEXI-STREAMS::FLEXI-LATIN-1-IO-STREAM {DA01719}> #<~A> (": " #<~?>) #<unavailable argument> #<unavailable argument>) 23: (SB-FORMAT::INTERPRET-DIRECTIVE-LIST #<FLEXI-STREAMS::FLEXI-LATIN-1-IO-STREAM {DA01719}> (#<~A> ": " #<~?>) ("User-Agent" "Drakma/0.11.1 (SBCL 1.0.12.21; Linux; 2.6.22-14-generic; http://weitz.de/drakma/)" NIL) ("User-Agent" "Drakma/0.11.1 (SBCL 1.0.12.21; Linux; 2.6.22-14-generic; http://weitz.de/drakma/)" NIL)) 24: (SB-FORMAT::%FORMAT #<FLEXI-STREAMS::FLEXI-LATIN-1-IO-STREAM {DA01719}> "~A: ~?" ("User-Agent" "Drakma/0.11.1 (SBCL 1.0.12.21; Linux; 2.6.22-14-generic; http://weitz.de/drakma/)" NIL) ("User-Agent" "Drakma/0.11.1 (SBCL 1.0.12.21; Linux; 2.6.22-14-generic; http://weitz.de/drakma/)" NIL)) 25: (SB-FORMAT::?-FORMAT-DIRECTIVE-INTERPRETER #<FLEXI-STREAMS::FLEXI-LATIN-1-IO-STREAM {DA01719}> #<~?> (#<~C> #<~C>) ("~A: ~?" ("User-Agent" "Drakma/0.11.1 (SBCL 1.0.12.21; Linux; 2.6.22-14-generic; http://weitz.de/drakma/)" NIL) #\Return #\Newline) #<unavailable argument>) 26: (SB-FORMAT::INTERPRET-DIRECTIVE-LIST #<FLEXI-STREAMS::FLEXI-LATIN-1-IO-STREAM {DA01719}> (#<~?> #<~C> #<~C>) ("~A: ~?" ("User-Agent" "Drakma/0.11.1 (SBCL 1.0.12.21; Linux; 2.6.22-14-generic; http://weitz.de/drakma/)" NIL) #\Return #\Newline) ("~A: ~?" ("User-Agent" "Drakma/0.11.1 (SBCL 1.0.12.21; Linux; 2.6.22-14-generic; http://weitz.de/drakma/)" NIL) #\Return #\Newline)) 27: (SB-FORMAT::%FORMAT #<FLEXI-STREAMS::FLEXI-LATIN-1-IO-STREAM {DA01719}> "~?~C~C" ("~A: ~?" ("User-Agent" "Drakma/0.11.1 (SBCL 1.0.12.21; Linux; 2.6.22-14-generic; http://weitz.de/drakma/)" NIL) #\Return #\Newline) ("~A: ~?" ("User-Agent" "Drakma/0.11.1 (SBCL 1.0.12.21; Linux; 2.6.22-14-generic; http://weitz.de/drakma/)" NIL) #\Return #\Newline)) 28: (FORMAT #<FLEXI-STREAMS::FLEXI-LATIN-1-IO-STREAM {DA01719}> "~?~C~C") 29: ((LABELS DRAKMA::WRITE-HTTP-LINE) "~A: ~?" "User-Agent" "Drakma/0.11.1 (SBCL 1.0.12.21; Linux; 2.6.22-14-generic; http://weitz.de/drakma/)" NIL) 30: ((LABELS DRAKMA::WRITE-HEADER) "User-Agent" "Drakma/0.11.1 (SBCL 1.0.12.21; Linux; 2.6.22-14-generic; http://weitz.de/drakma/)") 31: (DRAKMA:HTTP-REQUEST "http://localhost:4242/techradar/tech/Mobile-WiMax/timeline-most-active/Ntt") 32: (BULK-LOAD) 33: (SB-INT:SIMPLE-EVAL-IN-LEXENV (BULK-LOAD) #<NULL-LEXENV>) 34: (INTERACTIVE-EVAL (BULK-LOAD)) 35: (SB-IMPL::REPL-FUN NIL) 36: (SB-IMPL::REPL-FUN NIL) 37: ((LAMBDA ())) 38: ((LAMBDA ())) 39: (SB-IMPL::%WITH-REBOUND-IO-SYNTAX #<CLOSURE (LAMBDA #) {A70E5AD}>) 40: (SB-IMPL::TOPLEVEL-REPL NIL) 41: (SB-IMPL::TOPLEVEL-INIT) 42: ((LABELS SB-IMPL::RESTART-LISP)) |
From: Nikodemus S. <nik...@ra...> - 2007-12-12 11:45:59
|
On Dec 11, 2007 9:04 PM, Daniel Farina <drf...@gm...> wrote: Here's a small patch for you to run with. I don't expect it to make a difference, but it may yield a bit more information, specifically, whether it's *CURRENT-THREAD* of MUTEX-OWNER that's getting clobbered. All the changes are around the recursion check. (I pretty much refuse to believe that a thread doesn't see it's own write, so something else pretty much has to be writing where it should not.) You can just load this into a running SBCL. No, I didn't actually test the changes -- so you're best of 1st giving it a whirl by deliberately grabbing a lock twice and seeing that it does what it's supposed to... Cheers, -- Nikodemus (in-package :sb-thread) (defun get-mutex (mutex &optional new-owner (waitp t)) #+sb-doc "Acquire MUTEX for NEW-OWNER, which must be a thread or NIL. If NEW-OWNER is NIL, it defaults to the current thread. If WAITP is non-NIL and the mutex is in use, sleep until it is available. Note: using GET-MUTEX to assign a MUTEX to another thread then the current one is not recommended, and liable to be deprecated. GET-MUTEX is not interrupt safe. The correct way to call it is: (WITHOUT-INTERRUPTS ... (ALLOW-WITH-INTERRUPTS (GET-MUTEX ...)) ...) WITHOUT-INTERRUPTS is necessary to avoid an interrupt unwinding the call while the mutex is in an inconsistent state while, ALLOW-WITH-INTERRUPTS allows the call to be interrupted from sleep. It is recommended that you use WITH-MUTEX instead of calling GET-MUTEX directly." (declare (type mutex mutex) (optimize (speed 3)) #-sb-thread (ignore waitp)) (let ((self *current-thread*)) (unless new-owner (setq new-owner self)) (let ((old-owner (mutex-%owner mutex))) (when (eq new-owner old-owner) (error "Recursive lock attempt on ~S, mutex already owned by ~S.~%~ Current thread: ~S" mutex old-owner self)))) #+sb-thread (progn ;; FIXME: Lutexes do not currently support deadlines, as at least ;; on Darwin pthread_foo_timedbar functions are not supported: ;; this means that we probably need to use the Carbon multiprocessing ;; functions on Darwin. ;; ;; FIXME: This is definitely not interrupt safe: what happens if ;; we get hit (1) during the lutex calls (ok, they may be safe, ;; but has that been checked?) (2) after the lutex call, but ;; before setting the mutex owner. #+sb-lutex (when (zerop (with-lutex-address (lutex (mutex-lutex mutex)) (if waitp (with-interrupts (%lutex-lock lutex)) (%lutex-trylock lutex)))) (setf (mutex-%owner mutex) new-owner) t) #-sb-lutex (let ((old (sb-ext:compare-and-swap (mutex-state mutex) +lock-free+ +lock-taken+))) (unless (or (eql +lock-free+ old) (not waitp)) (tagbody :retry (when (or (eql +lock-contested+ old) (not (eql +lock-free+ (sb-ext:compare-and-swap (mutex-state mutex) +lock-taken+ +lock-contested+)))) ;; Wait on the contested lock. (multiple-value-bind (to-sec to-usec) (decode-timeout nil) (when (= 1 (with-pinned-objects (mutex) (futex-wait (mutex-state-address mutex) (get-lisp-obj-address +lock-contested+) (or to-sec -1) (or to-usec 0)))) (signal-deadline)))) (setf old (sb-ext:compare-and-swap (mutex-state mutex) +lock-free+ +lock-contested+)) ;; Did we get it? (unless (eql +lock-free+ old) (go :retry)))) (cond ((eql +lock-free+ old) (let ((prev (sb-ext:compare-and-swap (mutex-%owner mutex) nil new-owner))) (when prev (bug "Old owner in free mutex: ~S" prev)) t)) (waitp (bug "Failed to acquire lock with WAITP.")))))) |
From: Daniel F. <drf...@gm...> - 2007-12-13 18:49:10
|
On Dec 12, 2007 3:45 AM, Nikodemus Siivola <nik...@ra...> wrote: > On Dec 11, 2007 9:04 PM, Daniel Farina <drf...@gm...> wrote: > > Here's a small patch for you to run with. I don't expect it to make a > difference, > but it may yield a bit more information, specifically, whether it's > *CURRENT-THREAD* > of MUTEX-OWNER that's getting clobbered. All the changes are around the > recursion check. > > (I pretty much refuse to believe that a thread doesn't see it's own write, so > something else pretty much has to be writing where it should not.) > > You can just load this into a running SBCL. No, I didn't actually test the > changes -- so you're best of 1st giving it a whirl by deliberately grabbing > a lock twice and seeing that it does what it's supposed to... > > Cheers, > > -- Nikodemus > > (in-package :sb-thread) > > (defun get-mutex (mutex &optional new-owner (waitp t)) > #+sb-doc > "Acquire MUTEX for NEW-OWNER, which must be a thread or NIL. If > NEW-OWNER is NIL, it defaults to the current thread. If WAITP is > non-NIL and the mutex is in use, sleep until it is available. > > Note: using GET-MUTEX to assign a MUTEX to another thread then the > current one is not recommended, and liable to be deprecated. > > GET-MUTEX is not interrupt safe. The correct way to call it is: > > (WITHOUT-INTERRUPTS > ... > (ALLOW-WITH-INTERRUPTS (GET-MUTEX ...)) > ...) > > WITHOUT-INTERRUPTS is necessary to avoid an interrupt unwinding the > call while the mutex is in an inconsistent state while, > ALLOW-WITH-INTERRUPTS allows the call to be interrupted from sleep. > > It is recommended that you use WITH-MUTEX instead of calling GET-MUTEX > directly." > (declare (type mutex mutex) (optimize (speed 3)) > #-sb-thread (ignore waitp)) > (let ((self *current-thread*)) > (unless new-owner > (setq new-owner self)) > (let ((old-owner (mutex-%owner mutex))) > (when (eq new-owner old-owner) > (error "Recursive lock attempt on ~S, mutex already owned by ~S.~%~ > Current thread: ~S" mutex old-owner self)))) > #+sb-thread > (progn > ;; FIXME: Lutexes do not currently support deadlines, as at least > ;; on Darwin pthread_foo_timedbar functions are not supported: > ;; this means that we probably need to use the Carbon multiprocessing > ;; functions on Darwin. > ;; > ;; FIXME: This is definitely not interrupt safe: what happens if > ;; we get hit (1) during the lutex calls (ok, they may be safe, > ;; but has that been checked?) (2) after the lutex call, but > ;; before setting the mutex owner. > #+sb-lutex > (when (zerop (with-lutex-address (lutex (mutex-lutex mutex)) > (if waitp > (with-interrupts (%lutex-lock lutex)) > (%lutex-trylock lutex)))) > (setf (mutex-%owner mutex) new-owner) > t) > #-sb-lutex > (let ((old (sb-ext:compare-and-swap (mutex-state mutex) > +lock-free+ > +lock-taken+))) > (unless (or (eql +lock-free+ old) (not waitp)) > (tagbody > :retry > (when (or (eql +lock-contested+ old) > (not (eql +lock-free+ > (sb-ext:compare-and-swap (mutex-state mutex) > +lock-taken+ > +lock-contested+)))) > ;; Wait on the contested lock. > (multiple-value-bind (to-sec to-usec) (decode-timeout nil) > (when (= 1 (with-pinned-objects (mutex) > (futex-wait (mutex-state-address mutex) > (get-lisp-obj-address +lock-contested+) > (or to-sec -1) > (or to-usec 0)))) > (signal-deadline)))) > (setf old (sb-ext:compare-and-swap (mutex-state mutex) > +lock-free+ > +lock-contested+)) > ;; Did we get it? > (unless (eql +lock-free+ old) > (go :retry)))) > (cond ((eql +lock-free+ old) > (let ((prev (sb-ext:compare-and-swap (mutex-%owner mutex) > nil new-owner))) > (when prev > (bug "Old owner in free mutex: ~S" prev)) > t)) > (waitp > (bug "Failed to acquire lock with WAITP.")))))) > Here is a backtrace when running with the new GET-MUTEX. Unfortunately, it seems everything as one might expect with a naive recursive lock. fdr debugger invoked on a SIMPLE-ERROR in thread #<THREAD "initial thread" {A70D719}>: Recursive lock attempt on #S(SB-THREAD:MUTEX :NAME "GC lock" :%OWNER NIL :STATE 0), mutex already owned by #<SB-THREAD:THREAD "initial thread" {A70D719}>. Current thread: #<SB-THREAD:THREAD "initial thread" {A70D719}> Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL. restarts (invokable by number or by possibly-abbreviated name): 0: [ABORT] Exit debugger, returning to top level. (SB-THREAD:GET-MUTEX #<unavailable argument> #<unavailable argument> #<unavailable argument>) 0] WARNING: Starting a select without a timeout while interrupts are disabled. backtrace 0: (SB-THREAD:GET-MUTEX #<unavailable argument> #<unavailable argument> #<unavailable argument>) 1: ((FLET SB-UNIX::WITHOUT-INTERRUPTS-THUNK) NIL) 2: (SB-THREAD::CALL-WITH-MUTEX #<CLOSURE (FLET SB-THREAD::WITH-MUTEX-THUNK) {B79DAD0D}> #S(SB-THREAD:MUTEX :NAME "GC lock" :%OWNER NIL :STATE 0) #<SB-THREAD:THREAD "initial thread" {A70D719}> T) 3: ((FLET SB-UNIX::WITHOUT-INTERRUPTS-THUNK) #<unavailable argument>) 4: ((FLET SB-UNIX::RUN-WITHOUT-INTERRUPTS)) 5: (SB-UNIX::CALL-WITHOUT-INTERRUPTS #<CLOSURE (FLET SB-UNIX::WITHOUT-INTERRUPTS-THUNK) {B79DADFD}>) 6: (SB-KERNEL:SUB-GC) 7: ("foreign function: call_into_lisp") 8: ("foreign function: funcall0") 9: ("foreign function: maybe_gc") 10: ("foreign function: interrupt_handle_pending") 11: ("foreign function: handle_trap") 12: (MAKE-ARRAY 8212) 13: (SB-IMPL::DATA-VECTOR-FROM-INITS (8212) 8212 (UNSIGNED-BYTE 8) NIL NIL NIL NIL) 14: (MAKE-ARRAY 8212) 15: ((SB-PCL::FAST-METHOD TRIVIAL-GRAY-STREAMS:STREAM-WRITE-SEQUENCE (FLEXI-STREAMS:FLEXI-OUTPUT-STREAM "#<...>" . "#<...>")) #<unavailable argument> #<unavailable argument> #<unavailable argument> #<unavailable argument> #<unavailable argument> #<unavailable argument>) ...truncated for brevity |
From: Nikodemus S. <nik...@ra...> - 2007-12-13 20:35:21
|
On Dec 13, 2007 6:49 PM, Daniel Farina <drf...@gm...> wrote: > Here is a backtrace when running with the new GET-MUTEX. > > Unfortunately, it seems everything as one might expect with a naive > recursive lock. This is just too bizarre. I can see three possibilities (aside from another thread frobbing the mutex back and forth): * SBCL has a signal handling bug in the runtime that allows the thread to be interrupted after it has grabbed mutex. But this doesn't explain how it is subsequently released. * Your kernel has a signal handling bug causing the above. Same problem -- doesn't explain what you see as far as I can tell. * You have bad memory / processor cache / other magic hardware bit that causes the thread to see its own write with a small latency. Quite unlikely, really, but this one actually explains what is observed... Or maybe I'm just dense. Cheers, -- Nikodemus |
From: Daniel F. <drf...@gm...> - 2007-12-13 21:25:37
|
On Dec 13, 2007 12:35 PM, Nikodemus Siivola <nik...@ra...> wrote: > On Dec 13, 2007 6:49 PM, Daniel Farina <drf...@gm...> wrote: > > Here is a backtrace when running with the new GET-MUTEX. > > > > Unfortunately, it seems everything as one might expect with a naive > > recursive lock. > > This is just too bizarre. I can see three possibilities (aside from > another thread frobbing the mutex back and forth): > > * SBCL has a signal handling bug in the runtime that allows the > thread to be interrupted after it has grabbed mutex. > > But this doesn't explain how it is subsequently released. > > * Your kernel has a signal handling bug causing the above. Same > problem -- doesn't explain what you see as far as I can tell. > > * You have bad memory / processor cache / other magic hardware > bit that causes the thread to see its own write with a small > latency. Quite unlikely, really, but this one actually explains > what is observed... > I think you are not dense, and it is just as bizarre as you think it is. However, pay in mind that it happens systematically on two machines of completely different make but similar architecture. I'm going to try and wrap up the writing of a test case that I can distribute and doesn't have quite so many dependencies so that more data can be gathered. Another alternative is just to throw NOPs after load/stores and see what happens. fdr |
From: Nikodemus S. <nik...@ra...> - 2007-12-14 13:09:19
|
On Dec 13, 2007 9:25 PM, Daniel Farina <drf...@gm...> wrote: > However, pay in mind that it happens systematically on two machines of > completely different make but similar architecture. Right. That does rule out memory issues fairly well. Is the kernel version the same? Cheers, -- Nikodemus |
From: Daniel F. <drf...@gm...> - 2007-12-14 16:44:06
|
On Dec 14, 2007 5:09 AM, Nikodemus Siivola <nik...@ra...> wrote: > On Dec 13, 2007 9:25 PM, Daniel Farina <drf...@gm...> wrote: > > > However, pay in mind that it happens systematically on two machines of > > completely different make but similar architecture. > > Right. That does rule out memory issues fairly well. Is the kernel > version the same? Yes, they are running an identical kernel and OS (Ubuntu 7.10). Any suggestions on what particular linux version may work better? (Or should I just try another one at random-ish?) fdr |
From: Nikodemus S. <nik...@ra...> - 2007-12-14 17:59:48
|
On Dec 14, 2007 4:44 PM, Daniel Farina <drf...@gm...> wrote: > Yes, they are running an identical kernel and OS (Ubuntu 7.10). Any > suggestions on what particular linux version may work better? (Or > should I just try another one at random-ish?) Someone on the list may have an idea, but I don't. I would suggest downgrading instead of upgrading, though -- since quick googling seems to suggest that .22 was a rather exciting release. Cheers, -- Nikodemus |
From: Daniel F. <drf...@gm...> - 2007-12-14 18:48:04
|
On Dec 14, 2007 9:59 AM, Nikodemus Siivola <nik...@ra...> wrote: > On Dec 14, 2007 4:44 PM, Daniel Farina <drf...@gm...> wrote: > > > Yes, they are running an identical kernel and OS (Ubuntu 7.10). Any > > suggestions on what particular linux version may work better? (Or > > should I just try another one at random-ish?) > > Someone on the list may have an idea, but I don't. I would suggest downgrading > instead of upgrading, though -- since quick googling seems to suggest that > .22 was a rather exciting release. > Well, in interest in being able to test a diversity of kernels to try and confirm or deny this is an issue I think I'll install Debian etch. I *could* switch kernels around with Ubuntu 7.10, but rather not. fdr |
From: Daniel F. <drf...@gm...> - 2007-12-15 23:38:07
|
On Fri, 2007-12-14 at 10:48 -0800, Daniel Farina wrote: > On Dec 14, 2007 9:59 AM, Nikodemus Siivola <nik...@ra...> wrote: > > On Dec 14, 2007 4:44 PM, Daniel Farina <drf...@gm...> wrote: > > > > > Yes, they are running an identical kernel and OS (Ubuntu 7.10). Any > > > suggestions on what particular linux version may work better? (Or > > > should I just try another one at random-ish?) > > > > Someone on the list may have an idea, but I don't. I would suggest downgrading > > instead of upgrading, though -- since quick googling seems to suggest that > > .22 was a rather exciting release. > > > > Well, in interest in being able to test a diversity of kernels to try > and confirm or deny this is an issue I think I'll install Debian etch. > I *could* switch kernels around with Ubuntu 7.10, but rather not. I've also confirmed a small test case seems to cause this failure. Right now it requires hunchentoot and drakma (same as in the app), although I aim to attempt to minimize the test further. It can still take minutes to cause the recursive lock failure. I just wanted to get this out there... Here it is: (require :asdf) (require :drakma) (require :hunchentoot) (defun doit () (declare (optimize (speed 3))) (hunchentoot:start-server :port 4242) (dotimes (i 100000000) (drakma:http-request "http://localhost:4242"))) I'm going to try to remove the dependence on external libraries over time... fdr |
From: Juho S. <js...@ik...> - 2007-12-16 05:56:52
|
Daniel Farina <drf...@gm...> writes: > I've also confirmed a small test case seems to cause this failure. Right > now it requires hunchentoot and drakma (same as in the app), although I > aim to attempt to minimize the test further. It can still take minutes > to cause the recursive lock failure. I just wanted to get this out > there... Thanks. I think what's going on is that the Hunchentoot reader threads are somehow managing to enter a GC after the dynamic binding of *CURRENT-THREAD* has been unwound. So there are multiple threads entering SUB-GC at the same time, all thinking that they're the initial thread. Something like this: Real initial thread: Fake initial thread: Enter GET-MUTEX Recursion check (ok) Acquire mutex Enter GET-MUTEX Recursion check (fail) Release mutex Print error But I don't really have a good theory on why a moribund thread would be initiating a GC. -- Juho Snellman |
From: Nikodemus S. <nik...@ra...> - 2007-12-17 19:00:49
|
On 16 Dec 2007 07:57:13 +0200, Juho Snellman <js...@ik...> wrote: > Real initial thread: Fake initial thread: > Enter GET-MUTEX > Recursion check (ok) > Acquire mutex > Enter GET-MUTEX > Recursion check (fail) > Release mutex > Print error > > But I don't really have a good theory on why a moribund thread would > be initiating a GC. "Now that I think of it, it's actually simple." *GC-PENDING* is global (shared between all threads. If the dying thread receives an interrupt in the window between unwinding the lisp stack (and losing *CURRENT-THREAD*) and blocking all threads, the interrupt handler will notice the pending GC and go for it. I'll need to check the details (most signals should be blocked before we stop running lisp), but if I'm on the right track, then the answer is to block all blockable signals before unwinding the lisp stack. I'll test this later today. Cheers, -- Nikodemus |
From: Juho S. <js...@ik...> - 2007-12-18 09:53:13
|
"Nikodemus Siivola" <nik...@ra...> writes: > "Now that I think of it, it's actually simple." > > *GC-PENDING* is global (shared between all threads. If the dying thread > receives an interrupt in the window between unwinding the lisp stack (and > losing *CURRENT-THREAD*) and blocking all threads, the interrupt handler > will notice the pending GC and go for it. > > I'll need to check the details (most signals should be blocked > before we stop running lisp), but if I'm on the right track, then > the answer is to block all blockable signals before unwinding the > lisp stack. I'll test this later today. We already block all deferrable signals, so the only extra signal that would be blocked would be SIG_STOP_FOR_GC. I haven't looked at the details either, but if stopping the world would cause one of the stopped threads to trigger a separate GC that seems like a bug. So blocking all blockables instead of just deferrables doesn't seem like the right solution either. -- Juho Snellman |