|
From: Martin C. <cra...@co...> - 2014-11-10 22:43:29
|
https://bugs.launchpad.net/sbcl/+bug/1389866 Sorry about that, working on it. Mark W Johnson wrote on Mon, Nov 10, 2014 at 11:53:35AM -0800: > We ran into this shortly after we started using a gc-logfile in an sbcl > application involving multiple threads and foreign-function calls. > > A simple test case that reproduces the issue was tested with sbcl 1.1.4, > 1.1.15, 1.2.0, 1.2.1, 1.2.5* is included at the end of the message. Those > are simply the list of versions I tried. > Other details: > $ uname -a > Linux ... 3.2.0-44-generic #69-Ubuntu SMP Thu May 16 17:35:01 UTC 2013 > x86_64 x86_64 x86_64 GNU/Linux. > $ ldd --version: > ldd (Ubuntu EGLIBC 2.15-0ubuntu10.7) 2.15 > * the 1.2.5 test was with > Linux ... 3.2.0-53-generic #81-Ubuntu SMP Thu Aug 22 21:01:03 UTC 2013 > x86_64 x86_64 x86_64 GNU/Linux > ldd (Ubuntu EGLIBC 2.15-0ubuntu10.4) 2.15 > > The test case is contrived, but in a way I think captures what happens in > our application as it relates to this problem. The foreign allocations in > the test case are stand-ins for foreign function calls that will include, > within them, calls to malloc() and free() at some point in their lives. > > The symptom is that after some time, all threads stop/hang: sbcl is > unresponsive. Attaching gdb to sbcl, it is fairly clear that it happens at > the beginning of a GC cycle, just when we go to print out the entry header > in the logfile, and fopen("my-gc-logfile","a") calls malloc, and malloc > gets stuck, as suggested by this gdb backtrace: > > (gdb) backtrace > #0 __lll_lock_wait_private () at > ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:93 > #1 0x00007ffff7382f61 in _L_lock_10611 () at malloc.c:5249 > #2 0x00007ffff7380c87 in __GI___libc_malloc (bytes=140728965529632) at > malloc.c:2921 > #3 0x00007ffff736cebb in __fopen_internal (filename=0x63e4e0 > "my-gc-logfile", mode=0x42b962 "a", is32=1) > at ../sysdeps/wordsize-64/../../libio/iofopen.c:76 > #4 0x0000000000421ac1 in log_generation_stats (header=0x42b9ad "=== GC > Start ===", logfile=0x63e4e0 "my-gc-logfile") at gencgc.c:597 > #5 collect_garbage (last_gen=0 '\000') at gencgc.c:3826 > #6 0x000000100014a436 in ?? () > #7 0x00000010000221db in ?? () > #8 0x0000000000000000 in ?? () > > Most threads are in sem_wait(): > (gdb) backtrace > #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86 > #1 0x00000000004160d4 in os_sem_wait (sem=0x7fffd4af0040, what=0x42b036 > "wait_for_thread_state_change") at os-common.c:106 > #2 0x000000000041b439 in wait_for_thread_state_change > (thread=0x7fffd4ae8000, state=4) at thread.c:195 > #3 0x0000000000413e8b in sig_stop_for_gc_handler (signal=<optimized out>, > info=<optimized out>, context=0x7fffd48e6340) > at interrupt.c:1364 > #4 0x0000000000411f9b in low_level_handle_now_handler (signal=12, > info=0x7fffd48e6470, void_context=0x7fffd48e6340) at interrupt.c:1929 > #5 <signal handler called> > ... > > But a few (e.g. 10 in one case) are in malloc(), though called from thread.c > (gdb) backtrace > #0 __lll_lock_wait_private () at > ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:93 > #1 0x00007ffff7382f61 in _L_lock_10611 () at malloc.c:5249 > #2 0x00007ffff7380c87 in __GI___libc_malloc (bytes=140728965529632) at > malloc.c:2921 > #3 0x000000000041b1ef in plan_thread_post_mortem (corpse=0x7fffe3a20000) > at thread.c:276 > #4 schedule_thread_post_mortem (corpse=0x7fffe3a20000) at thread.c:311 > #5 undo_init_new_thread (th=0x7fffe3a20000, scribble=<optimized out>) at > thread.c:468 > #6 new_thread_trampoline (th=0x7fffe3a20000) at thread.c:492 > #7 0x00007ffff79c0e9a in start_thread (arg=0x7fffe381f700) at > pthread_create.c:308 > #8 0x00007ffff73f231d in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 > #9 0x0000000000000000 in ?? () > > The understanding we have is that some of the foreign memory allocations > are getting locked out, or perhaps are locking out allocations in > plan_thread_post_mortem(), because there are more threads vying for memory > than their are malloc arenas. This would by itself resolve, but then a GC > happens. The GC'ing thread, in the same arena as some locked out call to > malloc, is itself locked out when the gc-logfile is fopen'ed, and must > wait. But GC stopped everyone else, so the thread it is waiting for will > never finish. Does this make sense? Should we consider another > interpretation? > > It does seem like we can (1) increase the number of arena's > (MALLOC_ARENA_MAX >= max # of threads anticipated seems to solve the test > case) and (2) use (without-gcing) around the foreign allocations. Those > solve the problem for the test case. In the application, there are a very > large number of threads, and the foreign calls are less trivial, and we are > concerned that this solution will not scale well. > > Another thought was to see if there were a safer way to write out the > logfile in gencgc.c. A colleague has suggested we might, for example, > pre-allocate a buffer for a line of logfile, write to it with sprintf, and > then use open(2) and write(2) to get the info out to the file. We haven't > tried this yet. > > A deeper concern, however, arises in sbcl 1.2.5 in that there are now > (brought to my attention by the same colleague) explicit calls to malloc() > in set_page_consi_bit(), and free() in do_the_wipe() in gencgc.c. Indeed > the test code I provided hangs in 1.2.5 (system details described above) > even if (gc-logfile) is nil - is not used. All we need here to get into > trouble are foreign function calls that need to allocate memory, and we > wind up in a deadlock on gc'ing at some point. > > Finally coming to the point, would it be feasible to avoid using things > that may wind up in malloc() and free() in gencgc.c? What do you think? > > Thanks for your help, > > Mark > > ;;---- file gc-test-w-aliens.lisp ---- > > (defun allocation-w-alien-thunk (vector-size) > (lambda () > ;; give us time to start all the threads... once we start allocating > memory, > (sleep 0.2d0) ;; thread creation slows, and we end up with fewer max > total threads > (let ((alien (sb-alien:make-alien double vector-size)) > ;; the only purpose of the native array is to force gc to happen > with some > ;; frequency. We don't really have to do anything with it > (native (make-array vector-size :element-type 'double-float > :initial-element 0d0))) > (dotimes (j vector-size) > (setf (sb-alien:deref alien j) (random 1d0))) > (sb-alien:free-alien alien) > native))) > > (defun threaded-gc-w-alien-test (name chunk-size cycle-length cycles ) > (setf (gc-logfile) name) > (let ((threads (make-array cycle-length)) > (result nil)) > (dotimes (j cycles) > (dotimes (j cycle-length) > (setf (aref threads j) > (sb-thread:make-thread > (funcall #'allocation-w-alien-thunk chunk-size)))) > (print (list j cycles (length (sb-thread:list-all-threads)))) > (map nil (lambda (th) > (setf result > (sb-thread:join-thread th))) > threads)))) > ;; > ;; run me: > (threaded-gc-w-alien-test "my-gc-logfile" (ash 1 15) 500 100) > > ;;---- end gc-test-w-aliens.lisp --- > > ------- > Mark W Johnson > Senior Scientist > D-Wave Systems Inc, > www.dwavesys.com > Burnaby BC, Canada > ------------------------------------------------------------------------------ > Comprehensive Server Monitoring with Site24x7. > Monitor 10 servers for $9/Month. > Get alerted through email, SMS, voice calls or mobile push notifications. > Take corrective actions from your mobile device. > http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk > _______________________________________________ > Sbcl-help mailing list > Sbc...@li... > https://lists.sourceforge.net/lists/listinfo/sbcl-help -- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Martin Cracauer <cra...@co...> http://www.cons.org/cracauer/ |