|
From: Martin C. <cra...@co...> - 2014-12-10 16:55:26
|
I plan to commit the attached patch. Also tracked in https://bugs.launchpad.net/sbcl/+bug/1389866/+addcomment Sorry it took me so long, I wanted to see in a direct comparison that between those three: - fix off - fix on and unfixed - fix on and this diff #2 would hang and #1 and #3 would not. The attached patch does that. I would appreciate if users of other platforms gave it a spin since it calls mmap and madvise in ways that I guess won't be platform friendly. Martin Martin Cracauer wrote on Mon, Nov 10, 2014 at 05:20:22PM -0500: > https://bugs.launchpad.net/sbcl/+bug/1389866 > > Sorry about that, working on it. > > Mark W Johnson wrote on Mon, Nov 10, 2014 at 11:53:35AM -0800: > > We ran into this shortly after we started using a gc-logfile in an sbcl > > application involving multiple threads and foreign-function calls. > > > > A simple test case that reproduces the issue was tested with sbcl 1.1.4, > > 1.1.15, 1.2.0, 1.2.1, 1.2.5* is included at the end of the message. Those > > are simply the list of versions I tried. > > Other details: > > $ uname -a > > Linux ... 3.2.0-44-generic #69-Ubuntu SMP Thu May 16 17:35:01 UTC 2013 > > x86_64 x86_64 x86_64 GNU/Linux. > > $ ldd --version: > > ldd (Ubuntu EGLIBC 2.15-0ubuntu10.7) 2.15 > > * the 1.2.5 test was with > > Linux ... 3.2.0-53-generic #81-Ubuntu SMP Thu Aug 22 21:01:03 UTC 2013 > > x86_64 x86_64 x86_64 GNU/Linux > > ldd (Ubuntu EGLIBC 2.15-0ubuntu10.4) 2.15 > > > > The test case is contrived, but in a way I think captures what happens in > > our application as it relates to this problem. The foreign allocations in > > the test case are stand-ins for foreign function calls that will include, > > within them, calls to malloc() and free() at some point in their lives. > > > > The symptom is that after some time, all threads stop/hang: sbcl is > > unresponsive. Attaching gdb to sbcl, it is fairly clear that it happens at > > the beginning of a GC cycle, just when we go to print out the entry header > > in the logfile, and fopen("my-gc-logfile","a") calls malloc, and malloc > > gets stuck, as suggested by this gdb backtrace: > > > > (gdb) backtrace > > #0 __lll_lock_wait_private () at > > ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:93 > > #1 0x00007ffff7382f61 in _L_lock_10611 () at malloc.c:5249 > > #2 0x00007ffff7380c87 in __GI___libc_malloc (bytes=140728965529632) at > > malloc.c:2921 > > #3 0x00007ffff736cebb in __fopen_internal (filename=0x63e4e0 > > "my-gc-logfile", mode=0x42b962 "a", is32=1) > > at ../sysdeps/wordsize-64/../../libio/iofopen.c:76 > > #4 0x0000000000421ac1 in log_generation_stats (header=0x42b9ad "=== GC > > Start ===", logfile=0x63e4e0 "my-gc-logfile") at gencgc.c:597 > > #5 collect_garbage (last_gen=0 '\000') at gencgc.c:3826 > > #6 0x000000100014a436 in ?? () > > #7 0x00000010000221db in ?? () > > #8 0x0000000000000000 in ?? () > > > > Most threads are in sem_wait(): > > (gdb) backtrace > > #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86 > > #1 0x00000000004160d4 in os_sem_wait (sem=0x7fffd4af0040, what=0x42b036 > > "wait_for_thread_state_change") at os-common.c:106 > > #2 0x000000000041b439 in wait_for_thread_state_change > > (thread=0x7fffd4ae8000, state=4) at thread.c:195 > > #3 0x0000000000413e8b in sig_stop_for_gc_handler (signal=<optimized out>, > > info=<optimized out>, context=0x7fffd48e6340) > > at interrupt.c:1364 > > #4 0x0000000000411f9b in low_level_handle_now_handler (signal=12, > > info=0x7fffd48e6470, void_context=0x7fffd48e6340) at interrupt.c:1929 > > #5 <signal handler called> > > ... > > > > But a few (e.g. 10 in one case) are in malloc(), though called from thread.c > > (gdb) backtrace > > #0 __lll_lock_wait_private () at > > ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:93 > > #1 0x00007ffff7382f61 in _L_lock_10611 () at malloc.c:5249 > > #2 0x00007ffff7380c87 in __GI___libc_malloc (bytes=140728965529632) at > > malloc.c:2921 > > #3 0x000000000041b1ef in plan_thread_post_mortem (corpse=0x7fffe3a20000) > > at thread.c:276 > > #4 schedule_thread_post_mortem (corpse=0x7fffe3a20000) at thread.c:311 > > #5 undo_init_new_thread (th=0x7fffe3a20000, scribble=<optimized out>) at > > thread.c:468 > > #6 new_thread_trampoline (th=0x7fffe3a20000) at thread.c:492 > > #7 0x00007ffff79c0e9a in start_thread (arg=0x7fffe381f700) at > > pthread_create.c:308 > > #8 0x00007ffff73f231d in clone () at > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 > > #9 0x0000000000000000 in ?? () > > > > The understanding we have is that some of the foreign memory allocations > > are getting locked out, or perhaps are locking out allocations in > > plan_thread_post_mortem(), because there are more threads vying for memory > > than their are malloc arenas. This would by itself resolve, but then a GC > > happens. The GC'ing thread, in the same arena as some locked out call to > > malloc, is itself locked out when the gc-logfile is fopen'ed, and must > > wait. But GC stopped everyone else, so the thread it is waiting for will > > never finish. Does this make sense? Should we consider another > > interpretation? > > > > It does seem like we can (1) increase the number of arena's > > (MALLOC_ARENA_MAX >= max # of threads anticipated seems to solve the test > > case) and (2) use (without-gcing) around the foreign allocations. Those > > solve the problem for the test case. In the application, there are a very > > large number of threads, and the foreign calls are less trivial, and we are > > concerned that this solution will not scale well. > > > > Another thought was to see if there were a safer way to write out the > > logfile in gencgc.c. A colleague has suggested we might, for example, > > pre-allocate a buffer for a line of logfile, write to it with sprintf, and > > then use open(2) and write(2) to get the info out to the file. We haven't > > tried this yet. > > > > A deeper concern, however, arises in sbcl 1.2.5 in that there are now > > (brought to my attention by the same colleague) explicit calls to malloc() > > in set_page_consi_bit(), and free() in do_the_wipe() in gencgc.c. Indeed > > the test code I provided hangs in 1.2.5 (system details described above) > > even if (gc-logfile) is nil - is not used. All we need here to get into > > trouble are foreign function calls that need to allocate memory, and we > > wind up in a deadlock on gc'ing at some point. > > > > Finally coming to the point, would it be feasible to avoid using things > > that may wind up in malloc() and free() in gencgc.c? What do you think? > > > > Thanks for your help, > > > > Mark > > > > ;;---- file gc-test-w-aliens.lisp ---- > > > > (defun allocation-w-alien-thunk (vector-size) > > (lambda () > > ;; give us time to start all the threads... once we start allocating > > memory, > > (sleep 0.2d0) ;; thread creation slows, and we end up with fewer max > > total threads > > (let ((alien (sb-alien:make-alien double vector-size)) > > ;; the only purpose of the native array is to force gc to happen > > with some > > ;; frequency. We don't really have to do anything with it > > (native (make-array vector-size :element-type 'double-float > > :initial-element 0d0))) > > (dotimes (j vector-size) > > (setf (sb-alien:deref alien j) (random 1d0))) > > (sb-alien:free-alien alien) > > native))) > > > > (defun threaded-gc-w-alien-test (name chunk-size cycle-length cycles ) > > (setf (gc-logfile) name) > > (let ((threads (make-array cycle-length)) > > (result nil)) > > (dotimes (j cycles) > > (dotimes (j cycle-length) > > (setf (aref threads j) > > (sb-thread:make-thread > > (funcall #'allocation-w-alien-thunk chunk-size)))) > > (print (list j cycles (length (sb-thread:list-all-threads)))) > > (map nil (lambda (th) > > (setf result > > (sb-thread:join-thread th))) > > threads)))) > > ;; > > ;; run me: > > (threaded-gc-w-alien-test "my-gc-logfile" (ash 1 15) 500 100) > > > > ;;---- end gc-test-w-aliens.lisp --- > > > > ------- > > Mark W Johnson > > Senior Scientist > > D-Wave Systems Inc, > > www.dwavesys.com > > Burnaby BC, Canada > > > ------------------------------------------------------------------------------ > > Comprehensive Server Monitoring with Site24x7. > > Monitor 10 servers for $9/Month. > > Get alerted through email, SMS, voice calls or mobile push notifications. > > Take corrective actions from your mobile device. > > http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk > > > _______________________________________________ > > Sbcl-help mailing list > > Sbc...@li... > > https://lists.sourceforge.net/lists/listinfo/sbcl-help > > > -- > %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > Martin Cracauer <cra...@co...> http://www.cons.org/cracauer/ > > ------------------------------------------------------------------------------ > Comprehensive Server Monitoring with Site24x7. > Monitor 10 servers for $9/Month. > Get alerted through email, SMS, voice calls or mobile push notifications. > Take corrective actions from your mobile device. > http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk > _______________________________________________ > Sbcl-help mailing list > Sbc...@li... > https://lists.sourceforge.net/lists/listinfo/sbcl-help -- %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Martin Cracauer <cra...@co...> http://www.cons.org/cracauer/ |