Re: [Sbcl-help] thread deadlock in malloc during gc when using (gc-logfile) & foreign functions

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I plan to commit the attached patch.
Also tracked in
https://bugs.launchpad.net/sbcl/+bug/1389866/+addcomment

Sorry it took me so long, I wanted to see in a direct comparison that
between those three:
- fix off
- fix on and unfixed
- fix on and this diff

#2 would hang and #1 and #3 would not.  The attached patch does that.

I would appreciate if users of other platforms gave it a spin since it
calls mmap and madvise in ways that I guess won't be platform friendly.

Martin

Martin Cracauer wrote on Mon, Nov 10, 2014 at 05:20:22PM -0500: 
> https://bugs.launchpad.net/sbcl/+bug/1389866
> 
> Sorry about that, working on it.
> 
> Mark W Johnson wrote on Mon, Nov 10, 2014 at 11:53:35AM -0800: 
> > We ran into this shortly after we started using a gc-logfile in an sbcl
> > application involving multiple threads and foreign-function calls.
> > 
> > A simple test case that reproduces the issue was tested with sbcl 1.1.4,
> > 1.1.15, 1.2.0, 1.2.1, 1.2.5* is included at the end of the message. Those
> > are simply the list of versions I tried.
> > Other details:
> > $ uname -a
> > Linux ... 3.2.0-44-generic #69-Ubuntu SMP Thu May 16 17:35:01 UTC 2013
> > x86_64 x86_64 x86_64 GNU/Linux.
> > $ ldd --version:
> > ldd (Ubuntu EGLIBC 2.15-0ubuntu10.7) 2.15
> > * the 1.2.5 test was with
> > Linux ... 3.2.0-53-generic #81-Ubuntu SMP Thu Aug 22 21:01:03 UTC 2013
> > x86_64 x86_64 x86_64 GNU/Linux
> > ldd (Ubuntu EGLIBC 2.15-0ubuntu10.4) 2.15
> > 
> > The test case is contrived, but in a way I think captures what happens in
> > our application as it relates to this problem.  The foreign allocations in
> > the test case are stand-ins for foreign function calls that will include,
> > within them, calls to malloc() and free() at some point in their lives.
> > 
> > The symptom is that after some time, all threads stop/hang: sbcl is
> > unresponsive.  Attaching gdb to sbcl, it is fairly clear that it happens at
> > the beginning of a GC cycle, just when we go to print out the entry header
> > in the logfile, and fopen("my-gc-logfile","a") calls malloc, and malloc
> > gets stuck, as suggested by this gdb backtrace:
> > 
> > (gdb) backtrace
> > #0  __lll_lock_wait_private () at
> > ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:93
> > #1  0x00007ffff7382f61 in _L_lock_10611 () at malloc.c:5249
> > #2  0x00007ffff7380c87 in __GI___libc_malloc (bytes=140728965529632) at
> > malloc.c:2921
> > #3  0x00007ffff736cebb in __fopen_internal (filename=0x63e4e0
> > "my-gc-logfile", mode=0x42b962 "a", is32=1)
> >     at ../sysdeps/wordsize-64/../../libio/iofopen.c:76
> > #4  0x0000000000421ac1 in log_generation_stats (header=0x42b9ad "=== GC
> > Start ===", logfile=0x63e4e0 "my-gc-logfile") at gencgc.c:597
> > #5  collect_garbage (last_gen=0 '\000') at gencgc.c:3826
> > #6  0x000000100014a436 in ?? ()
> > #7  0x00000010000221db in ?? ()
> > #8  0x0000000000000000 in ?? ()
> > 
> > Most threads are in sem_wait():
> > (gdb) backtrace
> > #0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
> > #1  0x00000000004160d4 in os_sem_wait (sem=0x7fffd4af0040, what=0x42b036
> > "wait_for_thread_state_change") at os-common.c:106
> > #2  0x000000000041b439 in wait_for_thread_state_change
> > (thread=0x7fffd4ae8000, state=4) at thread.c:195
> > #3  0x0000000000413e8b in sig_stop_for_gc_handler (signal=<optimized out>,
> > info=<optimized out>, context=0x7fffd48e6340)
> >     at interrupt.c:1364
> > #4  0x0000000000411f9b in low_level_handle_now_handler (signal=12,
> > info=0x7fffd48e6470, void_context=0x7fffd48e6340) at interrupt.c:1929
> > #5  <signal handler called>
> > ...
> > 
> > But a few (e.g. 10 in one case) are in malloc(), though called from thread.c
> > (gdb) backtrace
> > #0  __lll_lock_wait_private () at
> > ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:93
> > #1  0x00007ffff7382f61 in _L_lock_10611 () at malloc.c:5249
> > #2  0x00007ffff7380c87 in __GI___libc_malloc (bytes=140728965529632) at
> > malloc.c:2921
> > #3  0x000000000041b1ef in plan_thread_post_mortem (corpse=0x7fffe3a20000)
> > at thread.c:276
> > #4  schedule_thread_post_mortem (corpse=0x7fffe3a20000) at thread.c:311
> > #5  undo_init_new_thread (th=0x7fffe3a20000, scribble=<optimized out>) at
> > thread.c:468
> > #6  new_thread_trampoline (th=0x7fffe3a20000) at thread.c:492
> > #7  0x00007ffff79c0e9a in start_thread (arg=0x7fffe381f700) at
> > pthread_create.c:308
> > #8  0x00007ffff73f231d in clone () at
> > ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> > #9  0x0000000000000000 in ?? ()
> > 
> > The understanding we have is that some of the foreign memory allocations
> > are getting locked out, or perhaps are locking out allocations in
> > plan_thread_post_mortem(), because there are more threads vying for memory
> > than their are malloc arenas.  This would by itself resolve, but then a GC
> > happens.  The GC'ing thread, in the same arena as some locked out call to
> > malloc, is itself locked out when the gc-logfile is fopen'ed, and must
> > wait.  But GC stopped everyone else, so the thread it is waiting for will
> > never finish.  Does this make sense?  Should we consider another
> > interpretation?
> > 
> > It does seem like we can (1) increase the number of arena's
> > (MALLOC_ARENA_MAX >= max # of threads anticipated seems to solve the test
> > case) and (2) use (without-gcing) around the foreign allocations.  Those
> > solve the problem for the test case. In the application, there are a very
> > large number of threads, and the foreign calls are less trivial, and we are
> > concerned that this solution will not scale well.
> > 
> > Another thought was to see if there were a safer way to write out the
> > logfile in gencgc.c.  A colleague has suggested we might, for example,
> > pre-allocate a buffer for a line of logfile, write to it with sprintf, and
> > then use open(2) and write(2) to get the info out to the file.  We haven't
> > tried this yet.
> > 
> > A deeper concern, however, arises in sbcl 1.2.5 in that there are now
> > (brought to my attention by the same colleague) explicit calls to malloc()
> > in set_page_consi_bit(), and free() in do_the_wipe() in gencgc.c. Indeed
> > the test code I provided hangs in 1.2.5 (system details described above)
> > even if (gc-logfile) is nil - is not used.  All we need here to get into
> > trouble are foreign function calls that need to allocate memory,  and we
> > wind up in a deadlock on gc'ing at some point.
> > 
> > Finally coming to the point, would it be feasible to avoid using things
> > that may wind up in malloc() and free() in gencgc.c?  What do you think?
> > 
> > Thanks for your help,
> > 
> > Mark
> > 
> > ;;---- file gc-test-w-aliens.lisp ----
> > 
> > (defun allocation-w-alien-thunk (vector-size)
> >   (lambda ()
> >     ;; give us time to start all the threads... once we start allocating
> > memory,
> >     (sleep 0.2d0)  ;;  thread creation slows, and we end up with fewer max
> > total threads
> >     (let ((alien (sb-alien:make-alien double vector-size))
> >           ;; the only purpose of the native array is to force gc to happen
> > with some
> >           ;; frequency.  We don't really have to do anything with it
> >           (native (make-array vector-size :element-type 'double-float
> > :initial-element 0d0)))
> >        (dotimes (j vector-size)
> >          (setf (sb-alien:deref alien j) (random 1d0)))
> >       (sb-alien:free-alien alien)
> >       native)))
> > 
> > (defun threaded-gc-w-alien-test (name chunk-size cycle-length cycles )
> >   (setf (gc-logfile) name)
> >   (let ((threads (make-array cycle-length))
> >         (result nil))
> >   (dotimes (j cycles)
> >     (dotimes (j cycle-length)
> >       (setf (aref threads j)
> >             (sb-thread:make-thread
> >              (funcall #'allocation-w-alien-thunk chunk-size))))
> >     (print (list j cycles  (length (sb-thread:list-all-threads))))
> >     (map nil (lambda (th)
> >                (setf result
> >                      (sb-thread:join-thread th)))
> >          threads))))
> > ;;
> > ;; run me:
> > (threaded-gc-w-alien-test "my-gc-logfile" (ash 1 15) 500 100)
> > 
> > ;;---- end gc-test-w-aliens.lisp ---
> > 
> > -------
> > Mark W Johnson
> > Senior Scientist
> > D-Wave Systems Inc,
> > www.dwavesys.com
> > Burnaby BC, Canada
> 
> > ------------------------------------------------------------------------------
> > Comprehensive Server Monitoring with Site24x7.
> > Monitor 10 servers for $9/Month.
> > Get alerted through email, SMS, voice calls or mobile push notifications.
> > Take corrective actions from your mobile device.
> > http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk
> 
> > _______________________________________________
> > Sbcl-help mailing list
> > Sbc...@li...
> > https://lists.sourceforge.net/lists/listinfo/sbcl-help
> 
> 
> -- 
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> Martin Cracauer <cra...@co...>   http://www.cons.org/cracauer/
> 
> ------------------------------------------------------------------------------
> Comprehensive Server Monitoring with Site24x7.
> Monitor 10 servers for $9/Month.
> Get alerted through email, SMS, voice calls or mobile push notifications.
> Take corrective actions from your mobile device.
> http://pubads.g.doubleclick.net/gampad/clk?id=154624111&iu=/4140/ostg.clktrk
> _______________________________________________
> Sbcl-help mailing list
> Sbc...@li...
> https://lists.sourceforge.net/lists/listinfo/sbcl-help

-- 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Martin Cracauer <cra...@co...>   http://www.cons.org/cracauer/

Re: [Sbcl-help] thread deadlock in malloc during gc when using (gc-logfile) & foreign functions

Common Lisp compiler and runtime

Re: [Sbcl-help] thread deadlock in malloc during gc when using (gc-logfile) & foreign functions