|
From: Mark W J. <mwj...@gm...> - 2014-11-10 19:54:29
|
We ran into this shortly after we started using a gc-logfile in an sbcl
application involving multiple threads and foreign-function calls.
A simple test case that reproduces the issue was tested with sbcl 1.1.4,
1.1.15, 1.2.0, 1.2.1, 1.2.5* is included at the end of the message. Those
are simply the list of versions I tried.
Other details:
$ uname -a
Linux ... 3.2.0-44-generic #69-Ubuntu SMP Thu May 16 17:35:01 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux.
$ ldd --version:
ldd (Ubuntu EGLIBC 2.15-0ubuntu10.7) 2.15
* the 1.2.5 test was with
Linux ... 3.2.0-53-generic #81-Ubuntu SMP Thu Aug 22 21:01:03 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux
ldd (Ubuntu EGLIBC 2.15-0ubuntu10.4) 2.15
The test case is contrived, but in a way I think captures what happens in
our application as it relates to this problem. The foreign allocations in
the test case are stand-ins for foreign function calls that will include,
within them, calls to malloc() and free() at some point in their lives.
The symptom is that after some time, all threads stop/hang: sbcl is
unresponsive. Attaching gdb to sbcl, it is fairly clear that it happens at
the beginning of a GC cycle, just when we go to print out the entry header
in the logfile, and fopen("my-gc-logfile","a") calls malloc, and malloc
gets stuck, as suggested by this gdb backtrace:
(gdb) backtrace
#0 __lll_lock_wait_private () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:93
#1 0x00007ffff7382f61 in _L_lock_10611 () at malloc.c:5249
#2 0x00007ffff7380c87 in __GI___libc_malloc (bytes=140728965529632) at
malloc.c:2921
#3 0x00007ffff736cebb in __fopen_internal (filename=0x63e4e0
"my-gc-logfile", mode=0x42b962 "a", is32=1)
at ../sysdeps/wordsize-64/../../libio/iofopen.c:76
#4 0x0000000000421ac1 in log_generation_stats (header=0x42b9ad "=== GC
Start ===", logfile=0x63e4e0 "my-gc-logfile") at gencgc.c:597
#5 collect_garbage (last_gen=0 '\000') at gencgc.c:3826
#6 0x000000100014a436 in ?? ()
#7 0x00000010000221db in ?? ()
#8 0x0000000000000000 in ?? ()
Most threads are in sem_wait():
(gdb) backtrace
#0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
#1 0x00000000004160d4 in os_sem_wait (sem=0x7fffd4af0040, what=0x42b036
"wait_for_thread_state_change") at os-common.c:106
#2 0x000000000041b439 in wait_for_thread_state_change
(thread=0x7fffd4ae8000, state=4) at thread.c:195
#3 0x0000000000413e8b in sig_stop_for_gc_handler (signal=<optimized out>,
info=<optimized out>, context=0x7fffd48e6340)
at interrupt.c:1364
#4 0x0000000000411f9b in low_level_handle_now_handler (signal=12,
info=0x7fffd48e6470, void_context=0x7fffd48e6340) at interrupt.c:1929
#5 <signal handler called>
...
But a few (e.g. 10 in one case) are in malloc(), though called from thread.c
(gdb) backtrace
#0 __lll_lock_wait_private () at
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:93
#1 0x00007ffff7382f61 in _L_lock_10611 () at malloc.c:5249
#2 0x00007ffff7380c87 in __GI___libc_malloc (bytes=140728965529632) at
malloc.c:2921
#3 0x000000000041b1ef in plan_thread_post_mortem (corpse=0x7fffe3a20000)
at thread.c:276
#4 schedule_thread_post_mortem (corpse=0x7fffe3a20000) at thread.c:311
#5 undo_init_new_thread (th=0x7fffe3a20000, scribble=<optimized out>) at
thread.c:468
#6 new_thread_trampoline (th=0x7fffe3a20000) at thread.c:492
#7 0x00007ffff79c0e9a in start_thread (arg=0x7fffe381f700) at
pthread_create.c:308
#8 0x00007ffff73f231d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#9 0x0000000000000000 in ?? ()
The understanding we have is that some of the foreign memory allocations
are getting locked out, or perhaps are locking out allocations in
plan_thread_post_mortem(), because there are more threads vying for memory
than their are malloc arenas. This would by itself resolve, but then a GC
happens. The GC'ing thread, in the same arena as some locked out call to
malloc, is itself locked out when the gc-logfile is fopen'ed, and must
wait. But GC stopped everyone else, so the thread it is waiting for will
never finish. Does this make sense? Should we consider another
interpretation?
It does seem like we can (1) increase the number of arena's
(MALLOC_ARENA_MAX >= max # of threads anticipated seems to solve the test
case) and (2) use (without-gcing) around the foreign allocations. Those
solve the problem for the test case. In the application, there are a very
large number of threads, and the foreign calls are less trivial, and we are
concerned that this solution will not scale well.
Another thought was to see if there were a safer way to write out the
logfile in gencgc.c. A colleague has suggested we might, for example,
pre-allocate a buffer for a line of logfile, write to it with sprintf, and
then use open(2) and write(2) to get the info out to the file. We haven't
tried this yet.
A deeper concern, however, arises in sbcl 1.2.5 in that there are now
(brought to my attention by the same colleague) explicit calls to malloc()
in set_page_consi_bit(), and free() in do_the_wipe() in gencgc.c. Indeed
the test code I provided hangs in 1.2.5 (system details described above)
even if (gc-logfile) is nil - is not used. All we need here to get into
trouble are foreign function calls that need to allocate memory, and we
wind up in a deadlock on gc'ing at some point.
Finally coming to the point, would it be feasible to avoid using things
that may wind up in malloc() and free() in gencgc.c? What do you think?
Thanks for your help,
Mark
;;---- file gc-test-w-aliens.lisp ----
(defun allocation-w-alien-thunk (vector-size)
(lambda ()
;; give us time to start all the threads... once we start allocating
memory,
(sleep 0.2d0) ;; thread creation slows, and we end up with fewer max
total threads
(let ((alien (sb-alien:make-alien double vector-size))
;; the only purpose of the native array is to force gc to happen
with some
;; frequency. We don't really have to do anything with it
(native (make-array vector-size :element-type 'double-float
:initial-element 0d0)))
(dotimes (j vector-size)
(setf (sb-alien:deref alien j) (random 1d0)))
(sb-alien:free-alien alien)
native)))
(defun threaded-gc-w-alien-test (name chunk-size cycle-length cycles )
(setf (gc-logfile) name)
(let ((threads (make-array cycle-length))
(result nil))
(dotimes (j cycles)
(dotimes (j cycle-length)
(setf (aref threads j)
(sb-thread:make-thread
(funcall #'allocation-w-alien-thunk chunk-size))))
(print (list j cycles (length (sb-thread:list-all-threads))))
(map nil (lambda (th)
(setf result
(sb-thread:join-thread th)))
threads))))
;;
;; run me:
(threaded-gc-w-alien-test "my-gc-logfile" (ash 1 15) 500 100)
;;---- end gc-test-w-aliens.lisp ---
-------
Mark W Johnson
Senior Scientist
D-Wave Systems Inc,
www.dwavesys.com
Burnaby BC, Canada
|