From: <me...@ho...> - 2005-04-07 16:26:04
Attachments:
thread.patch
|
Threaded SBCL 0.8.21 locks up on: (progn (defun waste (&optional (n 100000)) (loop repeat n do (make-string 16384))) (loop for i below 10 do (format t "LOOP:~A~%" i) (force-output) (sb-thread:make-thread #'(lambda () (waste))) (waste) (gc))) It is mostly because the delivery of SIG_STOP_FOR_GC is unreliable. There are a few problems: 1) a thread struct with pid 0 may be on all_threads, and a signal delivered to pid 0 is sent to every process in the process group of the current process (not sure it was triggered) 2) gc_start_the_world does not wait for the signaled thread to process SIG_STOP_FOR_GC => the signal may queue up and thread->state gets out of sync with reality => gc confusion (this is not likely to happen either) 3) threads start up in STATE_STOPPED. If gc_stop_the_world sees this state it does not send a signal to stop it which is bad enough, but later gc_start_the_world does which is even worse => total confusion (perhaps this is triggered by the above test) 4) if a gc hits after a thread is cloned, but before it is arch_os_thread_init'ed we get a nice memory fault or two (why?). If all else is fixed this can be triggered by: (progn (defun waste (&optional (n 100000)) (loop repeat n do (make-string 16384))) (defparameter *aaa* nil) (loop for i below 10 do (format t "LOOP:~A~%" i) (force-output) (sb-thread:make-thread #'(lambda () (let ((*aaa* (waste))) (waste)))) (let ((*aaa* (waste))) (waste)) (gc))) Although my understanding of sbcl, signals, gc and threading is patchy at best, the attached patch attempts to fix these problems by: - threads start in STATE_STARTING - create_thread holds a thread_start_lock until the started thread enters STATE_RUNNING - gc_stop_the_world acquires and gc_start_the_world releases thread_start_lock - gc_start_the_world waits until all threads leave STATE_STOPPED - test for pid 0 before sending a signal The test forms do not fail anymore, my paserve app runs better and the threaded tests of cl-ppcre fail later :-(. Cheers, Gabor |
From: <me...@ho...> - 2005-04-09 09:18:08
Attachments:
thread.patch
|
On Thursday 07 April 2005 18:26, G=C3=A1bor Melis wrote: > The test forms do not fail anymore, my paserve app runs better and the > threaded tests of cl-ppcre fail later :-(. I have cleaned up the patch a little: if no threads can be started, then=20 gc_{stop,start}_the_world might as well forget about defending against new= =20 threads being linked onto all_threads. It was tested against 0.8.21.23 (wit= h=20 Nikodemus's finalization fixes). It now finishes the threaded cl-ppcre test= s=20 and runs my paserve app with 60 simultaneous clients without lockups and at= a=20 reasonably stable pace (6-7s/1000 request) which is a great improvement ove= r=20 the previous value of 6-150s plus lockups. |
From: <me...@ho...> - 2005-05-27 08:33:34
|
This would be moment a great to comment on the patch as, if all goes well, I plan to commit it this weekend after some more tests. Gabor |