#192 Hang with condition variable

HEAD
accepted
nobody
5
2014-09-23
2012-06-30
Anonymous
No
(defun test ()
  (let ((lock (mp:make-lock))
        (cvar (mp:make-condition-variable))
        (flag nil))
    (mp:process-run-function
     "test" (lambda ()
              (mp:with-lock (lock)
                (setf flag t)
                (mp:condition-variable-signal cvar))))
    (mp:with-lock (lock)
      (loop until flag do (mp:condition-variable-wait cvar lock)))))

(defun run ()
  (loop
     (test)
     (format t ".")
     (finish-output)))

RUN eventually hangs with latest git 1d3355d, but not with 12.2.1.

Linux xi 3.2.0-24-generic-pae #39-Ubuntu SMP Mon May 21 18:54:21 UTC 2012 i686 i686 i386 GNU/Linux

gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

Discussion

  • The previous implementation was based on wrong assumptions. I have uploaded a new one. Please verify that it works for you. Otherwise I will close the bug.

     
  • Please forget the previous comment: the fix was wrong, for it did not just awake threads which were waiting, but also those arriving. I will work further on this.

     

  • Anonymous
    2012-07-11

    In case this might be helpful, the following RUN hangs with and without the latest changes.

    (defstruct sema
      (count 0)
      (lock (mp:make-lock))
      (cvar (mp:make-condition-variable)))
    
    (defun inc-sema (sema)
      (mp:with-lock ((sema-lock sema))
        (incf (sema-count sema))
        (mp:condition-variable-signal (sema-cvar sema))))
    
    (defun dec-sema (sema)
      (mp:with-lock ((sema-lock sema))
        (loop (cond ((plusp (sema-count sema))
                     (decf (sema-count sema))
                     (return))
                    (t
                     (mp:condition-variable-wait
                      (sema-cvar sema) (sema-lock sema)))))))
    
    (defun test (thread-count)
      (let ((from-threads (make-sema))
            (to-threads   (make-sema)))
        (loop repeat thread-count do
             (mp:process-run-function
              "test" (lambda ()
                       (dec-sema to-threads)
                       (inc-sema from-threads))))
        (loop repeat thread-count do (inc-sema to-threads))
        (loop repeat thread-count do (dec-sema from-threads))))
    
    (defun run ()
      (loop
         (test 16)
         (format t ".")
         (finish-output)))
    
     
    Last edit: Juan Jose Garcia Ripoll 2012-12-05
  • Thanks a lot for the test cases. I believe I have found the problem now: the condition variable signaling operation must not only awake the process, but also immediately remove it from the wait queue.

    Both your programs run without hanging, but I will wait for your feedback to close the bug report.

     

  • Anonymous
    2012-07-14

    The second test case still hangs for me eventually, though it now takes longer to do so -- within 3000 iterations. Changing thread-count to 128 causes it to hang within 100 iterations.

    If it's any consolation, LispWorks 6.0 failed the same test with similar symptoms.

     
    • I have uploaded another fix, this time to the wait queue, for some signaling events were missing between the point at which interrupts are disabled and the point at which the thread enters the wait queue. I see no hangs, neither in Linux nor in OS X (Snow Leopard & Lion). Could you check?

       

      • Anonymous
        2012-07-28

        I regret to report that the latest in git (7638ca) still hangs for me. I did two sets of three trials.

        Iterations until hang for 16 threads: 1626, 6228, 1505

        Iterations until hang for 128 threads: 185, 561, 129

        While running inside a terminal (as opposed to slime), when I interrupt after hanging and then hit Ctl-D to exit, there are segfault messages printed as it's exiting. This isn't new; I just tried a recent earlier build (e1fdb8) and saw the same.

        One time I had a segfault instead of a hang, and in fact I still have this session running, so feel free to ask me to print values or whatever. The backtrace is empty.

        I made the following comment on a CLISP bug report for the same test case, which may or may not be relevant:

        On SBCL I once had a condition variable problem which was either wholly present or wholly absent depending upon some dice roll at launch time. (The SBCL bundled with Ubuntu sometimes decided to produce spurious wakeups, which I had not handled properly. This caused some confusion because a vanilla SBCL compiled locally did not generate these wakeups.) Perhaps not seeing the hang after a minute means a relaunch is needed.

         
        • Thanks again for your patience. I have uploaded another set of changes to git, one of which might be responsible for the lost wakeup signals (ecl_wakeup_process had a test outside a spinlock instead of inside it). A more intensive testing, with half an hour running processes, revealed no hangs in OS X Lion, Leopard or Linux.

           

          • Anonymous
            2012-07-30

            Sorry, it still hangs for me with latest (0bc0dc). Iteration count stats:

            16 threads: 4744, 651, 9127

            128 threads: 844, 486, 3889

            200 threads: 25, 421, 603

            This is a Core i7 running 32-bit Linux (details in the OP).

            As I mentioned above, my only guess about the difference between our systems is spurious wakeups -- my example being the difference between Ubuntu's SBCL (sometimes spurious wakeups) and a hand-compiled SBCL (no spurious wakeups).

            ldd ecl
            linux-gate.so.1 => (0xb77a5000)
            libecl.so.12.7 => /home/jlawrence/usr/stow/ecl-dev-test/lib/libecl.so.12.7 (0xb7513000)
            libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb734e000)
            libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xb7332000)
            libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0xb732d000)
            libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7301000)
            libgmp.so.10 => /usr/lib/i386-linux-gnu/libgmp.so.10 (0xb7282000)
            libffi.so.6 => /usr/lib/i386-linux-gnu/libffi.so.6 (0xb727b000)
            libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb725c000)
            /lib/ld-linux.so.2 (0xb77a6000)

             

  • Anonymous
    2012-07-22

    For the second test I noticed that after it hangs, interrupting and selecting the continue restart will get it going again, until the next hang.

     
    • Yes, all the times your tests failed were because one process was waiting for a notification that a lock had been released. There are a number of issues, very subtle ones, that made the wait queue fail when a large number of threads competed for locks. It seems that most of them have been solved, but I would like to be sure, and then try to improve over the current design.

       
  • Just for the record, I attach the equivalent C program using the POSIX libraries. It does not hang in my computer, but it progressively slows down until it becomes unusable and hogs the system (OS X Leopard).

     
    Attachments

    • Anonymous
      2012-08-05

      I did not notice this comment until now because of the new page splitting for comments.

      The purpose of this C program isn't clear to me -- is it to see what happens with zombie threads? On my machine it quickly fails on the pthread_create assertion. pthread_join isn't being called, whose manpage says:

         "Failure  to  join with a thread that is joinable (i.e., one
         that is not detached), produces a "zombie  thread".   Avoid
         doing  this,  since each zombie thread consumes some system
         resources, and when enough zombie threads have accumulated,
         it  will  no  longer  be possible to create new threads (or
         processes)."
      

      When I change the program to call pthread_join, it runs forever on my machine.

       

  • Anonymous
    2012-08-06

    I obtained a hang with the "sema" test on an old Mac 10.5, 64-bit.

    Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009

    i686-apple-darwin9-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5577)

     

  • Anonymous
    2012-09-21

    Note that I'm not the bug report originator.

    I noticed that the intermitent lock contention issue I had been experiencing
    where CPUs would spin busily waiting for locks is less a problem than it was.
    This could be related to recent changes to the queue code in relation to this
    bug.

    Occasionally, although rarely, I experienced a lock where all threads/CPUs
    would remain idle, possibly a kind of deadlock but where nothing is
    responsive anymore (including the REPL) and where ECL must be killed.
    However this happened rarely enough that I couldn't diagnose it yet.

    Thanks,
    Matthew Mondor

     
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,20 +1,20 @@
    -(defun test ()
    -  (let ((lock (mp:make-lock))
    -        (cvar (mp:make-condition-variable))
    -        (flag nil))
    -    (mp:process-run-function
    -     "test" (lambda ()
    -              (mp:with-lock (lock)
    -                (setf flag t)
    -                (mp:condition-variable-signal cvar))))
    -    (mp:with-lock (lock)
    -      (loop until flag do (mp:condition-variable-wait cvar lock)))))
    -
    -(defun run ()
    -  (loop
    -     (test)
    -     (format t ".")
    -     (finish-output)))
    +    (defun test ()
    +      (let ((lock (mp:make-lock))
    +            (cvar (mp:make-condition-variable))
    +            (flag nil))
    +        (mp:process-run-function
    +         "test" (lambda ()
    +                  (mp:with-lock (lock)
    +                    (setf flag t)
    +                    (mp:condition-variable-signal cvar))))
    +        (mp:with-lock (lock)
    +          (loop until flag do (mp:condition-variable-wait cvar lock)))))
    +    
    +    (defun run ()
    +      (loop
    +         (test)
    +         (format t ".")
    +         (finish-output)))
    
     RUN eventually hangs with latest git 1d3355d, but not with 12.2.1.
    
    • milestone: --> Stable_release
     
  • Seems that the fixes to the interrupt system (https://sourceforge.net/p/ecls/bugs/216/) fix this problem. I have been running the tests for one hour on one computer and they are ok (128 and 200 threads). Could you confirm?

     
  • lmj
    lmj
    2012-12-15

    The original test no longer hangs for me, but the subsequent test with sema still hangs eventually. The probability of hang per iteration seems proportional to the number of threads. I was testing with 128 threads.

    Here is a variant which fails sooner:

    (defstruct sema
      (count 0)
      (lock (mp:make-lock :recursive nil))
      (cvar (mp:make-condition-variable)))
    
    (defun inc-sema (sema)
      (mp:with-lock ((sema-lock sema))
        (incf (sema-count sema))
        (mp:condition-variable-signal (sema-cvar sema))))
    
    (defun dec-sema (sema)
      (mp:with-lock ((sema-lock sema))
        (loop (cond ((plusp (sema-count sema))
                     (decf (sema-count sema))
                     (return))
                    (t
                     (mp:condition-variable-wait
                      (sema-cvar sema) (sema-lock sema)))))))
    
    (defun test (message-count thread-count)
      (let ((to-workers (make-sema))
            (from-workers (make-sema)))
        (loop :repeat thread-count :do
           (mp:process-run-function
            "test"
            (lambda ()
              (loop
                 (dec-sema to-workers)
                 (inc-sema from-workers)))))
        (loop
           (loop :repeat message-count :do
              (inc-sema to-workers))
           (loop :repeat message-count :do
              (dec-sema from-workers))
           (assert (zerop (sema-count to-workers)))
           (assert (zerop (sema-count from-workers)))
           (format t ".")
           (finish-output))))
    
    (defun run ()
      (test 10000 64))
    

    RUN eventually fails with

    Attempted to recursively lock #<lock (nonrecursive) 0a1734b0> which is already owned by #<process "test">
    

    Backtrace is empty. Sometimes this is displayed while exiting ECL after the error:

    ;;; Detected access to protected memory, also kwown as 'bus or segmentation fault'.
    ;;; Jumping to the outermost toplevel
    

    This is with latest from git, ecl-12.12.1-13459a98-linux-x86

     
  • lmj
    lmj
    2012-12-15

    This hangs rather quickly for me. I haven't seen it produce a recursive lock error as with the homemade semaphore.

    (defun test (message-count thread-count)
      (let ((to-workers (mp:make-semaphore))
            (from-workers (mp:make-semaphore)))
        (loop :repeat thread-count :do
           (mp:process-run-function
            "test"
            (lambda ()
              (loop
                 (mp:wait-on-semaphore to-workers)
                 (mp:signal-semaphore from-workers)))))
        (loop
           (loop :repeat message-count :do
              (mp:signal-semaphore to-workers))
           (loop :repeat message-count :do
              (mp:wait-on-semaphore from-workers))
           (assert (zerop (mp:semaphore-count to-workers)))
           (assert (zerop (mp:semaphore-count from-workers)))
           (format t ".")
           (finish-output))))
    
    (defun run ()
      (test 10000 64))
    
     
    Last edit: lmj 2012-12-15


Anonymous


Cancel   Add attachments