#192 Hang with condition variable

HEAD
accepted
nobody
5
2014-09-23
2012-06-30
Anonymous
No
(defun test ()
  (let ((lock (mp:make-lock))
        (cvar (mp:make-condition-variable))
        (flag nil))
    (mp:process-run-function
     "test" (lambda ()
              (mp:with-lock (lock)
                (setf flag t)
                (mp:condition-variable-signal cvar))))
    (mp:with-lock (lock)
      (loop until flag do (mp:condition-variable-wait cvar lock)))))

(defun run ()
  (loop
     (test)
     (format t ".")
     (finish-output)))

RUN eventually hangs with latest git 1d3355d, but not with 12.2.1.

Linux xi 3.2.0-24-generic-pae #39-Ubuntu SMP Mon May 21 18:54:21 UTC 2012 i686 i686 i386 GNU/Linux

gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

Discussion

<< < 1 2 (Page 2 of 2)
  • Yes, all the times your tests failed were because one process was waiting for a notification that a lock had been released. There are a number of issues, very subtle ones, that made the wait queue fail when a large number of threads competed for locks. It seems that most of them have been solved, but I would like to be sure, and then try to improve over the current design.

     
  • Just for the record, I attach the equivalent C program using the POSIX libraries. It does not hang in my computer, but it progressively slows down until it becomes unusable and hogs the system (OS X Leopard).

     
    Attachments

    • Anonymous
      2012-08-05

      I did not notice this comment until now because of the new page splitting for comments.

      The purpose of this C program isn't clear to me -- is it to see what happens with zombie threads? On my machine it quickly fails on the pthread_create assertion. pthread_join isn't being called, whose manpage says:

         "Failure  to  join with a thread that is joinable (i.e., one
         that is not detached), produces a "zombie  thread".   Avoid
         doing  this,  since each zombie thread consumes some system
         resources, and when enough zombie threads have accumulated,
         it  will  no  longer  be possible to create new threads (or
         processes)."
      

      When I change the program to call pthread_join, it runs forever on my machine.

       

  • Anonymous
    2012-08-06

    I obtained a hang with the "sema" test on an old Mac 10.5, 64-bit.

    Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009

    i686-apple-darwin9-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5577)

     

  • Anonymous
    2012-09-21

    Note that I'm not the bug report originator.

    I noticed that the intermitent lock contention issue I had been experiencing
    where CPUs would spin busily waiting for locks is less a problem than it was.
    This could be related to recent changes to the queue code in relation to this
    bug.

    Occasionally, although rarely, I experienced a lock where all threads/CPUs
    would remain idle, possibly a kind of deadlock but where nothing is
    responsive anymore (including the REPL) and where ECL must be killed.
    However this happened rarely enough that I couldn't diagnose it yet.

    Thanks,
    Matthew Mondor

     
    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,20 +1,20 @@
    -(defun test ()
    -  (let ((lock (mp:make-lock))
    -        (cvar (mp:make-condition-variable))
    -        (flag nil))
    -    (mp:process-run-function
    -     "test" (lambda ()
    -              (mp:with-lock (lock)
    -                (setf flag t)
    -                (mp:condition-variable-signal cvar))))
    -    (mp:with-lock (lock)
    -      (loop until flag do (mp:condition-variable-wait cvar lock)))))
    -
    -(defun run ()
    -  (loop
    -     (test)
    -     (format t ".")
    -     (finish-output)))
    +    (defun test ()
    +      (let ((lock (mp:make-lock))
    +            (cvar (mp:make-condition-variable))
    +            (flag nil))
    +        (mp:process-run-function
    +         "test" (lambda ()
    +                  (mp:with-lock (lock)
    +                    (setf flag t)
    +                    (mp:condition-variable-signal cvar))))
    +        (mp:with-lock (lock)
    +          (loop until flag do (mp:condition-variable-wait cvar lock)))))
    +    
    +    (defun run ()
    +      (loop
    +         (test)
    +         (format t ".")
    +         (finish-output)))
    
     RUN eventually hangs with latest git 1d3355d, but not with 12.2.1.
    
    • milestone: --> Stable_release
     
  • Seems that the fixes to the interrupt system (https://sourceforge.net/p/ecls/bugs/216/) fix this problem. I have been running the tests for one hour on one computer and they are ok (128 and 200 threads). Could you confirm?

     
  • lmj
    lmj
    2012-12-15

    The original test no longer hangs for me, but the subsequent test with sema still hangs eventually. The probability of hang per iteration seems proportional to the number of threads. I was testing with 128 threads.

    Here is a variant which fails sooner:

    (defstruct sema
      (count 0)
      (lock (mp:make-lock :recursive nil))
      (cvar (mp:make-condition-variable)))
    
    (defun inc-sema (sema)
      (mp:with-lock ((sema-lock sema))
        (incf (sema-count sema))
        (mp:condition-variable-signal (sema-cvar sema))))
    
    (defun dec-sema (sema)
      (mp:with-lock ((sema-lock sema))
        (loop (cond ((plusp (sema-count sema))
                     (decf (sema-count sema))
                     (return))
                    (t
                     (mp:condition-variable-wait
                      (sema-cvar sema) (sema-lock sema)))))))
    
    (defun test (message-count thread-count)
      (let ((to-workers (make-sema))
            (from-workers (make-sema)))
        (loop :repeat thread-count :do
           (mp:process-run-function
            "test"
            (lambda ()
              (loop
                 (dec-sema to-workers)
                 (inc-sema from-workers)))))
        (loop
           (loop :repeat message-count :do
              (inc-sema to-workers))
           (loop :repeat message-count :do
              (dec-sema from-workers))
           (assert (zerop (sema-count to-workers)))
           (assert (zerop (sema-count from-workers)))
           (format t ".")
           (finish-output))))
    
    (defun run ()
      (test 10000 64))
    

    RUN eventually fails with

    Attempted to recursively lock #<lock (nonrecursive) 0a1734b0> which is already owned by #<process "test">
    

    Backtrace is empty. Sometimes this is displayed while exiting ECL after the error:

    ;;; Detected access to protected memory, also kwown as 'bus or segmentation fault'.
    ;;; Jumping to the outermost toplevel
    

    This is with latest from git, ecl-12.12.1-13459a98-linux-x86

     
  • lmj
    lmj
    2012-12-15

    This hangs rather quickly for me. I haven't seen it produce a recursive lock error as with the homemade semaphore.

    (defun test (message-count thread-count)
      (let ((to-workers (mp:make-semaphore))
            (from-workers (mp:make-semaphore)))
        (loop :repeat thread-count :do
           (mp:process-run-function
            "test"
            (lambda ()
              (loop
                 (mp:wait-on-semaphore to-workers)
                 (mp:signal-semaphore from-workers)))))
        (loop
           (loop :repeat message-count :do
              (mp:signal-semaphore to-workers))
           (loop :repeat message-count :do
              (mp:wait-on-semaphore from-workers))
           (assert (zerop (mp:semaphore-count to-workers)))
           (assert (zerop (mp:semaphore-count from-workers)))
           (format t ".")
           (finish-output))))
    
    (defun run ()
      (test 10000 64))
    
     
    Last edit: lmj 2012-12-15
<< < 1 2 (Page 2 of 2)


Anonymous


Cancel   Add attachments