#192 Hang with condition variable

Stable_release
open
nobody
None
5
2014-08-19
2012-06-30
Anonymous
No
(defun test ()
  (let ((lock (mp:make-lock))
        (cvar (mp:make-condition-variable))
        (flag nil))
    (mp:process-run-function
     "test" (lambda ()
              (mp:with-lock (lock)
                (setf flag t)
                (mp:condition-variable-signal cvar))))
    (mp:with-lock (lock)
      (loop until flag do (mp:condition-variable-wait cvar lock)))))

(defun run ()
  (loop
     (test)
     (format t ".")
     (finish-output)))

RUN eventually hangs with latest git 1d3355d, but not with 12.2.1.

Linux xi 3.2.0-24-generic-pae #39-Ubuntu SMP Mon May 21 18:54:21 UTC 2012 i686 i686 i386 GNU/Linux

gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

Discussion

1 2 > >> (Page 1 of 2)
  • The previous implementation was based on wrong assumptions. I have uploaded a new one. Please verify that it works for you. Otherwise I will close the bug.

     
  • Please forget the previous comment: the fix was wrong, for it did not just awake threads which were waiting, but also those arriving. I will work further on this.

     

  • Anonymous
    2012-07-11

    In case this might be helpful, the following RUN hangs with and without the latest changes.

    (defstruct sema
      (count 0)
      (lock (mp:make-lock))
      (cvar (mp:make-condition-variable)))
    
    (defun inc-sema (sema)
      (mp:with-lock ((sema-lock sema))
        (incf (sema-count sema))
        (mp:condition-variable-signal (sema-cvar sema))))
    
    (defun dec-sema (sema)
      (mp:with-lock ((sema-lock sema))
        (loop (cond ((plusp (sema-count sema))
                     (decf (sema-count sema))
                     (return))
                    (t
                     (mp:condition-variable-wait
                      (sema-cvar sema) (sema-lock sema)))))))
    
    (defun test (thread-count)
      (let ((from-threads (make-sema))
            (to-threads   (make-sema)))
        (loop repeat thread-count do
             (mp:process-run-function
              "test" (lambda ()
                       (dec-sema to-threads)
                       (inc-sema from-threads))))
        (loop repeat thread-count do (inc-sema to-threads))
        (loop repeat thread-count do (dec-sema from-threads))))
    
    (defun run ()
      (loop
         (test 16)
         (format t ".")
         (finish-output)))
    
     
    Last edit: Juan Jose Garcia Ripoll 2012-12-05
  • Thanks a lot for the test cases. I believe I have found the problem now: the condition variable signaling operation must not only awake the process, but also immediately remove it from the wait queue.

    Both your programs run without hanging, but I will wait for your feedback to close the bug report.

     

  • Anonymous
    2012-07-14

    The second test case still hangs for me eventually, though it now takes longer to do so -- within 3000 iterations. Changing thread-count to 128 causes it to hang within 100 iterations.

    If it's any consolation, LispWorks 6.0 failed the same test with similar symptoms.

     
    • I have uploaded another fix, this time to the wait queue, for some signaling events were missing between the point at which interrupts are disabled and the point at which the thread enters the wait queue. I see no hangs, neither in Linux nor in OS X (Snow Leopard & Lion). Could you check?

       

      • Anonymous
        2012-07-28

        I regret to report that the latest in git (7638ca) still hangs for me. I did two sets of three trials.

        Iterations until hang for 16 threads: 1626, 6228, 1505

        Iterations until hang for 128 threads: 185, 561, 129

        While running inside a terminal (as opposed to slime), when I interrupt after hanging and then hit Ctl-D to exit, there are segfault messages printed as it's exiting. This isn't new; I just tried a recent earlier build (e1fdb8) and saw the same.

        One time I had a segfault instead of a hang, and in fact I still have this session running, so feel free to ask me to print values or whatever. The backtrace is empty.

        I made the following comment on a CLISP bug report for the same test case, which may or may not be relevant:

        On SBCL I once had a condition variable problem which was either wholly present or wholly absent depending upon some dice roll at launch time. (The SBCL bundled with Ubuntu sometimes decided to produce spurious wakeups, which I had not handled properly. This caused some confusion because a vanilla SBCL compiled locally did not generate these wakeups.) Perhaps not seeing the hang after a minute means a relaunch is needed.

         
        • Thanks again for your patience. I have uploaded another set of changes to git, one of which might be responsible for the lost wakeup signals (ecl_wakeup_process had a test outside a spinlock instead of inside it). A more intensive testing, with half an hour running processes, revealed no hangs in OS X Lion, Leopard or Linux.

           

          • Anonymous
            2012-07-30

            Sorry, it still hangs for me with latest (0bc0dc). Iteration count stats:

            16 threads: 4744, 651, 9127

            128 threads: 844, 486, 3889

            200 threads: 25, 421, 603

            This is a Core i7 running 32-bit Linux (details in the OP).

            As I mentioned above, my only guess about the difference between our systems is spurious wakeups -- my example being the difference between Ubuntu's SBCL (sometimes spurious wakeups) and a hand-compiled SBCL (no spurious wakeups).

            ldd ecl
            linux-gate.so.1 => (0xb77a5000)
            libecl.so.12.7 => /home/jlawrence/usr/stow/ecl-dev-test/lib/libecl.so.12.7 (0xb7513000)
            libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb734e000)
            libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xb7332000)
            libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0xb732d000)
            libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7301000)
            libgmp.so.10 => /usr/lib/i386-linux-gnu/libgmp.so.10 (0xb7282000)
            libffi.so.6 => /usr/lib/i386-linux-gnu/libffi.so.6 (0xb727b000)
            libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb725c000)
            /lib/ld-linux.so.2 (0xb77a6000)

             

  • Anonymous
    2012-07-22

    For the second test I noticed that after it hangs, interrupting and selecting the continue restart will get it going again, until the next hang.

     
1 2 > >> (Page 1 of 2)


Anonymous


Cancel   Add attachments