From: Boris P. <bpr...@ho...> - 2011-02-02 04:13:41
|
So, again, not sure is this is the right thread for discussing the code, apologies in advance. Anyway, a correction is in order: the previously mentioned fix does not address the underlying problem. The underlying problem seems to be as follows. When 'fusermount -u' is performed, the fuse_chan_recv() in a worker thread invokes fuse_exist_session() and returns zero. The worker thread that gets 0 out of fuse_chan_recv() in fuse_do_work(), promptly sets its cancellation status to ENABLE and posts the semaphore. The master thread gets unblocked from sem_wait(), checks the session exit status (it is "exited"), and moves on to posting cancellation requests to the worker threads. Whether the workers have already (re)enabled cancellation or not. That's how the master thread hangs in pthread_join() while the worker thread that was supposed to be cancelled sits in pause() (at the time the cancellation request is sent, the thread has cancellation disabled). So, rather than figuring out how to fix the race (the master would have to know when _all_ the workers have cancellation enabled and are waiting in pause()), it seems much simpler to not hang in pause() and return from the worker thread instead. Is there any specific reason why the workers should wait to be cancelled in pause( ) ? Could they simply terminate (return) thereby enabling join by the master ? The cancellation requests could still be sent just in case. Best regards, Boris. P.S. I tried the above approach, seems to work every time (unlike the earlier fix :)) |
From: Miklos S. <mi...@sz...> - 2011-02-02 10:45:43
|
On Tue, 1 Feb 2011, Boris Protopopov wrote: > So, again, not sure is this is the right thread for discussing the > code, apologies in advance. Sure, this is exactly the right place. Thanks for reporting the issue! > Anyway, a correction is in order: the previously mentioned fix does > not address the underlying problem. > > The underlying problem seems to be as follows. When 'fusermount -u' > is performed, the fuse_chan_recv() in a worker thread invokes > fuse_exist_session() and returns zero. The worker thread that gets 0 > out of fuse_chan_recv() in fuse_do_work(), promptly sets its > cancellation status to ENABLE and posts the semaphore. The master > thread gets unblocked from sem_wait(), checks the session exit > status (it is "exited"), and moves on to posting cancellation > requests to the worker threads. Whether the workers have already > (re)enabled cancellation or not. That's how the master thread hangs > in pthread_join() while the worker thread that was supposed to be > cancelled sits in pause() (at the time the cancellation request is > sent, the thread has cancellation disabled). > > So, rather than figuring out how to fix the race (the master would > have to know when _all_ the workers have cancellation enabled and > are waiting in pause()), it seems much simpler to not hang in > pause() and return from the worker thread instead. Is there any > specific reason why the workers should wait to be cancelled in > pause( ) ? Could they simply terminate (return) thereby enabling > join by the master ? The cancellation requests could still be sent > just in case. The history of that pause() goes back to this ChangeLog entry in 2005: * lib: don't exit threads, so cancelation doesn't cause segfault But segfaulting on cancelation looks like a bug in the pthread implementation, and the above change was probably a workaround for this pthread bug. pthread_cancel() should work whether the target has exited or not. So, as you suggest, I think it's safe to remove that pause() call, which would solve the races. Thanks, Miklos |
From: Miklos S. <mi...@sz...> - 2011-02-02 11:27:05
|
On Wed, 02 Feb 2011, Miklos Szeredi wrote: > The history of that pause() goes back to this ChangeLog entry in 2005: > > * lib: don't exit threads, so cancelation doesn't cause segfault > > But segfaulting on cancelation looks like a bug in the pthread > implementation, and the above change was probably a workaround for > this pthread bug. pthread_cancel() should work whether the target has > exited or not. Actually, it was not a pthread bug. Back then the worker threads were detached, which means after the thread exited its thread ID is no longer valid and pthread_cancel() may return ESRCH or its behavior may be undefined. The later being the case for that particular pthread implementation. > So, as you suggest, I think it's safe to remove that pause() call, > which would solve the races. The current implementation doesn't detach the threads (so it can join them) so this is not a problem. Fix committed and pushed out. Thanks, Miklos |