|
From: David F. <fa...@kd...> - 2012-11-05 17:57:15
|
The testcase http://www.davidfaure.fr/2012/qmutex_trylock.cpp (from https://bugs.kde.org/show_bug.cgi?id=243232) shows that an optimization inside Qt leads to a helgrind warning about wrong lock ordering, making the use of that feature impossible for detecting actual problems elsewhere (i.e. I have to use --track-lockorders=no all the time). Technically if we ignore the distinction between lock and tryLock, helgrind is right, we did lock the mutexes in reverse order. But because it's a tryLock, it couldn't possibly deadlock. Should helgrind simply ignore all pthread_mutex_trylock calls, for the lockorder detection, even if they succeed? I think so, actually (by definition they couldn't deadlock, which is what track-lockorders is all about). But if you think it should only be ignored in this particular case, are there magic macros in helgrind.h that could be used to annotate this tryLock() call? PS: see my previous email about VALGRIND_HG_ENABLE_CHECKING not working. Is this known? Should I report a bug? -- David Faure, fa...@kd..., http://www.davidfaure.fr Working on KDE, in particular KDE Frameworks 5 |
|
From: Philippe W. <phi...@sk...> - 2012-11-05 22:18:57
|
On Mon, 2012-11-05 at 18:59 +0100, David Faure wrote: > The testcase http://www.davidfaure.fr/2012/qmutex_trylock.cpp > (from https://bugs.kde.org/show_bug.cgi?id=243232) > shows that an optimization inside Qt leads to a helgrind warning about wrong > lock ordering, making the use of that feature impossible for detecting actual > problems elsewhere (i.e. I have to use --track-lockorders=no all the time). > > Technically if we ignore the distinction between lock and tryLock, helgrind is > right, we did lock the mutexes in reverse order. But because it's a tryLock, > it couldn't possibly deadlock. > > Should helgrind simply ignore all pthread_mutex_trylock calls, for the > lockorder detection, even if they succeed? I think so, actually (by definition > they couldn't deadlock, which is what track-lockorders is all about). A deadlock can appear if there is a mixture of trylock and lock on the same lock. So, trylock cannot just be ignored. E.g. Thread 1: trylock mutex1 lock mutex2 Thread 2: lock mutex2 lock mutex1 might deadlock. Even without mixture, isn't the below slightly bizarre/dangerous ? Thread 1: trylock mutex1 trylock mutex2 Thread 2: trylock mutex2 trylock mutex1 If the 2nd trylock fails, what is the "plan B" ? It seems that a task must unlock all locks and restart from "scratch" in the above case. I guess we might need an option such as: --trylock-logic={normal-lock|local-retry|full-retry} normal-lock = current behaviour local-retry means the task would re-trylock full-retry means that the plan B is to unlock all locks and retry everything. Or maybe we would need the same info but with an annotation of the lock operation and/or of the lock itself ? I am quite sure the simple logic "trylock can be ignored" is not ok for all cases. > PS: see my previous email about VALGRIND_HG_ENABLE_CHECKING not working. Is > this known? Should I report a bug? Always good to report a bug if you have found a bug :). Mail will be forgotten, bug entries are less likely to be lost. Philippe |
|
From: David F. <fa...@kd...> - 2012-11-06 12:41:26
|
On Monday 05 November 2012 23:19:42 Philippe Waroquiers wrote: > On Mon, 2012-11-05 at 18:59 +0100, David Faure wrote: > > The testcase http://www.davidfaure.fr/2012/qmutex_trylock.cpp > > (from https://bugs.kde.org/show_bug.cgi?id=243232) > > shows that an optimization inside Qt leads to a helgrind warning about > > wrong lock ordering, making the use of that feature impossible for > > detecting actual problems elsewhere (i.e. I have to use > > --track-lockorders=no all the time). > > > > Technically if we ignore the distinction between lock and tryLock, > > helgrind is right, we did lock the mutexes in reverse order. But because > > it's a tryLock, it couldn't possibly deadlock. > > > > Should helgrind simply ignore all pthread_mutex_trylock calls, for the > > lockorder detection, even if they succeed? I think so, actually (by > > definition they couldn't deadlock, which is what track-lockorders is all > > about). > A deadlock can appear if there is a mixture of trylock and lock on > the same lock. So, trylock cannot just be ignored. > E.g. > Thread 1: trylock mutex1 > lock mutex2 > > Thread 2: lock mutex2 > lock mutex1 > might deadlock. True. This means that only "trylock in second place" should be ignored. More on this below. > Even without mixture, isn't the below slightly bizarre/dangerous ? > Thread 1: trylock mutex1 > trylock mutex2 > > Thread 2: trylock mutex2 > trylock mutex1 No deadlock can ever happen here. > If the 2nd trylock fails, what is the "plan B" ? If the program then accesses shared data, a race condition will happen and will be detected by helgrind anyway. So ignoring the ordering of these trylocks is ok, I would think. Of course helgrind must record "we got the lock", for the race condition detection feature, but it shouldn't warn about the wrong order of the locking, since it can't possibly deadlock. Not that the above would be good programming practice, of course, but helgrind can't say anything about it if all the locks were acquired. It will warn in another run, where some trylock fails, and a race ensues. > It seems that a task must unlock all locks and restart > from "scratch" in the above case. Yes this is exactly that QOrderedMutexLocker does. Thread 1: lock mutex1 lock mutex2 Thread 2: lock mutex2 trylock mutex1 if that failed, unlock mutex2 lock mutex1 lock mutex2 If the trylock fails (because thread1 was first), then it unlocks and restarts from scratch. I can't see a deadlock risk with that, so ideally helgrind shouldn't warn. > I guess we might need an option such as: > --trylock-logic={normal-lock|local-retry|full-retry} > normal-lock = current behaviour > local-retry means the task would re-trylock > full-retry means that the plan B is to unlock all locks > and retry everything. I don't see how this can be a global option. Some piece of code (like QOrderedMutexLocker) might have the "full retry" logic above, but other pieces of code might do something different - e.g. something wrong. It doesn't make sense to me to tell helgrind "this is what all the code paths are going to do about tryLock", that's impossible to predict in a complex program. Let me present another case: Thread 1: lock mutex1 lock mutex2 Thread 2: lock mutex2 trylock mutex1 if that fails, unlock mutex2 and give up This could happen for a non-critical operation that can be canceled if it can't be done immediately. Again, no deadlock possible, so helgrind shouldn't warn about a successful trylock being "out of order". And yet this isn't a "full retry", so I don't think --trylock-logic=full-retry is the solution. Deadlock can only happen if both threads use normal locking as their second operation. A trylock as the second operation doesn't deadlock. > Or maybe we would need the same info but with an annotation > of the lock operation and/or of the lock itself ? Sounds like an annotation of the trylock operation, unless we agree on my next statement: > I am quite sure the simple logic "trylock can be ignored" > is not ok for all cases. Right. Let me refine that: "trylock can be ignored as the second operation, i.e. helgrind shouldn't issue the out-of-order-locking-warning at the precise moment it's seeing a successful out-of-order trylock." If we can agree on that, then there's no need for an annotation in the code. > > PS: see my previous email about VALGRIND_HG_ENABLE_CHECKING not working. > > Is this known? Should I report a bug? > > Always good to report a bug if you have found a bug :). > Mail will be forgotten, bug entries are less likely to be lost. OK, will do. On the other hand I'm getting more reaction about the tryLock issue here than on bug 243232 :-). Anyway, thanks for your input, much appreciated. -- David Faure, fa...@kd..., http://www.davidfaure.fr Working on KDE, in particular KDE Frameworks 5 |
|
From: Philippe W. <phi...@sk...> - 2012-11-06 21:56:46
|
On Tue, 2012-11-06 at 13:43 +0100, David Faure wrote: > On Monday 05 November 2012 23:19:42 Philippe Waroquiers wrote: > > On Mon, 2012-11-05 at 18:59 +0100, David Faure wrote: > > > The testcase http://www.davidfaure.fr/2012/qmutex_trylock.cpp > > > (from https://bugs.kde.org/show_bug.cgi?id=243232) > > > shows that an optimization inside Qt leads to a helgrind warning about > > > wrong lock ordering, making the use of that feature impossible for > > > detecting actual problems elsewhere (i.e. I have to use > > > --track-lockorders=no all the time). > > > > > > Technically if we ignore the distinction between lock and tryLock, > > > helgrind is right, we did lock the mutexes in reverse order. But because > > > it's a tryLock, it couldn't possibly deadlock. > > > > > > Should helgrind simply ignore all pthread_mutex_trylock calls, for the > > > lockorder detection, even if they succeed? I think so, actually (by > > > definition they couldn't deadlock, which is what track-lockorders is all > > > about). > > A deadlock can appear if there is a mixture of trylock and lock on > > the same lock. So, trylock cannot just be ignored. > > E.g. > > Thread 1: trylock mutex1 > > lock mutex2 > > > > Thread 2: lock mutex2 > > lock mutex1 > > might deadlock. > > True. > This means that only "trylock in second place" should be ignored. More on this > below. More generally, I guess that you mean "trylock in last place" should be ignored (rather than the special case of 2nd place). This might be difficult to implement as each time a lock is taken, helgrind checks for order violation. I suspect a later lock operation might then transform a "trylock in last place" to a "trylock which is now not anymore in last place". But of course, when the trylock operation has just been done, this trylock is "last place" and so if we would ignore it, then this would be similar to always ignore the trylock, which is not ok. Currently, helgrind maintains a graph of lock order. I suspect we might need different graph node types and/or edge types to cope with trylock. For sure, more investigations needed looking in depth at the current algorithm. > > > Even without mixture, isn't the below slightly bizarre/dangerous ? > > Thread 1: trylock mutex1 > > trylock mutex2 > > > > Thread 2: trylock mutex2 > > trylock mutex1 > > No deadlock can ever happen here. Yes, no deadlock can occur. However, this is really a really doubful construction. The question is: should helgrind report a "lock order" warning for such constructs ? > > > If the 2nd trylock fails, what is the "plan B" ? > > If the program then accesses shared data, a race condition will happen and > will be detected by helgrind anyway. So ignoring the ordering of these > trylocks is ok, I would think. Of course helgrind must record "we got the > lock", for the race condition detection feature, but it shouldn't warn about > the wrong order of the locking, since it can't possibly deadlock. The idea of helgrind is that it detects lock order problems and/or race condition problems *even* if no deadlock happens and/or if no race condition really happened. Maybe it is very unlikely that the trylock fails. Still would be nice to discover the bug. And if the trylock does not fail, then the race condition will then not be detected by helgrind. > Not that the above would be good programming practice, of course, but helgrind > can't say anything about it if all the locks were acquired. It will warn in > another run, where some trylock fails, and a race ensues. > > > It seems that a task must unlock all locks and restart > > from "scratch" in the above case. > > Yes this is exactly that QOrderedMutexLocker does. > > Thread 1: lock mutex1 > lock mutex2 > > Thread 2: lock mutex2 > trylock mutex1 > if that failed, > unlock mutex2 > lock mutex1 > lock mutex2 QOrderedMutexLocker really retries locking in an other order than the operations done by Thread 2 ? Or is that a typo ? > > If the trylock fails (because thread1 was first), then it unlocks and restarts > from scratch. I can't see a deadlock risk with that, so ideally helgrind > shouldn't warn. > > > I guess we might need an option such as: > > --trylock-logic={normal-lock|local-retry|full-retry} > > normal-lock = current behaviour > > local-retry means the task would re-trylock > > full-retry means that the plan B is to unlock all locks > > and retry everything. > > I don't see how this can be a global option. Some piece of code (like > QOrderedMutexLocker) might have the "full retry" logic above, but other pieces > of code might do something different - e.g. something wrong. It doesn't make > sense to me to tell helgrind "this is what all the code paths are going to do > about tryLock", that's impossible to predict in a complex program. For sure, generally, an application can do a big variety of behaviours. I suspect however that an application might (this is sane at least) try to obey a certain "design pattern" such as: in all cases, if a trylock fails, unlock all and retry all. The command line option will then cover such cases without annotations. > > Let me present another case: > > Thread 1: lock mutex1 > lock mutex2 > > Thread 2: lock mutex2 > trylock mutex1 > if that fails, unlock mutex2 and give up > > This could happen for a non-critical operation that can be canceled if it > can't be done immediately. Again, no deadlock possible, so helgrind shouldn't > warn about a successful trylock being "out of order". And yet this isn't a > "full retry", so I don't think --trylock-logic=full-retry is the solution. > > Deadlock can only happen if both threads use normal locking as their > second operation. A trylock as the second operation doesn't deadlock. Replace second by last. But how do we know that an operation is the last one ? I think this is a non trivial change in the current algorithm. Pḧilippe |
|
From: David F. <fa...@kd...> - 2012-11-07 09:49:24
|
On Tuesday 06 November 2012 22:56:32 Philippe Waroquiers wrote: > On Tue, 2012-11-06 at 13:43 +0100, David Faure wrote: > > On Monday 05 November 2012 23:19:42 Philippe Waroquiers wrote: > > > On Mon, 2012-11-05 at 18:59 +0100, David Faure wrote: > > > > The testcase http://www.davidfaure.fr/2012/qmutex_trylock.cpp > > > > (from https://bugs.kde.org/show_bug.cgi?id=243232) > > > > shows that an optimization inside Qt leads to a helgrind warning about > > > > wrong lock ordering, making the use of that feature impossible for > > > > detecting actual problems elsewhere (i.e. I have to use > > > > --track-lockorders=no all the time). > > > > > > > > Technically if we ignore the distinction between lock and tryLock, > > > > helgrind is right, we did lock the mutexes in reverse order. But > > > > because > > > > it's a tryLock, it couldn't possibly deadlock. > > > > > > > > Should helgrind simply ignore all pthread_mutex_trylock calls, for the > > > > lockorder detection, even if they succeed? I think so, actually (by > > > > definition they couldn't deadlock, which is what track-lockorders is > > > > all > > > > about). > > > > > > A deadlock can appear if there is a mixture of trylock and lock on > > > the same lock. So, trylock cannot just be ignored. > > > E.g. > > > > > > Thread 1: trylock mutex1 > > > > > > lock mutex2 > > > > > > Thread 2: lock mutex2 > > > > > > lock mutex1 > > > > > > might deadlock. > > > > True. > > This means that only "trylock in second place" should be ignored. More on > > this below. > > More generally, I guess that you mean "trylock in last place" should > be ignored (rather than the special case of 2nd place). Right. > This might be difficult to implement as each time a lock is taken, > helgrind checks for order violation. I suspect a later lock operation > might then transform a "trylock in last place" to a "trylock which is > now not anymore in last place". Right. So let's record it, but let's not warn at the precise time the successful trylock happens. If another lock happens later, out of order due to the successful trylock, then yes, let's warn. > But of course, when the trylock operation has just been done, > this trylock is "last place" and so if we would ignore it, then > this would be similar to always ignore the trylock, which is not ok. Right, not "ignore it as if it didn't happen", but "ignore it as in don't warn right now", and still record it. > Currently, helgrind maintains a graph of lock order. > I suspect we might need different graph node types and/or edge types > to cope with trylock. For sure, more investigations needed looking > in depth at the current algorithm. I think it would be good enough to add the successful trylock to the graph, just without a warning at that point. > > > Even without mixture, isn't the below slightly bizarre/dangerous ? > > > > > > Thread 1: trylock mutex1 > > > > > > trylock mutex2 > > > > > > Thread 2: trylock mutex2 > > > > > > trylock mutex1 > > > > No deadlock can ever happen here. > > Yes, no deadlock can occur. However, this is really a really doubful > construction. The question is: should helgrind report a "lock order" > warning for such constructs ? I don't think so, due to the valid use cases I'm showing here. > The idea of helgrind is that it detects lock order problems and/or > race condition problems *even* if no deadlock happens and/or if no > race condition really happened. > Maybe it is very unlikely that the trylock fails. Still would be nice > to discover the bug. And if the trylock does not fail, then the race > condition will then not be detected by helgrind. The simpler construct that can lead to this problem is * trylock mutex1 * access shared data If the trylock fails, then we have a race condition. If it succeeds, then there is no problem. I don't see how you can detect the potential race condition with helgrind when the trylock succeeds, unless you go as far as saying "all trylocks are bad" (which was actually my thinking until reading the code of QOrderedMutexLocker which has a valid use case for trylock -- maybe there are more, of course). But this seems impossible to detect. I mean, the code could say if (trylock succeeds) // do something else // do something else You're working on a runtime analysis tool, not on a static analysis tool, so you can't know anything about the branch that is not being taken in a given run of helgrind. Therefore I see no way of saying "there could have been a race if this trylock had failed, but it actually succeeded in this run". So I don't disagree on "it would be nice to discover the bug", but if it's impossible then let's drop the idea and come back to what is actually possible :) > > Yes this is exactly that QOrderedMutexLocker does. > > > > Thread 1: lock mutex1 > > > > lock mutex2 > > > > Thread 2: lock mutex2 > > > > trylock mutex1 > > if that failed, > > > > unlock mutex2 > > lock mutex1 > > lock mutex2 > > QOrderedMutexLocker really retries locking in an other order > than the operations done by Thread 2 ? > Or is that a typo ? Did you read it wrong? It ends up locking mutex1 and then mutex2, in the "retry" block. This is in the same order as thread 1. QOrderedMutexLocker starts at "trylock mutex1", i.e. we ask that class to ensure that mutex1 and mutex2 are locked, but it could happen that we're calling that class with mutex2 already locked. So it will try to lock mutex1, and if that fails (already locked), then unlock+lock1+lock2. If it succeeds, then no other thread had locked mutex1, so no deadlock. Yes, even though we ended up locking out of order (2 then 1). > For sure, generally, an application can do a big variety of behaviours. > I suspect however that an application might (this is sane at least) try > to obey a certain "design pattern" such as: > in all cases, if a trylock fails, unlock all and retry all. This is too monolithic thinking :) An application that uses a given framework (e.g. Qt) could very well be doing things differently than the framework does internally. As a developer debugging a large application written by other people, using a framework written by other people, how can I guarantee helgrind that all trylock uses follow a single design pattern? > > Deadlock can only happen if both threads use normal locking as their > > second operation. A trylock as the second operation doesn't deadlock. > > Replace second by last. But how do we know that an operation is the last > one ? I think this is a non trivial change in the current algorithm. At the time of the trylock, it is the last one -> no warning at that precise moment. This sounds like a simple enough change in the current algorithm? Basically adding one if() ... if only I knew where ;) -- David Faure, fa...@kd..., http://www.davidfaure.fr Working on KDE, in particular KDE Frameworks 5 |
|
From: Philippe W. <phi...@sk...> - 2012-11-07 22:00:56
|
On Wed, 2012-11-07 at 10:51 +0100, David Faure wrote:
> > The idea of helgrind is that it detects lock order problems and/or
> > race condition problems *even* if no deadlock happens and/or if no
> > race condition really happened.
> > Maybe it is very unlikely that the trylock fails. Still would be nice
> > to discover the bug. And if the trylock does not fail, then the race
> > condition will then not be detected by helgrind.
>
> The simpler construct that can lead to this problem is
> * trylock mutex1
> * access shared data
"discover the bug" is related to the doubful construct, not
to a race condition (as said above, if the trylock does not fail,
no way to detect the race condition that would happen if
the application does not properly handle the trylock failure).
Note that currently, laog is producing messages which should
be considered as "lock order warning", not as
"for sure there is a deadlock order problem".
The trylock is one case of "warning, not an error" which could/should
be improved.
But there are others e.g. laog does not understand the concept of
"guard locks" which is (IIUC): each thread can acquire a single lock
in a set of locks. If a thread wants to acquire more than one lock
(in any order then), it first has to acquire the "guard lock",
and then can lock in any order any nr of locks in the lock set.
With this guard lock, not possible to have a deadlock,
but for sure this is not understood by the current helgrind
laog algorithm.
Properly handling guard locks implies a more sophisticated algorithm
than the current laog (search for "good lock algorithm").
> This is too monolithic thinking :)
> An application that uses a given framework (e.g. Qt) could very well be doing
> things differently than the framework does internally.
> As a developer debugging a large application written by other people, using a
> framework written by other people, how can I guarantee helgrind that all
> trylock uses follow a single design pattern?
Of course, this cannot be guaranteed. But the idea is not that
the command line option(s) are covering the full range of possible
"design patterns". The idea is that they should cover 'out of the box'
a reasonable range of such patterns. This is true e.g. for memcheck
(out of the box, it supports malloc compatible libs, but otherwise
you need annotations).
Helgrind should also support some 'out of the box' locking logics.
The difficult question is what should be covered by the command line
options 'out of the box' (the nirvana being an algorithm that would
work for everything without annotations).
> At the time of the trylock, it is the last one -> no warning at that precise
> moment. This sounds like a simple enough change in the current algorithm?
> Basically adding one if() ... if only I knew where ;)
To avoid doing a "lock order warning" when doing the trylock is easy
I believe:
in hg_main.c:3697, put a condition
'if (!is_a_try_lock)
before:
other = laog__do_dfs_from_to(lk, thr->locksetA);
if (other) {
....
(where is_a_trylock has to be given by the caller).
I think it is almost mechanical work to add arguments to *POST event
handlers and corresponding requests to transfer the "is a try lock"
from the helgrind interception to the line 3697).
But I suspect that the insertion of a trylock in the graph
might later on cause a 'wrong' cycle to be detected.
E.g. (L = lock, T = trylock, L and T followed by lock nr)
threadA L1 T2
threadB L2 L3
threadC L3 L1
cannot deadlock (I think :) if threadA releases lock 1
when T2 fails.
But when L3 L1 will be inserted, a cycle will be found
via T2 (if the graph has not remembered this is a trylock).
So, I am still (somewhat intuitively) thinking that we need
to have nodes and/or edges marked with "this is a trylock"
and have the graph exploration taking these marking into account
to not generate such "false warnings".
Far to be a mathematical proof, I know :).
Philippe
|
|
From: David F. <fa...@kd...> - 2012-11-07 23:16:00
|
On Wednesday 07 November 2012 23:00:51 Philippe Waroquiers wrote:
> On Wed, 2012-11-07 at 10:51 +0100, David Faure wrote:
> > > The idea of helgrind is that it detects lock order problems and/or
> > > race condition problems *even* if no deadlock happens and/or if no
> > > race condition really happened.
> > > Maybe it is very unlikely that the trylock fails. Still would be nice
> > > to discover the bug. And if the trylock does not fail, then the race
> > > condition will then not be detected by helgrind.
> >
> > The simpler construct that can lead to this problem is
> > * trylock mutex1
> > * access shared data
>
> "discover the bug" is related to the doubful construct, not
> to a race condition
If there's no race condition and no deadlock, I'm not sure what bug you want
to detect :-)
> Note that currently, laog is producing messages which should
> be considered as "lock order warning", not as
> "for sure there is a deadlock order problem".
Yes, but why does it warn about lock order? Because it could cause deadlocks.
I agree, this is about potential deadlocks, not actual deadlocks. But
trylock 1 + trylock 2 vs trylock2 + trylock1 (the case we're talking about in
this part of the mail) is not even a potential deadlock. It can't ever
deadlock. So there's nothing to warn about.
> The trylock is one case of "warning, not an error" which could/should
> be improved.
> But there are others e.g. laog does not understand the concept of
> "guard locks" which is (IIUC): each thread can acquire a single lock
> in a set of locks. If a thread wants to acquire more than one lock
> (in any order then), it first has to acquire the "guard lock",
> and then can lock in any order any nr of locks in the lock set.
> With this guard lock, not possible to have a deadlock,
> but for sure this is not understood by the current helgrind
> laog algorithm.
Right. That one definitely needs annotations in the source code, I would think.
There's no way for the tool to detect that these mutexes all go together.
> > At the time of the trylock, it is the last one -> no warning at that
> > precise moment. This sounds like a simple enough change in the current
> > algorithm? Basically adding one if() ... if only I knew where ;)
>
> To avoid doing a "lock order warning" when doing the trylock is easy
> I believe:
> in hg_main.c:3697, put a condition
> 'if (!is_a_try_lock)
> before:
> other = laog__do_dfs_from_to(lk, thr->locksetA);
> if (other) {
> ....
> (where is_a_trylock has to be given by the caller).
I gave it a try, but I'm hitting a problem with exactly that, passing
isTryLock to that code. isTryLock is set in HG_PTHREAD_MUTEX_LOCK_PRE and
similar, while the above code is called from HG_PTHREAD_MUTEX_LOCK_POST and
similar. If I understand correctly, adding an argument to the _POST variant
would break source compatibility for the existing userland macros?
> I think it is almost mechanical work to add arguments to *POST event
> handlers and corresponding requests to transfer the "is a try lock"
> from the helgrind interception to the line 3697).
Ah OK, so you don't seem to see a problem with adding an argument :)
No source compatibility for the user request macros is OK?
I'll finish the patch then, but only if you agree with the approach, otherwise
this would be dead code, i.e. a wasted effort. At this point you don't seem
fully convinced :)
> But I suspect that the insertion of a trylock in the graph
> might later on cause a 'wrong' cycle to be detected.
> E.g. (L = lock, T = trylock, L and T followed by lock nr)
> threadA L1 T2
> threadB L2 L3
> threadC L3 L1
> cannot deadlock (I think :) if threadA releases lock 1
> when T2 fails.
Well, if T2 fails then we have no cycle, and if it succeeds we have a real
potential deadlock. The question is whether we want to remember the T2 attempt
(and warn later) even when T2 fails. I would say, if it failed, it's like it
didn't happen.
Do you actually store failed trylocks currently?
> But when L3 L1 will be inserted, a cycle will be found
> via T2 (if the graph has not remembered this is a trylock).
> So, I am still (somewhat intuitively) thinking that we need
> to have nodes and/or edges marked with "this is a trylock"
> and have the graph exploration taking these marking into account
> to not generate such "false warnings".
My idea is rather that a successful trylock is just like a lock (except that
it shouldn't warn at that precise moment, but if any later real-lock happens
out of order with it, then we should warn, this is why successful trylocks
should be recorded), and a failed trylock should NOT be recorded.
So at this point I don't see a need for remembering "this was a trylock". I
don't see how this could matter later on.
--
David Faure, fa...@kd..., http://www.davidfaure.fr
Working on KDE, in particular KDE Frameworks 5
|
|
From: Philippe W. <phi...@sk...> - 2012-11-08 22:44:19
|
On Thu, 2012-11-08 at 00:18 +0100, David Faure wrote:
> On Wednesday 07 November 2012 23:00:51 Philippe Waroquiers wrote:
> > "discover the bug" is related to the doubful construct, not
> > to a race condition
>
> If there's no race condition and no deadlock, I'm not sure what bug you want
> to detect :-)
The "doubtful construct". Such kind of construct is in my opinion better
removed, and better detect that via a tool.
>
> > Note that currently, laog is producing messages which should
> > be considered as "lock order warning", not as
> > "for sure there is a deadlock order problem".
>
> Yes, but why does it warn about lock order? Because it could cause deadlocks.
Not necessarily. Doubtful construct like the above could cause e.g.
livelocks if the "inverted" (failing) trylocks are not properly handled.
So, even if a wrong order does not necessarily cause a deadlock, it
can be a good idea to detect these doubtful constructions and remove
them.
>
> I agree, this is about potential deadlocks, not actual deadlocks. But
> trylock 1 + trylock 2 vs trylock2 + trylock1 (the case we're talking about in
> this part of the mail) is not even a potential deadlock. It can't ever
> deadlock. So there's nothing to warn about.
A compiler produces warnings and errors. warnings are useful to detect
potential problems. Similarly, a wrong lock order (even if not
deadlocking) might be something you do not want. And so, have e.g. an
option to continue to warn for such a case (i.e. continue to have the
current helgrind behaviour) would be useful for many users.
>
> > The trylock is one case of "warning, not an error" which could/should
> > be improved.
> > But there are others e.g. laog does not understand the concept of
> > "guard locks" which is (IIUC): each thread can acquire a single lock
> > in a set of locks. If a thread wants to acquire more than one lock
> > (in any order then), it first has to acquire the "guard lock",
> > and then can lock in any order any nr of locks in the lock set.
> > With this guard lock, not possible to have a deadlock,
> > but for sure this is not understood by the current helgrind
> > laog algorithm.
>
> Right. That one definitely needs annotations in the source code, I would think.
> There's no way for the tool to detect that these mutexes all go together.
Search on the web for "good lock algorithm", which solves the problem
of guard lock detection in deadlock lock graph analysis.
>
> I gave it a try, but I'm hitting a problem with exactly that, passing
> isTryLock to that code. isTryLock is set in HG_PTHREAD_MUTEX_LOCK_PRE and
> similar, while the above code is called from HG_PTHREAD_MUTEX_LOCK_POST and
> similar. If I understand correctly, adding an argument to the _POST variant
> would break source compatibility for the existing userland macros?
I do not think it breaks compatibility for two reasons:
1. I believe these are "system client requests", executed by
helgrind interception code, not by client code.
2. it is possible to "add an argument", because the requests all
have a fixed nr of arguments, with "unused arguments" passed as 0.
As long as "really using" such an "unused" argument keeps the same
semantic for the 0 value, it is ok to add such arguments.
(e.g. I have in 3.7.0 "added" a new argument to the memcheck leak
search client request to do delta leak search, without breaking
the compatibility:
0 means full leak search
1 means delta leak search
> I'll finish the patch then, but only if you agree with the approach, otherwise
> this would be dead code, i.e. a wasted effort. At this point you don't seem
> fully convinced :)
Because I believe the below "3 threads 3 locks" is the counter-example
which shows that simply adding the trylock in the graph does not
work.
Note that convincing me is neither a sufficient nor necessary condition
to have a complex helgrind patch accepted :).
>
> > But I suspect that the insertion of a trylock in the graph
> > might later on cause a 'wrong' cycle to be detected.
> > E.g. (L = lock, T = trylock, L and T followed by lock nr)
> > threadA L1 T2
> > threadB L2 L3
> > threadC L3 L1
> > cannot deadlock (I think :) if threadA releases lock 1
> > when T2 fails.
>
> Well, if T2 fails then we have no cycle, and if it succeeds we have a real
> potential deadlock. The question is whether we want to remember the T2 attempt
> (and warn later) even when T2 fails. I would say, if it failed, it's like it
> didn't happen.
If T2 succeeds, there is of course no deadlock : in this case, threadA
can do the required work, and then release the locks.
So, with T2, the above cannot deadlock (if we assume the heuristic
of threadA is to release L1 when T2 fails). If the heuristic of threadA
is rather to just retry T2, then we have a deadlock.
So, in summary:
if T2 succeeds, for sure no deadlock (at least for this run :).
if T2 fails, then depending on threadA heuristic, we might
have a deadlock (probably better to call it a livelock,
as threadA will continue to burn CPU to re-trylock 2 :).
Now, with the above, let's imagine we apply the rule:
When inserting a trylock, do not compute cycles but just insert
it "as a normal lock operation".
If the 3 threads execute in sequence (fully), what will be
inserted is
L1 L2
L2 L3
L3 L1
At this time, the graph has "forgotten" that L2 done by threadA
is in fact a T2.
And so, the last insertion of L1 will not be able to detect
that this is not a deadlock : a cycle will be found, while
according to the assumed threadA heuristic, it will never deadlock
So, I am still not convinced that T2 handling by
"just add L2, but do not check cycle"
will work.
> Do you actually store failed trylocks currently?
>
> > But when L3 L1 will be inserted, a cycle will be found
> > via T2 (if the graph has not remembered this is a trylock).
> > So, I am still (somewhat intuitively) thinking that we need
> > to have nodes and/or edges marked with "this is a trylock"
> > and have the graph exploration taking these marking into account
> > to not generate such "false warnings".
>
> My idea is rather that a successful trylock is just like a lock (except that
> it shouldn't warn at that precise moment, but if any later real-lock happens
> out of order with it, then we should warn, this is why successful trylocks
> should be recorded), and a failed trylock should NOT be recorded.
> So at this point I don't see a need for remembering "this was a trylock". I
> don't see how this could matter later on.
See above.
It looks like we need to clarify what is the expected "application
behaviour" that the helgrind "lock detection order" is supposed to
"simulate" and so detect warnings and/or errors.
The current laog algorithm is based on the following application model:
* when a trylock fails, the application just retries this trylock,
it keeps all the other locks.
For such a model, the current laog algorithm works, because a trylock
can be handled exactly like a normal lock.
If we do not want to handle a trylock like a normal lock, we have to
define what is the heuristic that the application is supposed to apply
for a failing trylock. As shown above, depending on this, the algorithm
will have to differ.
The example
threadA T1 T2
threadB T2 T1
cannot "deadlock", but depending on the heuristic can still "livelock"
(not very different of a deadlock, except it consumes more cpu).
So, do we want helgrind to warn or not for that case ?
(cfr initial discussion about what helgrind is supposed to warn for ?).
Philippe
|