From: David F. <fa...@kd...> - 2012-08-25 09:01:18
|
How do I tell helgrind that some atomic operations on integers are OK? A suppression like this works, but it would make helgrind more useful to all Qt users if either atomic operations were handled automatically, or if this was added to valgrind's own suppressions. { qt5_basic_atomic_integer Helgrind:Race fun:_ZNK19QBasicAtomicIntegerIiE4loadEv } Also, this only works for integers, not for pointers... ==8489== Possible data race during read of size 8 at 0x8B9C968 by thread #3 ==8489== at 0x57549C0: QBasicAtomicPointer<QTextCodec>::loadAcquire() const (qgenericatomic.h:110) ==8489== by 0x5753A26: QTextCodec::codecForLocale() (qtextcodec.cpp:683) ==8489== by 0x558BD08: QString::toLocal8Bit() const (qstring.cpp:3959) ==8489== ==8489== This conflicts with a previous write of size 8 by thread #2 ==8489== at 0x57549A2: QBasicAtomicPointer<QTextCodec>::storeRelease(QTextCodec*) (qgenericatomic.h:119) ==8489== by 0x5757D5D: QIcuCodec::defaultCodecUnlocked() (qicucodec.cpp:441) ==8489== by 0x5753A43: QTextCodec::codecForLocale() (qtextcodec.cpp:687) ==8489== by 0x558BD08: QString::toLocal8Bit() const (qstring.cpp:3959) Can't make suppressions for these, given that the template class could use any type (here QTextCodec). The implementation of loadAcquire/storeRelease is compiler and platform dependent. With C++11 it uses std::atomic, with gcc (in non-c++11 mode) it uses some __sync_synchronize stuff, on ARM there's a bit of assembly, etc. I presume all this is too low-level for helgrind? I.e. can it actually detect atomic operations and therefore validate Qt's implementation, or should we rather just "trust Qt" and aim for silencing helgrind whenever it sees QBasicAtomic{Integer,Pointer}? -- David Faure, fa...@kd..., http://www.davidfaure.fr Sponsored by Nokia to work on KDE, incl. KDE Frameworks 5 |
From: Bart V. A. <bva...@ac...> - 2012-08-25 10:30:56
|
On 08/25/12 08:45, David Faure wrote: > How do I tell helgrind that some atomic operations on integers are OK? > > A suppression like this works, but it would make helgrind more useful to all Qt users if either atomic operations were handled automatically, or if this was added to valgrind's own suppressions. > > { > qt5_basic_atomic_integer > Helgrind:Race > fun:_ZNK19QBasicAtomicIntegerIiE4loadEv > } > > Also, this only works for integers, not for pointers... > > ==8489== Possible data race during read of size 8 at 0x8B9C968 by thread #3 > ==8489== at 0x57549C0: QBasicAtomicPointer<QTextCodec>::loadAcquire() const (qgenericatomic.h:110) > ==8489== by 0x5753A26: QTextCodec::codecForLocale() (qtextcodec.cpp:683) > ==8489== by 0x558BD08: QString::toLocal8Bit() const (qstring.cpp:3959) > ==8489== > ==8489== This conflicts with a previous write of size 8 by thread #2 > ==8489== at 0x57549A2: QBasicAtomicPointer<QTextCodec>::storeRelease(QTextCodec*) (qgenericatomic.h:119) > ==8489== by 0x5757D5D: QIcuCodec::defaultCodecUnlocked() (qicucodec.cpp:441) > ==8489== by 0x5753A43: QTextCodec::codecForLocale() (qtextcodec.cpp:687) > ==8489== by 0x558BD08: QString::toLocal8Bit() const (qstring.cpp:3959) > > Can't make suppressions for these, given that the template class could use any type > (here QTextCodec). > > The implementation of loadAcquire/storeRelease is compiler and platform dependent. > With C++11 it uses std::atomic, with gcc (in non-c++11 mode) it uses some __sync_synchronize stuff, > on ARM there's a bit of assembly, etc. > > I presume all this is too low-level for helgrind? I.e. can it actually detect atomic operations > and therefore validate Qt's implementation, or should we rather just "trust Qt" and aim for > silencing helgrind whenever it sees QBasicAtomic{Integer,Pointer}? The Qt authors will have to modify the Qt library somewhat before it will become convenient to analyze Qt programs with Helgrind or DRD. See also this bug report (created three years ago): http://bugreports.qt-project.org/browse/QTBUG-5655. Bart. |
From: Julian S. <js...@ac...> - 2012-08-25 20:44:11
|
On Saturday, August 25, 2012, David Faure wrote: > How do I tell helgrind that some atomic operations on integers are OK? > [...] Without looking in more detail at this, it is difficult to tell if that is really the right question to ask. Helgrind treats atomic operations (atomic read-modify-write) as if they were simply reads, both for x86/x86_64, and similarly for LL/SC pairs on ARM, PPC, MIPS, etc. So it should not report races on code that uses atomic operations. Now, that said, a reason it might be reporting races like this is that Qt is using the atomic ops to create inter-thread synchronisation that Helgrind cannot see, hence it reports a race where there is none. Or maybe Qt really is racey -- perhaps harmlessly. In which case H is correct to complain. Without looking further, it's not really possible to say what's happening. I would be happy to do that if you make it easy (viz, send commands starting with "wget qt5.tarball" etc) to build a simple test case. One way to hide the complaints is use VALGRIND_{DISABLE,ENABLE}_ERROR_REPORTING around the affected areas. Read the docs in valgrind.h about that. Or you can use VALGRIND_HG_{DISABLE,ENABLE}_CHECKING on the address range (in helgrind.h) to tell H not to check the particular address range. That's quite dangerous though because you need be sure to re-enable checking on the range when the memory is freed. Both of those approaches are basically bad though, because they might hide real bugs, both in Qt and in apps linked against it. Better to investigate what is going on here, fix any bugs found, and mark up Qt to notify Helgrind/DRD of any inter-thread synchronisation that the normal pthread intercepts cannot see. Once that is done, it should be possible to use Helgrind/DRD on Qt without false errors. That could be complex though, because it may involve describing the synchronisation induced by thread-safe reference counting. That's doable -- I marked up Firefox's XPCOM implementation so it can be Helgrinded, so I know it is possible. LMK if I can help more. J |
From: David F. <fa...@kd...> - 2012-08-25 23:28:58
|
On Saturday 25 August 2012 22:43:58 Julian Seward wrote: > Or maybe Qt really is racey Bingo :) I tried extracting a testcase of the problem, and it turns out that Qt5's feature detection (which implementation to use) goes wrong, and a no-op code path ends up being used for this particular operation. I reported this to Thiago, I'll be back with a real problem next time, not a no-op implementation :-) Thanks for the input. -- David Faure, fa...@kd..., http://www.davidfaure.fr Sponsored by Nokia to work on KDE, incl. KDE Frameworks 5 |
From: David F. <fa...@kd...> - 2012-08-26 09:04:53
Attachments:
testcase_atomic_ops_helgrind.cpp
|
On Sunday 26 August 2012 01:30:43 David Faure wrote: > On Saturday 25 August 2012 22:43:58 Julian Seward wrote: > > Or maybe Qt really is racey > > Bingo :) OK, maybe not. Turns out the code runs as intended by the Qt developers. Marc Mutz said " The standard says it's racy, but the implementation of std::atomic::load(memory_order_acquire) won't look different. Simple reads and writes on x86 are already sequentially consistent. Think MESI cache coherency. Before a CPU writes to a memory location it needs to acquire exclusive ownership (E) of the cache line, the well-known "hardware mutex" on a cache line that produces False Sharing, too. This seems to hold for all architectures, cf. threads re: "Brief example ..." at http://www.decadent.org.uk/pipermail/cpp-threads/2008-December/thread.html " I attached a pure C++ testcase of the issue. Compile it with g++ testcase_atomic_ops_helgrind.cpp -o testcase_atomic_ops_helgrind -lpthread Thiago expects that helgrind can't autodetect this case and that helgrind- macros markup is needed in Qt, I'm fine with adding that if you guys agree -- after you show me how by modifying the attached example :-) Thanks. -- David Faure, fa...@kd..., http://www.davidfaure.fr Sponsored by Nokia to work on KDE, incl. KDE Frameworks 5 |
From: Julian S. <js...@ac...> - 2012-08-26 09:28:29
|
On Sunday, August 26, 2012, David Faure wrote: > Thiago expects that helgrind can't autodetect this case and that helgrind- > macros markup is needed in Qt, I'm fine with adding that if you guys agree > -- after you show me how by modifying the attached example :-) IIUC then, QBasicAtomicInt::{loadAcquire,storeRelease} just does normal loads and stores of an int. Right? What atomicity properties are you expecting onDemandNumber() to have? Viz, how is onDemandNumber supposed to behave when you have multiple threads doing it at the same time on the same location? J |
From: Patrick J. L. <lop...@gm...> - 2012-08-27 22:04:02
|
On Sun, Aug 26, 2012 at 2:06 AM, David Faure <fa...@kd...> wrote: > > Marc Mutz said " > The standard says it's racy, but the implementation of > std::atomic::load(memory_order_acquire) won't look different. Simple reads > and writes on x86 are already sequentially consistent. I do not think this is true. Loads on x86 do have acquire semantics, and stores do have release semantics, but that is not the same as sequential consistency. If loads and stores on x86 were "already sequentially consistent", there would be no need for the mfence instruction... And as Julian points out, the compiler has plenty of license to reorder things even if the CPU does not. Dealing with all of this is why C++11 has a memory model. Out of curiosity, is Helgrind expected to generate errors for correctly-implemented lock-free algorithms? (That is, algorithms with correct memory barriers and no mutexes?) - Pat |
From: Julian S. <js...@ac...> - 2012-08-28 08:29:30
|
On Tuesday, August 28, 2012, Patrick J. LoPresti wrote: > On Sun, Aug 26, 2012 at 2:06 AM, David Faure <fa...@kd...> wrote: > > Marc Mutz said " > > The standard says it's racy, but the implementation of > > std::atomic::load(memory_order_acquire) won't look different. Simple > > reads and writes on x86 are already sequentially consistent. > > I do not think this is true. Loads on x86 do have acquire semantics, > and stores do have release semantics, but that is not the same as > sequential consistency. If loads and stores on x86 were "already > sequentially consistent", there would be no need for the mfence > instruction... On re-reading this, I am wondering now if Marc meant "x86 TSO (total store order)" when he said "sequential consistency". IIRC from Hennessy and Patterson, no real processor provides sequential consistency because to do so precludes the use of per-processor store queues, and without those a high performance memory system is more or less impossible. x86 does provide TSO, a guarantee that stores by one processor appear to other processors in the same order. But that's a weaker guarantee than sequential consistency. > Out of curiosity, is Helgrind expected to generate errors for > correctly-implemented lock-free algorithms? (That is, algorithms with > correct memory barriers and no mutexes?) It will generate false errors in cases where it can't "see" all of the happens-before relationships between threads. By default it ships with intercepts that correctly detect those relationships for POSIX pthreads (by intercepting pthread_mutex_lock, unlock, etc). However, if you create your own locking primitives or do synchronisation some other way, there's a flexible set of client requests for marking up code, so it can indeed be checked. I did this just the other day for home-rolled spinlocks. J |
From: David F. <fa...@kd...> - 2012-08-26 09:39:07
|
On Sunday 26 August 2012 11:28:06 Julian Seward wrote: > On Sunday, August 26, 2012, David Faure wrote: > > Thiago expects that helgrind can't autodetect this case and that helgrind- > > macros markup is needed in Qt, I'm fine with adding that if you guys agree > > -- after you show me how by modifying the attached example :-) > > IIUC then, QBasicAtomicInt::{loadAcquire,storeRelease} just does normal > loads and stores of an int. Right? Yep (on x86). This whole API exists so that other architectures can get another implementation. > What atomicity properties are you expecting onDemandNumber() to have? > Viz, how is onDemandNumber supposed to behave when you have multiple > threads doing it at the same time on the same location? static int onDemandNumber() { static QBasicAtomicInt sharedInt = { 0 }; if (!sharedInt.loadAcquire()) { // some calculation goes here sharedInt.storeRelease(42); } return sharedInt.loadAcquire(); } I think the point is that the first call to loadAcquire() should either return 0 or 42, but never some intermediate value due to a write in progress. Hmm, I see the point though. This *is* racy by design, since you don't know if you'll get 0 or 42, if another thread is running at the same time. The only thing is, we know it's fine to get either one, since we'll simply do the calculation twice if two threads get 0 at the same time. Overall this is more performant than using a mutex every single time this is called. I'm not sure what can be done then, to avoid a helgrind warning. I presume the only solution is to annotate onDemandNumber() itself, not QBasicAtomicInt. I.e. annotate all uses of QBasicAtomicInt where the overall logic makes it safe, in order to still be able to detect unsafe uses... (like load, increment, store). -- David Faure, fa...@kd..., http://www.davidfaure.fr Sponsored by Nokia to work on KDE, incl. KDE Frameworks 5 |
From: Julian S. <js...@ac...> - 2012-08-26 10:49:53
|
> I'm not sure what can be done then, to avoid a helgrind warning. If you can guarantee that "// some calculation goes here" touches only thread-local state, then there is only a race on sharedInt itself. In which case, use VALGRIND_HG_{DISABLE,ENABLE}_CHECKING to disable reporting on the relevant specific instances of the sharedInt. If "// some calculation goes here" touches shared state then you will have to use VALGRIND_{DISABLE,ENABLE}_ERROR_REPORTING instead. This is a heavyweight and undesirable solution, though. ---------- My understanding of this is that it is in violation of the C++11 memory model, which says that "the implementation" may deliver stores from one core to another in any order, in the absence of any C++11 mandated inter- thread synchronisation. You can argue that for x86 the hardware's TSO guarantees make this harmless ... > Marc Mutz said " > The standard says it's racy, but the implementation of ... but AIUI, "the implementation" also includes the compiler, and I believe it has been observed that GCC will indeed break your code in unexpected ways, in some cases. In short you need to prove that not only the hardware won't reorder stores between cores -- which for x86 happens to be true -- but also the compiler won't. J |
From: David F. <fa...@kd...> - 2012-08-27 11:43:22
|
On Sunday 26 August 2012 12:53:41 Marc Mutz wrote: > On Sunday August 26 2012, David Faure wrote: > > On Sunday 26 August 2012 11:28:06 Julian Seward wrote: > > > On Sunday, August 26, 2012, David Faure wrote: > > > > Thiago expects that helgrind can't autodetect this case and that > > > > helgrind- macros markup is needed in Qt, I'm fine with adding that if > > > > you guys agree -- after you show me how by modifying the attached > > > > example :-) > > > > > > IIUC then, QBasicAtomicInt::{loadAcquire,storeRelease} just does normal > > > loads and stores of an int. Right? > > > > Yep (on x86). This whole API exists so that other architectures can get > > another implementation. > > > > > What atomicity properties are you expecting onDemandNumber() to have? > > > Viz, how is onDemandNumber supposed to behave when you have multiple > > > threads doing it at the same time on the same location? > > > > static int onDemandNumber() { > > > > static QBasicAtomicInt sharedInt = { 0 }; > > if (!sharedInt.loadAcquire()) { > > > > // some calculation goes here > > sharedInt.storeRelease(42); > > > > } > > return sharedInt.loadAcquire(); > > > > } > > If sharedInt is a std::atomic<int>, then the above is not a C++11 data race, No, it's a bare int, just like Qt5's QBasicAtomicInt does by default on x86. You missed the initial email with the testcase, here it is (testcase_atomic_ops_helgrind.cpp). > so Helgrind shouldn't warn. Helgrind warns, hence my mail here. > If 'some calculation goes here' doesn't write > to memory global memory, you could even drop the memory ordering. Atomics > don't participate in data races, but they might also not establish the > happens-before that you need elsewhere. This is obviously a reduced testcase ;) The calculation, in the example I took, is to register the metatype, which is global stuff, but which is somewhat safe if done twice (it will return the same number). Inside QMetaType::registerNormalizedType there's a mutex. The whole point, though, as I see it, is to avoid locking in the very common case where the metatype has been registered already. > > I think the point is that the first call to loadAcquire() should either > > return 0 or 42, but never some intermediate value due to a write in > > progress. > > Correct. What's more: for any thread, any read of an atomic variable must > return a value previously read, or a value written later by another thread. > IOW: once a thread sees 42, it can't magically see 0 the next time. > According to Hans Boehm, this may happen on IA-64 because the architecture > allows to reorder reads from the same memory location. Which is why the atomic stuff expands to different code on IA-64, so no problem. > If it writes to a shared memory location, the code contains a C++11 data > race, and you need more complex synchronisation Yes, which is there, but that's not the point. From what you say, the code is fine, so helgrind shouldn't warn about sharedInt itself. However, Julian says the code is not fine, the compiler could generate code that doesn't work correctly. I don't know who to trust at this point, I'm caught in the middle :-). If this is safe, helgrind should be fixed. If this is not safe, Qt should be fixed. But it sounds to me like different people have different interpretations on what is safe, in such code :-) I just realized one thing though: I was wrong when I said "it is racy because the value could be 0 or 42", that is not a race. Writing the same code with a mutex around the int leads to exactly this behavior (0 or 42), which is fine. See testcase_int_mutex_helgrind.cpp. No helgrind warning, and I think everyone agrees that this code is ok. So if ints are read and written atomically, then the helgrind warning in the initial testcase is wrong? (ok I guess it's more complicated, if the compiler can reorder stuff in the first testcase and not the second....). -- David Faure, fa...@kd..., http://www.davidfaure.fr Sponsored by Nokia to work on KDE, incl. KDE Frameworks 5 |
From: David F. <fa...@kd...> - 2013-01-23 14:05:18
|
On Monday 27 August 2012 15:25:14 Marc Mutz wrote: > If atomic loads and stores on x86 are implemented with a volatile cast, > then the compiler can't reorder stuff around them, either. Not more than > with a std::atomic, at least. QAtomic does that. For load-relaxed, Thiago > thinks that a normal read (non-volatile) is correct and I couldn't prove > him wrong. I was talking to Julian about this again today, and he pointed me to this writeup: http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming We're looking at how to silence valgrind about Qt atomic ops, but before that, it would actually be good to be sure that what Qt does it correct, on x86.... Where does the claim about "volatile cast means the compiler can't reorder stuff around them" come from? In the Qt source code I see qatomic_gcc.h (which is unused, unfortunately) calling __sync_synchronize() in loadAcquire(). Shouldn't Qt's code call that, when compiled with gcc? This does lead to different assembly, too, on x86. So the claim that x86 doesn't need memory barriers seems wrong? --- testcase_atomic_ops_helgrind.s.orig 2013-01-23 15:04:20.889417624 +0100 +++ testcase_atomic_ops_helgrind.s 2013-01-23 15:07:06.938422071 +0100 @@ -380,6 +380,7 @@ _ZN10QAtomicOpsIiE11loadAcquireERKi: movq -24(%rbp), %rax movl (%rax), %eax movl %eax, -4(%rbp) + mfence movl -4(%rbp), %eax popq %rbp .cfi_def_cfa 7, 8 @@ -403,6 +404,7 @@ _ZN10QAtomicOpsIiE12storeReleaseERii: movq -8(%rbp), %rax movl -12(%rbp), %edx movl %edx, (%rax) + mfence popq %rbp .cfi_def_cfa 7, 8 ret -- David Faure, fa...@kd..., http://www.davidfaure.fr Working on KDE, in particular KDE Frameworks 5 |
From: Julian S. <js...@ac...> - 2013-01-23 14:58:27
|
On 01/23/2013 03:08 PM, David Faure wrote: > I was talking to Julian about this again today, and he pointed me to this writeup: > > http://software.intel.com/en-us/blogs/2007/11/30/volatile-almost-useless-for-multi-threaded-programming > > We're looking at how to silence valgrind about Qt atomic ops, but before that, > it would actually be good to be sure that what Qt does it correct, on x86.... > > Where does the claim about "volatile cast means the compiler can't reorder stuff around them" > come from? I think that claim is incorrect. It may be that the compiler is not allowed to reorder volatile stores, but for sure it can reorder volatile stores w.r.t. non-volatile stores, which makes the volatile stores completely useless from a multithreading perspective. That writeup contains a nice example: volatile int Ready; int Message[100]; void foo( int i ) { Message[i/10] = 42; Ready = 1; } Imagine that 'message' is some chunk of info that we intend to give to another thread, and 'Ready' is a flag that we intend to use to tell the other thread when the message is ready to be read. gcc-4.7.2 -m32 -O2 produces this assembly (omitting the CFI junk): foo: movl 4(%esp), %ecx movl $1717986919, %edx movl $1, Ready <---------- Ready = 1; movl %ecx, %eax imull %edx sarl $31, %ecx sarl $2, %edx subl %ecx, %edx movl $42, Message(,%edx,4) <--------- Message[i / 10] = 42; ret so the volatile clearly didn't have the claimed effect. Bottom line is, as I understand it, that C++11 allows implementations (compiler + hardware) to assume code is race free, and optimise accordingly. On x86, we have the added guarantee that the hardware will deliver stores in order, but the compiler is still free to reorder ordinary stores. J |
From: Dave G. <go...@mc...> - 2013-01-23 15:22:59
|
On Jan 23, 2013, at 8:08 AM CST, David Faure wrote: > On Monday 27 August 2012 15:25:14 Marc Mutz wrote: >> If atomic loads and stores on x86 are implemented with a volatile cast, >> then the compiler can't reorder stuff around them, either. Not more than >> with a std::atomic, at least. QAtomic does that. For load-relaxed, Thiago >> thinks that a normal read (non-volatile) is correct and I couldn't prove >> him wrong. […] > Where does the claim about "volatile cast means the compiler can't reorder stuff around them" > come from? Julian already addressed this in a separate post, and I agree with his assessment re: access to volatile-qualified variables. > In the Qt source code I see qatomic_gcc.h (which is unused, unfortunately) calling > __sync_synchronize() in loadAcquire(). Shouldn't Qt's code call that, when compiled with gcc? If you don't want to write inline assembly, this might be your best bet. But on TSO systems like x86, you only need a "compiler barrier". In x86 inline assembly syntax, this looks like: __asm__ __volatile__ ( "" ::: "memory" ) This prevents GCC (and compilers that correctly support this syntax) from reordering accesses across this statement or making assumptions about the state of memory across this point. The only case you need to worry about on x86 (assuming you are not fiddling with "special" memory) is that earlier stores could be reordered after subsequent loads. That should be the only time you need the real "mfence" instructions that you have below. "load-acquire" and "store-release" don't fall into that category. > This does lead to different assembly, too, on x86. > So the claim that x86 doesn't need memory barriers seems wrong? > > --- testcase_atomic_ops_helgrind.s.orig 2013-01-23 15:04:20.889417624 +0100 > +++ testcase_atomic_ops_helgrind.s 2013-01-23 15:07:06.938422071 +0100 > @@ -380,6 +380,7 @@ _ZN10QAtomicOpsIiE11loadAcquireERKi: > movq -24(%rbp), %rax > movl (%rax), %eax > movl %eax, -4(%rbp) > + mfence > movl -4(%rbp), %eax > popq %rbp > .cfi_def_cfa 7, 8 > @@ -403,6 +404,7 @@ _ZN10QAtomicOpsIiE12storeReleaseERii: > movq -8(%rbp), %rax > movl -12(%rbp), %edx > movl %edx, (%rax) > + mfence > popq %rbp > .cfi_def_cfa 7, 8 > ret > -Dave |
From: Julian S. <js...@ac...> - 2013-01-23 15:48:24
|
On 01/23/2013 04:22 PM, Dave Goodell wrote: > If you don't want to write inline assembly, this might be your best bet. > But on TSO systems like x86, you only need a "compiler barrier". > In x86 inline assembly syntax, this looks like: > > __asm__ __volatile__ ( "" ::: "memory" ) > > This prevents GCC (and compilers that correctly support this syntax) > from reordering accesses across this statement or making assumptions > about the state of memory across this point. Yeah, you're right. Thanks for bringing clarity to my (concurrently-) fried brain. To make a stab at an executive summary, then: * in intentionally racey code, we need to worry about both hardware and compiler reordering stores. That requires, at a minimum, inserting suitable load and store fences. * on x86/x86_64, we only need to worry about the compiler reordering stores, since the hardware promises not to. That means all we need is a compiler barrier, as you show -- probably on both sides of both the load and store. * regardless of the above 2 points, since C++11 allows compilers to assume code is race free, we can't rule out the possibility that the compiler does some other legal transformation which makes the code not work as the authors intended. Because of this last point, overall I'm still with Pat LoPresti, who summarised the situation admirably thus: "There is no such thing as a benign data race. Ever." (v-users, 4 Dec 2012) IMO, the idea that some races are harmless is a dangerous myth that we should try to stamp out. J |
From: David F. <fa...@kd...> - 2012-10-24 06:18:58
|
Hi, I'm finally coming back to this issue. On Sunday 26 August 2012 12:49:27 Julian Seward wrote: > > I'm not sure what can be done then, to avoid a helgrind warning. > > If you can guarantee that "// some calculation goes here" touches only > thread-local state, then there is only a race on sharedInt itself. In > which case, use VALGRIND_HG_{DISABLE,ENABLE}_CHECKING to disable reporting > on the relevant specific instances of the sharedInt. This seems to be what I need. VALGRIND_HG_DISABLE_CHECKING(&_q_value, sizeof(_q_value)); in loadAcquire silences the warning. Surprisingly, VALGRIND_HG_ENABLE_CHECKING doesn't appear to work, though. All races are suppressed, even obvious races that warn if disable+enable was never used before. Testcase attached, see the call to oops(). In my tests, ENABLE_CHECKING basically behaves like DISABLE_CHECKING (for instance if you simply put a ENABLE_CHECKING at the beginning of loadAcquire and nothing else, then there's no warning at all anymore). > ---------- > > My understanding of this is that it is in violation of the C++11 memory > model, which says that "the implementation" may deliver stores from one > core to another in any order, in the absence of any C++11 mandated inter- > thread synchronisation. > > You can argue that for x86 the hardware's TSO guarantees make this > harmless ... > > > Marc Mutz said " > > The standard says it's racy, but the implementation of > > ... but AIUI, "the implementation" also includes the compiler, and I > believe it has been observed that GCC will indeed break your code in > unexpected ways, in some cases. In short you need to prove that not > only the hardware won't reorder stores between cores -- which for x86 > happens to be true -- but also the compiler won't. Yes but AFAIU that's what the "volatile" does -- prevent the compiler from reordering. However valgrind can't possibly find out that "volatile" was used, if all that does is disable compiler optimizations, so I agree that this cannot all work out of the box, valgrind-specific markup is definitely needed in the Qt atomics class. Current version of my patch attached -- no re-ENABLE for now, since it doesn't work anyway ;) -- David Faure, fa...@kd..., http://www.davidfaure.fr Working on KDE, in particular KDE Frameworks 5 |
From: David F. <fa...@kd...> - 2013-01-19 09:57:23
|
For a different approach to this issue: if helgrind can't reliably detect atomic operations, as discussed before, and if VALGRIND_HG_ENABLE_CHECKING is broken (see my previous email in this thread), then a simple solution is to add suppressions. { QBasicAtomicPointer_load Helgrind:Race fun:_ZNK19QBasicAtomicPointer*loadAcquireEv } has cut down the noise considerably in my tests. Plus a few similar ones, like QBasicAtomicInt. Historically we had Qt-related suppressions in a suppression file shipped with KDE, but there are many more Qt users than the KDE developers. Would it be OK to ship a qt5.supp file within valgrind, and load it unconditionnally, like xfree-4.supp is currently handled? -- David Faure, fa...@kd..., http://www.davidfaure.fr Working on KDE, in particular KDE Frameworks 5 |
From: Dave G. <go...@mc...> - 2013-01-23 16:33:16
|
On Jan 23, 2013, at 9:55 AM CST, Thiago Macieira wrote: > On quarta-feira, 23 de janeiro de 2013 09.22.49, Dave Goodell wrote: >> >> If you don't want to write inline assembly, this might be your best bet. >> But on TSO systems like x86, you only need a "compiler barrier". In x86 >> inline assembly syntax, this looks like: >> >> __asm__ __volatile__ ( "" ::: "memory" ) >> >> This prevents GCC (and compilers that correctly support this syntax) from >> reordering accesses across this statement or making assumptions about the >> state of memory across this point. > > Indeed, but it also prevents proper merging of stores and loads. For example, > if x is an atomic variable, the compiler can and should merge these two > stores: > > x = 1; > x = 2; > > It's a trade-off until we have the proper implementation with std::atomic. Agreed. My comments refer to the current common state of the world, not a C11/C++11 world. That's all too new for me to be able to rely on its availability yet. I am eagerly awaiting the day that inline assembly and atomic libraries can be ditched altogether. >> The only case you need to worry about on x86 (assuming you are not fiddling >> with "special" memory) is that earlier stores could be reordered after >> subsequent loads. That should be the only time you need the real "mfence" >> instructions that you have below. "load-acquire" and "store-release" don't >> fall into that category. > > Which is why the qatomic_gcc.h implementation is a last resort for Qt. Since > it does a full memory barrier including mfence, it's way too expensive. > > Note that x86 does have sfence and lfence instructions too. I should go ask > some colleagues at Intel about when they should be used, because I haven't yet > found a case where they are needed. It does, but AIUI they are only needed for write-combining (WC) memory regions, such as memory mapped video card frame buffers. > Here's the listing of the current implementations of atomics in Qt and their > drawbacks, in the order that they are preferred: > > 0) MSVC: applies only to MSVC, using intrinsics > > 1) Arch-specific implementation: uses assembly for the fetchAndAdd, testAndSet > and fetchAndStore operations, but uses direct volatile loads/stores for > loadAcquire and storeRelease, and non-volatile loads/stores for load/store, > which are subject to reordering by the compiler. This is the default on x86 > with GCC, Clang and ICC. The above will probably only work on Itanium machines where compilers seem to emit acquire/release suffixes on volatile load/store operations. You should probably update this implementation to include compiler barriers. > 2) std::atomic implementation, if std::atomic is supported. This appeared on > GCC in 4.6, but was implemented using the old intrinsics with full barrier. In > GCC 4.7, it uses the new intrinsics that support more relaxed barriers. This > works fine for x86, but the implementation is incredibly sub-optimal on other > architectures (uses locks on ARM). > > 3) GCC old intrinsics implementation, the one with full barriers. > > From GCC's plans, the implementation in 4.8 will solve the implementation > issues on other architectures, so we may start using that. That means Qt 5.2. Makes sense. > In any of the implementations, all the code is inlined, so there's nothing for > valgrind to react to. The best we could do is insert some of valgrind's no-op > hints, on debug or special builds. Can you recommend what hints to add? Julian/Bart/etc. may have more to add here. I remember having trouble with annotating load-acquire/store-release in the past. Here's the (only partially helpful) thread on the topic: > May I also suggest an out-of-line marker strategy, similar to what the Linux > kernel does for catching processor faults, and the ABI does for exceptions? If > we had this, we'd leave the markers enabled even in release builds. I'm not sure I'm familiar with this technique. Do you have a link to any reading (even code is fine) that I could do on the subject? -Dave |
From: Dave G. <go...@mc...> - 2013-01-23 16:35:15
|
On Jan 23, 2013, at 10:33 AM CST, Dave Goodell wrote: > Julian/Bart/etc. may have more to add here. I remember having trouble with annotating load-acquire/store-release in the past. Here's the (only partially helpful) thread on the topic: Sorry for the extra email. I accidentally whacked the "send email" key combination instead of the "paste" key combination. The link that I intended to send was: http://thread.gmane.org/gmane.comp.debugging.valgrind.devel/11471/focus=11480 -Dave |