|
From: Miguel O. <mig...@gm...> - 2008-12-07 22:07:10
|
Hello!
I have used Valgrind several times and it is an amazing program, thank
you for developing it! I am studying basic thread stuff and I just
wrote a producer-consumer program using a circular buffer. I was
pretty sure that my program was giving the right answers; however, I
ran it through Helgrind and found "possible races" sometimes.
The basic producer-consumer scheme has 2 semaphores (data & free)
and 1 mutex for accesing the buffer. Instead of that, I use 2 mutexes,
one for writing, another one for reading. Example code:
int buffer[size];
int in = 0;
int out = 0;
void producer()
{
sem_wait(&free);
pthread_mutex_lock(&mutex_producer);
buffer[in] = ...;
in = (in + 1) % size;
pthread_mutex_unlock(&mutex_producer);
sem_post(&data);
}
void consumer()
{
sem_wait(&data);
pthread_mutex_lock(&mutex_consumer);
... = buffer[out];
out = (out + 1) % size;
pthread_mutex_unlock(&mutex_consumer);
sem_post(&free);
}
I wrote some simple test suite and ran it for a couple of hours and it
did not fail even a single time. Still, I investigated a bit about the
producer-consumer scheme and tried to find whether this double mutex
stuff is correct for circular buffers. I did not see any place where
independent mutexes where used, so I tried to make my program crash. I
could not, so I tried Helgrind. After many tests, I found that:
1. If I just use one mutex, Helgrind does not detect anything anytime.
2. If I use two mutexes, Helgrind sometimes detect races. Still, the
program's output is correct. I have reduced the tests to 4 threads (2
producers, 2 consumers) using a 2 position circular buffer and only
sending two pieces of data (one each producer). Example trace of that
case:
P0 P1 C0 C1
wait(f) wait(f) wait(d) wait(d)
lock(p) lock(p) BLK BLK
write(0) BLK BLK BLK
unlock(p) -> P1 BLK BLK BLK
post(d) -> C0 write(1) BLK BLK
--- unlock(p) lock(c) BLK
post(d) -> C1 read(0) BLK
--- ... lock(c)
unlock(c) -> C1 BLK
post(f) read(1)
--- unlock(c)
post(f)
---
d/f = data/free semaphores, p/c = producer/consumer mutexes, 0/1 =
buffer's memory positions
As far as I understand, each producer writes different buffer
positions and each consumer reads different positions of (already)
written memory alone.
Using just one mutex, the difference is that C0 could not do read(0)
while P1 is doing write(1), but that are different memory positions, so
I do not understand where the problem is.
I do not know if I am really mistaken or Helgrind is spotting something else.
Thank you!
Miguel Ojeda
|
|
From: Bart V. A. <bar...@gm...> - 2008-12-08 09:39:46
|
On Sun, Dec 7, 2008 at 11:07 PM, Miguel Ojeda <mig...@gm...> wrote: > I do not know if I am really mistaken or Helgrind is spotting something else. As far as I can see Helgrind was behaving as it should. IMHO your e-mail is a general question about multithreading, and it might be a good idea to repost your e-mail on a mailing list intended for multithreaded programming questions. Bart. |
|
From: Julian S. <js...@ac...> - 2008-12-08 10:34:59
|
On Monday 08 December 2008, Bart Van Assche wrote: > On Sun, Dec 7, 2008 at 11:07 PM, Miguel Ojeda > > <mig...@gm...> wrote: > > I do not know if I am really mistaken or Helgrind is spotting something > > else. > > As far as I can see Helgrind was behaving as it should. Miguel, can you send a complete runnable test program that shows the problem? Helgrind in the 3.3.x sometimes does report races that do not exist, (false positives), and that might be the case here. Helgrind in the development sources does not have those problems. I suggest you check out and build the svn trunk as described at http://valgrind.org/downloads/repository.html and use either Helgrind or Drd (--tool=helgrind or --tool=drd). Both of these use algorithms which don't give false positives. J |
|
From: Miguel O. <mig...@gm...> - 2008-12-08 15:28:41
|
On Mon, Dec 8, 2008 at 11:21 AM, Julian Seward <js...@ac...> wrote: > On Monday 08 December 2008, Bart Van Assche wrote: >> On Sun, Dec 7, 2008 at 11:07 PM, Miguel Ojeda >> >> <mig...@gm...> wrote: >> > I do not know if I am really mistaken or Helgrind is spotting something >> > else. >> >> As far as I can see Helgrind was behaving as it should. > > Miguel, can you send a complete runnable test program that shows > the problem? Sure, below there is a program I coded that triggers it (I tried to make it trigger often, still...). By the way, I hope GMail will not linewrap it too much. > > Helgrind in the 3.3.x sometimes does report races that do not exist, > (false positives), and that might be the case here. Helgrind in the > development sources does not have those problems. > > I suggest you check out and build the svn trunk as described at > Thank you! I will try it ASAP. > http://valgrind.org/downloads/repository.html > > and use either Helgrind or Drd (--tool=helgrind or --tool=drd). Both > of these use algorithms which don't give false positives. > > J > gcc -ansi -pedantic -Wall -lpthread test.c valgrind --tool=helgrind ./a.out --- #include <stdio.h> #include <string.h> #include <stdlib.h> #include <unistd.h> #include <time.h> #include <pthread.h> #include <semaphore.h> #define BUFFER_MAX (2) typedef int data_t; typedef struct { sem_t data, free; size_t in, out; pthread_mutex_t mutex_in, mutex_out; /* Some example data */ data_t buffer[BUFFER_MAX]; } buffer_t; void buffer_init(buffer_t * b) { sem_init(&b->data, 0, 0); sem_init(&b->free, 0, BUFFER_MAX); pthread_mutex_init(&b->mutex_in, NULL); pthread_mutex_init(&b->mutex_out, NULL); b->in = 0; b->out = 0; } void buffer_recv(buffer_t * b, data_t * d) { sem_wait(&b->data); pthread_mutex_lock(&b->mutex_out); memcpy(d, b->buffer + b->out, sizeof(data_t)); b->out = (b->out + 1) % BUFFER_MAX; pthread_mutex_unlock(&b->mutex_out); sem_post(&b->free); } void buffer_send(buffer_t * b, data_t * d) { sem_wait(&b->free); pthread_mutex_lock(&b->mutex_in); memcpy(b->buffer + b->in, d, sizeof(data_t)); b->in = (b->in + 1) % BUFFER_MAX; pthread_mutex_unlock(&b->mutex_in); sem_post(&b->data); } void buffer_destroy(buffer_t * b) { sem_destroy(&b->data); sem_destroy(&b->free); pthread_mutex_destroy(&b->mutex_in); pthread_mutex_destroy(&b->mutex_out); } buffer_t b; void producer(int id) { buffer_send(&b, &id); pthread_exit(NULL); } #define MAXSLEEP (100 * 1000) void consumer(int id) { int d; srand(time(NULL)); usleep(rand() % MAXSLEEP); buffer_recv(&b, &d); printf("%i: %i\n", id, d); pthread_exit(NULL); } #define THREADS (10) int main() { pthread_t producers[THREADS]; pthread_t consumers[THREADS]; int i; buffer_init(&b); for (i = 0; i < THREADS; ++i) pthread_create(producers + i, NULL, (void * (*)(void *)) producer, (void *) i); for (i = 0; i < THREADS; ++i) pthread_create(consumers + i, NULL, (void * (*)(void *)) consumer, (void *) i); for (i = 0; i < THREADS; ++i) { pthread_join(producers[i], NULL); pthread_join(consumers[i], NULL); } buffer_destroy(&b); return 0; } |
|
From: Bart V. A. <bar...@gm...> - 2008-12-08 10:46:31
|
On Mon, Dec 8, 2008 at 11:21 AM, Julian Seward <js...@ac...> wrote: > Helgrind in the 3.3.x sometimes does report races that do not exist, > (false positives), and that might be the case here. Helgrind in the > development sources does not have those problems. I know. The problem in the source code Miguel posted on this list is that buffer[] is not protected by the same mutex in both the producer and consumer thread, or, in other words, inconsistent locking. Any data race detection tool will complain on such constructs. There are two possible solutions: either make the locking consistent or make the buffer[] accesses atomic. Bart. |
|
From: Julian S. <js...@ac...> - 2008-12-09 09:20:04
|
On Monday 08 December 2008, Bart Van Assche wrote: > I know. The problem in the source code Miguel posted on this list is > that buffer[] is not protected by the same mutex in both the producer > and consumer thread, or, in other words, inconsistent locking. Certainly there are two mutexes, but there are also two semaphores used too. I'm not convinced that Helgrind-3.3.1 is correct in reporting an error. Using Helgrind-trunk and Drd-trunk, there's a lot of noise resulting from the use of pthread_exit, which causes some apparently un-threadsafe stack unwinding to happen (should be suppressed, I suppose) and a lot of noise from rand() since it's not threadsafe: The function rand() is not reentrant or thread-safe, since it uses hidden state that is modified on each call. This might just be the seed value to be used by the next call, or it might be something more elaborate. In order to get reproducible behavior in a threaded application, this state must be made explicit. The function rand_r() is supplied with a pointer to an unsigned int, to be used as state. But neither Drd nor Helgrind-trunk report any races (that I can see) in the basic producer/consumer code. In this sort of situation, one way to proceed is for the code's author (Miguel) to provide a semi-formal argument about why the code is correct. Presumably he has some intuition about why there should be no races in it. J |
|
From: Bart V. A. <bar...@gm...> - 2008-12-09 11:23:04
|
On Tue, Dec 9, 2008 at 10:06 AM, Julian Seward <js...@ac...> wrote: > On Monday 08 December 2008, Bart Van Assche wrote: > >> I know. The problem in the source code Miguel posted on this list is >> that buffer[] is not protected by the same mutex in both the producer >> and consumer thread, or, in other words, inconsistent locking. > > Certainly there are two mutexes, but there are also two semaphores > used too. I'm not convinced that Helgrind-3.3.1 is correct in > reporting an error. > > Using Helgrind-trunk and Drd-trunk, there's a lot of noise resulting > from the use of pthread_exit, which causes some apparently un-threadsafe > stack unwinding to happen (should be suppressed, I suppose) and a lot of > noise from rand() since it's not threadsafe: > > The function rand() is not reentrant or thread-safe, since it uses > hidden state that is modified on each call. This might just be the > seed value to be used by the next call, or it might be something more > elaborate. In order to get reproducible behavior in a threaded > application, this state must be made explicit. The function rand_r() > is supplied with a pointer to an unsigned int, to be used as state. > > But neither Drd nor Helgrind-trunk report any races (that I can see) > in the basic producer/consumer code. > > In this sort of situation, one way to proceed is for the code's author > (Miguel) to provide a semi-formal argument about why the code is correct. > Presumably he has some intuition about why there should be no races in it. Regarding rand(): I have just added a suppression pattern for random_r(). While using rand() in a multithreaded program may lead to nondeterministic behavior, this may be valid use of the function rand(). Can you please tell me on which Linux distribution drd-trunk printed noise for the call to pthread_exit() ? At least on Fedora 6 I have not been able to reproduce this behavior. drd-trunk definitely does report races on the basic producer-consumer code. The first race reported by drd-trunk is about the producer-consumer mechanism (after having moved the srand(time(NULL)) call to the start of main()): $ ./vg-in-place --tool=drd ./test-miguel-ojeda ... ==11549== Thread 11: ==11549== Conflicting load by thread 11/13 at 0x00601344 size 4 ==11549== at 0x400A41: buffer_recv (in /home/bart/software/valgrind/test-miguel-ojeda) ==11549== by 0x400BAB: consumer (in /home/bart/software/valgrind/test-miguel-ojeda) ==11549== by 0x4A0A538: vg_thread_wrapper (drd_pthread_intercepts.c:193) ==11549== by 0x3B002062F6: start_thread (in /lib64/libpthread-2.5.so) ==11549== by 0x3AFF6CE86C: clone (in /lib64/libc-2.5.so) ==11549== Allocation context: BSS section of /home/bart/software/valgrind/test-miguel-ojeda ... My comments on the source code posted by Miguel are as follows: * If you want random numbers, srand() should be called once and it should be called from the main thread, not from the consumer thread. * A pattern I recognize is that the semaphores buffer_t::data and buffer_t::free are used as thread-safe counters with thread wake-up semantics. Although semaphores can be used to ensure mutual exclusion, this is not the case for buffer_t::data and buffer_t::free. * While buffer_t::buffer, buffer_t::in and buffer_t::out are accessed by both the producer and the consumer threads, this data is not protected consistently by locking. Either mutex_out and mutex_in should be replaced by a single mutex, or all accesses to the aforementioned three variables should be replaced by atomic accesses. Bart. |
|
From: Julian S. <js...@ac...> - 2008-12-13 00:36:49
|
On Tuesday 09 December 2008, Bart Van Assche wrote: > Can you please tell me on which Linux distribution drd-trunk printed > noise for the call to pthread_exit() ? At least on Fedora 6 I have not > been able to reproduce this behavior. Sorry for the slow response. I got tied up chasing fixing a bug in the generic suppression matching code. Anyway, my mistake -- I can't reproduce this now. On OS11.0 (amd64) drd shows no noise regarding pthread_exit. J |
|
From: Bart V. A. <bar...@gm...> - 2008-12-09 12:33:43
|
On Tue, Dec 9, 2008 at 10:06 AM, Julian Seward <js...@ac...> wrote: > Using Helgrind-trunk and Drd-trunk, there's a lot of noise resulting > from the use of pthread_exit, which causes some apparently un-threadsafe > stack unwinding to happen (should be suppressed, I suppose) [ ... ] Hello Julian, In my opinion the following Helgrind-trunk behavior is incorrect: * Helgrind-trunk does not complain on any access of buffer_t::in, buffer_t::out or buffer_t::buffer. * Helgrind-trunk complains on pthread_exit(). If you want to know why drd-trunk doesn't complain on pthread_exit(), see also http://article.gmane.org/gmane.comp.debugging.valgrind.devel/3326 Bart. |
|
From: Julian S. <js...@ac...> - 2008-12-13 01:03:40
|
On Tuesday 09 December 2008, Bart Van Assche wrote: > In my opinion the following Helgrind-trunk behavior is incorrect: > * Helgrind-trunk does not complain on any access of buffer_t::in, > buffer_t::out or buffer_t::buffer. As far as I can see, buffer_t::in is only accessed when buffer_t::mutex_in is held. So that looks safe to me. Similarly, buffer_t::out is only accessed when buffer_t::mutex_out is held. It's harder to make an argument of why there is no race on buffer_t::buffer. That's really something that Miguel should be able to tell us. My guess is as follows: Firstly, can the writer (buffer_send) wrap around and overtake the reader (buffer_recv) ? I don't think so, because buffer_t::free, which is initialised to BUFFER_MAX, will block the writer after BUFFER_MAX writes without a read. Dually buffer_t::data, which is initialised to zero, stops the reader overtaking the writer. So with these two semaphores, we are guaranteed that each element in buffer is strictly: written, read, written, read, etc. After each write the writer posts to buffer_t::data; only then can the reader read it. ==> no race there. After each read, the reader posts to buffer_t::free; only then can the writer put a new value in that slot. ==> no race there. So (assuming the above is correct!) I think there really is no race, and Helgrind-trunk is correct not to complain. > * Helgrind-trunk complains on pthread_exit(). If you want to know why > drd-trunk doesn't complain on pthread_exit(), see also > http://article.gmane.org/gmane.comp.debugging.valgrind.devel/3326 Ah yes, I had forgotten about that page-at-the-top-of-the-stack stuff. J |
|
From: Miguel O. <mig...@gm...> - 2008-12-13 12:52:34
|
On Tue, Dec 9, 2008 at 12:22 PM, Bart Van Assche <bar...@gm...> wrote: > > Regarding rand(): I have just added a suppression pattern for > random_r(). While using rand() in a multithreaded program may lead to > nondeterministic behavior, this may be valid use of the function > rand(). rand(), srand() and other stuff in the program are irrelevant, aren't they?. Just delete them if you want. I added them in order to trigger helgrind easily by adding some unpredictable timing between threads. On Sat, Dec 13, 2008 at 1:50 AM, Julian Seward <js...@ac...> wrote: > > On Tuesday 09 December 2008, Bart Van Assche wrote: >> In my opinion the following Helgrind-trunk behavior is incorrect: >> * Helgrind-trunk does not complain on any access of buffer_t::in, >> buffer_t::out or buffer_t::buffer. > > As far as I can see, buffer_t::in is only accessed when > buffer_t::mutex_in is held. So that looks safe to me. > Similarly, buffer_t::out is only accessed when buffer_t::mutex_out > is held. > > It's harder to make an argument of why there is no race on buffer_t::buffer. > That's really something that Miguel should be able to tell us. > My guess is as follows: Sorry for the delay. We are facing the classic producer-consumer problem, just with two mutexes, one for reading and one for writing, instead of just one for both. > > Firstly, can the writer (buffer_send) wrap around and overtake the > reader (buffer_recv) ? I don't think so, because buffer_t::free, > which is initialised to BUFFER_MAX, will block the writer after > BUFFER_MAX writes without a read. Dually buffer_t::data, which > is initialised to zero, stops the reader overtaking the writer. > > So with these two semaphores, we are guaranteed that each element > in buffer is strictly: written, read, written, read, etc. > > After each write the writer posts to buffer_t::data; only then can > the reader read it. ==> no race there. > > After each read, the reader posts to buffer_t::free; only then can > the writer put a new value in that slot. ==> no race there. > > So (assuming the above is correct!) I think there really is no race, > and Helgrind-trunk is correct not to complain. Indeed, I think the same: I can not see any situation of race in the program (and testing seems to agree with me); still, valgrind-3.3.1 complains. Maybe there was a bug there, already fixed on trunk. Let's try writing a "proof": If I am not mistaken, there can not be races iff (in = out) => ((d = 0 and r = 0) xor (f = 0 and w = 0)) where: in = b->in out = b->out d = b->data f = b->free r = "readers between their sem_wait() and pthread_mutex_lock()" w = "writers between their sem_wait() and pthread_mutex_lock()" The initial condition is correct as in = out, d = 0 and r = 0. Moreover, every (in = out) case falls within one of this two cases: 1. The readers catch the writers: "out" has been incremented by one equaling "in". d = 0 because the readers caught the writers, so there are not more data to read ([1] in order to increment d we have to decrement f first, so writers working before any reader does). [2] r = 0 for the same reason: If r > 0, then someone would have decremented d, which can only happen if d > 0. However, [1]. 2. The writers catch the readers: Same rationale. End of "proof". Well, this is a wannabe-proof: I am not sure of the reasons I wrote after the [2]. I tried using reductio ad absurdum and it seems even harder to explain it clearly. Some ideas? > > >> * Helgrind-trunk complains on pthread_exit(). If you want to know why >> drd-trunk doesn't complain on pthread_exit(), see also >> http://article.gmane.org/gmane.comp.debugging.valgrind.devel/3326 > > Ah yes, I had forgotten about that page-at-the-top-of-the-stack stuff. > > J > |
|
From: Bart V. A. <bar...@gm...> - 2008-12-14 09:30:32
|
On Sat, Dec 13, 2008 at 1:52 PM, Miguel Ojeda <mig...@gm...> wrote: > > On Tue, Dec 9, 2008 at 12:22 PM, Bart Van Assche > <bar...@gm...> wrote: > > > > Regarding rand(): I have just added a suppression pattern for > > random_r(). While using rand() in a multithreaded program may lead to > > nondeterministic behavior, this may be valid use of the function > > rand(). > > rand(), srand() and other stuff in the program are irrelevant, aren't > they?. Just delete them if you want. I added them in order to trigger > helgrind easily by adding some unpredictable timing between threads. Yes, rand() and srand() are irrelevant in the discussion of the producer/consumer algorithm. > Indeed, I think the same: I can not see any situation of race in the > program (and testing seems to agree with me); still, valgrind-3.3.1 > complains. I had a closer look at your algorithm, and it looks safe to me. I was relying too much on the error message reported by DRD. So there must be a subtle bug in the way DRD updates the vector clocks inside its sem_wait() and/or sem_post() wrappers -- I'm currently examining this. Since this issue was not yet discovered by any other DRD regression test, would you mind if I add your test program to the set of DRD regression tests ? I need your permission before your test program can be redistributed under the GPLv2 license. Bart. |
|
From: Miguel O. <mig...@gm...> - 2008-12-14 16:27:14
|
On Sun, Dec 14, 2008 at 10:30 AM, Bart Van Assche
<bar...@gm...> wrote:
> On Sat, Dec 13, 2008 at 1:52 PM, Miguel Ojeda
> <mig...@gm...> wrote:
>>
>> On Tue, Dec 9, 2008 at 12:22 PM, Bart Van Assche
>> <bar...@gm...> wrote:
>> >
>> > Regarding rand(): I have just added a suppression pattern for
>> > random_r(). While using rand() in a multithreaded program may lead to
>> > nondeterministic behavior, this may be valid use of the function
>> > rand().
>>
>> rand(), srand() and other stuff in the program are irrelevant, aren't
>> they?. Just delete them if you want. I added them in order to trigger
>> helgrind easily by adding some unpredictable timing between threads.
>
> Yes, rand() and srand() are irrelevant in the discussion of the
> producer/consumer algorithm.
>
>> Indeed, I think the same: I can not see any situation of race in the
>> program (and testing seems to agree with me); still, valgrind-3.3.1
>> complains.
>
> I had a closer look at your algorithm, and it looks safe to me. I was
> relying too much on the error message reported by DRD. So there must
> be a subtle bug in the way DRD updates the vector clocks inside its
> sem_wait() and/or sem_post() wrappers -- I'm currently examining this.
> Since this issue was not yet discovered by any other DRD regression
> test, would you mind if I add your test program to the set of DRD
> regression tests ? I need your permission before your test program can
> be redistributed under the GPLv2 license.
Sure, here you have the
Signed-off-by: Miguel Ojeda-Sandonis <mig...@gm...>
if it is needed. By the way, I modified the program in order to delete
the thread-unsafe calls, please check it.
I tested the program below against helgrind-3.3.1, helgrind-trunk and
drd-trunk. I got the following:
helgrind-3.1.1: Possible data races on producer/consumer's memcpy()
almost every run of the test like the following:
==19632== Possible data race during read of size 4 at 0x8049EBC
==19632== at 0x40CCCAC: memcpy (in /lib/i686/cmov/libc-2.7.so)
==19632== by 0x8048A8D: consumer (test.c:79)
==19632== by 0x402641B: mythread_wrapper (hg_intercepts.c:193)
==19632== by 0x40424BF: start_thread (in /lib/i686/cmov/libpthread-2.7.so)
==19632== by 0x413561D: clone (in /lib/i686/cmov/libc-2.7.so)
==19632== Old state: shared-modified by threads #6, #24
==19632== New state: shared-modified by threads #6, #24, #25
==19632== Reason: this thread, #25, holds no consistent locks
==19632== Last consistently used lock for 0x8049EBC was first observed
==19632== at 0x4026242: pthread_mutex_init (hg_intercepts.c:346)
==19632== by 0x80488A8: buffer_init (test.c:28)
==19632== by 0x8048B0A: main (test.c:91)
helgrind-trunk: No data races, however, sometimes I get the following errors:
==19277== Possible data race during read of size 1 at 0x4dc1d10 by thread #3
==19277== at 0x4DBCA8C: (within /lib/libgcc_s.so.1)
==19277== by 0x404A94F: (within /lib/i686/cmov/libpthread-2.7.so)
==19277== by 0xFFFFFFFF: ???
==19277== by 0xFFFFFFFF: ???
==19277== This conflicts with a previous write of size 1 by thread #2
==19277== at 0x4DBAB6C: (within /lib/libgcc_s.so.1)
==19277== by 0x404A94F: (within /lib/i686/cmov/libpthread-2.7.so)
==19277== by 0xFFFFFFFF: ???
==19277== by 0xFFFFFFFF: ???
drd-trunk: I also get the races on producer/consumer's memcpy() again:
==19532== Thread 7:
==19532== Conflicting load by thread 7/37 at 0x08049ebc size 4
==19532== at 0x40CECAC: memcpy (in /lib/i686/cmov/libc-2.7.so)
==19532== by 0x8048A8D: consumer (test.c:80)
==19532== by 0x40273DD: vg_thread_wrapper (drd_pthread_intercepts.c:193)
==19532== by 0x40444BF: start_thread (in /lib/i686/cmov/libpthread-2.7.so)
==19532== by 0x413761D: clone (in /lib/i686/cmov/libc-2.7.so)
==19532== Allocation context: BSS section of /home/max/valgrind/bin/bin/a.out
==19532== Other segment start (thread 0/4)
==19532== (thread finished, call stack no longer available)
==19532== Other segment end (thread 0/4)
==19532== (thread finished, call stack no longer available)
==19532== Other segment start (thread 0/7)
I compiled using: gcc -Wall -lpthread -g test.c
---
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
#include <semaphore.h>
#define THREADS (20)
#define BUFFER_SIZE (2)
#define MAX_SLEEP_TIME (100 * 1000) /* 100 ms */
typedef unsigned int data_t;
typedef struct {
sem_t data, free;
size_t in, out;
pthread_mutex_t mutex_in, mutex_out;
/* Some example data */
data_t buffer[BUFFER_SIZE];
} buffer_t;
void buffer_init(buffer_t * b)
{
sem_init(&b->data, 0, 0);
sem_init(&b->free, 0, BUFFER_SIZE);
pthread_mutex_init(&b->mutex_in, NULL);
pthread_mutex_init(&b->mutex_out, NULL);
b->in = 0;
b->out = 0;
}
void buffer_recv(buffer_t * b, data_t * d)
{
sem_wait(&b->data);
pthread_mutex_lock(&b->mutex_out);
memcpy(d, b->buffer + b->out, sizeof(data_t));
b->out = (b->out + 1) % BUFFER_SIZE;
pthread_mutex_unlock(&b->mutex_out);
sem_post(&b->free);
}
void buffer_send(buffer_t * b, data_t * d)
{
sem_wait(&b->free);
pthread_mutex_lock(&b->mutex_in);
memcpy(b->buffer + b->in, d, sizeof(data_t));
b->in = (b->in + 1) % BUFFER_SIZE;
pthread_mutex_unlock(&b->mutex_in);
sem_post(&b->data);
}
void buffer_destroy(buffer_t * b)
{
sem_destroy(&b->data);
sem_destroy(&b->free);
pthread_mutex_destroy(&b->mutex_in);
pthread_mutex_destroy(&b->mutex_out);
}
buffer_t b;
void producer(unsigned int id)
{
buffer_send(&b, &id);
pthread_exit(NULL);
}
void consumer(unsigned int id)
{
char str[100];
unsigned int d = id;
usleep(rand_r(&d) % MAX_SLEEP_TIME);
buffer_recv(&b, &d);
sprintf(str, "%u: %u\n", id, d);
write(fileno(stdout), str, strlen(str));
pthread_exit(NULL);
}
int main()
{
volatile unsigned int i;
pthread_t producers[THREADS];
pthread_t consumers[THREADS];
buffer_init(&b);
for (i = 0; i < THREADS; ++i)
pthread_create(producers + i, NULL,
(void * (*)(void *)) producer, (void *) i);
for (i = 0; i < THREADS; ++i)
pthread_create(consumers + i, NULL,
(void * (*)(void *)) consumer, (void *) i);
for (i = 0; i < THREADS; ++i) {
pthread_join(producers[i], NULL);
pthread_join(consumers[i], NULL);
}
buffer_destroy(&b);
return 0;
}
|
|
From: Bart V. A. <bar...@gm...> - 2008-12-27 12:41:27
|
On Sun, Dec 14, 2008 at 10:30 AM, Bart Van Assche <bar...@gm...> wrote: > [ ... ] So there must > be a subtle bug in the way DRD updates the vector clocks inside its > sem_wait() and/or sem_post() wrappers -- I'm currently examining this. [ ... ] Semaphore handling in DRD should now be correct. The fix has been committed to the trunk, and will be included in the 3.4.0 release. Bart. |
|
From: Julian S. <js...@ac...> - 2008-12-14 10:13:31
|
> Indeed, I think the same: I can not see any situation of race in the > program (and testing seems to agree with me); still, valgrind-3.3.1 > complains. Maybe there was a bug there, already fixed on trunk. Helgrind in 3.3.1 uses an algorithm which sometimes incorrectly reports races, which I think is what happened here. (These limitations are documented in the Helgrind 3.3.1 manual). Helgrind in the trunk uses a less aggressive algorithm, which (correctly) reports no races, but the tradeoff is that the less aggressive algorithm can sometimes miss real races. It's a difficult problem. > Let's try writing a "proof": Note that there are 2 different questions here: * is your code correct (race-free) ? * is Helgrind-trunk correct to report zero races in your code? I think the answer to both is "yes", but I was mainly concerned with the second question :-) To clarify what exactly Helgrind-trunk is checking: Helgrind-trunk reports a race when on a location, when it sees accesses from two different threads to the location, at least one of which is a write, but it does not see any synchronisation operation between the two threads. Since no synchronisation operation is observed, it could be that the accesses happen in a different order in a different run of the program (based on relative speeds of the threads). Hence it reports a race. buffer_t::in is guarded by buffer_t::mutex_in. Accesses by different threads to buffer_t::in are all separated by a synchronisation operation, which is: unlock of mutex_in followed by lock of mutex_in. Hence no reported race. Same logic applies for buffer_t::out. For the elements in buffer_t::buffer itself, the synchronisation operation (inter-thread dependency) is created by the post->wait pair on ::data (for the reader to be able to proceed) and the post->wait pair on ::free (for the writer to be able to proceed). Anyway, that doesn't prove or disprove anything about your code. But it might give you a bit better insight into what Helgrind-trunk and also Drd-trunk are really checking. J |
|
From: Miguel O. <mig...@gm...> - 2008-12-14 16:38:50
|
On Sun, Dec 14, 2008 at 10:59 AM, Julian Seward <js...@ac...> wrote: > >> Indeed, I think the same: I can not see any situation of race in the >> program (and testing seems to agree with me); still, valgrind-3.3.1 >> complains. Maybe there was a bug there, already fixed on trunk. > > Helgrind in 3.3.1 uses an algorithm which sometimes incorrectly reports > races, which I think is what happened here. (These limitations are > documented in the Helgrind 3.3.1 manual). Helgrind in the trunk uses > a less aggressive algorithm, which (correctly) reports no races, but the > tradeoff is that the less aggressive algorithm can sometimes miss real > races. It's a difficult problem. > >> Let's try writing a "proof": > > Note that there are 2 different questions here: > > * is your code correct (race-free) ? > > * is Helgrind-trunk correct to report zero races in your code? > > I think the answer to both is "yes", but I was mainly concerned with > the second question :-) > > To clarify what exactly Helgrind-trunk is checking: > > Helgrind-trunk reports a race when on a location, when it sees accesses > from two different threads to the location, at least one of which is a > write, but it does not see any synchronisation operation between the > two threads. Since no synchronisation operation is observed, it could > be that the accesses happen in a different order in a different run > of the program (based on relative speeds of the threads). Hence it reports > a race. > > buffer_t::in is guarded by buffer_t::mutex_in. Accesses by different > threads to buffer_t::in are all separated by a synchronisation operation, > which is: unlock of mutex_in followed by lock of mutex_in. Hence no > reported race. > > Same logic applies for buffer_t::out. > > For the elements in buffer_t::buffer itself, the synchronisation operation > (inter-thread dependency) is created by the post->wait pair on ::data > (for the reader to be able to proceed) and the post->wait pair on ::free > (for the writer to be able to proceed). So Helgrind-trunk correctly do not report a data race here; however, what kind of data races could Helgrind-trunk miss by using the new algorithm (in order to be aware of them)? > > Anyway, that doesn't prove or disprove anything about your code. But > it might give you a bit better insight into what Helgrind-trunk and > also Drd-trunk are really checking. Thank you for the explanation! > > J > Miguel Ojeda |
|
From: Miguel O. <mig...@gm...> - 2008-12-14 16:56:47
|
On Sun, Dec 14, 2008 at 10:59 AM, Julian Seward <js...@ac...> wrote: > >> Indeed, I think the same: I can not see any situation of race in the >> program (and testing seems to agree with me); still, valgrind-3.3.1 >> complains. Maybe there was a bug there, already fixed on trunk. > > Helgrind in 3.3.1 uses an algorithm which sometimes incorrectly reports > races, which I think is what happened here. (These limitations are > documented in the Helgrind 3.3.1 manual). Helgrind in the trunk uses > a less aggressive algorithm, which (correctly) reports no races, but the > tradeoff is that the less aggressive algorithm can sometimes miss real > races. It's a difficult problem. I see... What is the plan for Helgrind-trunk then? I will try a guess: you are improving step by step the trunk's algorithm so it can report more and more races without giving false-positives anytime. I think some people would prefer to have the old algorithm, so if Helgrind do not report any data race, they can be pretty sure their program is correct; and if Helgrind reports some data race, they can add more restrictive locking to their application without losing too much performance in normal applications. However, other people may like Helgrind to behave as trunk, so they can use it like a "test" (not a proof-of-correctness but just as another test in the suite). Indeed, it is a hard problem. Miguel Ojeda |
|
From: Bart V. A. <bar...@gm...> - 2008-12-14 20:09:37
|
On Sun, Dec 14, 2008 at 5:56 PM, Miguel Ojeda <mig...@gm...> wrote: > I think some people would prefer to have the old algorithm, so if > Helgrind do not report any data race, they can be pretty sure their > program is correct; and if Helgrind reports some data race, they can > add more restrictive locking to their application without losing too > much performance in normal applications. The algorithm implemented in Helgrind 3.3.1 is also called the Eraser algorithm, and checks a locking discipline: it verifies whether shared memory locations are protected consistently by locking. A disadvantage of the Eraser algorithm is that it reports false positives on some algorithms that use synchronization primitives correctly, e.g. the algorithm you published earlier in this thread. The algorithm implemented in DRD-trunk is the so-called happens-before algorithm: conflicting accesses are only reported as a data race if they are not ordered through synchronization (mutex, semaphore, barrier, ...). As far as I know Helgrind-trunk also uses this algorithm. A disadvantage of the happens-before algorithm is that the errors reported can depend on how the program being analyzed is scheduled. Both algorithms are valuable. Unfortunately there is not yet any known way to combine those two algorithms (Eraser and happens-before) without also combining the disadvantages of both algorithms. Bart. |
|
From: Julian S. <js...@ac...> - 2008-12-16 09:36:17
|
On Sunday 14 December 2008, Bart Van Assche wrote: > Both algorithms are valuable. Unfortunately there is not yet any known > way to combine those two algorithms (Eraser and happens-before) > without also combining the disadvantages of both algorithms. It could be argued that, because of this problem, race detection is really not something very suitable for a dynamic tool. For Miguel's example it would be better in fact to construct a formal proof (somehow!) which shows it is race-free regardless of the scheduling. In other words, maybe static analysis of the program is better. But obviously that is far beyond the scope of Valgrind, and I would guess far beyond the scope of the art for anything but the smallest programs. J |
|
From: Bart V. A. <bar...@gm...> - 2008-12-16 10:43:12
|
On Tue, Dec 16, 2008 at 10:22 AM, Julian Seward <js...@ac...> wrote: > It could be argued that, because of this problem, race detection is > really not something very suitable for a dynamic tool. For Miguel's > example it would be better in fact to construct a formal proof > (somehow!) which shows it is race-free regardless of the scheduling. > > In other words, maybe static analysis of the program is better. But > obviously that is far beyond the scope of Valgrind, and I would guess > far beyond the scope of the art for anything but the smallest programs. IMHO this view is too negative. The proper approach for ensuring that a program does not contain data races is to choose a strategy at design time that guarantees that a program is data-race free (e.g. protecting all shared data through locking and documenting which synchronization object protects which data item). Data-race detection tools are only an aid for verifying that such a strategy has been implemented correctly, and not a replacement for proper software design. This is similar to e.g. ensuring that a program does not contain memory leaks. When a program performs explicit dynamic memory allocation, for each memory allocation call it must be clear who is the owner of the allocated memory. An effective strategy is to deallocate dynamic memory in the same context where it has been allocated, and when this is not feasible, to document how ownership is transferred from a function to its caller. Although tools like memcheck make detecting memory leaks easy, choosing a proper strategy for avoiding memory leaks remains essential. Other popular strategies for avoiding memory leaks, at least in C++ programs, are using STL containers and/or smart pointers. Regarding proving the correctness of threading behavior of software: software for which it is hard to prove formally that it is correct is almost always hard to read too. Source-code level annotations could be a great help for both static or dynamic analysis tools. E.g. C++ software is easier to verify if it has been specified per class whether or not member functions may be called from another thread than the thread in which the object has been constructed. For C++ classes that have been designed to be accessed from a single thread only, a very effective debugging help is to store the thread ID of the thread the constructor has been called from (pthread_self()) in a data member, and to verify that thread ID in other member functions e.g. as follows: assert(pthread_self() == m_constructor_thread_id). Maybe it's possible to let a static analysis tool perform such an analysis for a properly annotated program. Bart. |
|
From: Dallman, J. <joh...@si...> - 2008-12-17 15:28:40
|
> IMHO this view is too negative. The proper approach for ensuring that > a program does not contain data races is to choose a strategy at > design time that guarantees that a program is data-race free (e.g. > protecting all shared data through locking and documenting which > synchronization object protects which data item). It would be lovely were it possible to work this way. Sadly, humans make mistakes, and are often quite bad at realising that they have made them. > Data-race detection tools are only an aid for verifying that such > a strategy has been implemented correctly, and not a replacement > for proper software design. For software with real-world levels of complexity, verifying every test case with a data-race detection tool is the work of years. In practice, one has to use them as aids to debugging. -- John Dallman Parasolid Porting Engineer Siemens PLM Software 46 Regent Street, Cambridge, CB2 1DP United Kingdom Tel: +44-1223-371554 joh...@si... www.siemens.com/plm |
|
From: Julian S. <js...@ac...> - 2008-12-16 10:05:49
|
On Sunday 14 December 2008, Miguel Ojeda wrote: > > Helgrind in 3.3.1 uses an algorithm which sometimes incorrectly reports > > races, which I think is what happened here. (These limitations are > > documented in the Helgrind 3.3.1 manual). Helgrind in the trunk uses > > a less aggressive algorithm, which (correctly) reports no races, but the > > tradeoff is that the less aggressive algorithm can sometimes miss real > > races. It's a difficult problem. > > I see... What is the plan for Helgrind-trunk then? I will try a guess: > you are improving step by step the trunk's algorithm so it can report > more and more races without giving false-positives anytime. As per Bart's comments in a previous message, that's not really possible; or at least I don't know how. A lot of effort recently went into Helgrind-trunk in order to make it able to show full stack traces for both accesses when a race is detected. I believe that's an important facility from a programmer-productivity perspective. From testing of the previous algorithms on non-trivial programs (and even on very small programs, in fact) it is obvious that it is often very difficult to take a race report and find the root cause of the problem if you don't know where the other conflicting access(es) is/are. In other words, the real challenge is not just to detect the races, but also to collect enough information that the programmer can make sense of the result. Doing this without a massive space/time overhead seems to be, let's say, not at all easy. > I think some people would prefer to have the old algorithm, so if > Helgrind do not report any data race, they can be pretty sure their > program is correct; and if Helgrind reports some data race, they can > add more restrictive locking to their application without losing too > much performance in normal applications. Yes. This is definitely true. Perhaps it would be possible to integrate both algorithms in Helgrind and have a command line flag to select one at startup. The problem is to do this without introducing more maintainability and performance problems in a code base which is already too complex for my liking. J |
|
From: Bart V. A. <bar...@gm...> - 2008-12-16 17:11:33
|
On Tue, Dec 16, 2008 at 10:52 AM, Julian Seward <js...@ac...> wrote: > Yes. This is definitely true. Perhaps it would be possible to integrate > both algorithms in Helgrind and have a command line flag to select one at > startup. The problem is to do this without introducing more > maintainability and performance problems in a code base which is already > too complex for my liking. An alternative to integrating both algorithms in one tool is to have two tools, one with the Eraser algorithm and one with the happens-before algorithm (when counting DRD that would make three data-race detection tools). Do you think this would be a good idea ? Bart. |