|
From: Konstantin S. <kon...@gm...> - 2010-01-13 10:21:54
|
Hi, Memcheck hangs at the very end of some of our programs with ~5% probability. I tried running with --trace-signals=yes --trace-syscalls=yes --time-stamp=yes and here is what I've got: When the program finishes successfully, I see 00:00:00:23.549 -1351-- sigvgkill for lwp 31360 tid 5 00:00:00:23.549 31351-- sigvgkill for lwp 31363 tid 7 00:00:00:23.549 -1351-- sigvgkill for lwp 31360 tid 5 00:00:00:23.549 -1351-- sigvgkill for lwp 31360 tid 5 --> [pre-success] Success(0x0:0x0) --00:00:00:23.551 31351-- Caught __NR_exit; running __libc_freeres() <and the program terminates> When the program hangs, I see --00:00:00:23.987 4983-- sigvgkill for lwp 5193 tid 2 --00:00:00:23.987 4983-- get_thread_out_of_syscall zaps tid 3 lwp 5194 --00:00:00:23.987 4983-- sigvgkill for lwp 5194 tid 3 --00:00:00:23.987 4983-- get_thread_out_of_syscall zaps tid 5 lwp 5196 --00:00:00:23.987 4983-- sigvgkill for lwp 5196 tid 5 --> [pre-success] Success(0x0:0x0) <nothing else is happening, valgrind hangs until it is killed externally by timeout> Which is worse, I can not reproduce this on any machine on which I can attach with gdb... Any idea how to attack the problem? Any workaround? Thanks, --kcc |
|
From: Julian S. <js...@ac...> - 2010-01-13 21:03:18
|
> Any idea how to attack the problem? > Any workaround? Don't know, sorry. Obviously if you can reduce it to a test case that is small enough to be debuggable, that would be a big help. J |
|
From: Konstantin S. <kon...@gm...> - 2010-01-18 09:16:25
|
On Wed, Jan 13, 2010 at 4:54 PM, Julian Seward <js...@ac...> wrote: > > > Any idea how to attack the problem? > > Any workaround? > > Don't know, sorry. Obviously if you can reduce it to a test case > that is small enough to be debuggable trying... When strace-ing an already hung process, I see this line: read(32761, <unfinished ...> Does it tell you anything? Does 32761 look like a legitimate fd? --kcc > , that would be a big help. > > J > |
|
From: Julian S. <js...@ac...> - 2010-01-18 09:41:32
|
On Monday 18 January 2010, Konstantin Serebryany wrote: > On Wed, Jan 13, 2010 at 4:54 PM, Julian Seward <js...@ac...> wrote: > > > Any idea how to attack the problem? > > > Any workaround? > > > > Don't know, sorry. Obviously if you can reduce it to a test case > > that is small enough to be debuggable > > trying... > > When strace-ing an already hung process, I see this line: > read(32761, <unfinished ...> > > Does it tell you anything? > Does 32761 look like a legitimate fd? Hmm, I don't know. I don't think it's legit but am not sure. J > > --kcc > > > , that would be a big help. > > > > J |
|
From: Konstantin S. <kon...@gm...> - 2010-01-27 09:37:43
|
I've minimized the problem to a small test (below).
It spawns many threads and doesn't join them before exiting.
It will hang (or loop forever) one out of 40-100 runs:
% g++ -g -lpthread hang.cc
% for((i=10;i<=99;i++)); do date; time ~/valgrind/trunk/inst/bin/valgrind
--tool=none --trace-syscalls=yes --trace-signals=yes -q ./a.out 2> $i.log ;
done
Even nulgrind is affected.
Any suggestion?
Thanks,
--kcc
--------------------------------------------------------------------------------------------------------
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
void *run(void *p) {
for (int i = 0; ; i++) {
usleep(100);
fprintf(stderr, "T=%d i=%d\n", (int)pthread_self(), i);
}
return NULL;
}
int main(int argc, char** argv) {
for (int i = 0; i < 200; i++) {
pthread_t t;
pthread_create(&t, NULL, run, NULL);
}
fprintf(stderr, "exiting main\n");
return 0;
}
--------------------------------------------------------------------------------------------------------
On Mon, Jan 18, 2010 at 12:56 PM, Julian Seward <js...@ac...> wrote:
> On Monday 18 January 2010, Konstantin Serebryany wrote:
> > On Wed, Jan 13, 2010 at 4:54 PM, Julian Seward <js...@ac...> wrote:
> > > > Any idea how to attack the problem?
> > > > Any workaround?
> > >
> > > Don't know, sorry. Obviously if you can reduce it to a test case
> > > that is small enough to be debuggable
> >
> > trying...
> >
> > When strace-ing an already hung process, I see this line:
> > read(32761, <unfinished ...>
> >
> > Does it tell you anything?
> > Does 32761 look like a legitimate fd?
>
> Hmm, I don't know. I don't think it's legit but am not sure.
>
> J
>
> >
> > --kcc
> >
> > > , that would be a big help.
> > >
> > > J
>
>
>
|
|
From: Julian S. <js...@ac...> - 2010-01-27 09:44:51
|
On Wednesday 27 January 2010, Konstantin Serebryany wrote: > I've minimized the problem to a small test (below). Good that there's a small test case now. So .. this is on which platform? something-linux, or something-darwin? J |
|
From: Konstantin S. <kon...@gm...> - 2010-01-27 09:47:29
|
On Wed, Jan 27, 2010 at 12:59 PM, Julian Seward <js...@ac...> wrote: > On Wednesday 27 January 2010, Konstantin Serebryany wrote: > > I've minimized the problem to a small test (below). > > Good that there's a small test case now. > > So .. this is on which platform? something-linux, or something-darwin? > I am seeing it on x86_64 linux (ubuntu 8.04) w/o the --trace-... flags it hangs more often: for((i=10;i<=99;i++)); do date; time ~/valgrind/trunk/inst/bin/valgrind --tool=none -q ./a.out 2> $i.log ; done > > J > |
|
From: Julian S. <js...@ac...> - 2010-01-27 17:14:41
|
On Wednesday 27 January 2010, Konstantin Serebryany wrote: > I've minimized the problem to a small test (below). > It spawns many threads and doesn't join them before exiting. > It will hang (or loop forever) one out of 40-100 runs: > % g++ -g -lpthread hang.cc > % for((i=10;i<=99;i++)); do date; time ~/valgrind/trunk/inst/bin/valgrind > --tool=none --trace-syscalls=yes --trace-signals=yes -q ./a.out 2> $i.log > ; done Ok; managed to reproduce it. 2 threads were still stuck in some syscall (don't know which yet). Investigating. J |
|
From: Julian S. <js...@ac...> - 2010-01-28 07:26:33
|
On Wednesday 27 January 2010, Julian Seward wrote: > On Wednesday 27 January 2010, Konstantin Serebryany wrote: > > I've minimized the problem to a small test (below). > > It spawns many threads and doesn't join them before exiting. > > It will hang (or loop forever) one out of 40-100 runs: > > % g++ -g -lpthread hang.cc > > % for((i=10;i<=99;i++)); do date; time ~/valgrind/trunk/inst/bin/valgrind > > --tool=none --trace-syscalls=yes --trace-signals=yes -q ./a.out 2> > > $i.log ; done > > Ok; managed to reproduce it. 2 threads were still stuck in some syscall > (don't know which yet). Investigating. I can reproduce it, but only in the case where there is no logging, which isn't useful. If you have a logfile where it hangs for --trace-syscalls=yes --trace-signals=yes, can you compress it and send it to me? afaics the log is about 40MB long, but it should bzip2 nicely. J |
|
From: Konstantin S. <kon...@gm...> - 2010-01-28 07:37:30
|
Sent a log off list With logging on it does not really want to hang. Instead (with ~5% probability) it loops forever. I think this is the same bug -- the process misses its own death time... --kcc On Thu, Jan 28, 2010 at 10:40 AM, Julian Seward <js...@ac...> wrote: > On Wednesday 27 January 2010, Julian Seward wrote: > > On Wednesday 27 January 2010, Konstantin Serebryany wrote: > > > I've minimized the problem to a small test (below). > > > It spawns many threads and doesn't join them before exiting. > > > It will hang (or loop forever) one out of 40-100 runs: > > > % g++ -g -lpthread hang.cc > > > % for((i=10;i<=99;i++)); do date; time > ~/valgrind/trunk/inst/bin/valgrind > > > --tool=none --trace-syscalls=yes --trace-signals=yes -q ./a.out 2> > > > $i.log ; done > > > > Ok; managed to reproduce it. 2 threads were still stuck in some syscall > > (don't know which yet). Investigating. > > I can reproduce it, but only in the case where there is no logging, > which isn't useful. If you have a logfile where it hangs for > --trace-syscalls=yes --trace-signals=yes, can you compress it and > send it to me? afaics the log is about 40MB long, but it should > bzip2 nicely. > > J > |
|
From: Konstantin S. <kon...@gm...> - 2010-02-02 13:24:43
|
Hi Julian, Any luck with this hang? Anything I can help with? --kcc On Thu, Jan 28, 2010 at 10:37 AM, Konstantin Serebryany < kon...@gm...> wrote: > Sent a log off list > With logging on it does not really want to hang. > Instead (with ~5% probability) it loops forever. > I think this is the same bug -- the process misses its own death time... > > --kcc > > On Thu, Jan 28, 2010 at 10:40 AM, Julian Seward <js...@ac...> wrote: > >> On Wednesday 27 January 2010, Julian Seward wrote: >> > On Wednesday 27 January 2010, Konstantin Serebryany wrote: >> > > I've minimized the problem to a small test (below). >> > > It spawns many threads and doesn't join them before exiting. >> > > It will hang (or loop forever) one out of 40-100 runs: >> > > % g++ -g -lpthread hang.cc >> > > % for((i=10;i<=99;i++)); do date; time >> ~/valgrind/trunk/inst/bin/valgrind >> > > --tool=none --trace-syscalls=yes --trace-signals=yes -q ./a.out 2> >> > > $i.log ; done >> > >> > Ok; managed to reproduce it. 2 threads were still stuck in some syscall >> > (don't know which yet). Investigating. >> >> I can reproduce it, but only in the case where there is no logging, >> which isn't useful. If you have a logfile where it hangs for >> --trace-syscalls=yes --trace-signals=yes, can you compress it and >> send it to me? afaics the log is about 40MB long, but it should >> bzip2 nicely. >> >> J >> > > |
|
From: Julian S. <js...@ac...> - 2010-02-05 16:42:05
|
The log is quite useful. It might be that there is a race between the handling for sys_clone and for sys_exit_group. I'm not sure I understand the details though. sys_exit_group happens when the main thread exits. It marks all other threads in the same thread group as "to be forced to exit". If any of these threads are blocked in syscalls then they are hit on the head with sigvgkill to get them out of the syscall. Or something like that. (see function PRE_(sys_exit_group)). So, I suspect the problem is, there is a child thread that has just been created by clone (by a call to do_syscall_clone_amd64_linux) but which is not yet marked as being in the same thread group as its parent (which happens a few hundred instructions after the child's starup, in thread_wrapper (called by run_a_thread_NORETURN called by ML_(start_thread_NORETURN), which is the start point for the child on the host cpu). Then the parent exits, but the child is not marked as also-to-exit because it is not marked as in the same thread group as its parent. So it stays alive. This is I think what happened to tid=281 in the logfile you sent. It would be best to mark the child's thread group before creating it. But I don't understand the meaning of thread groups, and how these relate to what VG_(gettid) and VG_(getpid) return. I could chase this if you can refine the test case into something that reliably hangs every time -- the current 5% failure rate is going to make it impossible to investigate. One thing you could do is to insert a spin-wait loop in ML_(start_thread_noreturn) [make sure gcc doesn't just optimise it away] to delay the point where the child sets up its .threadgroup field. This might make the hang happen more often. Can you try that? J On Tuesday 02 February 2010, Konstantin Serebryany wrote: > Hi Julian, > > Any luck with this hang? > Anything I can help with? > > --kcc > > On Thu, Jan 28, 2010 at 10:37 AM, Konstantin Serebryany < > > kon...@gm...> wrote: > > Sent a log off list > > With logging on it does not really want to hang. > > Instead (with ~5% probability) it loops forever. > > I think this is the same bug -- the process misses its own death time... > > > > --kcc > > > > On Thu, Jan 28, 2010 at 10:40 AM, Julian Seward <js...@ac...> wrote: > >> On Wednesday 27 January 2010, Julian Seward wrote: > >> > On Wednesday 27 January 2010, Konstantin Serebryany wrote: > >> > > I've minimized the problem to a small test (below). > >> > > It spawns many threads and doesn't join them before exiting. > >> > > It will hang (or loop forever) one out of 40-100 runs: > >> > > % g++ -g -lpthread hang.cc > >> > > % for((i=10;i<=99;i++)); do date; time > >> > >> ~/valgrind/trunk/inst/bin/valgrind > >> > >> > > --tool=none --trace-syscalls=yes --trace-signals=yes -q ./a.out 2> > >> > > $i.log ; done > >> > > >> > Ok; managed to reproduce it. 2 threads were still stuck in some > >> > syscall (don't know which yet). Investigating. > >> > >> I can reproduce it, but only in the case where there is no logging, > >> which isn't useful. If you have a logfile where it hangs for > >> --trace-syscalls=yes --trace-signals=yes, can you compress it and > >> send it to me? afaics the log is about 40MB long, but it should > >> bzip2 nicely. > >> > >> J |
|
From: Konstantin S. <kon...@gm...> - 2010-02-07 18:53:20
|
On Fri, Feb 5, 2010 at 8:00 PM, Julian Seward <js...@ac...> wrote:
>
> The log is quite useful. It might be that there is a race
> between the handling for sys_clone and for sys_exit_group. I'm not
> sure I understand the details though.
>
> sys_exit_group happens when the main thread exits. It marks
> all other threads in the same thread group as "to be forced
> to exit". If any of these threads are blocked in syscalls
> then they are hit on the head with sigvgkill to get them out
> of the syscall. Or something like that. (see function
> PRE_(sys_exit_group)).
>
> So, I suspect the problem is, there is a child thread
> that has just been created by clone
> (by a call to do_syscall_clone_amd64_linux)
> but which is not yet marked
> as being in the same thread group as its parent
> (which happens a few hundred instructions after the child's
> starup, in thread_wrapper (called by run_a_thread_NORETURN called
> by ML_(start_thread_NORETURN), which is the start point
> for the child on the host cpu).
>
> Then the parent exits, but the child is not marked as also-to-exit
> because it is not marked as in the same thread group as
> its parent. So it stays alive. This is I think what happened
> to tid=281 in the logfile you sent.
>
> It would be best to mark the child's thread group before
> creating it. But I don't understand the meaning of thread groups,
> and how these relate to what VG_(gettid) and VG_(getpid) return.
>
> I could chase this if you can refine the test case into something
> that reliably hangs every time -- the current 5% failure rate is going to
> make it impossible to investigate.
>
> One thing you could do is to insert a spin-wait loop in
>
Indeed, the patch below make the bug manifeest itself every time.
The process either hangs (top shows it as zombie) or continues to print
stuff forever.
--kcc
--- coregrind/m_syswrap/syswrap-linux.c (revision 11037)
+++ coregrind/m_syswrap/syswrap-linux.c (working copy)
@@ -214,11 +214,20 @@
vg_assert(0);
}
+static void spin_loop(int c, int tid) {
+ static volatile int z;
+ VG_(printf)("spinning: %d\n", tid);
+ while(c--) {
+ z++;
+ }
+ VG_(printf)("done: %d\n", tid);
+}
+
Word ML_(start_thread_NORETURN) ( void* arg )
{
ThreadState* tst = (ThreadState*)arg;
ThreadId tid = tst->tid;
-
+ spin_loop(1 << 25, tid);
run_a_thread_NORETURN ( (Word)tid );
/*NOTREACHED*/
vg_assert(0);
> ML_(start_thread_noreturn) [make sure gcc doesn't just optimise it
> away] to delay the point where the child sets up its .threadgroup
> field. This might make the hang happen more often. Can you try that?
>
> J
>
> On Tuesday 02 February 2010, Konstantin Serebryany wrote:
> > Hi Julian,
> >
> > Any luck with this hang?
> > Anything I can help with?
> >
> > --kcc
> >
> > On Thu, Jan 28, 2010 at 10:37 AM, Konstantin Serebryany <
> >
> > kon...@gm...> wrote:
> > > Sent a log off list
> > > With logging on it does not really want to hang.
> > > Instead (with ~5% probability) it loops forever.
> > > I think this is the same bug -- the process misses its own death
> time...
> > >
> > > --kcc
> > >
> > > On Thu, Jan 28, 2010 at 10:40 AM, Julian Seward <js...@ac...>
> wrote:
> > >> On Wednesday 27 January 2010, Julian Seward wrote:
> > >> > On Wednesday 27 January 2010, Konstantin Serebryany wrote:
> > >> > > I've minimized the problem to a small test (below).
> > >> > > It spawns many threads and doesn't join them before exiting.
> > >> > > It will hang (or loop forever) one out of 40-100 runs:
> > >> > > % g++ -g -lpthread hang.cc
> > >> > > % for((i=10;i<=99;i++)); do date; time
> > >>
> > >> ~/valgrind/trunk/inst/bin/valgrind
> > >>
> > >> > > --tool=none --trace-syscalls=yes --trace-signals=yes -q ./a.out
> 2>
> > >> > > $i.log ; done
> > >> >
> > >> > Ok; managed to reproduce it. 2 threads were still stuck in some
> > >> > syscall (don't know which yet). Investigating.
> > >>
> > >> I can reproduce it, but only in the case where there is no logging,
> > >> which isn't useful. If you have a logfile where it hangs for
> > >> --trace-syscalls=yes --trace-signals=yes, can you compress it and
> > >> send it to me? afaics the log is about 40MB long, but it should
> > >> bzip2 nicely.
> > >>
> > >> J
>
>
>
|
|
From: Julian S. <js...@ac...> - 2010-02-09 20:24:07
|
Konstantin, please can you file a bug report on this?
Else it's in danger of falling through the cracks.
J
On Sunday 07 February 2010, Konstantin Serebryany wrote:
> On Fri, Feb 5, 2010 at 8:00 PM, Julian Seward <js...@ac...> wrote:
> > The log is quite useful. It might be that there is a race
> > between the handling for sys_clone and for sys_exit_group. I'm not
> > sure I understand the details though.
> >
> > sys_exit_group happens when the main thread exits. It marks
> > all other threads in the same thread group as "to be forced
> > to exit". If any of these threads are blocked in syscalls
> > then they are hit on the head with sigvgkill to get them out
> > of the syscall. Or something like that. (see function
> > PRE_(sys_exit_group)).
> >
> > So, I suspect the problem is, there is a child thread
> > that has just been created by clone
> > (by a call to do_syscall_clone_amd64_linux)
> > but which is not yet marked
> > as being in the same thread group as its parent
> > (which happens a few hundred instructions after the child's
> > starup, in thread_wrapper (called by run_a_thread_NORETURN called
> > by ML_(start_thread_NORETURN), which is the start point
> > for the child on the host cpu).
> >
> > Then the parent exits, but the child is not marked as also-to-exit
> > because it is not marked as in the same thread group as
> > its parent. So it stays alive. This is I think what happened
> > to tid=281 in the logfile you sent.
> >
> > It would be best to mark the child's thread group before
> > creating it. But I don't understand the meaning of thread groups,
> > and how these relate to what VG_(gettid) and VG_(getpid) return.
> >
> > I could chase this if you can refine the test case into something
> > that reliably hangs every time -- the current 5% failure rate is going to
> > make it impossible to investigate.
> >
> > One thing you could do is to insert a spin-wait loop in
>
> Indeed, the patch below make the bug manifeest itself every time.
> The process either hangs (top shows it as zombie) or continues to print
> stuff forever.
>
> --kcc
>
>
> --- coregrind/m_syswrap/syswrap-linux.c (revision 11037)
> +++ coregrind/m_syswrap/syswrap-linux.c (working copy)
> @@ -214,11 +214,20 @@
> vg_assert(0);
> }
>
> +static void spin_loop(int c, int tid) {
> + static volatile int z;
> + VG_(printf)("spinning: %d\n", tid);
> + while(c--) {
> + z++;
> + }
> + VG_(printf)("done: %d\n", tid);
> +}
> +
> Word ML_(start_thread_NORETURN) ( void* arg )
> {
> ThreadState* tst = (ThreadState*)arg;
> ThreadId tid = tst->tid;
> -
> + spin_loop(1 << 25, tid);
> run_a_thread_NORETURN ( (Word)tid );
> /*NOTREACHED*/
> vg_assert(0);
>
> > ML_(start_thread_noreturn) [make sure gcc doesn't just optimise it
> > away] to delay the point where the child sets up its .threadgroup
> > field. This might make the hang happen more often. Can you try that?
> >
> > J
> >
> > On Tuesday 02 February 2010, Konstantin Serebryany wrote:
> > > Hi Julian,
> > >
> > > Any luck with this hang?
> > > Anything I can help with?
> > >
> > > --kcc
> > >
> > > On Thu, Jan 28, 2010 at 10:37 AM, Konstantin Serebryany <
> > >
> > > kon...@gm...> wrote:
> > > > Sent a log off list
> > > > With logging on it does not really want to hang.
> > > > Instead (with ~5% probability) it loops forever.
> > > > I think this is the same bug -- the process misses its own death
> >
> > time...
> >
> > > > --kcc
> > > >
> > > > On Thu, Jan 28, 2010 at 10:40 AM, Julian Seward <js...@ac...>
> >
> > wrote:
> > > >> On Wednesday 27 January 2010, Julian Seward wrote:
> > > >> > On Wednesday 27 January 2010, Konstantin Serebryany wrote:
> > > >> > > I've minimized the problem to a small test (below).
> > > >> > > It spawns many threads and doesn't join them before exiting.
> > > >> > > It will hang (or loop forever) one out of 40-100 runs:
> > > >> > > % g++ -g -lpthread hang.cc
> > > >> > > % for((i=10;i<=99;i++)); do date; time
> > > >>
> > > >> ~/valgrind/trunk/inst/bin/valgrind
> > > >>
> > > >> > > --tool=none --trace-syscalls=yes --trace-signals=yes -q ./a.out
> >
> > 2>
> >
> > > >> > > $i.log ; done
> > > >> >
> > > >> > Ok; managed to reproduce it. 2 threads were still stuck in some
> > > >> > syscall (don't know which yet). Investigating.
> > > >>
> > > >> I can reproduce it, but only in the case where there is no logging,
> > > >> which isn't useful. If you have a logfile where it hangs for
> > > >> --trace-syscalls=yes --trace-signals=yes, can you compress it and
> > > >> send it to me? afaics the log is about 40MB long, but it should
> > > >> bzip2 nicely.
> > > >>
> > > >> J
|
|
From: Konstantin S. <kon...@gm...> - 2010-02-10 07:52:52
|
Done: https://bugs.kde.org/show_bug.cgi?id=226116 On Tue, Feb 9, 2010 at 11:43 PM, Julian Seward <js...@ac...> wrote: > > Konstantin, please can you file a bug report on this? > Else it's in danger of falling through the cracks. > > J > > On Sunday 07 February 2010, Konstantin Serebryany wrote: > > On Fri, Feb 5, 2010 at 8:00 PM, Julian Seward <js...@ac...> wrote: > > > The log is quite useful. It might be that there is a race > > > between the handling for sys_clone and for sys_exit_group. I'm not > > > sure I understand the details though. > > > > > > sys_exit_group happens when the main thread exits. It marks > > > all other threads in the same thread group as "to be forced > > > to exit". If any of these threads are blocked in syscalls > > > then they are hit on the head with sigvgkill to get them out > > > of the syscall. Or something like that. (see function > > > PRE_(sys_exit_group)). > > > > > > So, I suspect the problem is, there is a child thread > > > that has just been created by clone > > > (by a call to do_syscall_clone_amd64_linux) > > > but which is not yet marked > > > as being in the same thread group as its parent > > > (which happens a few hundred instructions after the child's > > > starup, in thread_wrapper (called by run_a_thread_NORETURN called > > > by ML_(start_thread_NORETURN), which is the start point > > > for the child on the host cpu). > > > > > > Then the parent exits, but the child is not marked as also-to-exit > > > because it is not marked as in the same thread group as > > > its parent. So it stays alive. This is I think what happened > > > to tid=281 in the logfile you sent. > > > > > > It would be best to mark the child's thread group before > > > creating it. But I don't understand the meaning of thread groups, > > > and how these relate to what VG_(gettid) and VG_(getpid) return. > > > > > > I could chase this if you can refine the test case into something > > > that reliably hangs every time -- the current 5% failure rate is going > to > > > make it impossible to investigate. > > > > > > One thing you could do is to insert a spin-wait loop in > > > > Indeed, the patch below make the bug manifeest itself every time. > > The process either hangs (top shows it as zombie) or continues to print > > stuff forever. > > > > --kcc > > > > > > --- coregrind/m_syswrap/syswrap-linux.c (revision 11037) > > +++ coregrind/m_syswrap/syswrap-linux.c (working copy) > > @@ -214,11 +214,20 @@ > > vg_assert(0); > > } > > > > +static void spin_loop(int c, int tid) { > > + static volatile int z; > > + VG_(printf)("spinning: %d\n", tid); > > + while(c--) { > > + z++; > > + } > > + VG_(printf)("done: %d\n", tid); > > +} > > + > > Word ML_(start_thread_NORETURN) ( void* arg ) > > { > > ThreadState* tst = (ThreadState*)arg; > > ThreadId tid = tst->tid; > > - > > + spin_loop(1 << 25, tid); > > run_a_thread_NORETURN ( (Word)tid ); > > /*NOTREACHED*/ > > vg_assert(0); > > > > > ML_(start_thread_noreturn) [make sure gcc doesn't just optimise it > > > away] to delay the point where the child sets up its .threadgroup > > > field. This might make the hang happen more often. Can you try that? > > > > > > J > > > > > > On Tuesday 02 February 2010, Konstantin Serebryany wrote: > > > > Hi Julian, > > > > > > > > Any luck with this hang? > > > > Anything I can help with? > > > > > > > > --kcc > > > > > > > > On Thu, Jan 28, 2010 at 10:37 AM, Konstantin Serebryany < > > > > > > > > kon...@gm...> wrote: > > > > > Sent a log off list > > > > > With logging on it does not really want to hang. > > > > > Instead (with ~5% probability) it loops forever. > > > > > I think this is the same bug -- the process misses its own death > > > > > > time... > > > > > > > > --kcc > > > > > > > > > > On Thu, Jan 28, 2010 at 10:40 AM, Julian Seward <js...@ac...> > > > > > > wrote: > > > > >> On Wednesday 27 January 2010, Julian Seward wrote: > > > > >> > On Wednesday 27 January 2010, Konstantin Serebryany wrote: > > > > >> > > I've minimized the problem to a small test (below). > > > > >> > > It spawns many threads and doesn't join them before exiting. > > > > >> > > It will hang (or loop forever) one out of 40-100 runs: > > > > >> > > % g++ -g -lpthread hang.cc > > > > >> > > % for((i=10;i<=99;i++)); do date; time > > > > >> > > > > >> ~/valgrind/trunk/inst/bin/valgrind > > > > >> > > > > >> > > --tool=none --trace-syscalls=yes --trace-signals=yes -q > ./a.out > > > > > > 2> > > > > > > > >> > > $i.log ; done > > > > >> > > > > > >> > Ok; managed to reproduce it. 2 threads were still stuck in some > > > > >> > syscall (don't know which yet). Investigating. > > > > >> > > > > >> I can reproduce it, but only in the case where there is no > logging, > > > > >> which isn't useful. If you have a logfile where it hangs for > > > > >> --trace-syscalls=yes --trace-signals=yes, can you compress it and > > > > >> send it to me? afaics the log is about 40MB long, but it should > > > > >> bzip2 nicely. > > > > >> > > > > >> J > > > |