|
From: David C. <dc...@gm...> - 2014-01-26 02:20:24
|
Hi,
I've got an issue with memcheck in Valgrind 3.8.1 hanging. I've left
processes running for weeks or even months but they don't complete
(normally these processes run in a few minutes tops, and they were working
fine with memcheck until a while ago.
Has anyone seen anything like this before? Here are the details:
options:
--quiet --track-origins=yes --free-fill=7a --child-silent-after-fork=yes
--fair-sched=no --log-file=/path/to/log
--suppressions=/path/to/suppression.file
strace shows:
Process 5223 attached - interrupt to quit
read(1027,
system details:
$ uname -a
Linux HOSTNAME 2.6.18-194.3.1.el5 #1 SMP Sun May 2 04:17:42 EDT 2010 x86_64
x86_64 x86_64 GNU/Linux
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.5 (Tikanga)
Top shows:
732 USER 25 0 2396m 1.6g 110m S 99.8 3.5 86263:19
memcheck-amd64-
25100 USER 25 0 2498m 1.7g 112m S 99.8 3.6 86272:55
memcheck-amd64-
Ps shows pipe wait in WCHAN:
0 S nbezj7v 25100 1864 98 85 0 - 639685 pipe_w 2013
? 59-22:03:14 /path/to/valgrind options
Have seen it also hang on fast mutexes – i.e. strace/ps will show the
process in FUTEX_WAIT status.
Thanks,
David.
|
|
From: Philippe W. <phi...@sk...> - 2014-01-26 13:07:54
|
On Sun, 2014-01-26 at 02:20 +0000, David Carter wrote:
> Hi,
>
>
> I've got an issue with memcheck in Valgrind 3.8.1 hanging. I've left
> processes running for weeks or even months but they don't complete
> (normally these processes run in a few minutes tops, and they were
> working fine with memcheck until a while ago.
>
>
> Has anyone seen anything like this before? Here are the details:
>
>
> options:
>
> --quiet --track-origins=yes --free-fill=7a
> --child-silent-after-fork=yes --fair-sched=no --log-file=/path/to/log
> --suppressions=/path/to/suppression.file
>
>
>
> strace shows:
>
> Process 5223 attached - interrupt to quit
>
> read(1027,
With --fair-sched=no, valgrind uses a pipe to implement a "big lock".
It is however not clear with what you have shown if this 1027 is
the valgrind pipe big lock fd. If yes, then it looks like a bug in
valgrind, as the above read means a thread want to acquire the big
lock to run, but the thread currently holding the lock does not
release it.
Here are various suggestions :
1. when you are in the above blocked state, use gdb+vgdb
to connect to your process, and examine the state
of your process (e.g. which thread is doing what)
(the most likely cause of deadlock/problem is your application, not
valgrind, at least when looking at your mail with
a "valgrind developer hat on" :).
2. upgrade to 3.9.0, there are many bugs solved since 3.8.1
(probably not yours, I do not see anything related to deadlock
but one never knows).
3. run with a lot more traces e.g.
-v -v -v -d -d -d --trace-sched=yes --trace-syscalls=yes --trace-signals=yes
and see if there is some suspicious output.
Philippe
|
|
From: David C. <dc...@gm...> - 2014-01-26 22:28:58
|
Thank you very much, Philippe, The --fair-sched option was set in an attempt to fix this. I had read about interminable FUTEX_WAIT status and I think that was one of the suggestions. Clearly it doesn't make any difference. I think I've tried 3.9.0, but I will double-check and run that one from now on anyway. I have tried connecting with gdb and there wasn't much visible. I'll try again though and also try vgdb - I was unaware of this tool. Not sure what is getting locked, whether it's Valgrind or our code. We do use threading but only in a limited way, and I'm pretty sure memcheck is hanging up on single-threaded cases. Hopefully the extra logging etc will reveal something. I can't easily log onto the machine from here - I'll run the experiments you suggest and report back in a short while. One thing I didn't mention, which might be important, is that I run valgrind through a python-driven process-pool. I use the multiprocess module to spawn off a bunch of valgrinds. I don't think its relevant as it was working fine for several weeks like this before the hang-ups started. Best wishes and thanks again, David. On Sun, Jan 26, 2014 at 1:07 PM, Philippe Waroquiers < phi...@sk...> wrote: > On Sun, 2014-01-26 at 02:20 +0000, David Carter wrote: > > Hi, > > > > > > I've got an issue with memcheck in Valgrind 3.8.1 hanging. I've left > > processes running for weeks or even months but they don't complete > > (normally these processes run in a few minutes tops, and they were > > working fine with memcheck until a while ago. > > > > > > Has anyone seen anything like this before? Here are the details: > > > > > > options: > > > > --quiet --track-origins=yes --free-fill=7a > > --child-silent-after-fork=yes --fair-sched=no --log-file=/path/to/log > > --suppressions=/path/to/suppression.file > > > > > > > > strace shows: > > > > Process 5223 attached - interrupt to quit > > > > read(1027, > With --fair-sched=no, valgrind uses a pipe to implement a "big lock". > It is however not clear with what you have shown if this 1027 is > the valgrind pipe big lock fd. If yes, then it looks like a bug in > valgrind, as the above read means a thread want to acquire the big > lock to run, but the thread currently holding the lock does not > release it. > > Here are various suggestions : > 1. when you are in the above blocked state, use gdb+vgdb > to connect to your process, and examine the state > of your process (e.g. which thread is doing what) > (the most likely cause of deadlock/problem is your application, not > valgrind, at least when looking at your mail with > a "valgrind developer hat on" :). > > 2. upgrade to 3.9.0, there are many bugs solved since 3.8.1 > (probably not yours, I do not see anything related to deadlock > but one never knows). > > 3. run with a lot more traces e.g. > -v -v -v -d -d -d --trace-sched=yes --trace-syscalls=yes > --trace-signals=yes > and see if there is some suspicious output. > > Philippe > > > > |
|
From: David C. <dc...@gm...> - 2014-01-31 16:34:53
|
Hi Philippe,
Upgraded to 3.9.0 as you suggested and ran with these options:
-v -v -v -d -d -d --trace-sched=yes --trace-syscalls=yes
--trace-signals=yes --quiet --track-origins=yes --free-fill=7a
--child-silent-after-fork=yes --fair-sched=no
After some time, a bunch of processes went into 'pipe_w' status. These
were single-threaded processes. Their logfiles (which were enormous -
hundreds of gigabytes!) all contained this line:
--23014-- SCHED[3]: TRC: YIELD
Each of the processes showed only one thread:
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.1)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <
http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and
redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type
"show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 2071
Reading symbols from
/apps1/pkgs/valgrind-3.9.0/lib/valgrind/memcheck-amd64-linux...done.
0x000000003804b559 in do_syscall_WRK ()
(gdb) where
#0 0x000000003804b559 in do_syscall_WRK ()
#1 0x000000003804b94a in vgPlain_do_syscall (sysno=1028,
a1=34516426208, a2=1, a3=18446744073709551615, a4=0, a5=0, a6=0, a7=0,
a8=0) at m_syscall.c:674
#2 0x0000000038035d44 in vgPlain_read (fd=1,
buf=0xffffffffffffffff, count=<value optimized out>) at m_libcfile.c:158
#3 0x00000000380daa98 in vgModuleLocal_sema_down
(sema=0x802001830, as_LL=0 '\000') at m_scheduler/sema.c:109
#4 0x0000000038083687 in vgPlain_acquire_BigLock_LL
(tid=1, who=0x80956dde0 "") at m_scheduler/scheduler.c:355
#5 vgPlain_acquire_BigLock (tid=1, who=0x80956dde0 "") at
m_scheduler/scheduler.c:277
#6 0x00000000380838f5 in vgPlain_scheduler (tid=<value
optimized out>) at m_scheduler/scheduler.c:1227
#7 0x00000000380b28b6 in thread_wrapper (tidW=1) at
m_syswrap/syswrap-linux.c:103
#8 run_a_thread_NORETURN (tidW=1) at
m_syswrap/syswrap-linux.c:156
#9 0x0000000000000000 in ?? ()
(gdb) info threads
* 1 process 2071 0x000000003804b559 in do_syscall_WRK ()
(gdb)
strace showed the same as before (i.e. read on a high-numbered filehandle,
around 1026 or 1027). Someone has suggested that this would indicate that
valgrind is calling dup2 to create new filehandles. Evidence from lsof
also bears this out, showing only 77 open files for each process. The fd's
not relevant to our application are:
COMMAND PID USER FD TYPE DEVICE SIZE
NODE NAME
memcheck- 2071 nbezj7v 5r FIFO 0,6
297571407 pipe
memcheck- 2071 nbezj7v 7u sock 0,5
297780139 can't identify protocol
memcheck- 2071 nbezj7v 8w FIFO 0,6
297571410 pipe
memcheck- 2071 nbezj7v 9r CHR 1,3
3908 /dev/null
memcheck- 2071 nbezj7v 10r DIR 253,0
4096 2 /
memcheck- 2071 nbezj7v 1025u REG 253,0 637
1114475 /tmp/valgrind_proc_2071_cmdline_ad8659c2 (deleted)
memcheck- 2071 nbezj7v 1026u REG 253,0 256
1114491 /tmp/valgrind_proc_2071_auxv_ad8659c2 (deleted)
memcheck- 2071 nbezj7v 1028r FIFO 0,6
297571563 pipe
memcheck- 2071 nbezj7v 1029w FIFO 0,6
297571563 pipe
memcheck- 2071 nbezj7v 1030r FIFO 253,0
1114706 /tmp/vgdb-pipe-from-vgdb-to-2071-by-USERNAME-on-???
I tried vgdb, but not a lot of luck. After invoking 'valgrind --vgdb=yes
--vgdb-error=0 /path/to/my/exe', I then got this in another terminal:
$ gdb /path/to/my/exe
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.1)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <
http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and
redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type
"show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
"/path/to/my/exe": not in executable format: File truncated
(gdb) target remote | /apps1/pkgs/valgrind-3.9.0/bin/vgdb
--pid=30352
Remote debugging using |
/apps1/pkgs/valgrind-3.9.0/bin/vgdb --pid=30352
relaying data between gdb and process 30352
Remote register badly formatted:
T0506:0000000000000000;07:30f0fffe0f000000;10:700aa05d38000000;thread:7690;
here:
00000000;07:30f0fffe0f000000;10:700aa05d38000000;thread:7690;
Try to load the executable by `file' first,
you may also check `set/show architecture'.
This also caused the vgdb server to hang up. I tried with the 'file'
command made no difference. The "not in executable format" is totally
expected - we run a optimised lightweight "test shell" process which loads
a bunch of heavy debug so's.
What is the next stage, can I try different options? Or perhaps
instrument/change the source code in some way in order to figure out what
is happening?
Thanks,
David.
------------------------------
On Sunday, January 26, 2014, David Carter <dc...@gm...> wrote:
> Thank you very much, Philippe,
>
> The --fair-sched option was set in an attempt to fix this. I had read
> about interminable FUTEX_WAIT status and I think that was one of the
> suggestions. Clearly it doesn't make any difference.
>
> I think I've tried 3.9.0, but I will double-check and run that one from
> now on anyway.
>
> I have tried connecting with gdb and there wasn't much visible. I'll try
> again though and also try vgdb - I was unaware of this tool.
>
> Not sure what is getting locked, whether it's Valgrind or our code. We do
> use threading but only in a limited way, and I'm pretty sure memcheck is
> hanging up on single-threaded cases. Hopefully the extra logging etc will
> reveal something. I can't easily log onto the machine from here - I'll run
> the experiments you suggest and report back in a short while.
>
> One thing I didn't mention, which might be important, is that I run
> valgrind through a python-driven process-pool. I use the multiprocess
> module to spawn off a bunch of valgrinds. I don't think its relevant as it
> was working fine for several weeks like this before the hang-ups started.
>
> Best wishes and thanks again,
> David.
>
>
>
> On Sun, Jan 26, 2014 at 1:07 PM, Philippe Waroquiers <
> phi...@sk...<javascript:_e(%7B%7D,'cvml','phi...@sk...');>
> > wrote:
>
>> On Sun, 2014-01-26 at 02:20 +0000, David Carter wrote:
>> > Hi,
>> >
>> >
>> > I've got an issue with memcheck in Valgrind 3.8.1 hanging. I've left
>> > processes running for weeks or even months but they don't complete
>> > (normally these processes run in a few minutes tops, and they were
>> > working fine with memcheck until a while ago.
>> >
>> >
>> > Has anyone seen anything like this before? Here are the details:
>> >
>> >
>> > options:
>> >
>> > --quiet --track-origins=yes --free-fill=7a
>> > --child-silent-after-fork=yes --fair-sched=no --log-file=/path/to/log
>> > --suppressions=/path/to/suppression.file
>> >
>> >
>> >
>> > strace shows:
>> >
>> > Process 5223 attached - interrupt to quit
>> >
>> > read(1027,
>> With --fair-sched=no, valgrind uses a pipe to implement a "big lock".
>> It is however not clear with what you have shown if this 1027 is
>> the valgrind pipe big lock fd. If yes, then it looks like a bug in
>> valgrind, as the above read means a thread want to acquire the big
>> lock to run, but the thread currently holding the lock does not
>> release it.
>>
>> Here are various suggestions :
>> 1. when you are in the above blocked state, use gdb+vgdb
>> to connect to your process, and examine the state
>> of your process (e.g. which thread is doing what)
>> (the most likely cause of deadlock/problem is your application, not
>> valgrind, at least when looking at your mail with
>> a "valgrind developer hat on" :).
>>
>> 2. upgrade to 3.9.0, there are many bugs solved since 3.8.1
>> (probably not yours, I do not see anything related to deadlock
>> but one never knows).
>>
>> 3. run with a lot more traces e.g.
>> -v -v -v -d -d -d --trace-sched=yes --trace-syscalls=yes
>> --trace-signals=yes
>> and see if there is some suspicious output.
>>
>> Philippe
>>
>>
>>
>>
>
|
|
From: Philippe W. <phi...@sk...> - 2014-01-31 20:56:09
|
On Fri, 2014-01-31 at 16:34 +0000, David Carter wrote:
Hello David,
> Hi Philippe,
>
>
>
> Upgraded to 3.9.0 as you suggested and ran with these options:
>
>
>
> -v -v -v -d -d -d --trace-sched=yes
> --trace-syscalls=yes --trace-signals=yes --quiet --track-origins=yes
> --free-fill=7a --child-silent-after-fork=yes --fair-sched=no
>
>
>
> After some time, a bunch of processes went into 'pipe_w' status.
> These were single-threaded processes. Their logfiles (which were
> enormous - hundreds of gigabytes!) all contained this line:
>
>
>
> --23014-- SCHED[3]: TRC: YIELD
The above trace is strange: the SCHED[3] indicates that this is
valgrind thread id 3 which is doing the trace. That seems to indicate
that there was (at least at some point in time) more than one thread
in the game.
The YIELD scheduler trace is explained as:
case VEX_TRC_JMP_YIELD:
/* Explicit yield, because this thread is in a spin-lock
or something. Only let the thread run for a short while
longer. Because swapping to another thread is expensive,
we're prepared to let this thread eat a little more CPU
before swapping to another. That means that short term
spins waiting for hardware to poke memory won't cause a
thread swap. */
if (dispatch_ctr > 1000)
dispatch_ctr = 1000;
break;
>
>
>
> Each of the processes showed only one thread:
I have already seen in the past that GDB is not always
able to show the various threads (not clear when, I suspect
this might happen with a static thread lib ?).
To double check the nr of threads, you can do one of the following:
from the shell:
vgdb -l
and then for each reported PIDNR:
vgdb --pid=<PIDNR> v.info scheduler
That will show the state of the valgrind scheduler, and the list
of threads known by valgrind, and ask valgrind to produce a stack
trace (guest stack trace) for each thread.
Or alternatively:
ls /proc/<PIDNR>/task
will show the list of threads nr at linux level.
> #3 0x00000000380daa98 in vgModuleLocal_sema_down
> (sema=0x802001830, as_LL=0 '\000') at m_scheduler/sema.c:109
>
> #4 0x0000000038083687 in vgPlain_acquire_BigLock_LL
> (tid=1, who=0x80956dde0 "") at m_scheduler/scheduler.c:355
>
> #5 vgPlain_acquire_BigLock (tid=1, who=0x80956dde0
> "") at m_scheduler/scheduler.c:277
The above indicates that the thread is trying to acquire the
valgrind "big lock". When using --fair-sched=no, the big lock is
implemented using a pipe. Writing a character on the pipe releases
the lock. Reading the character on the pipe is the lock acquisition.
If the process is blocked on reading on the pipe, then it looks
like the "lock" character was not written back ?
Maybe strace -f valgrind and see what is happening with the lock
character. The lock character loops over A .. Z and then back to A.
Check that the lock is properly released some short time before
the trial to re-acquire it.
>
>
> strace showed the same as before (i.e. read on a high-numbered
> filehandle, around 1026 or 1027). Someone has suggested that this
> would indicate that valgrind is calling dup2 to create new
> filehandles. Evidence from lsof also bears this out, showing only 77
> open files for each process. The fd's not relevant to our application
> are:
In the below, I guess that 1028 and 1029 is the pipe used for the
valgrind lock.
>
>
>
> COMMAND PID USER FD TYPE DEVICE SIZE
> NODE NAME
>
> memcheck- 2071 nbezj7v 5r FIFO 0,6
> 297571407 pipe
>
> memcheck- 2071 nbezj7v 7u sock 0,5
> 297780139 can't identify protocol
>
> memcheck- 2071 nbezj7v 8w FIFO 0,6
> 297571410 pipe
>
> memcheck- 2071 nbezj7v 9r CHR 1,3
> 3908 /dev/null
>
> memcheck- 2071 nbezj7v 10r DIR 253,0 4096
> 2 /
>
> memcheck- 2071 nbezj7v 1025u REG 253,0 637
> 1114475 /tmp/valgrind_proc_2071_cmdline_ad8659c2 (deleted)
>
> memcheck- 2071 nbezj7v 1026u REG 253,0 256
> 1114491 /tmp/valgrind_proc_2071_auxv_ad8659c2 (deleted)
>
> memcheck- 2071 nbezj7v 1028r FIFO 0,6
> 297571563 pipe
>
> memcheck- 2071 nbezj7v 1029w FIFO 0,6
> 297571563 pipe
>
> memcheck- 2071 nbezj7v 1030r FIFO 253,0
> 1114706 /tmp/vgdb-pipe-from-vgdb-to-2071-by-USERNAME-on-???
>
>
>
> I tried vgdb, but not a lot of luck. After invoking 'valgrind
> --vgdb=yes --vgdb-error=0 /path/to/my/exe', I then got this in another
> terminal:
>
>
>
> $ gdb /path/to/my/exe
> "/path/to/my/exe": not in executable format: File
> truncated
>
> (gdb) target remote
> | /apps1/pkgs/valgrind-3.9.0/bin/vgdb --pid=30352
>
> Remote debugging using
> | /apps1/pkgs/valgrind-3.9.0/bin/vgdb --pid=30352
>
> relaying data between gdb and process 30352
>
> Remote register badly formatted:
> T0506:0000000000000000;07:30f0fffe0f000000;10:700aa05d38000000;thread:7690;
>
> here:
> 00000000;07:30f0fffe0f000000;10:700aa05d38000000;thread:7690;
>
> Try to load the executable by `file' first,
>
> you may also check `set/show architecture'.
>
>
>
> This also caused the vgdb server to hang up. I tried with the 'file'
> command made no difference. The "not in executable format" is totally
> expected - we run a optimised lightweight "test shell" process which
> loads a bunch of heavy debug so's.
The "not in executable format" is not expected by me :).
What kind of executable is that ? I thought that gdb should be able
to "understand" the executables that are launchable on e.g. red hat 5
(if I guessed the distro properly).
In the shell, what is 'file /path/to/my/exe' telling ?
Maybe you could download and compile the last gdb version (7.6 or 7.7
if it has just been produced) and see if gdb is now more intelligent ?
>
>
>
> What is the next stage, can I try different options? Or perhaps
> instrument/change the source code in some way in order to figure out
> what is happening?
As detailed above, you could:
* confirm the list of threads in your process
(ls /proc/..., vgdb v.info scheduler)
* if v.info scheduler works, you might guess what your
application is doing from the stack trace(s)
* maybe you could debug using a newer gdb
* strace -f valgrind .... /path/to/my/exe
might also give some lights on what happens with the valgrind big
lock.
Philippe
|
|
From: David C. <dc...@gm...> - 2014-01-31 22:00:00
|
Thank you again, Philippe. I will make some investigations next week and report back. Regards, David. On Friday, January 31, 2014, Philippe Waroquiers < phi...@sk...> wrote: > On Fri, 2014-01-31 at 16:34 +0000, David Carter wrote: > Hello David, > > > Hi Philippe, > > > > > > > > Upgraded to 3.9.0 as you suggested and ran with these options: > > > > > > > > -v -v -v -d -d -d --trace-sched=yes > > --trace-syscalls=yes --trace-signals=yes --quiet --track-origins=yes > > --free-fill=7a --child-silent-after-fork=yes --fair-sched=no > > > > > > > > After some time, a bunch of processes went into 'pipe_w' status. > > These were single-threaded processes. Their logfiles (which were > > enormous - hundreds of gigabytes!) all contained this line: > > > > > > > > --23014-- SCHED[3]: TRC: YIELD > The above trace is strange: the SCHED[3] indicates that this is > valgrind thread id 3 which is doing the trace. That seems to indicate > that there was (at least at some point in time) more than one thread > in the game. > The YIELD scheduler trace is explained as: > case VEX_TRC_JMP_YIELD: > /* Explicit yield, because this thread is in a spin-lock > or something. Only let the thread run for a short while > longer. Because swapping to another thread is expensive, > we're prepared to let this thread eat a little more CPU > before swapping to another. That means that short term > spins waiting for hardware to poke memory won't cause a > thread swap. */ > if (dispatch_ctr > 1000) > dispatch_ctr = 1000; > break; > > > > > > > > > > Each of the processes showed only one thread: > I have already seen in the past that GDB is not always > able to show the various threads (not clear when, I suspect > this might happen with a static thread lib ?). > > To double check the nr of threads, you can do one of the following: > from the shell: > vgdb -l > and then for each reported PIDNR: > vgdb --pid=<PIDNR> v.info scheduler > That will show the state of the valgrind scheduler, and the list > of threads known by valgrind, and ask valgrind to produce a stack > trace (guest stack trace) for each thread. > > Or alternatively: > ls /proc/<PIDNR>/task > will show the list of threads nr at linux level. > > > > #3 0x00000000380daa98 in vgModuleLocal_sema_down > > (sema=0x802001830, as_LL=0 '\000') at m_scheduler/sema.c:109 > > > > #4 0x0000000038083687 in vgPlain_acquire_BigLock_LL > > (tid=1, who=0x80956dde0 "") at m_scheduler/scheduler.c:355 > > > > #5 vgPlain_acquire_BigLock (tid=1, who=0x80956dde0 > > "") at m_scheduler/scheduler.c:277 > The above indicates that the thread is trying to acquire the > valgrind "big lock". When using --fair-sched=no, the big lock is > implemented using a pipe. Writing a character on the pipe releases > the lock. Reading the character on the pipe is the lock acquisition. > If the process is blocked on reading on the pipe, then it looks > like the "lock" character was not written back ? > Maybe strace -f valgrind and see what is happening with the lock > character. The lock character loops over A .. Z and then back to A. > Check that the lock is properly released some short time before > the trial to re-acquire it. > > > > > > > strace showed the same as before (i.e. read on a high-numbered > > filehandle, around 1026 or 1027). Someone has suggested that this > > would indicate that valgrind is calling dup2 to create new > > filehandles. Evidence from lsof also bears this out, showing only 77 > > open files for each process. The fd's not relevant to our application > > are: > In the below, I guess that 1028 and 1029 is the pipe used for the > valgrind lock. > > > > > > > > > COMMAND PID USER FD TYPE DEVICE SIZE > > NODE NAME > > > > memcheck- 2071 nbezj7v 5r FIFO 0,6 > > 297571407 pipe > > > > memcheck- 2071 nbezj7v 7u sock 0,5 > > 297780139 can't identify protocol > > > > memcheck- 2071 nbezj7v 8w FIFO 0,6 > > 297571410 pipe > > > > memcheck- 2071 nbezj7v 9r CHR 1,3 > > 3908 /dev/null > > > > memcheck- 2071 nbezj7v 10r DIR 253,0 4096 > > 2 / > > > > memcheck- 2071 nbezj7v 1025u REG 253,0 637 > > 1114475 /tmp/valgrind_proc_2071_cmdline_ad8659c2 (deleted) > > > > memcheck- 2071 nbezj7v 1026u REG 253,0 256 > > 1114491 /tmp/valgrind_proc_2071_auxv_ad8659c2 (deleted) > > > > memcheck- 2071 nbezj7v 1028r FIFO 0,6 > > 297571563 pipe > > > > memcheck- 2071 nbezj7v 1029w FIFO 0,6 > > 297571563 pipe > > > > memcheck- 2071 nbezj7v 1030r FIFO 253,0 > > 1114706 /tmp/vgdb-pipe-from-vgdb-to-2071-by-USERNAME-on-??? > > > > > > > > I tried vgdb, but not a lot of luck. After invoking 'valgrind > > --vgdb=yes --vgdb-error=0 /path/to/my/exe', I then got this in another > > terminal: > > > > > > > > $ gdb /path/to/my/exe > > > "/path/to/my/exe": not in executable format: File > > truncated > > > > (gdb) target remote > > | /apps1/pkgs/valgrind-3.9.0/bin/vgdb --pid=30352 > > > > Remote debugging using > > | /apps1/pkgs/valgrind-3.9.0/bin/vgdb --pid=30352 > > > > relaying data between gdb and process 30352 > > > > Remote register badly formatted: > > > T0506:0000000000000000;07:30f0fffe0f000000;10:700aa05d38000000;thread:7690; > > > > here: > > 00000000;07:30f0fffe0f000000;10:700aa05d38000000;thread:7690; > > > > Try to load the executable by `file' first, > > > > you may also check `set/show architecture'. > > > > > > > > This also caused the vgdb server to hang up. I tried with the 'file' > > command made no difference. The "not in executable format" is totally > > expected - we run a optimised lightweight "test shell" process which > > loads a bunch of heavy debug so's. > The "not in executable format" is not expected by me :). > > What kind of executable is that ? I thought that gdb should be able > to "understand" the executables that are launchable on e.g. red hat 5 > (if I guessed the distro properly). > > In the shell, what is 'file /path/to/my/exe' telling ? > > Maybe you could download and compile the last gdb version (7.6 or 7.7 > if it has just been produced) and see if gdb is now more intelligent ? > > > > > > > > What is the next stage, can I try different options? Or perhaps > > instrument/change the source code in some way in order to figure out > > what is happening? > > As detailed above, you could: > * confirm the list of threads in your process > (ls /proc/..., vgdb v.info scheduler) > * if v.info scheduler works, you might guess what your > application is doing from the stack trace(s) > * maybe you could debug using a newer gdb > * strace -f valgrind .... /path/to/my/exe > might also give some lights on what happens with the valgrind big > lock. > > Philippe > > > > |
|
From: Philippe W. <phi...@sk...> - 2014-02-11 20:32:40
|
On Tue, 2014-02-11 at 07:22 +0000, David Carter wrote: > Hi Philippe, > > Thanks for your suggestions, I have got the first part of the > information. It seems there is some contention over locale > resources. Do you agree? Well, difficult to say without looking more in depth at the code. Taking into account that there are threads still running, that the valgrind trace shows that threads are being scheduled, I guess the problem is linked to the application, not to valgrind. It looks to me that the easiest would be to have a way to debug the application, trying e.g. a newer gdb and vgdb (if the newer gdb supports the strange executable format). Alternatively, just make a normal executable :). At this stage, not much can be done from Valgrind side I am afraid. Philippe |
|
From: David C. <dc...@gm...> - 2014-02-16 04:15:41
|
Hi Philippe, The executable not recognised thing is a red-herring I'm sure. The problem seems to be some kind of process-level lock on locale resources. A single instance of valgrind over all our processes runs fine. The issue is not the application, it is the parallelisation that I have introduced by using Python's multiprocess module in order to run batches of valgrind together. One valgrind instance takes about a week to run over all our processes which is why I started to explore the multiprocess route. For the moment, I'll scale back and be patient until I understand the locale issue. Thank for your help. Regards, David. On Tue, Feb 11, 2014 at 8:32 PM, Philippe Waroquiers < phi...@sk...> wrote: > On Tue, 2014-02-11 at 07:22 +0000, David Carter wrote: > > Hi Philippe, > > > > Thanks for your suggestions, I have got the first part of the > > information. It seems there is some contention over locale > > resources. Do you agree? > Well, difficult to say without looking more in depth at the code. > Taking into account that there are threads still running, > that the valgrind trace shows that threads are being scheduled, > I guess the problem is linked to the application, not to valgrind. > > It looks to me that the easiest would be to have a way to > debug the application, trying e.g. a newer gdb and vgdb > (if the newer gdb supports the strange executable format). > > Alternatively, just make a normal executable :). > > At this stage, not much can be done from Valgrind side I am afraid. > > > Philippe > > > |