Re: [Valgrind-users] Memcheck in Valgrind 3.8.1 hanging

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thank you again, Philippe. I will make some investigations next week and
report back.

Regards,
David.

On Friday, January 31, 2014, Philippe Waroquiers <
phi...@sk...> wrote:

> On Fri, 2014-01-31 at 16:34 +0000, David Carter wrote:
> Hello David,
>
> > Hi Philippe,
> >
> >
> >
> > Upgraded to 3.9.0 as you suggested and ran with these options:
> >
> >
> >
> >                 -v -v -v -d -d -d --trace-sched=yes
> > --trace-syscalls=yes --trace-signals=yes --quiet --track-origins=yes
> > --free-fill=7a --child-silent-after-fork=yes --fair-sched=no
> >
> >
> >
> > After some time, a bunch of processes went into 'pipe_w' status.
> > These were single-threaded processes.  Their logfiles (which were
> > enormous - hundreds of gigabytes!) all contained this line:
> >
> >
> >
> >                 --23014--   SCHED[3]: TRC: YIELD
> The above trace is strange: the SCHED[3] indicates that this is
> valgrind thread id 3 which is doing the trace. That seems to indicate
> that there was (at least at some point in time) more than one thread
> in the game.
> The YIELD scheduler trace is explained as:
>       case VEX_TRC_JMP_YIELD:
>          /* Explicit yield, because this thread is in a spin-lock
>             or something.  Only let the thread run for a short while
>             longer.  Because swapping to another thread is expensive,
>             we're prepared to let this thread eat a little more CPU
>             before swapping to another.  That means that short term
>             spins waiting for hardware to poke memory won't cause a
>             thread swap. */
>          if (dispatch_ctr > 1000)
>             dispatch_ctr = 1000;
>          break;
>
>
> >
> >
> >
> > Each of the processes showed only one thread:
> I have already seen in the past that GDB is not always
> able to show the various threads (not clear when, I suspect
> this might happen with a static thread lib ?).
>
> To double check the nr of threads, you can do one of the following:
>    from the shell:
>        vgdb -l
> and then for each reported PIDNR:
>        vgdb --pid=<PIDNR> v.info scheduler
> That will show the state of the valgrind scheduler, and the list
> of threads known by valgrind, and ask valgrind to produce a stack
> trace (guest stack trace) for each thread.
>
> Or alternatively:
>        ls /proc/<PIDNR>/task
> will show the list of threads nr at linux level.
>
>
> >                 #3  0x00000000380daa98 in vgModuleLocal_sema_down
> > (sema=0x802001830, as_LL=0 '\000') at m_scheduler/sema.c:109
> >
> >                 #4  0x0000000038083687 in vgPlain_acquire_BigLock_LL
> > (tid=1, who=0x80956dde0 "") at m_scheduler/scheduler.c:355
> >
> >                 #5  vgPlain_acquire_BigLock (tid=1, who=0x80956dde0
> > "") at m_scheduler/scheduler.c:277
> The above indicates that the thread is trying to acquire the
> valgrind "big lock". When using --fair-sched=no, the big lock is
> implemented using a pipe. Writing a character on the pipe releases
> the lock. Reading the character on the pipe is the lock acquisition.
> If the process is blocked on reading on the pipe, then it looks
> like the "lock" character was not written back ?
> Maybe strace -f valgrind and see what is happening with the lock
> character. The lock character loops over A .. Z and then back to A.
> Check that the lock is properly released some short time before
> the trial to re-acquire it.
>
> >
> >
> > strace showed the same as before (i.e. read on a high-numbered
> > filehandle, around 1026 or 1027).  Someone has suggested that this
> > would indicate that valgrind is calling dup2 to create new
> > filehandles.  Evidence from lsof also bears this out, showing only 77
> > open files for each process.  The fd's not relevant to our application
> > are:
> In the below, I guess that 1028 and 1029 is the pipe used for the
> valgrind lock.
>
> >
> >
> >
> >                 COMMAND    PID    USER   FD   TYPE DEVICE        SIZE
> > NODE NAME
> >
> >                 memcheck- 2071 nbezj7v    5r  FIFO    0,6
> > 297571407 pipe
> >
> >                 memcheck- 2071 nbezj7v    7u  sock    0,5
> > 297780139 can't identify protocol
> >
> >                 memcheck- 2071 nbezj7v    8w  FIFO    0,6
> > 297571410 pipe
> >
> >                 memcheck- 2071 nbezj7v    9r   CHR    1,3
> > 3908 /dev/null
> >
> >                 memcheck- 2071 nbezj7v   10r   DIR  253,0        4096
> > 2 /
> >
> >                 memcheck- 2071 nbezj7v 1025u   REG  253,0         637
> > 1114475 /tmp/valgrind_proc_2071_cmdline_ad8659c2 (deleted)
> >
> >                 memcheck- 2071 nbezj7v 1026u   REG  253,0         256
> > 1114491 /tmp/valgrind_proc_2071_auxv_ad8659c2 (deleted)
> >
> >                 memcheck- 2071 nbezj7v 1028r  FIFO    0,6
> > 297571563 pipe
> >
> >                 memcheck- 2071 nbezj7v 1029w  FIFO    0,6
> > 297571563 pipe
> >
> >                 memcheck- 2071 nbezj7v 1030r  FIFO  253,0
> > 1114706 /tmp/vgdb-pipe-from-vgdb-to-2071-by-USERNAME-on-???
> >
> >
> >
> > I tried vgdb, but not a lot of luck.  After invoking 'valgrind
> > --vgdb=yes --vgdb-error=0 /path/to/my/exe', I then got this in another
> > terminal:
> >
> >
> >
> >                 $ gdb /path/to/my/exe
>
> >                 "/path/to/my/exe": not in executable format: File
> > truncated
> >
> >                 (gdb) target remote
> > | /apps1/pkgs/valgrind-3.9.0/bin/vgdb --pid=30352
> >
> >                 Remote debugging using
> > | /apps1/pkgs/valgrind-3.9.0/bin/vgdb --pid=30352
> >
> >                 relaying data between gdb and process 30352
> >
> >                 Remote register badly formatted:
> >
> T0506:0000000000000000;07:30f0fffe0f000000;10:700aa05d38000000;thread:7690;
> >
> >                 here:
> > 00000000;07:30f0fffe0f000000;10:700aa05d38000000;thread:7690;
> >
> >                 Try to load the executable by `file' first,
> >
> >                 you may also check `set/show architecture'.
> >
> >
> >
> > This also caused the vgdb server to hang up.  I tried with the 'file'
> > command made no difference.  The "not in executable format" is totally
> > expected - we run a optimised lightweight "test shell" process which
> > loads a bunch of heavy debug so's.
> The "not in executable format" is not expected by me :).
>
> What kind of executable is that ? I thought that gdb should be able
> to "understand" the executables that are launchable on e.g. red hat 5
> (if I guessed the distro properly).
>
> In the shell, what is 'file /path/to/my/exe' telling ?
>
> Maybe you could download and compile the last gdb version (7.6 or 7.7
> if it has just been produced) and see if gdb is now more intelligent ?
> >
> >
> >
> > What is the next stage, can I try different options? Or perhaps
> > instrument/change the source code in some way in order to figure out
> > what is happening?
>
> As detailed above, you could:
>   * confirm the list of threads in your process
>      (ls /proc/..., vgdb v.info scheduler)
>   * if v.info scheduler works, you might guess what your
>     application is doing from the stack trace(s)
>   * maybe you could debug using a newer gdb
>   * strace -f valgrind .... /path/to/my/exe
>     might also give some lights on what happens with the valgrind big
>     lock.
>
> Philippe
>
>
>
>