|
From: David C. <dc...@gm...> - 2014-01-31 22:00:00
|
Thank you again, Philippe. I will make some investigations next week and report back. Regards, David. On Friday, January 31, 2014, Philippe Waroquiers < phi...@sk...> wrote: > On Fri, 2014-01-31 at 16:34 +0000, David Carter wrote: > Hello David, > > > Hi Philippe, > > > > > > > > Upgraded to 3.9.0 as you suggested and ran with these options: > > > > > > > > -v -v -v -d -d -d --trace-sched=yes > > --trace-syscalls=yes --trace-signals=yes --quiet --track-origins=yes > > --free-fill=7a --child-silent-after-fork=yes --fair-sched=no > > > > > > > > After some time, a bunch of processes went into 'pipe_w' status. > > These were single-threaded processes. Their logfiles (which were > > enormous - hundreds of gigabytes!) all contained this line: > > > > > > > > --23014-- SCHED[3]: TRC: YIELD > The above trace is strange: the SCHED[3] indicates that this is > valgrind thread id 3 which is doing the trace. That seems to indicate > that there was (at least at some point in time) more than one thread > in the game. > The YIELD scheduler trace is explained as: > case VEX_TRC_JMP_YIELD: > /* Explicit yield, because this thread is in a spin-lock > or something. Only let the thread run for a short while > longer. Because swapping to another thread is expensive, > we're prepared to let this thread eat a little more CPU > before swapping to another. That means that short term > spins waiting for hardware to poke memory won't cause a > thread swap. */ > if (dispatch_ctr > 1000) > dispatch_ctr = 1000; > break; > > > > > > > > > > Each of the processes showed only one thread: > I have already seen in the past that GDB is not always > able to show the various threads (not clear when, I suspect > this might happen with a static thread lib ?). > > To double check the nr of threads, you can do one of the following: > from the shell: > vgdb -l > and then for each reported PIDNR: > vgdb --pid=<PIDNR> v.info scheduler > That will show the state of the valgrind scheduler, and the list > of threads known by valgrind, and ask valgrind to produce a stack > trace (guest stack trace) for each thread. > > Or alternatively: > ls /proc/<PIDNR>/task > will show the list of threads nr at linux level. > > > > #3 0x00000000380daa98 in vgModuleLocal_sema_down > > (sema=0x802001830, as_LL=0 '\000') at m_scheduler/sema.c:109 > > > > #4 0x0000000038083687 in vgPlain_acquire_BigLock_LL > > (tid=1, who=0x80956dde0 "") at m_scheduler/scheduler.c:355 > > > > #5 vgPlain_acquire_BigLock (tid=1, who=0x80956dde0 > > "") at m_scheduler/scheduler.c:277 > The above indicates that the thread is trying to acquire the > valgrind "big lock". When using --fair-sched=no, the big lock is > implemented using a pipe. Writing a character on the pipe releases > the lock. Reading the character on the pipe is the lock acquisition. > If the process is blocked on reading on the pipe, then it looks > like the "lock" character was not written back ? > Maybe strace -f valgrind and see what is happening with the lock > character. The lock character loops over A .. Z and then back to A. > Check that the lock is properly released some short time before > the trial to re-acquire it. > > > > > > > strace showed the same as before (i.e. read on a high-numbered > > filehandle, around 1026 or 1027). Someone has suggested that this > > would indicate that valgrind is calling dup2 to create new > > filehandles. Evidence from lsof also bears this out, showing only 77 > > open files for each process. The fd's not relevant to our application > > are: > In the below, I guess that 1028 and 1029 is the pipe used for the > valgrind lock. > > > > > > > > > COMMAND PID USER FD TYPE DEVICE SIZE > > NODE NAME > > > > memcheck- 2071 nbezj7v 5r FIFO 0,6 > > 297571407 pipe > > > > memcheck- 2071 nbezj7v 7u sock 0,5 > > 297780139 can't identify protocol > > > > memcheck- 2071 nbezj7v 8w FIFO 0,6 > > 297571410 pipe > > > > memcheck- 2071 nbezj7v 9r CHR 1,3 > > 3908 /dev/null > > > > memcheck- 2071 nbezj7v 10r DIR 253,0 4096 > > 2 / > > > > memcheck- 2071 nbezj7v 1025u REG 253,0 637 > > 1114475 /tmp/valgrind_proc_2071_cmdline_ad8659c2 (deleted) > > > > memcheck- 2071 nbezj7v 1026u REG 253,0 256 > > 1114491 /tmp/valgrind_proc_2071_auxv_ad8659c2 (deleted) > > > > memcheck- 2071 nbezj7v 1028r FIFO 0,6 > > 297571563 pipe > > > > memcheck- 2071 nbezj7v 1029w FIFO 0,6 > > 297571563 pipe > > > > memcheck- 2071 nbezj7v 1030r FIFO 253,0 > > 1114706 /tmp/vgdb-pipe-from-vgdb-to-2071-by-USERNAME-on-??? > > > > > > > > I tried vgdb, but not a lot of luck. After invoking 'valgrind > > --vgdb=yes --vgdb-error=0 /path/to/my/exe', I then got this in another > > terminal: > > > > > > > > $ gdb /path/to/my/exe > > > "/path/to/my/exe": not in executable format: File > > truncated > > > > (gdb) target remote > > | /apps1/pkgs/valgrind-3.9.0/bin/vgdb --pid=30352 > > > > Remote debugging using > > | /apps1/pkgs/valgrind-3.9.0/bin/vgdb --pid=30352 > > > > relaying data between gdb and process 30352 > > > > Remote register badly formatted: > > > T0506:0000000000000000;07:30f0fffe0f000000;10:700aa05d38000000;thread:7690; > > > > here: > > 00000000;07:30f0fffe0f000000;10:700aa05d38000000;thread:7690; > > > > Try to load the executable by `file' first, > > > > you may also check `set/show architecture'. > > > > > > > > This also caused the vgdb server to hang up. I tried with the 'file' > > command made no difference. The "not in executable format" is totally > > expected - we run a optimised lightweight "test shell" process which > > loads a bunch of heavy debug so's. > The "not in executable format" is not expected by me :). > > What kind of executable is that ? I thought that gdb should be able > to "understand" the executables that are launchable on e.g. red hat 5 > (if I guessed the distro properly). > > In the shell, what is 'file /path/to/my/exe' telling ? > > Maybe you could download and compile the last gdb version (7.6 or 7.7 > if it has just been produced) and see if gdb is now more intelligent ? > > > > > > > > What is the next stage, can I try different options? Or perhaps > > instrument/change the source code in some way in order to figure out > > what is happening? > > As detailed above, you could: > * confirm the list of threads in your process > (ls /proc/..., vgdb v.info scheduler) > * if v.info scheduler works, you might guess what your > application is doing from the stack trace(s) > * maybe you could debug using a newer gdb > * strace -f valgrind .... /path/to/my/exe > might also give some lights on what happens with the valgrind big > lock. > > Philippe > > > > |