|
From: Tim B. <ti...@va...> - 2013-10-29 01:29:30
|
I'm using Valgrind 3.8.1 to run fuzzing tests on control software for an
embedded system. The tests are run on
Linux pod 3.11.4-201.fc19.x86_64 #1 SMP Thu Oct 10 14:11:18 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux
Valgrind itself is the standard Fedora 19 build with the following
arguments:
valgrind -v --leak-check=full
--leak-resolution=high
--show-possibly-lost=yes
--track-fds=yes
--num-callers=24
--vgdb=no
Normally all this works fine, but last night I did an overnight run on a
new dataset and after about eight hours valgrind started looping. I
wasn't sure what to do about that... there were no diagnostics either
from my software or from valgrind. When I attached to the program with
strace I got this endless series:
clock_gettime(CLOCK_MONOTONIC, {124051, 458677331}) = 0
rt_sigprocmask(SIG_SETMASK, ~[], ~[ILL TRAP BUS FPE KILL SEGV STOP], 8) = 0
rt_sigtimedwait([HUP INT ILL BUS FPE KILL SEGV TERM STOP WINCH SYS RTMIN
RT_1], 0x4028a5e30, {0, 0}, 8) = -1 EAGAIN (Resource temporarily
unavailable)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP BUS FPE KILL SEGV STOP], NULL, 8) = 0
gettid() = 30464
gettid() = 30464
write(1029, "A", 1) = 1
gettid() = 30464
read(1028, "A", 1) = 1
gettid() = 30464
clock_gettime(CLOCK_MONOTONIC, {124051, 469072047}) = 0
rt_sigprocmask(SIG_SETMASK, ~[], ~[ILL TRAP BUS FPE KILL SEGV STOP], 8) = 0
rt_sigtimedwait([HUP INT ILL BUS FPE KILL SEGV TERM STOP WINCH SYS RTMIN
RT_1], 0x4028a5e30, {0, 0}, 8) = -1 EAGAIN (Resource temporarily
unavailable)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP BUS FPE KILL SEGV STOP], NULL, 8) = 0
gettid() = 30464
gettid() = 30464
write(1029, "B", 1) = 1
gettid() = 30464
read(1028, "B", 1) = 1
gettid()
It loops through the English alphabet, one letter at a time, uppercase A
to uppercase Z, and then repeats. TID 30464 is the PID of the valgrind
process. File descriptors 1028 and 1029 appear to be the opposite ends
of a pipe, but I have no idea why the alphabet would be cycling through
it. The sigtimedwait() call is polling for one of those signals and not
getting it. We have no such calls explicitly in our code, and no RT
calls at all, which leads me to think this is valgrind's internal code,
but of course it could still be measuring something within the main program.
I ran gstack on it but got only:
gstack 30464
#0 0x0000000038033cb0 in vgMemCheck_helperc_LOADV64le ()
#1 0x0000000403290318 in ?? ()
#2 0x00000004028a5f00 in ?? ()
#3 0x0000000000000000 in ?? ()
which doesn't tell me a whole lot, and I don't find vgMemCheck_helperc
anywhere in the code, so I guess it's something synthesized at compile time.
My test is scripted so I can re-run it easily (given time), but could
use some pointers on how to approach this and get more information. I
imagine that rebuilding valgrind with -g might help with gstack, but is
there a way I can signal or stop Valgrind and have it display a stack
trace? It looks like SIGINT is blocked, and sending TERM merely stops
the process and I get nothing.
Thanks for any pointers!
Tim
|
|
From: Philippe W. <phi...@sk...> - 2013-10-29 20:49:02
|
On Tue, 2013-10-29 at 10:24 +0900, Tim Burress wrote:
> I'm using Valgrind 3.8.1 to run fuzzing tests on control software for an
> embedded system. The tests are run on
>
> Linux pod 3.11.4-201.fc19.x86_64 #1 SMP Thu Oct 10 14:11:18 UTC 2013
> x86_64 x86_64 x86_64 GNU/Linux
>
> Valgrind itself is the standard Fedora 19 build with the following
> arguments:
>
> valgrind -v --leak-check=full
> --leak-resolution=high
> --show-possibly-lost=yes
> --track-fds=yes
> --num-callers=24
> --vgdb=no
>
> Normally all this works fine, but last night I did an overnight run on a
> new dataset and after about eight hours valgrind started looping. I
> wasn't sure what to do about that... there were no diagnostics either
> from my software or from valgrind. When I attached to the program with
> strace I got this endless series:
>
> clock_gettime(CLOCK_MONOTONIC, {124051, 458677331}) = 0
> rt_sigprocmask(SIG_SETMASK, ~[], ~[ILL TRAP BUS FPE KILL SEGV STOP], 8) = 0
> rt_sigtimedwait([HUP INT ILL BUS FPE KILL SEGV TERM STOP WINCH SYS RTMIN
> RT_1], 0x4028a5e30, {0, 0}, 8) = -1 EAGAIN (Resource temporarily
> unavailable)
> rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP BUS FPE KILL SEGV STOP], NULL, 8) = 0
> gettid() = 30464
> gettid() = 30464
> write(1029, "A", 1) = 1
> gettid() = 30464
> read(1028, "A", 1) = 1
> gettid() = 30464
> clock_gettime(CLOCK_MONOTONIC, {124051, 469072047}) = 0
> rt_sigprocmask(SIG_SETMASK, ~[], ~[ILL TRAP BUS FPE KILL SEGV STOP], 8) = 0
> rt_sigtimedwait([HUP INT ILL BUS FPE KILL SEGV TERM STOP WINCH SYS RTMIN
> RT_1], 0x4028a5e30, {0, 0}, 8) = -1 EAGAIN (Resource temporarily
> unavailable)
> rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP BUS FPE KILL SEGV STOP], NULL, 8) = 0
> gettid() = 30464
> gettid() = 30464
> write(1029, "B", 1) = 1
> gettid() = 30464
> read(1028, "B", 1) = 1
> gettid()
>
> It loops through the English alphabet, one letter at a time, uppercase A
> to uppercase Z, and then repeats. TID 30464 is the PID of the valgrind
> process. File descriptors 1028 and 1029 appear to be the opposite ends
> of a pipe, but I have no idea why the alphabet would be cycling through
> it. The sigtimedwait() call is polling for one of those signals and not
> getting it. We have no such calls explicitly in our code, and no RT
> calls at all, which leads me to think this is valgrind's internal code,
> but of course it could still be measuring something within the main program.
All the above is caused by the internal Valgrind kitchen.
In particular, the pipes are used to implement a lock.
The write is an "unlock" and the read is a "lock".
>
> I ran gstack on it but got only:
>
> gstack 30464
> #0 0x0000000038033cb0 in vgMemCheck_helperc_LOADV64le ()
> #1 0x0000000403290318 in ?? ()
> #2 0x00000004028a5f00 in ?? ()
> #3 0x0000000000000000 in ?? ()
>
> which doesn't tell me a whole lot, and I don't find vgMemCheck_helperc
> anywhere in the code, so I guess it's something synthesized at compile time.
What you observe above is the host stack trace executing the guest code.
All the above indicates Valgrind is (still) executing the application
code.
Valgrind is (normally) compiled with -g. But in any case, if it was not
(or the Valgrind debug info was not installed on your fedora), this will
not help to get an understandable guest stack trace.
Rather, do not use --vgdb=no, rather keep the default (--vgdb=yes)
or use --vgdb=full (if needed).
You can then use gdb+vgdb to debug your application running under
Valgrind and see e.g. what your application is doing.
For more information look at Valgrind user manual section
3.2. Debugging your program using Valgrind gdbserver and GDB
http://www.valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.gdbserver
Philippe
NB: just for my own info, why are you using --vgdb=no ?
Any problems with it ?
|
|
From: Tim B. <ti...@va...> - 2013-10-30 02:06:30
|
On 10/30/2013 06:50 AM, Philippe Waroquiers wrote: > All the above is caused by the internal Valgrind kitchen. > In particular, the pipes are used to implement a lock. > The write is an "unlock" and the read is a "lock". I see. Thanks very much for explaining. It certainly did look unusual. I'm quickly getting a sense of what's under the hood in Valgrind... a lot more than I thought. > NB: just for my own info, why are you using --vgdb=no ? > Any problems with it ? I'm not exactly sure at this point. It was added to the test script quite some time ago. It may just be there to suppress the informational messages from Valgrind saying how to debug the program. By the way, I was able to find the problem in our software with a good amount of luck and by progressively whittling down the test. It uncovered an unconsidered corner case in some code that goes back ten years. Thanks! Tim |