|
From: Rich C. <rc...@wi...> - 2016-09-17 03:49:37
|
I'm trying to run valgrind on linux-4.7.2 with gcc-6.2.1. gdb version is GNU gdb (GDB; openSUSE Tumbleweed) 7.11.1 I think the issue is a race condition or something wrong with gdb, as the test hangs trying process gdbserver. The only way to get the test to progress is to kill the test process with -9. Any tips or suggestions for debugging the issue? Below is some data on the first hung test. Thanks. # running test output mcblocklistsearch: valgrind --tool=memcheck --vgdb=yes --vgdb-error=0 --vgdb-prefix=./vgdb-prefix-mcblocklistsearch -q ./../memcheck/tests/leak-tree (progB: ./gdb --quiet -l 60 --nx 1>&2 ../memcheck/tests/leak-tree) # gdb attached current process where #0 0x000000003809cc19 in do_syscall_WRK () #1 0x000000003809cd0d in vgPlain_do_syscall (sysno=sysno@entry=7, a1=a1@entry=955970576, a2=a2@entry=1, a3=a3@entry=18446744073709551615, a4=a4@entry=0, a5=a5@entry=0, a6=0, a7=0, a8=0) at m_syscall.c:956 #2 0x0000000038084fd3 in vgPlain_poll (fds=fds@entry=0x38faf410 <remote_desc_pollfdread_activity>, nfds=nfds@entry=1, timeout=timeout@entry=-1) at m_libcfile.c:623 #3 0x00000000380cd206 in vgPlain_poll_no_eintr (fds=fds@entry=0x38faf410 <remote_desc_pollfdread_activity>, nfds=nfds@entry=1, timeout=timeout@entry=-1) at m_gdbserver/remote-utils.c:86 #4 0x00000000380cdf20 in readchar (single=0) at m_gdbserver/remote-utils.c:958 #5 0x00000000380ce85b in getpkt (buf=0x80237c150 "") at m_gdbserver/remote-utils.c:1017 #6 0x00000000380cfeb7 in server_main () at m_gdbserver/server.c:1214 #7 0x00000000380cb02c in call_gdbserver (tid=1, reason=reason@entry=core_reason) at m_gdbserver/m_gdbserver.c:721 #8 0x00000000380cbd12 in vgPlain_gdbserver (tid=<optimized out>) at m_gdbserver/m_gdbserver.c:788 #9 0x00000000380d77dc in vgPlain_scheduler (tid=tid@entry=1) at m_scheduler/scheduler.c:1240 #10 0x00000000380e7267 in thread_wrapper (tidW=1) at m_syswrap/syswrap-linux.c:103 #11 run_a_thread_NORETURN (tidW=1) at m_syswrap/syswrap-linux.c:156 #12 0x0000000000000000 in ?? () (gdb) info thr Id Target Id Frame * 1 process 32279 "memcheck-amd64-" 0x000000003809cc19 in do_syscall_WRK () # process list for test processes UID PID PPID C STIME TTY TIME CMD coe 21990 22187 0 22:08 pts/21 00:00:00 make regtest coe 31592 21990 0 22:09 pts/21 00:00:00 /bin/sh -c if /usr/bin/perl tests/vg_regtest gdbserver_tests memcheck cachegrind callgrind massif lackey none helgrind drd exp-sgcheck exp-bbv exp-dhat ; then \ tests/post_regtest_checks /usr/local/ext/src/vg/vg.git gdbserver_tests memcheck cachegrind callgrind massif lackey none helgrind drd exp-sgcheck exp-bbv exp-dhat; \ else \ tests/post_regtest_checks /usr/local/ext/src/vg/vg.git gdbserver_tests memcheck cachegrind callgrind massif lackey none helgrind drd exp-sgcheck exp-bbv exp-dhat; \ false; \ fi coe 31593 31592 0 22:09 pts/21 00:00:00 /usr/bin/perl tests/vg_regtest gdbserver_tests memcheck cachegrind callgrind massif lackey none helgrind drd exp-sgcheck exp-bbv exp-dhat coe 32277 31593 0 22:09 pts/21 00:00:00 sh -c VALGRIND_LIB=/usr/local/ext/src/vg/vg.git/.in_place VALGRIND_LIB_INNER=/usr/local/ext/src/vg/vg.git/.in_place /usr/local/ext/src/vg/vg.git/./coregrind/valgrind --command-line-only=yes --memcheck:leak-check=no --tool=mcblocklistsearch --tool=memcheck --vgdb=yes --vgdb-error=0 --vgdb-prefix=./vgdb-prefix-mcblocklistsearch -q ./../memcheck/tests/leak-tree > mcblocklistsearch.stdout.out 2> mcblocklistsearch.stderr.out coe 32279 32277 0 22:09 pts/21 00:00:00 /usr/local/ext/src/vg/vg.git/./coregrind/valgrind --command-line-only=yes --memcheck:leak-check=no --tool=mcblocklistsearch --tool=memcheck --vgdb=yes --vgdb-error=0 --vgdb-prefix=./vgdb-prefix-mcblocklistsearch -q ./../memcheck/tests/leak-tree -- Rich Coe rc...@wi... |
|
From: Philippe W. <phi...@sk...> - 2016-09-17 08:51:45
|
On Fri, 2016-09-16 at 22:34 -0500, Rich Coe wrote:
> I'm trying to run valgrind on linux-4.7.2 with gcc-6.2.1.
> gdb version is GNU gdb (GDB; openSUSE Tumbleweed) 7.11.1
>
> I think the issue is a race condition or something wrong with gdb, as the
> test hangs trying process gdbserver.
>
> The only way to get the test to progress is to kill the test process with -9.
>
> Any tips or suggestions for debugging the issue?
> Below is some data on the first hung test.
The stacktrace shows that Valgrind gdbserver is waiting from
some input from gdb, at the very beginning of execution.
In the list of processes, I however do not see any gdb running.
So, maybe you will see something wrong in the gdb output files
(You should have this in the files
gdbserver_tests/mcblocklistsearch.stdoutB.out
and stderrB.out).
Another thing to try is to run the test manually:
In a window, start the test prog as indicated in
mcblocklistsearch.vgtest
In another window, start gdb (args also in the vgtest)
And then give to gdb one by one the gdb commands that are in
the file mcblocklistsearch.stdinB.gdb
to investigate what/why/when something goes
wrong/blocks/hangs/crashes/...
See gdbserver_tests/README_DEVELOPERS for some background
info.
Hope this helps
Philippe
|
|
From: Rich C. <rc...@wi...> - 2016-09-18 13:50:39
|
On Sat, 17 Sep 2016 10:52:01 +0200
Philippe Waroquiers <phi...@sk...> wrote:
> On Fri, 2016-09-16 at 22:34 -0500, Rich Coe wrote:
> > Any tips or suggestions for debugging the issue?
> > Below is some data on the first hung test.
> The stacktrace shows that Valgrind gdbserver is waiting from
> some input from gdb, at the very beginning of execution.
> In the list of processes, I however do not see any gdb running.
> So, maybe you will see something wrong in the gdb output files
> (You should have this in the files
> gdbserver_tests/mcblocklistsearch.stdoutB.out
> and stderrB.out).
Thanks Philippe, that helped a lot.
>From these messages in the output file
> no --pid= arg given and multiple valgrind pids found:
> use --pid=24697 for cscope -d
> use --pid=15216 for /opt/google/chrome/chrome [...]
I found vgdb was looking for named pipe files in the gdbserver directory.
There were 200 or so of named pipe files there. I cleaned them out and
all the tests are running now.
I found the helgrind and drd tests bar_bad are hanging:
#0 vgModuleLocal_do_syscall_for_client_WRK () at m_syswrap/syscall-amd64-linux.S:173
#1 0x00000000380c0638 in do_syscall_for_client (syscall_mask=0x802c53d80, tst=0x802018f90, syscallno=202)
at m_syswrap/syswrap-main.c:339
#2 vgPlain_client_syscall (tid=tid@entry=1, trc=trc@entry=73) at m_syswrap/syswrap-main.c:2007
#3 0x00000000380bd02b in handle_syscall (tid=tid@entry=1, trc=73) at m_scheduler/scheduler.c:1118
#4 0x00000000380be697 in vgPlain_scheduler (tid=tid@entry=1) at m_scheduler/scheduler.c:1435
#5 0x00000000380cdc07 in thread_wrapper (tidW=1) at m_syswrap/syswrap-linux.c:103
#6 run_a_thread_NORETURN (tidW=1) at m_syswrap/syswrap-linux.c:156
#7 0x0000000000000000 in ?? ()
x/i
=> 0x381257ac <vgModuleLocal_do_syscall_for_client_WRK+89>: pop %r8
I'm not sure yet why this is hanging on this instruction. I'll have to
investigate further.
Rich
--
Rich Coe rc...@wi...
|
|
From: Ivo R. <iv...@iv...> - 2016-09-18 19:45:16
|
2016-09-18 15:50 GMT+02:00 Rich Coe <rc...@wi...>: > I found the helgrind and drd tests bar_bad are hanging: > > #0 vgModuleLocal_do_syscall_for_client_WRK () at > m_syswrap/syscall-amd64-linux.S:173 > #1 0x00000000380c0638 in do_syscall_for_client (syscall_mask=0x802c53d80, > tst=0x802018f90, syscallno=202) > at m_syswrap/syswrap-main.c:339 > #2 vgPlain_client_syscall (tid=tid@entry=1, trc=trc@entry=73) at > m_syswrap/syswrap-main.c:2007 > #3 0x00000000380bd02b in handle_syscall (tid=tid@entry=1, trc=73) at > m_scheduler/scheduler.c:1118 > #4 0x00000000380be697 in vgPlain_scheduler (tid=tid@entry=1) at > m_scheduler/scheduler.c:1435 > #5 0x00000000380cdc07 in thread_wrapper (tidW=1) at > m_syswrap/syswrap-linux.c:103 > #6 run_a_thread_NORETURN (tidW=1) at m_syswrap/syswrap-linux.c:156 > #7 0x0000000000000000 in ?? () > On my Linux/Ubuntu box these tests hang even when run natively during the fourth testcase described as "destroy a barrier that has waiting threads". According to POSIX [1]: "...The results are undefined if *pthread_barrier_destroy*() is called when any thread is blocked on the barrier... " which bar_bad precisely does. Perhaps a timer which would time out this hang would be handy here. I. [1] http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_barrier_destroy.html |
|
From: Mark W. <mj...@re...> - 2016-09-18 20:18:27
|
On Sun, Sep 18, 2016 at 09:45:08PM +0200, Ivo Raisr wrote: > 2016-09-18 15:50 GMT+02:00 Rich Coe <rc...@wi...>: > > > I found the helgrind and drd tests bar_bad are hanging: > > > > #0 vgModuleLocal_do_syscall_for_client_WRK () at > > m_syswrap/syscall-amd64-linux.S:173 > > #1 0x00000000380c0638 in do_syscall_for_client (syscall_mask=0x802c53d80, > > tst=0x802018f90, syscallno=202) > > at m_syswrap/syswrap-main.c:339 > > #2 vgPlain_client_syscall (tid=tid@entry=1, trc=trc@entry=73) at > > m_syswrap/syswrap-main.c:2007 > > #3 0x00000000380bd02b in handle_syscall (tid=tid@entry=1, trc=73) at > > m_scheduler/scheduler.c:1118 > > #4 0x00000000380be697 in vgPlain_scheduler (tid=tid@entry=1) at > > m_scheduler/scheduler.c:1435 > > #5 0x00000000380cdc07 in thread_wrapper (tidW=1) at > > m_syswrap/syswrap-linux.c:103 > > #6 run_a_thread_NORETURN (tidW=1) at m_syswrap/syswrap-linux.c:156 > > #7 0x0000000000000000 in ?? () > > > > On my Linux/Ubuntu box these tests hang even when run natively during the > fourth testcase described as > "destroy a barrier that has waiting threads". According to POSIX [1]: > "...The results are undefined if *pthread_barrier_destroy*() is called when > any thread is blocked on the barrier... " > which bar_bad precisely does. > > Perhaps a timer which would time out this hang would be handy here. My apologies for not recognizing this earlier. We carry a patch in Fedora to work around this test hang. The workaround is attached to this bug report: https://bugsfiles.kde.org/attachment.cgi?id=96765 As you can read in that bug report the workaround isn't ideal because it might still FAIL the test. If someone could take a look at the bug and proposed patch to see if it can somehow be changed so that it makes the test reliably pass with older and newer glibc that would be very appreciated. Thanks, Mark |
|
From: Mark W. <mj...@re...> - 2016-09-20 09:34:17
|
On Sun, 2016-09-18 at 22:18 +0200, Mark Wielaard wrote: > > On my Linux/Ubuntu box these tests hang even when run natively during the > > fourth testcase described as > > "destroy a barrier that has waiting threads". According to POSIX [1]: > > "...The results are undefined if *pthread_barrier_destroy*() is called when > > any thread is blocked on the barrier... " > > which bar_bad precisely does. > > > > Perhaps a timer which would time out this hang would be handy here. > > My apologies for not recognizing this earlier. We carry a patch in Fedora > to work around this test hang. The workaround is attached to this bug > report: https://bugsfiles.kde.org/attachment.cgi?id=96765 > > As you can read in that bug report the workaround isn't ideal because > it might still FAIL the test. If someone could take a look at the bug > and proposed patch to see if it can somehow be changed so that it > makes the test reliably pass with older and newer glibc that would be > very appreciated. I checked in my workaround as valgrind svn r15962 (adding a missing exp file in svn r15966, sorry about that). It adds a sleeping thread that tries to unblock the barrier in case the test hangs (plus new exp files that describe that situation). It does seem to unblock the test, but it adds (more) non-determinism that seems to make the test fail more than before. So I kept the bug report open: https://bugs.kde.org/show_bug.cgi?id=358213 Could people check whether with this new testcase the test no-longer hangs and whether or not is passes? If it fails could you double check whether or not it fails always or just sometimes? And if you know why that would be very helpful to know! Thanks, Mark |
|
From: Rich C. <rc...@wi...> - 2016-09-20 18:15:10
|
On Tue, 20 Sep 2016 11:34:08 +0200 Mark Wielaard <mj...@re...> wrote: > On Sun, 2016-09-18 at 22:18 +0200, Mark Wielaard wrote: > Could people check whether with this new testcase the test no-longer > hangs and whether or not is passes? If it fails could you double check > whether or not it fails always or just sometimes? And if you know why > that would be very helpful to know! > Thanks Mark. The bar_bad testcase is no longer hanging and they are passing. Rich -- Rich Coe rc...@wi... |