|
From: Paul F. <pj...@wa...> - 2023-08-26 05:53:33
|
Hi
I was just looking at valgrind-testresults and there was a jump in the
number of failures on ppc64le on Aug 17th, just after the deferred
debuginfo reading change.
One random example
=================================================
./valgrind-old/drd/tests/tc16_byterace.stderr.diff
=================================================
--- tc16_byterace.stderr.exp 2023-08-17 03:01:09.168107928 +0000
+++ tc16_byterace.stderr.out 2023-08-17 03:28:20.030515805 +0000
@@ -1,8 +1,7 @@
Conflicting load by thread 1 at 0x........ size 1
at 0x........: main (tc16_byterace.c:34)
-Location 0x........ is 0 bytes inside bytes[4],
-a global variable declared at tc16_byterace.c:7
+Allocation context: BSS section of tc16_byterace
It does look to me like this is a debuginfo issue.
Can anyone take a look?
A+
Paul
|
|
From: Mark W. <ma...@kl...> - 2023-08-27 15:36:24
|
Hi Paul,
On Sat, Aug 26, 2023 at 06:26:29AM +0200, Paul Floyd wrote:
> I was just looking at valgrind-testresults and there was a jump in
> the number of failures on ppc64le on Aug 17th, just after the
> deferred debuginfo reading change.
>
> One random example
>
> =================================================
> ./valgrind-old/drd/tests/tc16_byterace.stderr.diff
> =================================================
> --- tc16_byterace.stderr.exp 2023-08-17 03:01:09.168107928 +0000
> +++ tc16_byterace.stderr.out 2023-08-17 03:28:20.030515805 +0000
> @@ -1,8 +1,7 @@
>
> Conflicting load by thread 1 at 0x........ size 1
> at 0x........: main (tc16_byterace.c:34)
> -Location 0x........ is 0 bytes inside bytes[4],
> -a global variable declared at tc16_byterace.c:7
> +Allocation context: BSS section of tc16_byterace
>
> It does look to me like this is a debuginfo issue.
You are correct, this was caused by:
commit 60f7e89ba32b54d73b9e36d49e28d0f559ade0b9
Author: Aaron Merey <am...@re...>
Date: Fri Jun 30 18:31:42 2023 -0400
Support lazy reading and downloading of DWARF debuginfo
That commit shouldn't have been architecture specific, but it
apparently was. I put some early analysis into the bug
https://bugs.kde.org/show_bug.cgi?id=471807#c16
The patch depends on a call to find_DiCfSI triggering a full debuginfo load.
find_DiCfSI is (indirectly called) when ML_(get_CFA) is called.
It looks like ppc64le doesn't call ML_(get_CFA) because we have the following in
coregrind/m_debuginfo/d3basics.c
#if defined(VGP_ppc32_linux) || defined(VGP_ppc64be_linux) \
|| defined(VGP_ppc64le_linux)
/* Valgrind on ppc32/ppc64 currently doesn't use unwind info. */
uw1 = ML_(read_Addr)((UChar*)regs->sp);
#else
uw1 = ML_(get_CFA)(regs->ip, regs->sp, regs->fp, 0, ~(UWord) 0);
#endif
Cheers,
Mark
|
|
From: Carl L. <ce...@us...> - 2023-08-29 19:15:52
|
Mark, Paul, Aaron: On Sun, 2023-08-27 at 17:36 +0200, Mark Wielaard wrote: > Hi Paul, > > On Sat, Aug 26, 2023 at 06:26:29AM +0200, Paul Floyd wrote: > > I was just looking at valgrind-testresults and there was a jump in > > the number of failures on ppc64le on Aug 17th, just after the > > deferred debuginfo reading change. > > > > One random example > > > > ================================================= > > ./valgrind-old/drd/tests/tc16_byterace.stderr.diff > > ================================================= > > --- tc16_byterace.stderr.exp 2023-08-17 03:01:09.168107928 +0000 > > +++ tc16_byterace.stderr.out 2023-08-17 03:28:20.030515805 +0000 > > @@ -1,8 +1,7 @@ > > > > Conflicting load by thread 1 at 0x........ size 1 > > at 0x........: main (tc16_byterace.c:34) > > -Location 0x........ is 0 bytes inside bytes[4], > > -a global variable declared at tc16_byterace.c:7 > > +Allocation context: BSS section of tc16_byterace > > > > It does look to me like this is a debuginfo issue. > > You are correct, this was caused by: > > commit 60f7e89ba32b54d73b9e36d49e28d0f559ade0b9 > Author: Aaron Merey <am...@re...> > Date: Fri Jun 30 18:31:42 2023 -0400 > > Support lazy reading and downloading of DWARF debuginfo > > That commit shouldn't have been architecture specific, but it > apparently was. I put some early analysis into the bug > > https://bugs.kde.org/show_bug.cgi?id=471807#c16 > > The patch depends on a call to find_DiCfSI triggering a full > debuginfo load. > find_DiCfSI is (indirectly called) when ML_(get_CFA) is called. > It looks like ppc64le doesn't call ML_(get_CFA) because we have the > following in > coregrind/m_debuginfo/d3basics.c > > #if defined(VGP_ppc32_linux) || defined(VGP_ppc64be_linux) \ > || defined(VGP_ppc64le_linux) > /* Valgrind on ppc32/ppc64 currently doesn't use unwind > info. */ > uw1 = ML_(read_Addr)((UChar*)regs->sp); > #else > uw1 = ML_(get_CFA)(regs->ip, regs->sp, regs->fp, 0, > ~(UWord) 0); > #endif I verified that the patch from Aaron causes regression failures on Power 9 and Power 10. Per the comment above, not sure why PowerPC does not support the get_CFA call? Unfortunately, I don't know much about callgrind or the debuginfo stuff. Not obvious to me at first glance how to fix the issue. I would be happy to help test a patch or work on a patch if someone has specific suggestions on how to fix the issue on PowerPC. Carl |
|
From: Carl L. <ce...@us...> - 2023-09-01 18:59:57
|
Mark: On Fri, 2023-09-01 at 16:21 +0200, Mark Wielaard wrote: > Hi Carl, > > On Thu, 2023-08-31 at 15:38 -0700, Carl Love wrote: > > So, I then tried to run the same test on a Power 8LE system Ubuntu > > 20.04.5 LTS (Focal Fossa). I get: > > > > valgrind --tool=memcheck -q ./memcheck/tests/doublefree > out- > > current > > > > valgrind: Fatal error at startup: a function redirection > > valgrind: which is mandatory for this platform-tool combination > > valgrind: cannot be set up. Details of the redirection are: > > valgrind: > > valgrind: A must-be-redirected function > > valgrind: whose name matches the pattern: strlen > > valgrind: in an object with soname matching: ld64.so.2 > > valgrind: was not found whilst processing > > valgrind: symbols from the object with soname: ld64.so.2 > > valgrind: > > valgrind: Possible fixes: (1, short term): install glibc's > > debuginfo > > valgrind: package on this machine. (2, longer term): ask the > > packagers > > valgrind: for your Linux distribution to please in future ship a > > non- > > valgrind: stripped ld.so (or whatever the dynamic linker .so is > > called) > > valgrind: that exports the above-named function using the standard > > valgrind: calling conventions for this platform. The package you > > need > > valgrind: to install for fix (1) is called > > valgrind: > > valgrind: On Debian, Ubuntu: libc6-dbg > > valgrind: On SuSE, openSuSE, Fedora, RHEL: glibc-debuginfo > > valgrind: > > valgrind: Note that if you are debugging a 32 bit process on a > > valgrind: 64 bit system, you will need a corresponding 32 bit > > debuginfo > > valgrind: package (e.g. libc6-dbg:i386). > > valgrind: > > valgrind: Cannot continue -- exiting now. Sorry. > > > > > > When I put in my print statements, I see the call to > > read_elf_symtab__normal instead of read_elf_symtab__ppc64be_linux > > as > > expected. It appears that some of the image file is read as I see a > > second call to di_notify_ACHIEVE_ACCEPT_STATE, read_elf_object > > which I > > don't see on the BE system before the run fails. > > So the above is indeed not architecture, but Debian/Ubuntu specific. > It is tracked as > https://bugs.kde.org/show_bug.cgi?id=473745 > > > It is because the ld.so symtab is in a separate dbg package, which > (now) isn't loaded early anymore when resolving the hardwired > redirects. It doesn't happen on other distros because they keep > symtab > in ld.so. I added the attached patch and tested on the four different platforms. The git tree for all four systems was: commit 053cf5ff31e4a2d65726af431824bf30172d21ed (HEAD -> master) Author: Mark Wielaard <ma...@kl...> Date: Fri Sep 1 19:10:17 2023 +0200 Explicitly load libc and any sonames that contain mandatory specs https://bugs.kde.org/show_bug.cgi?id=473745 commit d76ddc0981862bde160a92baf362d3baf2633368 Author: Aaron Merey <am...@re...> Date: Wed Aug 30 14:49:09 2023 -0400 Fix lazy debuginfo loading on ppc64le Lazy debuginfo loading introduced in commit 60f7e89ba32 assumed that either describe_IP or find_DiCfSI will be called before stacktrace printing. describe_IP and find_DiCfSI cause debuginfo to be lazily loaded before symtab lookup occurs during stacktraces. However this assumption does not hold true on ppc64le, resulting in debuginfo failing to load in time for stacktraces. Fix this by loading debuginfo during get_StackTrace_wrk on ppc arches. commit c934430d56c2add25002ea8e321bd8bdab80fc99 (origin/master, origin/HEAD) Author: Paul Floyd <pj...@wa...> Date: Thu Aug 31 15:32:21 2023 +0200 Bug 473870 - FreeBSD 14 applications fail early at startup FreeBSD recently started adding some functions using @gnu_indirect_function, specifically strpcmp which was causing this crash. When running and encountering this ifunc Valgrind looked for the ifunc_handler. But there wasn't one for FreeBSD so Valgrind asserted. The test results are: machine pre-lazy-load current mainline with ppc debuginfo fix with Explicitly-load- (as of 8/31/2023) libc-and-any-sonames Power 8 LE Red Hat Enterprise Linux Server 7.9 (Maipo) 707 tests, 708 tests, 708 tests 708 tests, 4 stderr failures, 280 stderr failures, 247 stderr failures, 4 stderr failures, 0 stdout failures, 54 stdout failures, 54 stdout failures, 0 stdout failures, 13 stderrB failures, 16 stderrB failures, 16 stderrB failures, 13 stderrB failures, 0 stdoutB failures, 11 stdoutB failures, 12 stdoutB failures 0 stdoutB failures 9 post failures 13 post failures 9 post failures 9 post failures Power 8 BE Ubuntu 20.04.5 LTS (Focal Fossa) 742 tests, 743 tests, 743 tests, 743 tests, 2 stderr failures, 671 stderr failures, 671 stderr failures, 671 stderr failures 0 stdout failures, 152 stdout failures, 152 stdout failures, 152 stdout failures, 0 stderrB failures, 14 stderrB failures, 14 stderrB failures, 14 stderrB failures, 2 stdoutB failures, 20 stdoutB failures, 20 stdoutB failures, 20 stdoutB failures, 9 post failures 43 post failures 43 post failures 43 post failures Power 9 LE Ubuntu 20.04.5 LTS (Focal Fossa) 711 tests, 712 tests, 712 tests, 712 tests, 4 stderr failures, 280 stderr failures, 247 stderr failures, 4 stderr failures, 0 stdout failures, 54 stdout failures, 54 stdout failures, 0 stdout failures, 13 stderrB failures, 16 stderrB failures, 16 stderrB failures, 13 stderrB failures, 0 stdoutB failures, 12 stdoutB failures, 12 stdoutB failures 0 stdoutB failures, 9 post failures 13 post failures 9 post failures 9 post failures Power 10 LE Red Hat Enterprise Linux 9.0 (Plow) 719 tests 720 tests, 720 tests, 720 tests, 2 stderr failures, 42 stderr failures, 2 stderr failures, 2 stderr failures, 0 stdout failures, 0 stdout failures, 0 stdout failures, 0 stdout failures, 2 stderrB failures, 2 stderrB failures, 2 stderrB failures, 2 stderrB failures, 10 stdoutB failures, 10 stdoutB failures, 10 stdoutB failures, 10 stdoutB failures 0 post failures 3 post failures 0 post failures 0 post failures The Explicitly-load-libc-and-any-sonames-that-contain-ma.patch seems to fix the issues across the various OS distributions for the LE machines. It does appear that there is a separate issue with the original patch to lazily load the debug info on PowerPC BE. Hopefully we can sort that issue out when Aaron gets back from vacation. Thanks for your help with the latest patch. Carl |
|
From: Mark W. <ma...@kl...> - 2023-09-02 00:08:12
|
Hi Carl, On Fri, Sep 01, 2023 at 11:59:39AM -0700, Carl Love via Valgrind-developers wrote: > The test results are: > > machine pre-lazy-load current mainline with ppc debuginfo fix with Explicitly-load- > (as of 8/31/2023) libc-and-any-sonames > > Power 8 LE Red Hat Enterprise Linux Server 7.9 (Maipo) > 707 tests, 708 tests, 708 tests 708 tests, > 4 stderr failures, 280 stderr failures, 247 stderr failures, 4 stderr failures, > 0 stdout failures, 54 stdout failures, 54 stdout failures, 0 stdout failures, > 13 stderrB failures, 16 stderrB failures, 16 stderrB failures, 13 stderrB failures, > 0 stdoutB failures, 11 stdoutB failures, 12 stdoutB failures 0 stdoutB failures > 9 post failures 13 post failures 9 post failures 9 post failures > > Power 8 BE Ubuntu 20.04.5 LTS (Focal Fossa) > 742 tests, 743 tests, 743 tests, 743 tests, > 2 stderr failures, 671 stderr failures, 671 stderr failures, 671 stderr failures > 0 stdout failures, 152 stdout failures, 152 stdout failures, 152 stdout failures, > 0 stderrB failures, 14 stderrB failures, 14 stderrB failures, 14 stderrB failures, > 2 stdoutB failures, 20 stdoutB failures, 20 stdoutB failures, 20 stdoutB failures, > 9 post failures 43 post failures 43 post failures 43 post failures > > Power 9 LE Ubuntu 20.04.5 LTS (Focal Fossa) > 711 tests, 712 tests, 712 tests, 712 tests, > 4 stderr failures, 280 stderr failures, 247 stderr failures, 4 stderr failures, > 0 stdout failures, 54 stdout failures, 54 stdout failures, 0 stdout failures, > 13 stderrB failures, 16 stderrB failures, 16 stderrB failures, 13 stderrB failures, > 0 stdoutB failures, 12 stdoutB failures, 12 stdoutB failures 0 stdoutB failures, > 9 post failures 13 post failures 9 post failures 9 post failures > > Power 10 LE Red Hat Enterprise Linux 9.0 (Plow) > 719 tests 720 tests, 720 tests, 720 tests, > 2 stderr failures, 42 stderr failures, 2 stderr failures, 2 stderr failures, > 0 stdout failures, 0 stdout failures, 0 stdout failures, 0 stdout failures, > 2 stderrB failures, 2 stderrB failures, 2 stderrB failures, 2 stderrB failures, > 10 stdoutB failures, 10 stdoutB failures, 10 stdoutB failures, 10 stdoutB failures > 0 post failures 3 post failures 0 post failures 0 post failures > > The Explicitly-load-libc-and-any-sonames-that-contain-ma.patch seems > to fix the issues across the various OS distributions for the LE > machines. It does appear that there is a separate issue with the > original patch to lazily load the debug info on PowerPC BE. > Hopefully we can sort that issue out when Aaron gets back from > vacation. The little endian results look really good. Thanks for all the testing. I pushed both Aaron's and my commit. I don't yet fully understand what is going on with big endian. Cheers, Mark |
|
From: Aaron M. <am...@re...> - 2023-08-30 19:10:08
Attachments:
0001-Fix-lazy-debuginfo-loading-on-ppc64le.patch
|
Hi Carl, Sorry for the delay. I'm currently away for the next couple weeks, however I was able to take a look at these regressions. It looks like debuginfo is not always lazily loaded on ppc64le since it's possible for neither describe_IP or find_DiCfSI to be called before symtab lookups during stacktrace. describe_IP and find_DiCfSI contain calls to lazily load debuginfo, so if they are not called before stacktrace printing it results in missing debuginfo and lower quality stacktraces. I've attached a patch that fixed the regressions for me when I tested this on a ppc64le machine. It adds lazy debuginfo loading during ppc get_StackTrace_wrk. Aaron On Tue, Aug 29, 2023 at 3:15 PM Carl Love <ce...@us...> wrote: > > Mark, Paul, Aaron: > > On Sun, 2023-08-27 at 17:36 +0200, Mark Wielaard wrote: > > Hi Paul, > > > > On Sat, Aug 26, 2023 at 06:26:29AM +0200, Paul Floyd wrote: > > > I was just looking at valgrind-testresults and there was a jump in > > > the number of failures on ppc64le on Aug 17th, just after the > > > deferred debuginfo reading change. > > > > > > One random example > > > > > > ================================================= > > > ./valgrind-old/drd/tests/tc16_byterace.stderr.diff > > > ================================================= > > > --- tc16_byterace.stderr.exp 2023-08-17 03:01:09.168107928 +0000 > > > +++ tc16_byterace.stderr.out 2023-08-17 03:28:20.030515805 +0000 > > > @@ -1,8 +1,7 @@ > > > > > > Conflicting load by thread 1 at 0x........ size 1 > > > at 0x........: main (tc16_byterace.c:34) > > > -Location 0x........ is 0 bytes inside bytes[4], > > > -a global variable declared at tc16_byterace.c:7 > > > +Allocation context: BSS section of tc16_byterace > > > > > > It does look to me like this is a debuginfo issue. > > > > You are correct, this was caused by: > > > > commit 60f7e89ba32b54d73b9e36d49e28d0f559ade0b9 > > Author: Aaron Merey <am...@re...> > > Date: Fri Jun 30 18:31:42 2023 -0400 > > > > Support lazy reading and downloading of DWARF debuginfo > > > > That commit shouldn't have been architecture specific, but it > > apparently was. I put some early analysis into the bug > > > > https://bugs.kde.org/show_bug.cgi?id=471807#c16 > > > > The patch depends on a call to find_DiCfSI triggering a full > > debuginfo load. > > find_DiCfSI is (indirectly called) when ML_(get_CFA) is called. > > It looks like ppc64le doesn't call ML_(get_CFA) because we have the > > following in > > coregrind/m_debuginfo/d3basics.c > > > > #if defined(VGP_ppc32_linux) || defined(VGP_ppc64be_linux) \ > > || defined(VGP_ppc64le_linux) > > /* Valgrind on ppc32/ppc64 currently doesn't use unwind > > info. */ > > uw1 = ML_(read_Addr)((UChar*)regs->sp); > > #else > > uw1 = ML_(get_CFA)(regs->ip, regs->sp, regs->fp, 0, > > ~(UWord) 0); > > #endif > > I verified that the patch from Aaron causes regression failures on > Power 9 and Power 10. Per the comment above, not sure why PowerPC does > not support the get_CFA call? Unfortunately, I don't know much about > callgrind or the debuginfo stuff. Not obvious to me at first glance > how to fix the issue. > > I would be happy to help test a patch or work on a patch if someone has > specific suggestions on how to fix the issue on PowerPC. > > Carl > |
|
From: Carl L. <ce...@us...> - 2023-08-30 22:48:42
|
Aaron:
On Wed, 2023-08-30 at 15:09 -0400, Aaron Merey wrote:
> Hi Carl,
>
> Sorry for the delay. I'm currently away for the next couple weeks,
> however
> I was able to take a look at these regressions.
>
> It looks like debuginfo is not always lazily loaded on ppc64le since
> it's
> possible for neither describe_IP or find_DiCfSI to be called before
> symtab
> lookups during stacktrace. describe_IP and find_DiCfSI contain calls
> to lazily load debuginfo, so if they are not called before stacktrace
> printing
> it results in missing debuginfo and lower quality stacktraces.
>
> I've attached a patch that fixed the regressions for me when I tested
> this on
> a ppc64le machine. It adds lazy debuginfo loading during ppc
> get_StackTrace_wrk.
>
Thanks for taking a look at the issue. I tested the patch an a variety
of machines and get mixed results. Here is what I am seeing before the
commit to add the lazy loading, with the current Valgrind mainline
(includes the lazy commit) and with the patch to fix the lazy load on
Power:
machine pre-lazy-load current mainline with ppc debuginfo fix
Power 8 LE 707 tests, 708 tests, 708 tests
4 stderr failures, 280 stderr failures, 247 stderr failures,
0 stdout failures, 54 stdout failures, 54 stdout failures,
13 stderrB failures, 16 stderrB failures, 16 stderrB failures,
0 stdoutB failures, 11 stdoutB failures, 12 stdoutB failures
9 post failures 13 post failures 9 post failures
Power 8 BE 742 tests, 743 tests, 743 tests,
2 stderr failures, 671 stderr failures, 671 stderr failures,
0 stdout failures, 152 stdout failures, 152 stdout failures,
0 stderrB failures, 14 stderrB failures, 14 stderrB failures,
2 stdoutB failures, 20 stdoutB failures, 20 stdoutB failures,
9 post failures 43 post failures 43 post failures
Power 9 LE 711 tests, 712 tests, 712 tests,
4 stderr failures, 280 stderr failures, 247 stderr failures,
0 stdout failures, 54 stdout failures, 54 stdout failures,
13 stderrB failures, 16 stderrB failures, 16 stderrB failures,
0 stdoutB failures, 12 stdoutB failures, 12 stdoutB failures
9 post failures 13 post failures 9 post failures
Power 10 LE 719 tests 720 tests, 720 tests,
2 stderr failures, 42 stderr failures, 2 stderr failures,
0 stdout failures, 0 stdout failures, 0 stdout failures,
2 stderrB failures, 2 stderrB failures, 2 stderrB failures,
10 stdoutB failures, 10 stdoutB failures, 10 stdoutB failures,
0 post failures 3 post failures 0 post failures
So the patch has mixed results in fixing the issue. It feels like
there is still a timing issue to me. Perhaps there needs to be a check
to see if the lazy load has completed before the use? Just throwing
out ideas here.
Anyway, sounds like you are out of the office for awhile. I am fine
with waiting until you are back to work on this some more. No need to
mess up you time off. I don't think there is a release coming soon so
I think we have some time to get this fixed up.
Thanks for the help with the initial patch fix.
Carl
|
|
From: Carl L. <ce...@us...> - 2023-08-31 22:38:31
|
Aaron, Mark:
On Wed, 2023-08-30 at 15:48 -0700, Carl Love wrote:
> Thanks for taking a look at the issue. I tested the patch an a variety
> of machines and get mixed results. Here is what I am seeing before the
> commit to add the lazy loading, with the current Valgrind mainline
> (includes the lazy commit) and with the patch to fix the lazy load on
> Power:
>
> machine pre-lazy-load current mainline with ppc debuginfo fix
> Power 8 LE 707 tests, 708 tests, 708 tests
> 4 stderr failures, 280 stderr failures, 247 stderr failures,
> 0 stdout failures, 54 stdout failures, 54 stdout failures,
> 13 stderrB failures, 16 stderrB failures, 16 stderrB failures,
> 0 stdoutB failures, 11 stdoutB failures, 12 stdoutB failures
> 9 post failures 13 post failures 9 post failures
>
> Power 8 BE 742 tests, 743 tests, 743 tests,
> 2 stderr failures, 671 stderr failures, 671 stderr failures,
> 0 stdout failures, 152 stdout failures, 152 stdout failures,
> 0 stderrB failures, 14 stderrB failures, 14 stderrB failures,
> 2 stdoutB failures, 20 stdoutB failures, 20 stdoutB failures,
> 9 post failures 43 post failures 43 post failures
>
> Power 9 LE 711 tests, 712 tests, 712 tests,
> 4 stderr failures, 280 stderr failures, 247 stderr failures,
> 0 stdout failures, 54 stdout failures, 54 stdout failures,
> 13 stderrB failures, 16 stderrB failures, 16 stderrB failures,
> 0 stdoutB failures, 12 stdoutB failures, 12 stdoutB failures
> 9 post failures 13 post failures 9 post failures
>
> Power 10 LE 719 tests 720 tests, 720 tests,
> 2 stderr failures, 42 stderr failures, 2 stderr failures,
> 0 stdout failures, 0 stdout failures, 0 stdout failures,
> 2 stderrB failures, 2 stderrB failures, 2 stderrB failures,
> 10 stdoutB failures, 10 stdoutB failures, 10 stdoutB failures,
> 0 post failures 3 post failures 0 post failures
I was thinking about what else could cause the differences in the test
results. I was wondering if the OS distribution might be an issue.
So, I tried some different OS distributions on the same hardware.
First here is the OS distribution for the above testing.
The Power 8 BE system is Red Hat Enterprise Linux Server 7.9 (Maipo)
The Power 8 LE system is Ubuntu 20.04.5 LTS (Focal Fossa)
The Power 9 LE system is Ubuntu 20.04.5 LTS (Focal Fossa)
The Power 10 LE system Red Hat Enterprise Linux 9.0 (Plow)
I did some additional testing on Power 9 LE and Power 10 LE with
different OS distributions with the PPC fix patch applied.
Power 9 LE Red Hat Enterprise Linux 8.7 (Ootpa)
== 714 tests, 4 stderr failures, 0 stdout failures, 0 stderrB failures,
0 stdoutB failures, 9 post failures ==
Power 10 LE Ubuntu 22.04.2 LTS
== 721 tests, 303 stderr failures, 62 stdout failures, 11 stderrB
failures, 14 stdoutB failures, 0 post failures ==
So, it seems that RHAT works well on Power 9 and Power 10. Ubuntu
doesn't work well on Power 10, Power 9 or Power 8. There seems to be
an OS issue, not a timing issue that is causing the differences on the
various platforms that I tested.
Carl
|
|
From: Mark W. <ma...@kl...> - 2023-08-31 14:15:04
|
Hi Aaron, Hi Carl,
On Wed, Aug 30, 2023 at 03:48:20PM -0700, Carl Love via Valgrind-developers wrote:
> On Wed, 2023-08-30 at 15:09 -0400, Aaron Merey wrote:
> > Sorry for the delay. I'm currently away for the next couple
> > weeks, however I was able to take a look at these regressions.
Thanks. But don't feel you have to come back early just for this
technical issue. We might not be as quick as you, but we should be
able to figure it out :)
> > It looks like debuginfo is not always lazily loaded on ppc64le
> > since it's possible for neither describe_IP or find_DiCfSI to be
> > called before symtab lookups during stacktrace. describe_IP and
> > find_DiCfSI contain calls to lazily load debuginfo, so if they are
> > not called before stacktrace printing it results in missing
> > debuginfo and lower quality stacktraces.
> >
> > I've attached a patch that fixed the regressions for me when I
> > tested this on a ppc64le machine. It adds lazy debuginfo loading
> > during ppc get_StackTrace_wrk.
>
> Thanks for taking a look at the issue. I tested the patch an a variety
> of machines and get mixed results. Here is what I am seeing before the
> commit to add the lazy loading, with the current Valgrind mainline
> (includes the lazy commit) and with the patch to fix the lazy load on
> Power: [...]
It also doesn't seem to work for me on a power9 f38 system. Which is
surprising, since theoretically I think it should work. The
difference between ppc64le and other architectures is that all other
architectures use VG_(use_CF_info) for unwinding, which will
indirectly load the debuginfo for the pc. So explicitly loading it for
the pc in the ppc case should have worked, but it doesn't... :{
I'll keep poking if there is some other difference with the other
architectures.
Cheers,
Mark
|
|
From: Mark W. <ma...@kl...> - 2023-08-31 15:43:37
|
Hi,
On Thu, Aug 31, 2023 at 04:14:53PM +0200, Mark Wielaard wrote:
> It also doesn't seem to work for me on a power9 f38 system. Which is
> surprising, since theoretically I think it should work. The
> difference between ppc64le and other architectures is that all other
> architectures use VG_(use_CF_info) for unwinding, which will
> indirectly load the debuginfo for the pc. So explicitly loading it for
> the pc in the ppc case should have worked, but it doesn't... :{
>
> I'll keep poking if there is some other difference with the other
> architectures.
I take that back. I didn't apply Aaron's patch correctly (I had some
local hacks that conflicted with the second part). With a clean
current trunk and Aaron's patch applied the results look pretty good:
== 712 tests, 4 stderr failures, 0 stdout failures, 0 stderrB failures, 0 stdoutB failures, 0 post failures ==
memcheck/tests/bug340392 (stderr)
memcheck/tests/linux/debuginfod-check (stderr)
helgrind/tests/pth_mempcpy_false_races (stderr)
drd/tests/std_thread2 (stderr)
...checking makefile consistency
...checking header files and include directives
make: *** [Makefile:1438: regtest] Error 1
I think memcheck/tests/bug340392,
helgrind/tests/pth_mempcpy_false_races and drd/tests/std_thread2 are
known failures.
memcheck/tests/linux/debuginfod-check.stderr.diff:
--- debuginfod-check.stderr.exp 2023-04-27 15:25:16.209181780 +0000
+++ debuginfod-check.stderr.out 2023-08-31 14:21:46.438006283 +0000
@@ -2,5 +2,5 @@
at 0x........: main (debuginfod-check.c:5)
Address 0x........ is 1 bytes before a block of size 1 alloc'd
at 0x........: malloc (vg_replace_malloc.c:...)
- by 0x........: main (debuginfod-check.c:4)
+ ...
so that is interesting, have to figure out why the explicit debuginfod
testcase fails. But the rest does look good with Aaron's patch
applied.
Carl, can you look if the patch applied cleanly for you?
Cheers,
Mark
|
|
From: Carl L. <ce...@us...> - 2023-08-31 17:07:44
|
Mark:
On Thu, 2023-08-31 at 17:43 +0200, Mark Wielaard wrote:
> Hi,
>
> On Thu, Aug 31, 2023 at 04:14:53PM +0200, Mark Wielaard wrote:
> > It also doesn't seem to work for me on a power9 f38 system. Which
> > is
> > surprising, since theoretically I think it should work. The
> > difference between ppc64le and other architectures is that all
> > other
> > architectures use VG_(use_CF_info) for unwinding, which will
> > indirectly load the debuginfo for the pc. So explicitly loading it
> > for
> > the pc in the ppc case should have worked, but it doesn't... :{
> >
> > I'll keep poking if there is some other difference with the other
> > architectures.
>
> I take that back. I didn't apply Aaron's patch correctly (I had some
> local hacks that conflicted with the second part). With a clean
> current trunk and Aaron's patch applied the results look pretty good:
>
> == 712 tests, 4 stderr failures, 0 stdout failures, 0 stderrB
> failures, 0 stdoutB failures, 0 post failures ==
> memcheck/tests/bug340392 (stderr)
> memcheck/tests/linux/debuginfod-check (stderr)
> helgrind/tests/pth_mempcpy_false_races (stderr)
> drd/tests/std_thread2 (stderr)
>
> ...checking makefile consistency
> ...checking header files and include directives
> make: *** [Makefile:1438: regtest] Error 1
>
> I think memcheck/tests/bug340392,
> helgrind/tests/pth_mempcpy_false_races and drd/tests/std_thread2 are
> known failures.
>
> memcheck/tests/linux/debuginfod-check.stderr.diff:
>
> --- debuginfod-check.stderr.exp 2023-04-27 15:25:16.209181780 +0000
> +++ debuginfod-check.stderr.out 2023-08-31 14:21:46.438006283 +0000
> @@ -2,5 +2,5 @@
> at 0x........: main (debuginfod-check.c:5)
> Address 0x........ is 1 bytes before a block of size 1 alloc'd
> at 0x........: malloc (vg_replace_malloc.c:...)
> - by 0x........: main (debuginfod-check.c:4)
> + ...
>
> so that is interesting, have to figure out why the explicit
> debuginfod
> testcase fails. But the rest does look good with Aaron's patch
> applied.
>
> Carl, can you look if the patch applied cleanly for you?
I have test directories on each machine. I did a git pull, compiled,
ran the test, then applied the fix patch, compiled, ran tests, then I
rolled back the git repository to the commit prior to the initial
commit complied and ran the test.
I didn't see any issues when I applied the PPC fix patch.
So today, I cloned the current valgrind tree into an empty directory,
applied the PPC fix patch. The patch applied without any issues. Then
configure, compiled and installed valgrind. I reran the tests on each
platform and got identical results as I posted yesterday. Looking at
the variability in the results before and after the PPC fix patch just
makes me wonder if there is a timing issue given what the patch did??
Valgrind is a single threaded program as far as I know so I am puzzled
how it could be a timing issue. I have tried running the tests
multiple times on the various platforms and always get consistent
results.
I will see if I can play around with the patch some today to see if I
can find anything.
Carl
|
|
From: Carl L. <ce...@us...> - 2023-08-31 17:40:26
|
Mark, Aaron:
So, I tried running the doublefree test by hand with the intention of
then adding some debug prints to see which routines were being called.
I am seeing the following:
valgrind --tool=memcheck -q ./memcheck/tests/doublefree > out-
current
valgrind: m_debuginfo/image.c:1106 (vgModuleLocal_img_valid):
Assertion 'img != NULL' failed.
Segmentation fault
I rolled back the git tree to the commit prior to the initial patch to
do the lazy load,
commit 6ce0979884a8f246c80a098333ceef1a7b7f694d
Author: Paul Floyd <pj...@wa...>
Date: Mon Jul 24 22:06:00 2023 +0200
Bug 472219 - Syscall param ppoll(ufds.events) points to
uninitialised byte(s
Add checks that (p)poll fd is not negative. If it is negative,
don't check
the events field.
I re-compliled, re-installed and tested again and get:
valgrind --tool=memcheck -q ./memcheck/tests/doublefree > out-current
==124807== Invalid free() / delete / delete[] / realloc()
==124807== at 0x409B680: free (vg_replace_malloc.c:974)
==124807== by 0x1000063B: main (doublefree.c:10)
==124807== Address 0x42f0040 is 0 bytes inside a block of size 177
free'd
==124807== at 0x409B680: free (vg_replace_malloc.c:974)
==124807== by 0x1000063B: main (doublefree.c:10)
==124807== Block was alloc'd at
==124807== at 0x409858C: malloc (vg_replace_malloc.c:431)
==124807== by 0x1000061B: main (doublefree.c:8)
==124807==
So it seems with the initial patch and the PPC patch we are hitting an
assertion issue. I will try and pursue a bit more.
Carl
|
|
From: Carl L. <ce...@us...> - 2023-08-31 22:38:32
|
On Thu, 2023-08-31 at 10:31 -0700, Carl Love wrote:
> Mark, Aaron:
>
> So, I tried running the doublefree test by hand with the intention of
> then adding some debug prints to see which routines were being
> called.
> I am seeing the following:
>
> valgrind --tool=memcheck -q ./memcheck/tests/doublefree > out-
> current
>
> valgrind: m_debuginfo/image.c:1106 (vgModuleLocal_img_valid):
> Assertion 'img != NULL' failed.
> Segmentation fault
>
> I rolled back the git tree to the commit prior to the initial patch
> to
> do the lazy load,
>
> commit 6ce0979884a8f246c80a098333ceef1a7b7f694d
> Author: Paul Floyd <pj...@wa...>
> Date: Mon Jul 24 22:06:00 2023 +0200
>
> Bug 472219 - Syscall param ppoll(ufds.events) points to
> uninitialised byte(s
>
> Add checks that (p)poll fd is not negative. If it is negative,
> don't check
> the events field.
>
> I re-compliled, re-installed and tested again and get:
>
> valgrind --tool=memcheck -q ./memcheck/tests/doublefree > out-
> current
> ==124807== Invalid free() / delete / delete[] / realloc()
> ==124807== at 0x409B680: free (vg_replace_malloc.c:974)
> ==124807== by 0x1000063B: main (doublefree.c:10)
> ==124807== Address 0x42f0040 is 0 bytes inside a block of size 177
> free'd
> ==124807== at 0x409B680: free (vg_replace_malloc.c:974)
> ==124807== by 0x1000063B: main (doublefree.c:10)
> ==124807== Block was alloc'd at
> ==124807== at 0x409858C: malloc (vg_replace_malloc.c:431)
> ==124807== by 0x1000061B: main (doublefree.c:8)
> ==124807==
>
> So it seems with the initial patch and the PPC patch we are hitting
> an
> assertion issue. I will try and pursue a bit more.
The system I was testing on is Power 8 BE
system is Red Hat Enterprise Linux Server 7.9 (Maipo)
The assertion is in function ML_(img_valid), file
coregrind/m_debuginfo/image.c. I put a print statement in before each
of the 18 calls to determine which of the calls fails. The failure is
in readelf.c, line ~ 609
Bool get_elf_symbol_info (... )
{
...
/* Now we want to know what's at that offset in the .opd
section. We can't look in the running image since it won't
necessarily have been mapped. But we can consult the oimage.
opd_img is the start address of the .opd in the oimage.
Hence: */
ULong fn_descr[2]; /* is actually 3 words, but we need only 2 */
VG_(printf)("CARLL, img_valid 2\n");
if (!ML_(img_valid)(escn_opd->img, escn_opd->ioff + offset_in_opd,
sizeof(fn_descr))) {
if (TRACE_SYMTAB_ENABLED) {
HChar* sym_name = ML_(img_strdup)(escn_strtab->img,
"di.gesi.6b", sym_name_ioff);
TRACE_SYMTAB(" ignore -- invalid OPD fn_descr offset: %s\n",
sym_name);
if (sym_name) ML_(dinfo_free)(sym_name);
}
return False;
}
...
The function is called from
static
__attribute__((unused)) /* not referred to on all targets */
void read_elf_symtab__ppc64be_linux(
struct _DebugInfo* di, const HChar* tab_name,
DiSlice* escn_symtab,
DiSlice* escn_strtab,
DiSlice* escn_opd, /* ppc64be-linux only */
Bool symtab_in_debug
)
{
...
}
in the same file. There is an #if def to select which of the two calls
to make
# if defined(VGP_ppc64be_linux)
read_elf_symtab = read_elf_symtab__ppc64be_linux;
# else
read_elf_symtab = read_elf_symtab__normal;
# endif
in function read_elf_object. Which is called from
di_notify_ACHIEVE_ACCEPT_STATE in debuginfo.c.
I believe we need to call read_elf_debug to actually load the image. I
am not seeing any calls to read_elf_debug. It is called in load_di,
addr_load_di and load_all_debuginfo. I don't see any of these
functions getting called. describe_IP calls load_di or addr_load_di;
find_DiCfSI will call load_di. Again, I don't see describe_IP or
find_DiCfSI being called.
----------------------------------------
So, I then tried to run the same test on a Power 8LE system Ubuntu
20.04.5 LTS (Focal Fossa). I get:
valgrind --tool=memcheck -q ./memcheck/tests/doublefree > out-
current
valgrind: Fatal error at startup: a function redirection
valgrind: which is mandatory for this platform-tool combination
valgrind: cannot be set up. Details of the redirection are:
valgrind:
valgrind: A must-be-redirected function
valgrind: whose name matches the pattern: strlen
valgrind: in an object with soname matching: ld64.so.2
valgrind: was not found whilst processing
valgrind: symbols from the object with soname: ld64.so.2
valgrind:
valgrind: Possible fixes: (1, short term): install glibc's debuginfo
valgrind: package on this machine. (2, longer term): ask the
packagers
valgrind: for your Linux distribution to please in future ship a non-
valgrind: stripped ld.so (or whatever the dynamic linker .so is
called)
valgrind: that exports the above-named function using the standard
valgrind: calling conventions for this platform. The package you need
valgrind: to install for fix (1) is called
valgrind:
valgrind: On Debian, Ubuntu: libc6-dbg
valgrind: On SuSE, openSuSE, Fedora, RHEL: glibc-debuginfo
valgrind:
valgrind: Note that if you are debugging a 32 bit process on a
valgrind: 64 bit system, you will need a corresponding 32 bit
debuginfo
valgrind: package (e.g. libc6-dbg:i386).
valgrind:
valgrind: Cannot continue -- exiting now. Sorry.
When I put in my print statements, I see the call to
read_elf_symtab__normal instead of read_elf_symtab__ppc64be_linux as
expected. It appears that some of the image file is read as I see a
second call to di_notify_ACHIEVE_ACCEPT_STATE, read_elf_object which I
don't see on the BE system before the run fails.
Carl
|
|
From: Mark W. <ma...@kl...> - 2023-09-01 14:21:34
|
Hi Carl, On Thu, 2023-08-31 at 15:38 -0700, Carl Love wrote: > So, I then tried to run the same test on a Power 8LE system Ubuntu > 20.04.5 LTS (Focal Fossa). I get: > > valgrind --tool=memcheck -q ./memcheck/tests/doublefree > out- > current > > valgrind: Fatal error at startup: a function redirection > valgrind: which is mandatory for this platform-tool combination > valgrind: cannot be set up. Details of the redirection are: > valgrind: > valgrind: A must-be-redirected function > valgrind: whose name matches the pattern: strlen > valgrind: in an object with soname matching: ld64.so.2 > valgrind: was not found whilst processing > valgrind: symbols from the object with soname: ld64.so.2 > valgrind: > valgrind: Possible fixes: (1, short term): install glibc's debuginfo > valgrind: package on this machine. (2, longer term): ask the > packagers > valgrind: for your Linux distribution to please in future ship a non- > valgrind: stripped ld.so (or whatever the dynamic linker .so is > called) > valgrind: that exports the above-named function using the standard > valgrind: calling conventions for this platform. The package you need > valgrind: to install for fix (1) is called > valgrind: > valgrind: On Debian, Ubuntu: libc6-dbg > valgrind: On SuSE, openSuSE, Fedora, RHEL: glibc-debuginfo > valgrind: > valgrind: Note that if you are debugging a 32 bit process on a > valgrind: 64 bit system, you will need a corresponding 32 bit > debuginfo > valgrind: package (e.g. libc6-dbg:i386). > valgrind: > valgrind: Cannot continue -- exiting now. Sorry. > > > When I put in my print statements, I see the call to > read_elf_symtab__normal instead of read_elf_symtab__ppc64be_linux as > expected. It appears that some of the image file is read as I see a > second call to di_notify_ACHIEVE_ACCEPT_STATE, read_elf_object which I > don't see on the BE system before the run fails. So the above is indeed not architecture, but Debian/Ubuntu specific. It is tracked as https://bugs.kde.org/show_bug.cgi?id=473745 It is because the ld.so symtab is in a separate dbg package, which (now) isn't loaded early anymore when resolving the hardwired redirects. It doesn't happen on other distros because they keep symtab in ld.so. Cheers, Mark |
|
From: Mark W. <ma...@kl...> - 2023-09-01 17:47:42
|
On Fri, 2023-09-01 at 16:21 +0200, Mark Wielaard wrote: > So the above is indeed not architecture, but Debian/Ubuntu specific. > It is tracked as https://bugs.kde.org/show_bug.cgi?id=473745 > > It is because the ld.so symtab is in a separate dbg package, which > (now) isn't loaded early anymore when resolving the hardwired > redirects. It doesn't happen on other distros because they keep symtab > in ld.so. I have attached a patch that seems to work for me. |