From: Mark W. <ma...@kl...> - 2025-03-20 21:22:01
|
Hi Florian, (Adding valgrind-developers to CC to see if someone else has some smart ideas how to deal with this.) On Thu, Mar 20, 2025 at 05:58:31PM +0100, Florian Weimer wrote: > > With latest glibc on fedora rawhide (glibc-2.40.9000-37.fc43.x86_64) I > > am seeing some extra frames in the call stack that I wonder whether to > > specially handle in valgrind. > > > > Before we would report on some bad syscall argument like: > > > > ==1929378== Syscall param sendmsg(msg) points to uninitialised byte(s) > > ==1929378== at 0x4971514: sendmsg (sendmsg.c:28) > > ==1929378== by 0x40128B: main (sendmsg.c:46) > > ==1929378== Address 0x1ffefff640 is on thread 1's stack > > ==1929378== in frame #1, created by main (sendmsg.c:13) > > > > Now it looks like: > > > > ==2670784== Syscall param sendmsg(msg) points to uninitialised byte(s) > > ==2670784== at 0x48D9AE6: __internal_syscall_cancel (cancellation.c:64) > > ==2670784== by 0x48D9B03: __syscall_cancel (cancellation.c:75) > > ==2670784== by 0x49628F0: sendmsg (sendmsg.c:28) > > ==2670784== by 0x4005CB: main (sendmsg.c:46) > > ==2670784== Address 0x1ffeffff40 is on thread 1's stack > > ==2670784== in frame #3, created by main (sendmsg.c:13) > > > > Which I think is not as helpful to the user. > > So I am wondering whether those extra frames should be handled > > specially in valgrind and filtered out. But were these extra stack > > frames added explicitly? And are they easily detected (symbol name > > starting with __ and containing syscall might be a good hearistic)? > > I think __internal_syscall_cancel should get inlined into > __syscall_cancel. It isn't, I double checked with gdb and there are always two extra frames on top of the call stack. > There is also another out-of-line system call in __syscall_cancel_arch, > which you probably don't see in your example because the process is > single-threaded. I did indeed see that in our gdb_server testsuite, I had to filter that out of the gdb output to make our vgdb tests pass. > It is necessary to concentrate all cancelable system calls in one place > for correctness reasons because we need to know if the cancelling signal > arrives within the system call or immediately after it. It's the only > way to tell whether the effect of the system call has taken place or > not. With all system calls in one place, this is a simple address > check. With the previous inlining-based approach, we would have to have > some sort of lookup table to determine whether the cancellation attempt > happened while the system call was executing or not. > > This is relevant bug: > > Race conditions in pthread cancellation > <https://sourceware.org/bugzilla/show_bug.cgi?id=12683> > > And this commit fixed it: > > commit 89b53077d2a58f00e7debdfe58afabe953dac60d > Author: Adhemerval Zanella <adh...@li...> > Date: Tue Jun 25 16:17:44 2024 -0300 > > nptl: Fix Race conditions in pthread cancellation [BZ#12683] Interesting, so this is actually in 2.41? I should try the fedora 42 beta then. Do you happen to know whether people/distros have backported this to earlier releases? I think these extra __*syscall*cancel* frames are somewhat confusing to the user and messes up existing suppressions. They also cause trouble for the valgrind regtests. I think the solution for valgrind is to just skip the top (two) frames if they match the __*syscall*cancel* symbol address ranges. And we only need to do that when we are creating a backtrace from a valgrind syscall wrapper. Looking at the glibc symtab I see four function symbol matching that pattern: 2140: 0000000000079840 51 FUNC LOCAL DEFAULT 4 __syscall_cancel_arch 3561: 000000000006daf0 64 FUNC LOCAL DEFAULT 4 __syscall_cancel 3700: 000000000006da60 140 FUNC LOCAL DEFAULT 4 __internal_syscall_cancel 4566: 000000000006da00 87 FUNC LOCAL DEFAULT 4 __syscall_do_cancel Can we rely on those names (and assume there are only 4) or is it better to be flexible and just create a dynamic array for any glibc local function that matches the __*syscall*cancel* pattern? Thanks, Mark |