Re: [Valgrind-users] unable to read core generated by valgrind in gdb / aarch64

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 9/9/22 04:58, John Reiser wrote:
>>> 1. Describe the environment completely.
> 
> Also: Any kind of threading (pthreads, or shm_open, or 
> mmap(,,,MAP_SHARED,,))
> must be mentioned explicitly.  Multiple execution contexts which access
> the same address space instance are a significant complicating factor.
> 
> If threading is involved, then try using "valgrind --tool=drd ..."
> or --tool=helgrind, because those tools specifically target detecting
> race conditions and other synchronization errors, much like --tool=memcheck
> [the default tool when no --tool= is mentioned] targets errors involving
> malloc() and free(), uninitialized variables, etc.
> 

No threading is used. Postgres is multi-process, and uses shared memory 
for the shared cache (through shm_open etc.). FWIW, as I mentioned 
before, this works perfectly fine when the core is not generated by 
valgrind.

>>> 4. Walk before attempting to run.
>>> Did you try a simple example?  Write a half-page program with 5 
>>> subroutines,
>>> each of which calls the next one, and the last one sends SIGABRT to 
>>> the process.
> 
>>> Does the .core file when run under valgrind give the correct 
>>> traceback using gdb?
> 
> Specifically: apply valgrind to the small program which causes a 
> deliberate SIGABRT,
> and get a core file.  Does gdb give the correct traceback for that core 
> file?
> If not, then you have an ideal test case for filing a bug report against 
> valgrind
> because even the simple core file is bad.  If gdb does give a correct 
> traceback
> for the simple core file, then you have to keep looking for the source 
> of the
> problem on your larger program.
> 

I'll try this once I have access to the machine early next week.

> 
>>> 5. (Learn and) Use the built-in tools where possible.
>>> Run the process interactively, invoking valgrind with "--vgdb-error=0",
>>> and giving the debugger command "(gdb) continue" after establishing
>>> connectivity between vgdb and the process.
>>> See the valgrind manual, section 3.2.9 "vgdb command line options".
>>> When the SIGABRT happens, then vgdb will allow you to use all the 
>>> ordinary
>>> gdb commands to get a backtrace, go up and down the stack, examine
>>> variables and other memory, run
>>>     (gdb) info proc
>>>     (gdb) shell cat /proc/$PID/maps
>>> to see exactly the layout of process memory, etc.
>>> There are also special commands to access valgrind functionality
>>> interactively, such as checking for memory leaks.
>>>
>>
>> I already explained why I don't want / can't use the interactive gdb. 
>> I'm aware of the option, I've used it before, but in this case it's 
>> not very practical.
> 
> The gdb process does not *have* to be run interactively, it just takes 
> more work
> and patience to run non-interactively.  Run "valgrind --vgdb-error=0 ..."
> and notice the last part of the printed instructions:
> 
>           and then give GDB the following command
>       ==215935==   target remote | 
> /path/to/libexec/valgrind/../../bin/vgdb --pid=215935
>       ==215935== --pid is optional if only one valgrind process is running
> 
> So if there is only one valgrind process, then you do not need to know 
> the pid.
> Thus you can run gdb with re-directed stdin/stdout/stderr, or perhaps 
> use the -x
> command-line option.  This allows a static, pre-scripted list of gdb 
> commands;
> it may require a few iterations to get a good debug script.  (Try the 
> commands
> using the trivial SIGABRT case!)  Also get the full gdb manual (more 
> than 800 pages)
> and look at the "thread apply all ..." and "frame apply all ..." commands.
> 

Sure, but that's more of a workaround - it does not make the core file 
useful, it provides alternative way to get to the same result. Plus it 
requires additional tooling/scripting, and I'd prefer keeping the 
tooling as simple as possible.

Postgres is a multi-process system, that runs a bunch of management 
processes, and client processes (1:1 to connections). We don't know in 
which one an issue might happen, so we'd have to attach a script to each 
of them.

Furthermore, there's the question of performance - we run these tests on 
many machines (although only some of them run them under valgrind), the 
valgrind makes it fairly slow already - if this vgdb thing makes even 
slower, that'd be an issue. But I haven't measured it, so maybe it's not 
as bad as I'm afraid.

> It may be possible to perform some interactive "reconnaisance" to suggest
> good things for the script to try.  Using --vgdb-error=0, put a breakpoint
> on a likely location for the error (or shortly before the error),
> and look around.  In the logged traceback:
> 
>    TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: 
> "reorderbuffer.c", Line: 902, PID: 536049)
>    (ExceptionalCondition+0x98)[0x8f5cec]
>    (+0x57a574)[0x682574]
>    (+0x579edc)[0x681edc]
>    (ReorderBufferAddNewTupleCids+0x60)[0x6864dc]
>    (SnapBuildProcessNewCid+0x94)[0x68b6a4]
> 
> any of those named locations, or shortly before them, might be a good spot.
> When execution stops at any one of the breakpoints, then look around
> and see if you can find clues about "prev_first_lsn < cur_txn->first_lsn"
> even though the error has not yet occurred.  Perhaps this will help
> identify location(s) that might be closer to the actual error
> when it does happen.  This might suggest commands for the non-interactive
> gdb debugging script.
> 

This does not work, I'm afraid. The issue is a (rare) race condition, 
and we run the assert thousands of times and it's fine 99.999% of the 
time. The breakpoint & interactive reconnaissance is unlikely to find 
anything 99% of the time, and it can easily make the race condition go 
away by changing the timing. That's kinda the interesting thing - this 
is not an issue valgrind is meant to discover, it's just that it seems 
to change the timing just enough to increase the probability.

regards
Tomas