From: Tomas V. <tv...@fu...> - 2022-09-09 14:26:28
|
On 9/9/22 04:58, John Reiser wrote: >>> 1. Describe the environment completely. > > Also: Any kind of threading (pthreads, or shm_open, or > mmap(,,,MAP_SHARED,,)) > must be mentioned explicitly. Multiple execution contexts which access > the same address space instance are a significant complicating factor. > > If threading is involved, then try using "valgrind --tool=drd ..." > or --tool=helgrind, because those tools specifically target detecting > race conditions and other synchronization errors, much like --tool=memcheck > [the default tool when no --tool= is mentioned] targets errors involving > malloc() and free(), uninitialized variables, etc. > No threading is used. Postgres is multi-process, and uses shared memory for the shared cache (through shm_open etc.). FWIW, as I mentioned before, this works perfectly fine when the core is not generated by valgrind. >>> 4. Walk before attempting to run. >>> Did you try a simple example? Write a half-page program with 5 >>> subroutines, >>> each of which calls the next one, and the last one sends SIGABRT to >>> the process. > >>> Does the .core file when run under valgrind give the correct >>> traceback using gdb? > > Specifically: apply valgrind to the small program which causes a > deliberate SIGABRT, > and get a core file. Does gdb give the correct traceback for that core > file? > If not, then you have an ideal test case for filing a bug report against > valgrind > because even the simple core file is bad. If gdb does give a correct > traceback > for the simple core file, then you have to keep looking for the source > of the > problem on your larger program. > I'll try this once I have access to the machine early next week. > >>> 5. (Learn and) Use the built-in tools where possible. >>> Run the process interactively, invoking valgrind with "--vgdb-error=0", >>> and giving the debugger command "(gdb) continue" after establishing >>> connectivity between vgdb and the process. >>> See the valgrind manual, section 3.2.9 "vgdb command line options". >>> When the SIGABRT happens, then vgdb will allow you to use all the >>> ordinary >>> gdb commands to get a backtrace, go up and down the stack, examine >>> variables and other memory, run >>> (gdb) info proc >>> (gdb) shell cat /proc/$PID/maps >>> to see exactly the layout of process memory, etc. >>> There are also special commands to access valgrind functionality >>> interactively, such as checking for memory leaks. >>> >> >> I already explained why I don't want / can't use the interactive gdb. >> I'm aware of the option, I've used it before, but in this case it's >> not very practical. > > The gdb process does not *have* to be run interactively, it just takes > more work > and patience to run non-interactively. Run "valgrind --vgdb-error=0 ..." > and notice the last part of the printed instructions: > > and then give GDB the following command > ==215935== target remote | > /path/to/libexec/valgrind/../../bin/vgdb --pid=215935 > ==215935== --pid is optional if only one valgrind process is running > > So if there is only one valgrind process, then you do not need to know > the pid. > Thus you can run gdb with re-directed stdin/stdout/stderr, or perhaps > use the -x > command-line option. This allows a static, pre-scripted list of gdb > commands; > it may require a few iterations to get a good debug script. (Try the > commands > using the trivial SIGABRT case!) Also get the full gdb manual (more > than 800 pages) > and look at the "thread apply all ..." and "frame apply all ..." commands. > Sure, but that's more of a workaround - it does not make the core file useful, it provides alternative way to get to the same result. Plus it requires additional tooling/scripting, and I'd prefer keeping the tooling as simple as possible. Postgres is a multi-process system, that runs a bunch of management processes, and client processes (1:1 to connections). We don't know in which one an issue might happen, so we'd have to attach a script to each of them. Furthermore, there's the question of performance - we run these tests on many machines (although only some of them run them under valgrind), the valgrind makes it fairly slow already - if this vgdb thing makes even slower, that'd be an issue. But I haven't measured it, so maybe it's not as bad as I'm afraid. > It may be possible to perform some interactive "reconnaisance" to suggest > good things for the script to try. Using --vgdb-error=0, put a breakpoint > on a likely location for the error (or shortly before the error), > and look around. In the logged traceback: > > TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: > "reorderbuffer.c", Line: 902, PID: 536049) > (ExceptionalCondition+0x98)[0x8f5cec] > (+0x57a574)[0x682574] > (+0x579edc)[0x681edc] > (ReorderBufferAddNewTupleCids+0x60)[0x6864dc] > (SnapBuildProcessNewCid+0x94)[0x68b6a4] > > any of those named locations, or shortly before them, might be a good spot. > When execution stops at any one of the breakpoints, then look around > and see if you can find clues about "prev_first_lsn < cur_txn->first_lsn" > even though the error has not yet occurred. Perhaps this will help > identify location(s) that might be closer to the actual error > when it does happen. This might suggest commands for the non-interactive > gdb debugging script. > This does not work, I'm afraid. The issue is a (rare) race condition, and we run the assert thousands of times and it's fine 99.999% of the time. The breakpoint & interactive reconnaissance is unlikely to find anything 99% of the time, and it can easily make the race condition go away by changing the timing. That's kinda the interesting thing - this is not an issue valgrind is meant to discover, it's just that it seems to change the timing just enough to increase the probability. regards Tomas |