From: Narayanan I. <na...@yo...> - 2022-03-14 15:27:30
|
Hi, While running the automated test suite (which has hundreds of tests) for my application with valgrind, I occasionally see failures like the following in some of the tests. ==29753== Can't extend stack to 0x1ffeec7948 during signal delivery for thread 1: ==29753== too small or bad protection modes ==29753== ==29753== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==29753== Access not within mapped region at address 0x1FFEEC7948 ==29753== at 0x4849FD8: strncpy (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==29753== by 0x489AE7C: cli_get_sub_quals (sr_unix/cli_parse.c:593) ==29753== by 0x489ABC3: parse_arg (sr_unix/cli_parse.c:0) ==29753== by 0x489BD6E: parse_triggerfile_cmd (sr_unix/cli_parse.c:1128) ==29753== by 0x4BD2377: trigger_parse (sr_unix/trigger_parse.c:1416) ==29753== by 0x4B12152: trigger_update_rec (sr_unix/trigger_update.c:1386) ==29753== by 0x4B16171: trigger_update_rec_helper (sr_unix/trigger_update.c:2171) ==29753== by 0x4B163B9: trigger_update (sr_unix/trigger_update.c:2224) ==29753== by 0x4B86385: op_fnztrigger (sr_port/op_fnztrigger.c:248) ==29753== by 0x5ABA384: _ydboctoplanhelpers (in YDBOcto/build/src/_ydbocto.so) ==29753== by 0x1774F1EF: ??? ==29753== by 0xAAAAAAAAAAAAAAA9: ??? ==29753== If you believe this happened as a result of a stack ==29753== overflow in your program's main thread (unlikely but ==29753== possible), you can try to increase the size of the ==29753== main thread stack using the --main-stacksize= flag. ==29753== The main thread stack size used in this run was 268435456. ==29753== Invalid write of size 8 ==29753== at 0x483A124: _vgnU_freeres (in /usr/libexec/valgrind/vgpreload_core-amd64-linux.so) ==29753== Address 0x1ffeec8808 is on thread 1's stack If I rerun just the failing test, it passes fine. Every time the list of tests that fail keeps changing. If I run the test without valgrind, it passes all the time. Originally I got a failure with the --main-stacksize set to 16Mb so I bumped it to 256Mb. And I still keep getting this failure at different tests. I also set the ulimit for stacksize to 256Mb just in case and I still see the failures. The application is a single-threaded application and I know for sure it does not use anywhere near 256Mb of stack space. The stack trace shown above keeps changing across the many random failures but in all of those stack traces, I believe only around .25Mb of stack space would be used at the most. In this application, a SIGALRM signal would happen every 1 second or so. The application does not set up any alternate stack (i.e. no sigaltstack() call). Not sure if that can be related to the random failure or not. This is on a Ubuntu 20.04 system. And my application was compiled with gcc. Not sure how to debug this further. Any help in this regard is appreciated. Thanks, Narayanan. |
From: Narayanan I. <na...@yo...> - 2022-03-14 15:30:23
|
One correction (not sure it matters). I believe the application uses 1.25Mb of stack space at the time of the failure (not .25 as I had originally mentioned). Narayanan. -----Original Message----- From: Narayanan Iyer [mailto:na...@yo...] Sent: Monday, March 14, 2022 11:27 AM To: val...@li... Cc: 'Narayanan Iyer' <na...@yo...> Subject: Can't extend stack during signal delivery : too small or bad protection modes Hi, While running the automated test suite (which has hundreds of tests) for my application with valgrind, I occasionally see failures like the following in some of the tests. ==29753== Can't extend stack to 0x1ffeec7948 during signal delivery for thread 1: ==29753== too small or bad protection modes ==29753== ==29753== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==29753== Access not within mapped region at address 0x1FFEEC7948 ==29753== at 0x4849FD8: strncpy (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) ==29753== by 0x489AE7C: cli_get_sub_quals (sr_unix/cli_parse.c:593) ==29753== by 0x489ABC3: parse_arg (sr_unix/cli_parse.c:0) ==29753== by 0x489BD6E: parse_triggerfile_cmd (sr_unix/cli_parse.c:1128) ==29753== by 0x4BD2377: trigger_parse (sr_unix/trigger_parse.c:1416) ==29753== by 0x4B12152: trigger_update_rec (sr_unix/trigger_update.c:1386) ==29753== by 0x4B16171: trigger_update_rec_helper (sr_unix/trigger_update.c:2171) ==29753== by 0x4B163B9: trigger_update (sr_unix/trigger_update.c:2224) ==29753== by 0x4B86385: op_fnztrigger (sr_port/op_fnztrigger.c:248) ==29753== by 0x5ABA384: _ydboctoplanhelpers (in YDBOcto/build/src/_ydbocto.so) ==29753== by 0x1774F1EF: ??? ==29753== by 0xAAAAAAAAAAAAAAA9: ??? ==29753== If you believe this happened as a result of a stack ==29753== overflow in your program's main thread (unlikely but ==29753== possible), you can try to increase the size of the ==29753== main thread stack using the --main-stacksize= flag. ==29753== The main thread stack size used in this run was 268435456. ==29753== Invalid write of size 8 ==29753== at 0x483A124: _vgnU_freeres (in /usr/libexec/valgrind/vgpreload_core-amd64-linux.so) ==29753== Address 0x1ffeec8808 is on thread 1's stack If I rerun just the failing test, it passes fine. Every time the list of tests that fail keeps changing. If I run the test without valgrind, it passes all the time. Originally I got a failure with the --main-stacksize set to 16Mb so I bumped it to 256Mb. And I still keep getting this failure at different tests. I also set the ulimit for stacksize to 256Mb just in case and I still see the failures. The application is a single-threaded application and I know for sure it does not use anywhere near 256Mb of stack space. The stack trace shown above keeps changing across the many random failures but in all of those stack traces, I believe only around .25Mb of stack space would be used at the most. In this application, a SIGALRM signal would happen every 1 second or so. The application does not set up any alternate stack (i.e. no sigaltstack() call). Not sure if that can be related to the random failure or not. This is on a Ubuntu 20.04 system. And my application was compiled with gcc. Not sure how to debug this further. Any help in this regard is appreciated. Thanks, Narayanan. |
From: Philippe W. <phi...@sk...> - 2022-03-16 00:58:26
|
If you are not using the last release of valgrind, you might try with the last release. Wondering if the problem also happens with other tools (e.g. --tool=none). Otherwise, you could try to debug your application when running under valgrind when it encounters the problem. Eg. use arguments --vgdb=full --vgdb-error=1 --vgdb-stop-at=exit,valgrindabexit (assuming the below is the first error you encounter. If not, you should first fix your code to solve the errors previously reported by valgrind). You could also compare the valgrind trace between a succesful run and an unsuccesful run, with e.g. the valgrind debug switches -v -v -v -d -d -d --trace-signals=yes and see if you detect a difference between the 2 runs. Note that with the above switches, you should see some debug log of the signal handling and of the stack extension mechanism. Hope this helps Philippe On Mon, 2022-03-14 at 11:30 -0400, Narayanan Iyer via Valgrind-users wrote: > One correction (not sure it matters). I believe the application uses 1.25Mb of stack space at the time of the failure (not .25 as I had originally mentioned). > > Narayanan. > > -----Original Message----- > From: Narayanan Iyer [mailto:na...@yo...] > Sent: Monday, March 14, 2022 11:27 AM > To: val...@li... > Cc: 'Narayanan Iyer' <na...@yo...> > Subject: Can't extend stack during signal delivery : too small or bad protection modes > > Hi, > > While running the automated test suite (which has hundreds of tests) for my application with valgrind, I occasionally see failures like the following in some of the tests. > > ==29753== Can't extend stack to 0x1ffeec7948 during signal delivery for thread 1: > ==29753== too small or bad protection modes > ==29753== > ==29753== Process terminating with default action of signal 11 (SIGSEGV): dumping core > ==29753== Access not within mapped region at address 0x1FFEEC7948 > ==29753== at 0x4849FD8: strncpy (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so) > ==29753== by 0x489AE7C: cli_get_sub_quals (sr_unix/cli_parse.c:593) > ==29753== by 0x489ABC3: parse_arg (sr_unix/cli_parse.c:0) > ==29753== by 0x489BD6E: parse_triggerfile_cmd (sr_unix/cli_parse.c:1128) > ==29753== by 0x4BD2377: trigger_parse (sr_unix/trigger_parse.c:1416) > ==29753== by 0x4B12152: trigger_update_rec (sr_unix/trigger_update.c:1386) > ==29753== by 0x4B16171: trigger_update_rec_helper (sr_unix/trigger_update.c:2171) > ==29753== by 0x4B163B9: trigger_update (sr_unix/trigger_update.c:2224) > ==29753== by 0x4B86385: op_fnztrigger (sr_port/op_fnztrigger.c:248) > ==29753== by 0x5ABA384: _ydboctoplanhelpers (in YDBOcto/build/src/_ydbocto.so) > ==29753== by 0x1774F1EF: ??? > ==29753== by 0xAAAAAAAAAAAAAAA9: ??? > ==29753== If you believe this happened as a result of a stack > ==29753== overflow in your program's main thread (unlikely but > ==29753== possible), you can try to increase the size of the > ==29753== main thread stack using the --main-stacksize= flag. > ==29753== The main thread stack size used in this run was 268435456. > ==29753== Invalid write of size 8 > ==29753== at 0x483A124: _vgnU_freeres (in /usr/libexec/valgrind/vgpreload_core-amd64-linux.so) > ==29753== Address 0x1ffeec8808 is on thread 1's stack > > If I rerun just the failing test, it passes fine. Every time the list of tests that fail keeps changing. If I run the test without valgrind, it passes all the time. > > Originally I got a failure with the --main-stacksize set to 16Mb so I bumped it to 256Mb. And I still keep getting this failure at different tests. I also set the ulimit for stacksize to 256Mb just in case and I still see the failures. > > The application is a single-threaded application and I know for sure it does not use anywhere near 256Mb of stack space. The stack trace shown above keeps changing across the many random failures but in all of those stack traces, I believe only around .25Mb of stack space would be used at the most. > > In this application, a SIGALRM signal would happen every 1 second or so. The application does not set up any alternate stack (i.e. no sigaltstack() call). Not sure if that can be related to the random failure or not. > > This is on a Ubuntu 20.04 system. And my application was compiled with gcc. > > Not sure how to debug this further. Any help in this regard is appreciated. > > Thanks, > Narayanan. > > > > > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users |
From: Narayanan I. <na...@yo...> - 2022-03-16 14:35:58
|
Hi Philippe, Thank you for your reply. 1) I am using valgrind-3.17.0 on a Ubuntu 21.10 box (sorry I had incorrectly mentioned Ubuntu 20.04 in my original report). Not sure if this is the latest release or not. 2) I did not try it with --tool=none. It takes some time of testing to happen even with memcheck so not sure a none tool would help. 3) Yes this is the first error that I encounter. And I had tried the approach that you suggest with --vgdb-error=1 and did attach to the process through gdb at exactly the point when the error is issued. But did not know what more to do then. The application stack trace looks good just like I would expect. 4) I am not sure how to make use of the debug switches in further analyzing this. I have found a workaround for my issue and that is to remove a 1Mb allocation in the stack and move it to the heap. For reasons not yet known, that made the error disappear. So I have decided to move on for now. To me, the symptoms I have seen so far make it seem like a valgrind issue and not an application issue. If there is anything that you or other valgrind experts would like to know from the failing case, I can try work with you. Thanks again. Narayanan. |