|
From: Uday R. <uda...@gm...> - 2012-11-05 09:37:21
|
I see invalid read messages from valgrind (3.6.1 on 64-bit Linux) -
the location in question is well within a buffer that was allocated
via malloc/realloc, as the message below indicates as well. As this
invalid read happens inside a library, I tried putting a memcpy just
before the library call to see if I was able to read from those
addresses without valgrind throwing an error and surprisingly,
valgrind didn't complain for my memcpy. {buf = malloc(756);
memcpy(buf, buf_arg_of_mpi_send, 756); }
was the code I inserted.
==14178== Invalid read of size 2
==14178== at 0x4A094C0: memcpy@@GLIBC_2.14 (mc_replace_strmem.c:653)
==14178== by 0x367B2D3C28: opal_convertor_pack (in
/usr/lib64/openmpi/lib/libmpi.so.1.0.2)
==14178== by 0x944627B: ??? (in /usr/lib64/openmpi/lib/openmpi/mca_btl_sm.so)
==14178== by 0x83E77E7: ??? (in
/usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so)
==14178== by 0x83DED1C: ??? (in
/usr/lib64/openmpi/lib/openmpi/mca_pml_ob1.so)
==14178== by 0x367B26522E: PMPI_Isend (in
/usr/lib64/openmpi/lib/libmpi.so.1.0.2)
==14178== by 0x403C82: main (in /examples/1d/distopt)
==14178== Address 0x8b3256e is 750 bytes inside a block of size 40,704 alloc'd
==14178== at 0x4A074CD: malloc (vg_replace_malloc.c:236)
==14178== by 0x4A07657: realloc (vg_replace_malloc.c:525)
==14178== by 0x4C10074: rt_max_alloc (polyrt.c:139)
==14178== by 0x401D6F: main (in /examples/1d/distopt)
Why was the read shown as invalid here? The block was on the heap;
could it have been made unaddressable some way by code that executed
within PMPI_Isend?
Thanks.
|
|
From: John R. <jr...@bi...> - 2012-11-05 15:35:35
|
> ==14178== Invalid read of size 2 > ==14178== at 0x4A094C0: memcpy@@GLIBC_2.14 (mc_replace_strmem.c:653) > ==14178== Address 0x8b3256e is 750 bytes inside a block of size 40,704 alloc'd > ==14178== at 0x4A074CD: malloc (vg_replace_malloc.c:236) > ==14178== by 0x4A07657: realloc (vg_replace_malloc.c:525) Something peculiar is happening because of the realloc(). Re-run with --track-origins=yes. Re-run together with vgdb. Stop at this complaint, and use vgdb to print the status of the entire buffer. -- |
|
From: Uday R. <uda...@gm...> - 2012-11-05 17:20:52
|
On Mon, Nov 5, 2012 at 9:06 PM, John Reiser <jr...@bi...> wrote: >> ==14178== Invalid read of size 2 >> ==14178== at 0x4A094C0: memcpy@@GLIBC_2.14 (mc_replace_strmem.c:653) > >> ==14178== Address 0x8b3256e is 750 bytes inside a block of size 40,704 alloc'd >> ==14178== at 0x4A074CD: malloc (vg_replace_malloc.c:236) >> ==14178== by 0x4A07657: realloc (vg_replace_malloc.c:525) > > Something peculiar is happening because of the realloc(). > > Re-run with --track-origins=yes. > > Re-run together with vgdb. Stop at this complaint, and use vgdb to print > the status of the entire buffer. Thanks, but I didn't understand the last part. What exactly do I need to run to "print the status of the entire buffer"? I'm running with --track-origins=yes --db-attach=yes my 'man valgrind' has no mention of vgdb. > > -- > > > ------------------------------------------------------------------------------ > LogMeIn Central: Instant, anywhere, Remote PC access and management. > Stay in control, update software, and manage PCs from one command center > Diagnose problems and improve visibility into emerging IT issues > Automate, monitor and manage. Do more in less time with Central > http://p.sf.net/sfu/logmein12331_d2d > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users |
|
From: Philippe W. <phi...@sk...> - 2012-11-05 21:47:00
|
On Mon, 2012-11-05 at 22:50 +0530, Uday Reddy wrote: > On Mon, Nov 5, 2012 at 9:06 PM, John Reiser <jr...@bi...> wrote: > >> ==14178== Invalid read of size 2 > >> ==14178== at 0x4A094C0: memcpy@@GLIBC_2.14 (mc_replace_strmem.c:653) > > > >> ==14178== Address 0x8b3256e is 750 bytes inside a block of size 40,704 alloc'd > >> ==14178== at 0x4A074CD: malloc (vg_replace_malloc.c:236) > >> ==14178== by 0x4A07657: realloc (vg_replace_malloc.c:525) > > > > Something peculiar is happening because of the realloc(). > > > > Re-run with --track-origins=yes. > > > > Re-run together with vgdb. Stop at this complaint, and use vgdb to print > > the status of the entire buffer. > > Thanks, but I didn't understand the last part. What exactly do I need > to run to "print the status of the entire buffer"? I'm running with > > --track-origins=yes --db-attach=yes > > my 'man valgrind' has no mention of vgdb. vgdb is only available from Valgrind version >= 3.7.0. => the best is to upgrade to the last version (3.8.1). With vgdb, you can connect gdb to the process under Valgrind and e.g. verify the addressability or definedness of your buffer using memcheck monitor commands. See http://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.monitor-commands for more info about memcheck monitor commands. See http://www.valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.gdbserver * for more info about how to use gdb with your program under Valgrind. Philippe |
|
From: Uday R. <uda...@gm...> - 2012-11-06 04:01:44
|
On Mon, Nov 5, 2012 at 11:35 PM, John Reiser <jr...@bi...> wrote: >> Thanks, but I didn't understand the last part. What exactly do I need >> to run to "print the status of the entire buffer"? I'm running with >> >> --track-origins=yes --db-attach=yes >> >> my 'man valgrind' has no mention of vgdb. > > $ valgrind --help | grep vgdb > > Quoting Philippe Waroquiers of [valgrind-users] "Addresses marked as ??? > in Valgrind stack trace" on 2012-10-09: > ----- > Since Valgrind 3.7.0, Valgrind contains an embedded gdbserver, to > which you connect from gdb using vgdb as a relay application. > > The advantages of gdb/vgdb compared to --db-attach is > that you get all the usual gdb commands > (e.g. breakpoints, continue, jump, inferior function calls, ...) > + interactive calls of Valgrind functionality (e.g. search for > memleaks e.g. when a breakpoint is reached). > vgdb also allows to look at a multi-threaded application, > allows inferior function calls, etc. > You can also start to debug your application under Valgrind from > the beginning (so, before an error has been reported). > > To use it, give argument --vgdb-error=0 to Valgrind, and follow > the instructions to connect your gdb using vgdb. Thanks. With vgdb/gdb, the debugger hangs after a couple of continues. Program received signal SIGTRAP, Trace/breakpoint trap. 0x000000300b4ea3d7 in writev () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install hwloc-1.4.1-2.fc17.x86_64 keyutils-libs-1.5.5-2.fc17.x86_64 krb5-libs-1.10.2-6.fc17.x86_64 libcom_err-1.42.3-3.fc17.x86_64 libesmtp-1.0.6-3.fc17.x86_64 libgcc-4.7.2-2.fc17.x86_64 libibverbs-1.1.6-2.fc17.x86_64 librdmacm-1.0.15-1.fc17.x86_64 libselinux-2.1.10-3.fc17.x86_64 libtool-ltdl-2.4.2-3.1.fc17.x86_64 libxml2-2.7.8-9.fc17.x86_64 numactl-libs-2.0.7-6.fc17.x86_64 openmpi-1.5.4-5.fc17.1.x86_64 openssl-1.0.0j-2.fc17.x86_64 pciutils-libs-3.1.9-1.fc17.x86_64 zlib-1.2.5-7.fc17.x86_64 (gdb) c Continuing. [hung here...] There is no progress after this. The CPU utilization is 0 and it looks like memcheck has hung. On the valgrind side, I see ==29118== Syscall param writev(vector[...]) points to uninitialised byte(s) ==29118== at 0x300B4EA3D7: writev (in /usr/lib64/libc-2.15.so) ==29118== by 0x6256B92: ??? (in /usr/lib64/openmpi/lib/openmpi/mca_oob_tcp.so) ==29118== by 0x62579AC: ??? (in /usr/lib64/openmpi/lib/openmpi/mca_oob_tcp.so) ==29118== by 0x625B1CB: ??? (in /usr/lib64/openmpi/lib/openmpi/mca_oob_tcp.so) ==29118== by 0x604E41E: ??? (in /usr/lib64/openmpi/lib/openmpi/mca_rml_oob.so) ==29118== by 0x604E998: ??? (in /usr/lib64/openmpi/lib/openmpi/mca_rml_oob.so) ==29118== by 0x66656EF: ??? (in /usr/lib64/openmpi/lib/openmpi/mca_grpcomm_bad.so) ==29118== by 0x37CB84D5A2: ??? (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== by 0x37CB864C17: PMPI_Init (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== by 0x40108A: main (jacobi-1d-imper.distopt.c:72) ==29118== Address 0x51fcc31 is 161 bytes inside a block of size 256 alloc'd ==29118== at 0x4A08A0E: realloc (vg_replace_malloc.c:662) ==29118== by 0x37CB8D1C97: ??? (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== by 0x37CB8EAE0D: ??? (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== by 0x37CB8B0955: orte_grpcomm_base_pack_modex_entries (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== by 0x66655CE: ??? (in /usr/lib64/openmpi/lib/openmpi/mca_grpcomm_bad.so) ==29118== by 0x37CB84D5A2: ??? (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== by 0x37CB864C17: PMPI_Init (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== by 0x40108A: main (jacobi-1d-imper.distopt.c:72) ==29118== Uninitialised value was created by a heap allocation ==29118== at 0x4A0881C: malloc (vg_replace_malloc.c:270) ==29118== by 0x37CB8E1D7B: ??? (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== by 0x37CB8E2440: opal_ifcount (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== by 0x6255068: ??? (in /usr/lib64/openmpi/lib/openmpi/mca_oob_tcp.so) ==29118== by 0x37CB8BC45A: mca_oob_base_init (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== by 0x604C7EB: ??? (in /usr/lib64/openmpi/lib/openmpi/mca_rml_oob.so) ==29118== by 0x37CB8CAAAC: orte_rml_base_select (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== by 0x37CB8AD204: orte_ess_base_app_setup (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== by 0x5E4A444: ??? (in /usr/lib64/openmpi/lib/openmpi/mca_ess_env.so) ==29118== by 0x37CB88E9C9: orte_init (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== by 0x37CB84CBF8: ??? (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== by 0x37CB864C17: PMPI_Init (in /usr/lib64/openmpi/lib/libmpi.so.1.0.2) ==29118== ==29118== (action on error) vgdb me ... ==29118== Continuing ... I'm using valgrind 3.8.1 on x86-64 with options: --vgdb-error=0 --track-origins=yes code was compiled with gcc 4.7.2. -- Uday > ----- > > -- > |
|
From: John R. <jr...@bi...> - 2012-11-06 15:03:32
|
> With vgdb/gdb, the debugger hangs after a couple of continues. > > Program received signal SIGTRAP, Trace/breakpoint trap. > 0x000000300b4ea3d7 in writev () from /lib64/libc.so.6 Is there a real 'int3' instruction there, or was SIGTRAP sent from somewhere else [using kill()]? Look with something like: (gdb) x/20i 0x000000300b4ea3d7 - 0x20 [Philippe, please comment.] > There is no progress after this. The CPU utilization is 0 and it looks > like memcheck has hung. On the valgrind side, I see > ==29118== Syscall param writev(vector[...]) points to uninitialised byte(s) > ==29118== at 0x300B4EA3D7: writev (in /usr/lib64/libc-2.15.so) > ==29118== by 0x6256B92: ??? (in > /usr/lib64/openmpi/lib/openmpi/mca_oob_tcp.so) What are the likely parameters to that writev()? How many different pieces, of what sizes, etc.? Would the output be going to a disk file, or to a socket? -- |
|
From: Uday R. <uda...@gm...> - 2012-11-06 15:12:23
|
On Tue, Nov 6, 2012 at 8:34 PM, John Reiser <jr...@bi...> wrote: >> With vgdb/gdb, the debugger hangs after a couple of continues. >> >> Program received signal SIGTRAP, Trace/breakpoint trap. >> 0x000000300b4ea3d7 in writev () from /lib64/libc.so.6 > > Is there a real 'int3' instruction there, or was SIGTRAP sent > from somewhere else [using kill()]? Look with something like: > (gdb) x/20i 0x000000300b4ea3d7 - 0x20 > > [Philippe, please comment.] > >> There is no progress after this. The CPU utilization is 0 and it looks >> like memcheck has hung. On the valgrind side, I see >> ==29118== Syscall param writev(vector[...]) points to uninitialised byte(s) >> ==29118== at 0x300B4EA3D7: writev (in /usr/lib64/libc-2.15.so) >> ==29118== by 0x6256B92: ??? (in >> /usr/lib64/openmpi/lib/openmpi/mca_oob_tcp.so) > > What are the likely parameters to that writev()? How many different pieces, > of what sizes, etc.? Would the output be going to a disk file, or to a socket? The output would be going to a socket. I believe it's just sending out a buffer. If it's just sending out uninitialized data, it should perhaps not be related to the "invalid read" that I see via valgrind at a later stage? Thanks. > > -- > > > ------------------------------------------------------------------------------ > LogMeIn Central: Instant, anywhere, Remote PC access and management. > Stay in control, update software, and manage PCs from one command center > Diagnose problems and improve visibility into emerging IT issues > Automate, monitor and manage. Do more in less time with Central > http://p.sf.net/sfu/logmein12331_d2d > _______________________________________________ > Valgrind-users mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-users |
|
From: Philippe W. <phi...@sk...> - 2012-11-06 21:30:31
|
On Tue, 2012-11-06 at 20:42 +0530, Uday Reddy wrote: > On Tue, Nov 6, 2012 at 8:34 PM, John Reiser <jr...@bi...> wrote: > >> With vgdb/gdb, the debugger hangs after a couple of continues. > >> > >> Program received signal SIGTRAP, Trace/breakpoint trap. > >> 0x000000300b4ea3d7 in writev () from /lib64/libc.so.6 > > > > Is there a real 'int3' instruction there, or was SIGTRAP sent > > from somewhere else [using kill()]? Look with something like: > > (gdb) x/20i 0x000000300b4ea3d7 - 0x20 > > > > [Philippe, please comment.] When Valgrind encounters an error and --vgdb-error=xxxx tells to let GDB connect, then Valgrind will return the control to GDB by telling that a SIGTRAP has been encountered. Valgrind gdbsrv never uses "real trap instructions", nor does it generate a SIGTRAP signal with sigkill. The SIGTRAP reported by GDB is purely because Valgrind gdbsrv reports that the process has stopped its execution due an error. The gdbserver protocol implies to indicate the stop reason : the SIGTRAP looks an as good reason as any other (the protocol does not have a reason "stopped due to an error detected by Valgrind" :). So, in summary, we can be reasonably sure that the SIGTRAP above is caused by Valgrind reporting the "writev error" to GDB, to allow the GDB user to dig more in depth on this error. > > > >> There is no progress after this. The CPU utilization is 0 and it looks > >> like memcheck has hung. On the valgrind side, I see > >> ==29118== Syscall param writev(vector[...]) points to uninitialised byte(s) > >> ==29118== at 0x300B4EA3D7: writev (in /usr/lib64/libc-2.15.so) > >> ==29118== by 0x6256B92: ??? (in > >> /usr/lib64/openmpi/lib/openmpi/mca_oob_tcp.so) > > > > What are the likely parameters to that writev()? How many different pieces, > > of what sizes, etc.? Would the output be going to a disk file, or to a socket? > > The output would be going to a socket. I believe it's just sending out > a buffer. If it's just sending out uninitialized data, it should > perhaps not be related to the "invalid read" that I see via valgrind > at a later stage? If memcheck looks to be hung, then just type control-c in GDB. This should interrupt the Valgrind process/memcheck. Then you can use GDB commands such as backtrace and similar to see what your application is doing (I guess blocked in a system call if it really consumes no cpu at all). Philippe NB: It is always possible that there is a bug in Valgrind and/or in Valgrind gdbsrv which causes your application to be really "abnormally" hung under Valgrind. If that is the case, re-running with -d -d -d -v -v -v --trace-syscalls=yes could give an idea about what is going wrong. |