|
From: Joan <joa...@gm...> - 2008-02-20 11:49:25
|
Hi everybody, I've been using valgrind for debugging some application that I've wrote... Now I'm programming with MPICH 1.0 environment, distribute memory. I'd like to know how can I do for debugging my application with valgrind under that situation. I've been looking for on the net but what I've found ( http://www.hlrs.de/people/keller/MPI/mpich_valgrind.html) has not been useful. Could somebody explain me how to do it? Thank you very much -- Blog: http://earenjoy.blogspot.com |
|
From: Ashley P. <api...@co...> - 2008-02-20 11:59:45
|
On Wed, 2008-02-20 at 12:49 +0100, Joan wrote: > Hi everybody, > > I've been using valgrind for debugging some application that I've > wrote... Now I'm programming with MPICH 1.0 environment, distribute > memory. > I'd like to know how can I do for debugging my application with > valgrind under that situation. I've been looking for on the net but > what I've found > (http://www.hlrs.de/people/keller/MPI/mpich_valgrind.html) has not > been useful. > > Could somebody explain me how to do it? It depends on the parallel environment you are using, i.e. how you start your parallel program. Rainers page used the command below, if you are using the mpirun provided with mpich-1 you could try that, if not let me know how you are starting jobs and I'll see if I can help you. /opt/mpich-1.2.6-gcc-ch_shmem/bin/mpirun -np 2 -dbg=valgrind ./mpi_murks Ashley Pittman. |
|
From: Ashley P. <api...@co...> - 2008-02-20 12:37:41
|
On Wed, 2008-02-20 at 13:19 +0100, Joan wrote: > Hi Ashley, > > In fact I'm using mpirun for starting my applicattion, but it doesn't > work. I don't know what exactly could be. It looks to be working to me, you haven't compiled your application with -g which would allow valgrind to resolve file/line numbers but it is running your progam under valgrind. What was the full command line you used, you aren't getting complete stack traces, it looks like you are using the --num-callers=1 option. For parallel environments I'd recommend using the --log-file-qualifier or --log-file option depending on the version of valgrind you are using, this will give you one output file per rank which is easier to deal with. Ashley, > I'm checking it with the source code that appears in > http://www.hlrs.de/people/keller/MPI/mpich_valgrind.html and that's > what I get from valgrind: > > ==17677== Memcheck, a memory error detector. > ==17677== Copyright (C) 2002-2006, and GNU GPL'd, by Julian > Seward et al. > ==17677== Using LibVEX rev 1658, a library for dynamic binary > translation. > ==17677== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks > LLP. > ==17677== Using valgrind-3.2.1-Debian, a dynamic binary > instrumentation framework. > ==17677== Copyright (C) 2000-2006, and GNU GPL'd, by Julian > Seward et al. > ==17677== For more details, rerun with: -v > ==17677== > ==17677== Invalid read of size 4 > ==17677== at 0x4010DE9: (within /lib/ld-2.3.6.so) > ==17677== Address 0x42BEAD4 is 36 bytes inside a block of > size 37 alloc'd > ==17677== at 0x401D38B: malloc (vg_replace_malloc.c:149) > ==17677== > ==17677== Invalid read of size 4 > ==17677== at 0x4010E17: (within /lib/ld-2.3.6.so) > ==17677== Address 0x4315A9C is 28 bytes inside a block of > size 31 alloc'd > ==17677== at 0x401D38B: malloc (vg_replace_malloc.c:149) > ==17677== > ==17677== Invalid read of size 4 > ==17677== at 0x4010DD3: (within /lib/ld-2.3.6.so) > ==17677== Address 0x4316118 is 32 bytes inside a block of > size 35 alloc'd > ==17677== at 0x401D38B: malloc (vg_replace_malloc.c:149) > ==17677== > ==17677== Syscall param writev(vector[...]) points to > uninitialised byte(s) > ==17677== at 0x4000792: (within /lib/ld-2.3.6.so) > ==17677== Address 0xBECC8568 is on thread 1's stack > ==17677== > ==17677== Invalid write of size 1 > ==17677== at 0x80AD60F: MPID_CH_Eagerb_recv_short > (in /home/aprat/provaValgrind/prova) > ==17677== Address 0x4323290 is 0 bytes after a block of size > 40 alloc'd > ==17677== at 0x401D38B: malloc (vg_replace_malloc.c:149) > ==17678== > ==17678== ERROR SUMMARY: 4 errors from 3 contexts (suppressed: > 37 from 1) > ==17678== malloc/free: in use at exit: 549,776 bytes in 30 > blocks. > ==17678== malloc/free: 97 allocs, 67 frees, 559,818 bytes > allocated. > ==17678== For counts of detected errors, rerun with: -v > ==17678== searching for pointers to 30 not-freed blocks. > ==17678== checked 179,208 bytes. > ==17678== > ==17678== LEAK SUMMARY: > ==17678== definitely lost: 156 bytes in 11 blocks. > ==17678== possibly lost: 0 bytes in 0 blocks. > ==17678== still reachable: 549,620 bytes in 19 blocks. > ==17678== suppressed: 0 bytes in 0 blocks. > ==17678== Use --leak-check=full to see details of leaked > memory. > ==17677== > ==17677== ERROR SUMMARY: 14 errors from 5 contexts > (suppressed: 42 from 2) > ==17677== malloc/free: in use at exit: 3,044 bytes in 16 > blocks. > ==17677== malloc/free: 172 allocs, 156 frees, 606,678 bytes > allocated. > ==17677== For counts of detected errors, rerun with: -v > ==17677== searching for pointers to 16 not-freed blocks. > ==17677== checked 174,272 bytes. > ==17677== > ==17677== LEAK SUMMARY: > ==17677== definitely lost: 204 bytes in 14 blocks. > ==17677== possibly lost: 0 bytes in 0 blocks. > ==17677== still reachable: 2,840 bytes in 2 blocks. > ==17677== suppressed: 0 bytes in 0 blocks. > ==17677== Use --leak-check=full to see details of leaked > memory. > > It does not really seems what I should see, doesn't it? |
|
From: Julian S. <js...@ac...> - 2008-02-20 12:46:44
|
(this also partially addresses Robert Anderson's earlier query) (If you haven't already, read the section in the Memcheck manual on MPI http://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.mpiwrap) MPI is a difficult (but not impossible) target for Memcheck. The key reasons are (a) MPI libraries often replace malloc/free with their own allocator. This seriously reduces Memcheck's ability to find problems, and it also causes many false positives. (b) MPI libraries usually do DMA direct from card to userspace. Again, Memcheck cannot see this happening, and you get many false errors and missed errors. The most effective approach seems to be to adjust your MPI arrangements to avoid these problems, even if this loses some performance. At the moment it appears that the state of the art for MPI libraries does not simultaneously provide maximum performance and adequate visibility for Memcheck to work well. I have had some success (in my very limited testing) with OpenMPI. It has flags which avoid (b) (and possibly (a), since malloc/free replacements are usually done to support (b)). At least when I tried OpenMPI a couple of years ago, the following helped: mpirun --mca btl tcp,self -np a.out which (according to an OpenMPI developer), "disables shared memory by telling Open MPI to only use loopback and tcp". Note that you should ask for tcp between all MPI processes, even those running on the same core, since Valgrind does not handle synchronisation through shared memory properly. Other people have had varying degrees of success using Memcheck with direct-to-userspace card transfers (b), although that generally makes life more difficult and requires suppressing large numbers of false errors to make it usable. I have heard before about (a) w.r.t. MPICH2 and it strikes me as essentially unsolvable; the only (effective) fix (that I know of) is to ask the library not to supply its own allocator. All that said, developers of these various MPI libraries can surely give you better answers than those above. It may be worth asking them directly. J On Wednesday 20 February 2008 12:49, Joan wrote: > Hi everybody > > I've been using valgrind for debugging some application that I've wrote... > Now I'm programming with MPICH 1.0 environment, distribute memory. > I'd like to know how can I do for debugging my application with valgrind > under that situation. I've been looking for on the net but what I've found > ( http://www.hlrs.de/people/keller/MPI/mpich_valgrind.html) has not been > useful. > > Could somebody explain me how to do it? > > Thank you very much |
|
From: Patrick O. <pat...@in...> - 2008-02-21 08:20:09
|
Hello all, I hesitated to write this email because it could be seen as a plug for a commercial product and I wasn't sure whether that would be welcome here. I promise to keep the advertising to a mininum... For the purpose of full disclosure: I used to be the main developer of the MPI correctness checking feature in the Intel(R) Trace Analyzer and Collector, which works with Intel(R) MPI. Since 7.1 distributed memory checking with Valgrind is also supported. Julian might remember that in the distant past I confirmed that the vget/set functionality would be important and should be added again; he was so kind to add it again in the 3.x releases. This is what the distributed memory checking uses today: it takes the definedness information on the sender, transmits it on a side channel, then on the receiver restores the definedness information. This way false positives about sending partially initialized data which is not used at the recipient is avoided. It also avoids false negatives at the recipient where it incorrectly assumes that all received data is initialized. The drawback of course is delayed error reporting: a policy of "all outgoing data must be initialized" is easier to enforce and check. One other use of Valgrind is to make memory inaccessible while it is owned semantically by MPI. That way incorrect accesses by the application to that memory is flagged immediately by Valgrind. I have a white paper lying around which describes these features in more detail, but it hasn't been published outside of Intel. If there is interest in it I could try to get it released. On Wed, 2008-02-20 at 13:43 +0100, Julian Seward wrote: > MPI is a difficult (but not impossible) target for Memcheck. > The key reasons are > > (a) MPI libraries often replace malloc/free with their own > allocator. This seriously reduces Memcheck's ability to > find problems, and it also causes many false positives. I don't think this is a problem with Intel MPI; if there is, the Intel MPI team is in the ideal position to fix it, because nowadays they are responsible for both products. > (b) MPI libraries usually do DMA direct from card to userspace. > Again, Memcheck cannot see this happening, and you get many > false errors and missed errors. With the approach that I described above this is not an issue because before the application gets access to that memory the definedness information is updated. The MPI's internal access to it might trigger reports; these are suppressed. > All that said, developers of these various MPI libraries can surely > give you better answers than those above. It may be worth asking > them directly. I took the first sentence as an invitation to speak up here at all, but further inquiries regarding the products that I mentioned are indeed better directed towards the normal Intel support channels. -- Best Regards, Patrick Ohly The content of this message is my personal opinion only and although I am an employee of Intel, the statements I make here in no way represent Intel's position on the issue, nor am I authorized to speak on behalf of Intel on this matter. The email footer below is automatically added to comply with company policy; for this email the "intended recipient(s)" are all human and non-human inhabitants of planet Earth. --------------------------------------------------------------------- Intel GmbH Dornacher Strasse 1 85622 Feldkirchen/Muenchen Germany Sitz der Gesellschaft: Feldkirchen bei Muenchen Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer Registergericht: Muenchen HRB 47456 Ust.-IdNr. VAT Registration No.: DE129385895 Citibank Frankfurt (BLZ 502 109 00) 600119052 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. |
|
From: Nicholas N. <nj...@cs...> - 2008-02-20 21:16:01
|
On Wed, 20 Feb 2008, Julian Seward wrote: > (b) MPI libraries usually do DMA direct from card to userspace. > Again, Memcheck cannot see this happening, and you get many > false errors and missed errors. Does the --ignore-ranges option help with that? N |
|
From: Julian S. <js...@ac...> - 2008-02-20 21:52:25
|
On Wednesday 20 February 2008 22:16, Nicholas Nethercote wrote: > On Wed, 20 Feb 2008, Julian Seward wrote: > > (b) MPI libraries usually do DMA direct from card to userspace. > > Again, Memcheck cannot see this happening, and you get many > > false errors and missed errors. > > Does the --ignore-ranges option help with that? Yes; and indeed this is the exact reason for its existence, so you can tell memcheck the address ranges where the card is mapped, and it will effectively ignore those (ignore all writes, and assume all reads return initialised data). But it's definitely a method of last resort: * you have to somehow figure out where the card is mapped, and pray that it doesn't get mapped at different locations in different runs. In some cases I've seen, more than one address range are involved; this just makes it more difficult. * it's indiscrimate. Presumably some parts of the address range are the actual end-user I/O buffers, and other parts are for controlling the card. Inside such an area, Memcheck can't check anything - if you are overrunning the user data areas, reading data from the user data area that wasn't sent to the card (so you're really reading uninitialised values, etc) Far better for the MPI library to communicate via standard syscalls (over TCP or UDP) and in short be an absolutely standard userspace process. At least for debugging runs, since that has a bad effect on performance for production runs. I believe the flags I specified for OpenMPI do achieve that; and using them it seemed easy to get reasonable results (essentially zero false errors) with little effort. Throw in the MPI wrapper library and you additionally get MPI level semantic checks on the library calls. J |
|
From: Ashley P. <api...@co...> - 2008-02-25 11:17:10
|
On Thu, 2008-02-21 at 12:38 +0100, Joan wrote:
> Hi Ashely,
>
> I've been applying your advices which has really help me.
>
> Now I get this trace:
>
> ==20977== Memcheck, a memory error detector.
> ==20977== Copyright (C) 2002-2006, and GNU GPL'd, by Julian
> Seward et al.
> ==20977== Using LibVEX rev 1658, a library for dynamic binary
> translation.
> ==20977== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks
> LLP.
> ==20977== Using valgrind-3.2.1-Debian, a dynamic binary
> instrumentation framework.
> ==20977== Copyright (C) 2000-2006, and GNU GPL'd, by Julian
> Seward et al.
> ==20977== For more details, rerun with: -v
> ==20977==
> ==20977== My PID = 20977, parent PID = 20976. Prog and args
> are:
> ==20977== /home/aprat/provaValgrind/prova
> ==20977== -p4pg
> ==20977== /home/aprat/provaValgrind/PI20893
> ==20977== -p4wd
> ==20977== /home/aprat/provaValgrind
> ==20977==
> ==20977== Conditional jump or move depends on uninitialised
> value(s)
> ==20977== at 0x4010C4E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B36F:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x42377FF:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x4239275: __nss_hosts_lookup
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423BEF5: gethostbyname_r
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423B72D: gethostbyname
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977==
> ==20977== Conditional jump or move depends on uninitialised
> value(s)
> ==20977== at 0x4010C5D: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B36F:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x42377FF:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x4239275: __nss_hosts_lookup
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423BEF5: gethostbyname_r
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423B72D: gethostbyname
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977==
> ==20977== Conditional jump or move depends on uninitialised
> value(s)
> ==20977== at 0x4010C6C: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B36F:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x42377FF:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x4239275: __nss_hosts_lookup
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423BEF5: gethostbyname_r
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423B72D: gethostbyname
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977==
> ==20977== Conditional jump or move depends on uninitialised
> value(s)
> ==20977== at 0x4010C7B: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B36F:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x42377FF:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x4239275: __nss_hosts_lookup
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423BEF5: gethostbyname_r
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423B72D: gethostbyname
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977==
> ==20977== Conditional jump or move depends on uninitialised
> value(s)
> ==20977== at 0x4010DDC: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B36F:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x42377FF:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x4239275: __nss_hosts_lookup
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423BEF5: gethostbyname_r
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423B72D: gethostbyname
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977==
> ==20977== Conditional jump or move depends on uninitialised
> value(s)
> ==20977== at 0x4010DE7: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B36F:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x42377FF:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x4239275: __nss_hosts_lookup
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423BEF5: gethostbyname_r
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423B72D: gethostbyname
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977==
> ==20977== Conditional jump or move depends on uninitialised
> value(s)
> ==20977== at 0x4010DDC: (within /lib/ld-2.3.6.so)
> ==20977== by 0x4004B78: (within /lib/ld-2.3.6.so)
> ==20977== by 0x4006792: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B36F:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x42377FF:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x4239275: __nss_hosts_lookup
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977==
> ==20977== Conditional jump or move depends on uninitialised
> value(s)
> ==20977== at 0x4010DE7: (within /lib/ld-2.3.6.so)
> ==20977== by 0x4004B78: (within /lib/ld-2.3.6.so)
> ==20977== by 0x4006792: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B36F:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x42377FF:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x4239275: __nss_hosts_lookup
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977==
> ==20977== Invalid read of size 4
> ==20977== at 0x4010DE9: (within /lib/ld-2.3.6.so)
> ==20977== by 0x4004B78: (within /lib/ld-2.3.6.so)
> ==20977== by 0x4006792: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B36F:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x42377FF:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x4239275: __nss_hosts_lookup
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== Address 0x42BEAD4 is 36 bytes inside a block of
> size 37 alloc'd
> ==20977== at 0x401D38B: malloc (vg_replace_malloc.c:149)
> ==20977== by 0x4006B83: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B36F:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x42377FF:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x4239275: __nss_hosts_lookup
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423BEF5: gethostbyname_r
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977==
> ==20977== Conditional jump or move depends on uninitialised
> value(s)
> ==20977== at 0x4008ED5: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B7C4:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x42377FF:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x4239275: __nss_hosts_lookup
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423BEF5: gethostbyname_r
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423B72D: gethostbyname
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977==
> ==20977== Conditional jump or move depends on uninitialised
> value(s)
> ==20977== at 0x4008B2E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B7C4:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x42377FF:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x4239275: __nss_hosts_lookup
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423BEF5: gethostbyname_r
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423B72D: gethostbyname
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977==
> ==20977== Invalid read of size 4
> ==20977== at 0x4010E17: (within /lib/ld-2.3.6.so)
> ==20977== by 0x4004B78: (within /lib/ld-2.3.6.so)
> ==20977== by 0x4006792: (within /lib/ld-2.3.6.so)
> ==20977== by 0x400A1F6: (within /lib/ld-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x400A3CA: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B3D4:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== Address 0x4315A9C is 28 bytes inside a block of
> size 31 alloc'd
> ==20977== at 0x401D38B: malloc (vg_replace_malloc.c:149)
> ==20977== by 0x4006B83: (within /lib/ld-2.3.6.so)
> ==20977== by 0x400A1F6: (within /lib/ld-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x400A3CA: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B3D4:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977==
> ==20977== Invalid read of size 4
> ==20977== at 0x4010DD3: (within /lib/ld-2.3.6.so)
> ==20977== by 0x4004B78: (within /lib/ld-2.3.6.so)
> ==20977== by 0x4006792: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B36F:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x439A169:
> (within /lib/tls/i686/cmov/libnss_compat-2.3.6.so)
> ==20977== by 0x439B28C: _nss_compat_getpwuid_r
> (in /lib/tls/i686/cmov/libnss_compat-2.3.6.so)
> ==20977== Address 0x4316118 is 32 bytes inside a block of
> size 35 alloc'd
> ==20977== at 0x401D38B: malloc (vg_replace_malloc.c:149)
> ==20977== by 0x4006B83: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425B36F:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425ADDE: _dl_open
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x425D5FC:
> (within /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x400B44E: (within /lib/ld-2.3.6.so)
> ==20977== by 0x425D65D: __libc_dlopen_mode
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x423770F: __nss_lookup_function
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977== by 0x439A169:
> (within /lib/tls/i686/cmov/libnss_compat-2.3.6.so)
> ==20977== by 0x439B28C: _nss_compat_getpwuid_r
> (in /lib/tls/i686/cmov/libnss_compat-2.3.6.so)
> ==20977== by 0x41E79D4: getpwuid_r
> (in /lib/tls/i686/cmov/libc-2.3.6.so)
> ==20977==
> ==20977== Syscall param write(buf) points to uninitialised
> byte(s)
> ==20977== at 0x4000792: (within /lib/ld-2.3.6.so)
> ==20977== by 0x808FF9D: net_send
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x809127E: net_slave_info
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x8091059: create_remote_processes
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x808CB8D: p4_startup
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x808C9B5: p4_create_procgroup
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x809D40C: MPID_P4_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x809C39C: MPID_CH_InitMsgPass
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x8097670: MPID_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x806459B: MPIR_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x8064383: PMPI_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x804D873: main (mpiValgrind.cpp:15)
> ==20977== Address 0xBEA4D814 is on thread 1's stack
> ==20977==
> ==20977== Syscall param write(buf) points to uninitialised
> byte(s)
> ==20977== at 0x4000792: (within /lib/ld-2.3.6.so)
> ==20977== by 0x808FF9D: net_send
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x808D933: send_proc_table
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x808CBA5: p4_startup
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x808C9B5: p4_create_procgroup
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x809D40C: MPID_P4_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x809C39C: MPID_CH_InitMsgPass
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x8097670: MPID_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x806459B: MPIR_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x8064383: PMPI_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x804D873: main (mpiValgrind.cpp:15)
> ==20977== Address 0xBEA4E878 is on thread 1's stack
> ==20977==
> ==20977== Syscall param write(buf) points to uninitialised
> byte(s)
> ==20977== at 0x4000792: (within /lib/ld-2.3.6.so)
> ==20977== by 0x808FF9D: net_send
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x808D9D8: send_proc_table
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x808CBA5: p4_startup
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x808C9B5: p4_create_procgroup
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x809D40C: MPID_P4_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x809C39C: MPID_CH_InitMsgPass
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x8097670: MPID_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x806459B: MPIR_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x8064383: PMPI_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x804D873: main (mpiValgrind.cpp:15)
> ==20977== Address 0xBEA4E878 is on thread 1's stack
> ==20977==
> ==20977== Syscall param writev(vector[...]) points to
> uninitialised byte(s)
> ==20977== at 0x4000792: (within /lib/ld-2.3.6.so)
> ==20977== by 0x8090361: net_send2
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x8093555: socket_send
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x80A552E: send_message
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x80A6053: subtree_broadcast_p4
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x80A5EF9: p4_broadcastx
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x809D4AA: MPID_P4_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x809C39C: MPID_CH_InitMsgPass
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x8097670: MPID_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x806459B: MPIR_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x8064383: PMPI_Init
> (in /home/aprat/provaValgrind/prova)
> ==20977== by 0x804D873: main (mpiValgrind.cpp:15)
> ==20977== Address 0xBEA4F568 is on thread 1's stack
> ==20978==
> ==20978== ERROR SUMMARY: 20 errors from 13 contexts
> (suppressed: 21 from 1)
> ==20978== malloc/free: in use at exit: 549,776 bytes in 30
> blocks.
> ==20978== malloc/free: 97 allocs, 67 frees, 559,818 bytes
> allocated.
> ==20978== For counts of detected errors, rerun with: -v
> ==20978== searching for pointers to 30 not-freed blocks.
> ==20978== checked 179,216 bytes.
> ==20978==
> ==20978== LEAK SUMMARY:
> ==20978== definitely lost: 156 bytes in 11 blocks.
> ==20978== possibly lost: 0 bytes in 0 blocks.
> ==20978== still reachable: 549,620 bytes in 19 blocks.
> ==20978== suppressed: 0 bytes in 0 blocks.
> ==20978== Use --leak-check=full to see details of leaked
> memory.
> ==20977==
> ==20977== ERROR SUMMARY: 28 errors from 17 contexts
> (suppressed: 21 from 1)
> ==20977== malloc/free: in use at exit: 595,864 bytes in 92
> blocks.
> ==20977== malloc/free: 163 allocs, 71 frees, 606,358 bytes
> allocated.
> ==20977== For counts of detected errors, rerun with: -v
> ==20977== searching for pointers to 92 not-freed blocks.
> ==20977== checked 183,112 bytes.
> ==20977==
> ==20977== LEAK SUMMARY:
> ==20977== definitely lost: 156 bytes in 11 blocks.
> ==20977== possibly lost: 0 bytes in 0 blocks.
> ==20977== still reachable: 595,708 bytes in 81 blocks.
> ==20977== suppressed: 0 bytes in 0 blocks.
> ==20977== Use --leak-check=full to see details of leaked
> memory.
>
>
> What is what I was hoping it to be. The only problem are the
> highlighted lines. It seems to indicate that all the problems are in
> the 15th line of my source code, which is:
>
> MPI_Init (&argc,&argv);
>
> (the main of course is: int main (int argc, char* argv[]) )
The last four errors are coming from within MPI_Init(), the call stack
includes your function but the error is most likely within MPI itself.
You have lots of errors from the program starting up which you don't
normally see, are you sure the version of valgrind you are using was
compiled on the same os as the compute nodes?
Perhaps more importantly however you should have one output file per
rank, it looks to me like you are running rank 0 under valgrind but the
other processes natively...
> Doesn't it have to show me other kind of errors, I've things as:
>
> mpiValgrind.cpp:24:: printf ("(Rank %d) array[i]:%d
> \n",rank,array[500000]);
>
> where array is: array = (int*) malloc (10 * sizeof(int)); and rank
> has not been initialized...
> Doesn't Valgrind have to indicate it?
If rank hasn't been initialised it should report the error, potentially
array[500000] won't give an error as Valgrind can only tell if you read
from invalid memory, not if you read from the memory region you wanted
to. There are redzones around allocations but these are in the region
of 16 bytes IIRC.
> Thank you for you help, it's being very helpful!
If you post to valgrind-users you'll get a quicker and possibly more
informed response.
Ashley,
|