|
From: Raghu R. <rag...@no...> - 2013-01-21 14:50:14
|
I was wondering if anyone has successfully used valgrind with MPI
applications on SGI systems with MPT?
I am trying to track some memory errors (and possibly threading errors) in
an MPI application in the following environment:
SGI Ice cluster with Westmere processors and Infiniband network.
Running RHEL.
Intel FORTRAN and C compilers.
Using SGI MPT for MPI.
Using a non-MPI program (the simple example from the valgrind website
tutorial) works exactly as documented. However, an MPI hello world example
with the same error does not point out the error, even though there are
messages from the MPI wrappers.
The code that was used and the output generated are included below. The
code was compiled using the following compilation line:
mpicc -DBUG -g -O0 -o hello_mpi_c hello_mpi_c.c
/contrib/valgrind/valgrind-3.8.1/lib/valgrind/libmpiwrap-amd64-linux.so
The code was executed using the following line:
mpiexec_mpt -np 2 /contrib/valgrind/valgrind-3.8.1/bin/valgrind hello_mpi_c
Any suggestions on what may be missing or what I'm doing wrong? Has anyone
used valgrind with MPI applications?
Thanks,
--Raghu
The code:
#include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv)
{
int ierr, myid, npes;
int len;
char name[MPI_MAX_PROCESSOR_NAME];
ierr = MPI_Init(&argc, &argv);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &myid);
ierr = MPI_Comm_size(MPI_COMM_WORLD, &npes);
ierr = MPI_Get_processor_name( name, &len );
printf("Hello from rank %d out of %d; procname = %s\n", myid, npes,
name);
#ifdef BUG
{
int* x = (int*)malloc(10 * sizeof(int));
x[10] = 0; // problem 1: heap block overrun
printf("Print something %d\n",x[10]);
} // problem 2: memory leak -- x not freed
#endif
ierr = MPI_Finalize();
}
The output:
+ np=2
+ export MPIWRAP_DEBUG=verbose
+ MPIWRAP_DEBUG=verbose
+ mpiexec_mpt -np 2 /contrib/valgrind/valgrind-3.8.1/bin/valgrind
hello_mpi_c
valgrind MPI wrappers 5398: Active for pid 5398
valgrind MPI wrappers 5398: Try MPIWRAP_DEBUG=help for possible options
valgrind MPI wrappers 5398: enter PMPI_Init
valgrind MPI wrappers 5397: Active for pid 5397
valgrind MPI wrappers 5397: Try MPIWRAP_DEBUG=help for possible options
valgrind MPI wrappers 5397: enter PMPI_Init
valgrind MPI wrappers 5397: enter PMPI_Init_thread
valgrind MPI wrappers 5398: enter PMPI_Init_thread
valgrind MPI wrappers 5398: exit PMPI_Init (err = 0)
valgrind MPI wrappers 5398: enter PMPI_Comm_rank
valgrind MPI wrappers 5398: exit PMPI_Comm_rank (err = 0)
valgrind MPI wrappers 5398: enter PMPI_Comm_size
valgrind MPI wrappers 5397: exit PMPI_Init (err = 0)
valgrind MPI wrappers 5398: exit PMPI_Comm_size (err = 0)
valgrind MPI wrappers 5397: enter PMPI_Comm_rank
valgrind MPI wrappers 5398: enter PMPI_Get_processor_name
valgrind MPI wrappers 5397: exit PMPI_Comm_rank (err = 0)
valgrind MPI wrappers 5397: enter PMPI_Comm_size
Hello from rank 1 out of 2; procname = r7i0n0
valgrind MPI wrappers 5397: exit PMPI_Comm_size (err = 0)
valgrind MPI wrappers 5397: enter PMPI_Get_processor_name
Hello from rank 0 out of 2; procname = r7i0n0
Print something 0
valgrind MPI wrappers 5398: enter PMPI_Finalize
Print something 0
valgrind MPI wrappers 5397: enter PMPI_Finalize
valgrind MPI wrappers 5398: exit PMPI_Finalize (err = 0)
valgrind MPI wrappers 5397: exit PMPI_Finalize (err = 0)
+ env
+ grep MPIWRAP
MPIWRAP_DEBUG=verbose
|
|
From: Julian S. <js...@ac...> - 2013-01-29 09:33:26
|
On 01/21/2013 03:49 PM, Raghu Reddy wrote: > I was wondering if anyone has successfully used valgrind with MPI > applications on SGI systems with MPT? I don't know about on SGI w/ MPT (whatever MPT is). But for sure in general on MPI, it works. > Using a non-MPI program (the simple example from the valgrind website > tutorial) works exactly as documented. However, an MPI hello world example > with the same error does not point out the error, even though there are > messages from the MPI wrappers. Does your MPI hello world test work as expected (with -DBUG) if you remove the MPI specifics and just run it as an ordinary executable? J |
|
From: Raghu R. <rag...@no...> - 2013-01-30 14:39:05
|
Hi Julian,
Thank you for your response. Here is some additional information. I
apologize that this is going to be a long message, I'm trying to provide
everything relevant to the problem.
I think I'm following the instructions as documented in the MPI section on
the valgrind web page. However, the fact that I don't see the valgrind
banner when I run an MPI application suggests that I'm missing something.
The only messages I get are those from the MPI wrappers from valgrind.
Detailed responses to your questions are in line below.
Thank you very much for any information you can provide!
On 01/21/2013 03:49 PM, Raghu Reddy wrote:
>> I was wondering if anyone has successfully used valgrind with MPI
>> applications on SGI systems with MPT?
>
> I don't know about on SGI w/ MPT (whatever MPT is). But for sure in
general on MPI, it works.
The SGI implementation of MPI is called MPT (message passing toolkit). So
it is simply another implementation of MPI and conforms to the MPI2
standard.
>> Using a non-MPI program (the simple example from the valgrind website
>> tutorial) works exactly as documented. However, an MPI hello world
>> example with the same error does not point out the error, even though
>> there are messages from the MPI wrappers.
>
> Does your MPI hello world test work as expected (with -DBUG) if you remove
the MPI specifics and just run it as an
> ordinary executable?
Without Valgrind:
=============
If I compile my MPI example code with -DBUG (without linking with valgrind),
and launch it with MPI launcher (on the SGI Systems it is not possible to
run an MPI program without using the MPI launcher) the program runs to
completion even though it has a bug (the complete code was included in the
original message; I wasn't sure if it was appropriate to include it again
for completeness):
r31i2n2% mpicc -DBUG -g -O0 -o hello_mpi_c hello_mpi_c.c
r31i2n2% mpiexec_mpt -np 4 ./hello_mpi_c
Hello from rank 0 out of 4; procname = r31i2n2
Print something 0
Hello from rank 1 out of 4; procname = r31i2n2
Print something 0
Hello from rank 2 out of 4; procname = r31i2n2
Print something 0
Hello from rank 3 out of 4; procname = r31i2n2
Print something 0
r31i2n2%
If I make it a serial program by stripping out all MPI, I can execute the
program as a serial program, and it runs to completion if (even though there
is a bug):
r31i2n2% m mem-bug.c
#include <stdlib.h>
void f(void)
{
int* x = malloc(10 * sizeof(int));
x[10] = 0; // problem 1: heap block overrun
} // problem 2: memory leak -- x not freed
int main(void)
{
f();
return 0;
}
r31i2n2%
r31i2n2% icc -o mem-bug -debug mem-bug.c
r31i2n2%
r31i2n2% ./mem-bug
r31i2n2%
With valgrind:
==========
The serial program with no MPI, when launched with valgrind, it does point
to the error and valgrind is working as expected:
r31i2n2% /contrib/valgrind/valgrind-3.8.1/bin/valgrind ./mem-bug
==9806== Memcheck, a memory error detector
==9806== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==9806== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==9806== Command: ./mem-bug
==9806==
==9806== Invalid write of size 4
==9806== at 0x40051E: f (mem-bug.c:6)
==9806== by 0x40052E: main (mem-bug.c:11)
==9806== Address 0x5a70068 is 0 bytes after a block of size 40 alloc'd
==9806== at 0x4C278FE: malloc (vg_replace_malloc.c:270)
==9806== by 0x400508: f (mem-bug.c:5)
==9806== by 0x40052E: main (mem-bug.c:11)
==9806==
==9806==
==9806== HEAP SUMMARY:
==9806== in use at exit: 40 bytes in 1 blocks
==9806== total heap usage: 1 allocs, 0 frees, 40 bytes allocated
==9806==
==9806== LEAK SUMMARY:
==9806== definitely lost: 40 bytes in 1 blocks
==9806== indirectly lost: 0 bytes in 0 blocks
==9806== possibly lost: 0 bytes in 0 blocks
==9806== still reachable: 0 bytes in 0 blocks
==9806== suppressed: 0 bytes in 0 blocks
==9806== Rerun with --leak-check=full to see details of leaked memory
==9806==
==9806== For counts of detected and suppressed errors, rerun with: -v
==9806== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
r31i2n2%
But the problem is I am unable to get valgrind to point out the problem in
the MPI code. The output from that run is included below (if it is all
right, I will include the source code also):
r31i2n2% m hello_mpi_c.c
#include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv)
{
int ierr, myid, npes;
int len;
char name[MPI_MAX_PROCESSOR_NAME];
ierr = MPI_Init(&argc, &argv);
#ifdef MACROTEST
#define MACROTEST 10
#endif
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &myid);
ierr = MPI_Comm_size(MPI_COMM_WORLD, &npes);
ierr = MPI_Get_processor_name( name, &len );
printf("Hello from rank %d out of %d; procname = %s\n", myid, npes,
name);
#ifdef MACROTEST
printf("Test Macro: %d\n", MACROTEST);
#endif
#ifdef BUG
{
int* x = (int*)malloc(10 * sizeof(int));
x[10] = 0; // problem 1: heap block overrun
printf("Print something %d\n",x[10]);
} // problem 2: memory leak -- x not freed
#endif
ierr = MPI_Finalize();
}
r31i2n2% mpicc -DBUG -g -O0 -o hello_mpi_c hello_mpi_c.c
/contrib/valgrind/valgrind-3.8.1/lib/valgrind/libmpiwrap-amd64-linux.so
r31i2n2%
r31i2n2% env MPIWRAP_DEBUG=verbose mpiexec_mpt -np 1
/contrib/valgrind/valgrind-3.8.1/bin/valgrind ./hello_mpi_c
valgrind MPI wrappers 9993: Active for pid 9993
valgrind MPI wrappers 9993: Try MPIWRAP_DEBUG=help for possible options
valgrind MPI wrappers 9993: enter PMPI_Init
valgrind MPI wrappers 9993: enter PMPI_Init_thread
valgrind MPI wrappers 9993: exit PMPI_Init (err = 0)
valgrind MPI wrappers 9993: enter PMPI_Comm_rank
valgrind MPI wrappers 9993: exit PMPI_Comm_rank (err = 0)
valgrind MPI wrappers 9993: enter PMPI_Comm_size
valgrind MPI wrappers 9993: exit PMPI_Comm_size (err = 0)
valgrind MPI wrappers 9993: enter PMPI_Get_processor_name
Hello from rank 0 out of 1; procname = r31i2n2
Print something 0
valgrind MPI wrappers 9993: enter PMPI_Finalize
valgrind MPI wrappers 9993: exit PMPI_Finalize (err = 0)
r31i2n2%
-----Original Message-----
From: Julian Seward [mailto:js...@ac...]
Sent: Tuesday, January 29, 2013 4:33 AM
To: Raghu Reddy
Cc: Val...@li...
Subject: Re: [Valgrind-users] Is it possible to use valgrind with MPI
applications (with SGI MPT)?
On 01/21/2013 03:49 PM, Raghu Reddy wrote:
> I was wondering if anyone has successfully used valgrind with MPI
> applications on SGI systems with MPT?
I don't know about on SGI w/ MPT (whatever MPT is). But for sure in general
on MPI, it works.
> Using a non-MPI program (the simple example from the valgrind website
> tutorial) works exactly as documented. However, an MPI hello world
> example with the same error does not point out the error, even though
> there are messages from the MPI wrappers.
Does your MPI hello world test work as expected (with -DBUG) if you remove
the MPI specifics and just run it as an ordinary executable?
J
|
|
From: Julian S. <js...@ac...> - 2013-01-30 15:11:57
|
> But the problem is I am unable to get valgrind to point out the problem in
> the MPI code. The output from that run is included below (if it is all
> right, I will include the source code also):
>
> r31i2n2% m hello_mpi_c.c
> #include <stdio.h>
> #include <mpi.h>
>
> int main(int argc, char **argv)
> {
> int ierr, myid, npes;
> int len;
> char name[MPI_MAX_PROCESSOR_NAME];
>
> ierr = MPI_Init(&argc, &argv);
> #ifdef MACROTEST
> #define MACROTEST 10
> #endif
> ierr = MPI_Comm_rank(MPI_COMM_WORLD, &myid);
> ierr = MPI_Comm_size(MPI_COMM_WORLD, &npes);
> ierr = MPI_Get_processor_name( name, &len );
>
> printf("Hello from rank %d out of %d; procname = %s\n", myid, npes,
> name);
>
> #ifdef MACROTEST
> printf("Test Macro: %d\n", MACROTEST);
> #endif
> #ifdef BUG
> {
> int* x = (int*)malloc(10 * sizeof(int));
> x[10] = 0; // problem 1: heap block overrun
> printf("Print something %d\n",x[10]);
> } // problem 2: memory leak -- x not freed
> #endif
>
> ierr = MPI_Finalize();
>
> }
Two things:
(1) rerun the MPI version but with the extra argument -v for valgrind,
and post the results here. This will make it possible to see if
interception of malloc etc failed for some reason.
(2) send (in private email) the executable corresponding to the above
program to me, so I can have a look at the code for main and see if the
compiler optimised out the test, since the allocation and assignment have no
useful side effects.
J
|
|
From: Raghu R. <rag...@no...> - 2013-02-04 16:27:19
|
Hi Julian,
Additional responses to your questions are included in line.
Included below are outputs from two runs, the first one is a non-MPI
application, and the second one is an MPI application. Both codes do
essentially the same thing except that the latter has some basic MPI calls
to make it an MPI application.
For the non-MPI application, valgrind prints a banner message.
For some reason I don't see the valgrind banner message for the MPI
application. The only valgrind messages are from the MPI wrappers from
valgrind.
r4i0n0% icc -o mem-bug -debug mem-bug.c
r4i0n0% /contrib/valgrind/valgrind-3.8.1/bin/valgrind ./mem-bug
==9629== Memcheck, a memory error detector
==9629== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==9629== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==9629== Command: ./mem-bug
==9629==
==9629== Invalid write of size 4
==9629== at 0x40051E: f (mem-bug.c:6)
==9629== by 0x40052E: main (mem-bug.c:11)
==9629== Address 0x5a70068 is 0 bytes after a block of size 40 alloc'd
==9629== at 0x4C278FE: malloc (vg_replace_malloc.c:270)
==9629== by 0x400508: f (mem-bug.c:5)
==9629== by 0x40052E: main (mem-bug.c:11)
==9629==
==9629==
==9629== HEAP SUMMARY:
==9629== in use at exit: 40 bytes in 1 blocks
==9629== total heap usage: 1 allocs, 0 frees, 40 bytes allocated
==9629==
==9629== LEAK SUMMARY:
==9629== definitely lost: 40 bytes in 1 blocks
==9629== indirectly lost: 0 bytes in 0 blocks
==9629== possibly lost: 0 bytes in 0 blocks
==9629== still reachable: 0 bytes in 0 blocks
==9629== suppressed: 0 bytes in 0 blocks
==9629== Rerun with --leak-check=full to see details of leaked memory
==9629==
==9629== For counts of detected and suppressed errors, rerun with: -v
==9629== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
r4i0n0%
r4i0n0%
r4i0n0%
r4i0n0% mpicc -DBUG -g -O0 -o hello_mpi_c hello_mpi_c.c
/contrib/valgrind/valgrind-3.8.1/lib/valgrind/libmpiwrap-amd64-linux.so
r4i0n0%
r4i0n0% env MPIWRAP_DEBUG=verbose mpiexec_mpt -np 1
/contrib/valgrind/valgrind-3.8.1/bin/valgrind -v ./hello_mpi_c
valgrind MPI wrappers 9681: Active for pid 9681
valgrind MPI wrappers 9681: Try MPIWRAP_DEBUG=help for possible options
valgrind MPI wrappers 9681: enter PMPI_Init
valgrind MPI wrappers 9681: enter PMPI_Init_thread
valgrind MPI wrappers 9681: exit PMPI_Init (err = 0)
valgrind MPI wrappers 9681: enter PMPI_Comm_rank
valgrind MPI wrappers 9681: exit PMPI_Comm_rank (err = 0)
valgrind MPI wrappers 9681: enter PMPI_Comm_size
valgrind MPI wrappers 9681: exit PMPI_Comm_size (err = 0)
valgrind MPI wrappers 9681: enter PMPI_Get_processor_name
Hello from rank 0 out of 1; procname = r4i0n0
Print something 0
valgrind MPI wrappers 9681: enter PMPI_Finalize
valgrind MPI wrappers 9681: exit PMPI_Finalize (err = 0)
r4i0n0%
Also, included in line below are responses to your questions.
Thanks,
--Raghu
||-----Original Message-----
||From: Julian Seward [mailto:js...@ac...]
||Sent: Wednesday, January 30, 2013 10:11 AM
||To: Raghu Reddy
||Cc: Val...@li...
||Subject: Re: [Valgrind-users] Is it possible to use valgrind with MPI
||applications (with SGI MPT)?
||
||
||> But the problem is I am unable to get valgrind to point out the
||> problem in the MPI code. The output from that run is included below
||> (if it is all right, I will include the source code also):
||>
||> r31i2n2% m hello_mpi_c.c
||> #include <stdio.h>
||> #include <mpi.h>
||>
||> int main(int argc, char **argv)
||> {
||> int ierr, myid, npes;
||> int len;
||> char name[MPI_MAX_PROCESSOR_NAME];
||>
||> ierr = MPI_Init(&argc, &argv);
||> #ifdef MACROTEST
||> #define MACROTEST 10
||> #endif
||> ierr = MPI_Comm_rank(MPI_COMM_WORLD, &myid);
||> ierr = MPI_Comm_size(MPI_COMM_WORLD, &npes);
||> ierr = MPI_Get_processor_name( name, &len );
||>
||> printf("Hello from rank %d out of %d; procname = %s\n", myid,
||> npes, name);
||>
||> #ifdef MACROTEST
||> printf("Test Macro: %d\n", MACROTEST); #endif #ifdef BUG
||> {
||> int* x = (int*)malloc(10 * sizeof(int));
||> x[10] = 0; // problem 1: heap block overrun
||> printf("Print something %d\n",x[10]);
||> } // problem 2: memory leak -- x not freed
||> #endif
||>
||> ierr = MPI_Finalize();
||>
||> }
||
||Two things:
||
||(1) rerun the MPI version but with the extra argument -v for valgrind, and
||post the results here. This will make it possible to see if interception
of
||malloc etc failed for some reason.
The output with -v option for valgrind is included in the output above.
||
||(2) send (in private email) the executable corresponding to the above
||program to me, so I can have a look at the code for main and see if the
||compiler optimised out the test, since the allocation and assignment have
no
||useful side effects.
||
In the output included above, the code does print the value of that memory
location, so I don't think it could have optimized it out. Also, it was
compiled with no optimization to further prevent this possibility.
As requested I will e-mail the executable in a separate e-mail privately to
you.
Thank you very much!
|
|
From: Julian S. <js...@ac...> - 2013-02-05 12:04:33
|
On 02/04/2013 05:26 PM, Raghu Reddy wrote: > r4i0n0% mpicc -DBUG -g -O0 -o hello_mpi_c hello_mpi_c.c > /contrib/valgrind/valgrind-3.8.1/lib/valgrind/libmpiwrap-amd64-linux.so > r4i0n0% > r4i0n0% env MPIWRAP_DEBUG=verbose mpiexec_mpt -np 1 > /contrib/valgrind/valgrind-3.8.1/bin/valgrind -v ./hello_mpi_c > valgrind MPI wrappers 9681: Active for pid 9681 > valgrind MPI wrappers 9681: Try MPIWRAP_DEBUG=help for possible options > valgrind MPI wrappers 9681: enter PMPI_Init > valgrind MPI wrappers 9681: enter PMPI_Init_thread > valgrind MPI wrappers 9681: exit PMPI_Init (err = 0) > valgrind MPI wrappers 9681: enter PMPI_Comm_rank > valgrind MPI wrappers 9681: exit PMPI_Comm_rank (err = 0) > valgrind MPI wrappers 9681: enter PMPI_Comm_size > valgrind MPI wrappers 9681: exit PMPI_Comm_size (err = 0) > valgrind MPI wrappers 9681: enter PMPI_Get_processor_name > Hello from rank 0 out of 1; procname = r4i0n0 > Print something 0 > valgrind MPI wrappers 9681: enter PMPI_Finalize > valgrind MPI wrappers 9681: exit PMPI_Finalize (err = 0) It looks as if Valgrind didn't run at all in this case, that is, hello_mpi_c is executed natively. This might be something to do with mpiexec -- we've had strange interactions with such programs (normally called mpirun) before now. Can you get any logging information out of mpiexec, to get some insight into what is going on? J |