|
From: Brian W. <br...@ls...> - 2008-11-07 17:13:49
|
I've been a long time user of valgrind, but am having serious problems with
recent versions. I don't know if it is the switch from ia32 to x86_64, the
switch from lam 7 to openmpi, or the switch from valgrind 2 to valgrind 3,
but here is my problem:
I have a small sample program that has definite, obvious errors in it.
When I build and compile it on my ia32 system, with lam 7, valgrind 2.2
correctly reports the errors, when compiled with MPI or without MPI.
When I build the program WITHOUT MPI at all, on my x86_64 system with Intel
Fortran and GCC, valgrind 3.4 also correctly reports errors.
However, if I build the program with openmpi (or hp-mpi) on my x86_64
system, valgrind 3.4 reports no errors at all. This is a serious problem
for me, as in the past few weeks I've run into a few problems that crash
with openmpi/x86_64, but I can't debug them with valgrind. When I move the
code to the old IA32 system and use valgrind there, and find and fix the
errors, the resulting code runs fine on the openmpi/x86_64 system. This
says to me that the errors detected on the IA32 system are in fact causing
problems on the x86_64 system (usually resulting in an error in free() or
malloc() because the memory structures are corrupt). But valgrind isn't
seeing them at all....
I'm also getting TONS of "uninitialized value" errors with HP-MPI that I
never got before (and some of which I have carefully tracked down, and they
are bogus, the values are clearly initialized), but that is another issue....
Any suggestions or info would be greatly appreciated.
(note: for compiling/testing on ia32 machine, change integer*8 to integer*4
and "long long" to "long", since pointers are 4 bytes long)
Here are my sample programs/makefile:
******************* makefile ****************************
all: tst_mpi tst_nompi
tst_nompi: tst_nompi.o mtst_nompi.o
ifort -g -o tst_nompi tst_nompi.o mtst_nompi.o
tst_nompi.o: tst.F
ifort -g -c tst.F -o tst_nompi.o
mtst_nompi.o: mtst.c
cc -g -c mtst.c -o mtst_nompi.o
tst_mpi: tst_mpi.o mtst_mpi.o
mpif77 -g -DUSEMPI -o tst_mpi tst_mpi.o mtst_mpi.o
tst_mpi.o: tst.F
mpif77 -g -DUSEMPI -c tst.F -o tst_mpi.o
mtst_mpi.o: mtst.c
mpicc -g -c mtst.c -o mtst_mpi.o
********************* tst.F ***************************************
program test
common /mem/ mp
integer ia(1)
pointer (mp,ia)
integer*8 lmalloc
external lmalloc
c
#ifdef USEMPI
include 'mpif.h'
call mpi_init(ierr)
call mpi_comm_rank(mpi_comm_world,iam,ierr)
call mpi_comm_size(mpi_comm_world,numproc,ierr)
#endif
c
nwords = 100000000
mp=lmalloc(nwords)
call subtest(ia,nwords)
#ifdef USEMPI
call mpi_finalize(ierr)
#endif
end
c
subroutine subtest(iw,nwords)
integer iw(*)
c
iw(10)=10
c write to word BEFORE BEGINNING of allocated memory
iw(0)=10
iw(nwords)=10
c write to word AFTER END of allocated memory
iw(nwords+1)=10
c
return
end
********************** mtst.c *******************************
#include <stdlib.h>
long long lmalloc_(int *nwords)
{
printf("Sanity check: 8 = %d\n",sizeof(long long));
return (long long) (void *) calloc(*nwords,sizeof(int));
}
|
|
From: tom f. <tf...@al...> - 2008-11-07 17:31:35
|
Brian Wainscott <br...@ls...> writes:
> I've been a long time user of valgrind, but am having serious problems with
> recent versions. I don't know if it is the switch from ia32 to x86_64, the
> switch from lam 7 to openmpi, or the switch from valgrind 2 to valgrind 3,
[snip]
> When I build the program WITHOUT MPI at all, on my x86_64 system with Intel
> Fortran and GCC, valgrind 3.4 also correctly reports errors.
valgrind 3.4 meaning the trunk?
> However, if I build the program with openmpi (or hp-mpi) on my x86_64
> system, valgrind 3.4 reports no errors at all.
Can you use Open MPI v 1.3? See:
http://www.open-mpi.de/faq/?category=debugging#memchecker_what
Note that 1.3 isn't released, it's only in the repository at this point.
> Any suggestions or info would be greatly appreciated.
I also use OpenMPI quite frequently, and I haven't quite taken the
dive to check out 1.3 and memchecker... I was given some other ideas
that I still haven't found time to try, first. So, while I'm hoping
this helps you, I'm also trying to use you as a guinea pig to evaluate
that setup <g>
Best,
-tom
|
|
From: Julian S. <js...@ac...> - 2008-11-07 22:37:13
|
> I'm also getting TONS of "uninitialized value" errors with HP-MPI that I > never got before (and some of which I have carefully tracked down, and they > are bogus, the values are clearly initialized), but that is another > issue.... Debugging MPI apps with Valgrind is a bit tricky, but it's certainly doable. What you describe sounds like it could have three possible causes. You'll have to investigate. 1. MPI implementations like to map the network card(s) into user space, and push data through them, bypassing the kernel. In this situation Memcheck hasn't got a clue what's going on, especially for data arriving at a node, and you get flooded with false errors. 2. For similar reasons, if two processes on the same node communicate via a shared memory region, the results will be bad. 3. It may be that OpenMPI is providing its own implementations of malloc, free, new, delete, etc; that Memcheck doesn't know about, which will also cause chaos. Re (1) and (2), it helps if you can get OpenMPI to make all processes (even those on the same node) communicate via standard TCP/IP networking, so that Memcheck can see data going into/out-of each process correctly. Some time ago I was told by an OpenMPI developer that this can be done by passing --mca btl tcp,self to mpirun. Re (3), that's more difficult to ascertain. I suggest also that asking the OpenMPI developers is worthwhile. I've found them in the past to be knowledgeable and helpful, and I believe they are long-time users of Valgrind/Memcheck. I believe you should be able to get to essentially zero false errors with a suitable OpenMPI configuration. I managed that in my testing with OpenMPI a couple of years back, although I should say that was very limited testing. Since you are upgrading from Valgrind 2.2, once you achieve a clean run, you might want to consider using Memcheck's MPI-checking wrapper library for extra validation at the PMPI_* function interface level, if you haven't already discovered that. See http://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.mpiwrap for details. I wouldn't recommend this before you get a clean run, though; the results will be confusing. J |
|
From: Brian W. <br...@ls...> - 2008-11-07 23:05:20
|
Julian, Thanks for the info, and I'll check into OpenMPI 1.3 which I'm told has memcheck related developments in it. The problems I'm having have nothing to do with MPI directly-- my test program has no MPI calls at all in it other than initialization and finalization, and I've tried writing to memory up to 512 bytes outside of the allocated memory block (both before and after), and get no error from valgrind. I suspect this suggestion is correct: > > 3. It may be that OpenMPI is providing its own implementations of > malloc, free, new, delete, etc; that Memcheck doesn't know about, > which will also cause chaos. > > > I suggest also that asking the OpenMPI developers is worthwhile. I've > found them in the past to be knowledgeable and helpful, and I believe they > are long-time users of Valgrind/Memcheck. > > I believe you should be able to get to essentially zero false errors > with a suitable OpenMPI configuration. I managed that in my testing with > OpenMPI a couple of years back, although I should say that was very > limited testing. I will contact them and see if they have any information. Thanks. -- Brian |
|
From: Brian W. <br...@ls...> - 2008-11-08 00:02:47
|
Julian, Yes, that was it. OpenMPI 1.3 does NOT, by default, use their own memory allocation routines -- it is an option. In OpenMPI 1.2 it is the default. I downloaded and installed OpenMPI 1.3, and with it, valgrind 3.4 is finding the memory write errors as I'd expected. So I can once again make good use of this wonderful tool. Thanks. -- Brian Julian Seward wrote: > > 3. It may be that OpenMPI is providing its own implementations of > malloc, free, new, delete, etc; that Memcheck doesn't know about, > which will also cause chaos. > > > J |