From: Ashley P. <as...@pi...> - 2009-11-27 12:39:43
|
On Fri, 2009-11-27 at 10:26 +0000, Indi Tristanto wrote: > I am trying to debug a large iterative solver that has been compiled > using intel fortran 10 and open mpi 1.2.6. that is run in a SLES10 > based PC cluster using valgrind 3.5.0. To supress the openmpi error > messages I've put mpi argument "--mca btl tcp, self", otherwise the > log file is simply filled by open mpi messages, which may amount to > 20Mb. Here, I am not trying to debug the open mpi. Open MPI 1.2.6 is fairly old now, there have been some major improvements in the 1.3 series. > The problem appears at random in after several iterations which > suggest a memory related problem. I expect Memcheck to report an error > when the offending lines is executed for the first time. Like threaded applications parallel applications can suffer from race conditions that only occur periodically or when a certain set of timing conditions exist. > Does that error message suggest the dynamic memory allocation within > open-mpi allreduce operation is at fault? If this is the case, could I > capture the problem earlier by removing the mpi suppression ? I thought you said the program was clean up until that point? One thing you should know is that open-mpi has at times in the past replaced the libc malloc with it's own verion, if it's done this then memcheck will be able to do considerably fewer checks as it will probably not know about these allocations. I'd recommend downloading the latest Open-MPI release (I believe 1.3.3 is still the latest, 1.3.4 is due real-soon now, probably when folks get back after thanks giving) and compile it without the malloc hooks. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk |