|
From: Ashley P. <as...@pi...> - 2009-12-02 12:42:19
|
On Tue, 2009-12-01 at 22:49 +0000, Indi Tristanto wrote: > Ashley, Tom > > Thank you for your reply. > > I add --mca btl tcp, self as mpirun argument following Julian > suggestion on Tom's posting. Indeed this managed to supress open mpi > errors to the extend that the valgrind log file is actually reduced > from 20Mb to only a few kb. However I found the discussion through a > google search, hence I'd appreciate if you could point me to the > tittle of the FAQ so that I can follow the complete discussion. The "--mca btl tcp,self" option tells OpenMPI to communicate only via tcp and loopback. Of note here is that "sm" or shared memory is missing, it's the shared memory fifo's that confuse valgrind and cause the 20Mb of errors you'd have seen. These would most likely all be false positives. > Valgrind never print out any error until the program crash with > messages I've wrote in my original posting. Since it happens in the > middle of an iterative process, the line where the program crashes has > been passed during the previous iterations without problem. As far as > I can see the program is trying to call mpi_reduce when it crashed. > Looking at valgrind behaviour, I have impression that the problem lies > on open-mpi 1.2.6. rather than on my program. (Valgrind did not report > anything when I checked the same program in sequential environment as > well as under mpich environment with gnu compilation) However, I was > expecting valgrind to report the problem earlier, i.e. when > mpi_allreduce is called by the same program line at the very first > iteration. If you look at your stack trace closer it claims to be in malloc which is itself in libopen-pal. Normally malloc is in libc and valgrind intercepts it as such, OpenMPI by default replaces the libc malloc with it's own version in libopen-pal which valgrind won't have intercepted. What this means is that you aren't getting the full value of memcheck as buffer over-runs and under-runs aren't being caught (as Valgrind doesn't know what has been malloced and what hasn't). Internally mpi_reduce is calling malloc which is then crashing, most likely because it's private data structures have been over-written by a buffer over-run. If you re-link your program without libopen-pal and re-run you will be using the libc malloc and valgrind will be able to do a lot more checks, hopefully including the buffer over-run which went on to cause the problem you are seeing. > Also have I been naive in expecting this sort of error to be reported > cleanly? No that's not naive. > My problem is that I have more than one call to mpi_allreduce and the > failure seems to happen randomly at any of the calls as well as the > number of iteration. So my valgrind error log changes from each run, > thus what I quoted on my posting is the "typical" error that I have > seen. I hope the above explains this, there will be many calls to malloc throughout the code and once you've over-written the meta-data it's essentially pot luck which one goes on to crash. > By the way thank you for the suggestion on upgrading the open mpi. I believe the latest OpenMPI doesn't replace malloc by default, I'm not entirely sure on this however. Certainly you can configure it not to. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk |