|
From: Indi T. <iht...@ho...> - 2009-12-01 22:49:59
|
Ashley, Tom Thank you for your reply. I add --mca btl tcp, self as mpirun argument following Julian suggestion on Tom's posting. Indeed this managed to supress open mpi errors to the extend that the valgrind log file is actually reduced from 20Mb to only a few kb. However I found the discussion through a google search, hence I'd appreciate if you could point me to the tittle of the FAQ so that I can follow the complete discussion. As of my problem, the program did crash, presumably with a segfault. I put 'presumably' since valgrind simply hang the whole parallel computation after printing out the last error message. Valgrind never print out any error until the program crash with messages I've wrote in my original posting. Since it happens in the middle of an iterative process, the line where the program crashes has been passed during the previous iterations without problem. As far as I can see the program is trying to call mpi_reduce when it crashed. Looking at valgrind behaviour, I have impression that the problem lies on open-mpi 1.2.6. rather than on my program. (Valgrind did not report anything when I checked the same program in sequential environment as well as under mpich environment with gnu compilation) However, I was expecting valgrind to report the problem earlier, i.e. when mpi_allreduce is called by the same program line at the very first iteration. What I am wondering at the moment is whether the problem is hidden within the ompi-suppressed errors ? Also have I been naive in expecting this sort of error to be reported cleanly? My problem is that I have more than one call to mpi_allreduce and the failure seems to happen randomly at any of the calls as well as the number of iteration. So my valgrind error log changes from each run, thus what I quoted on my posting is the "typical" error that I have seen. By the way thank you for the suggestion on upgrading the open mpi. Regards Indi > Subject: Re: [Valgrind-users] memcheck behaviour in random failure of an open mpi based code. > From: as...@pi... > To: tf...@al... > CC: iht...@ho... > Date: Mon, 30 Nov 2009 18:50:00 +0000 > > On Mon, 2009-11-30 at 10:20 -0700, tom fogal wrote: > > Ashley Pittman <as...@pi...> writes: > > > On Sun, 2009-11-29 at 19:34 -0700, tom fogal wrote: > > > > Indi Tristanto <iht...@ho...> writes: > > > > > I am trying to debug a large iterative solver that has been compiled usin > > > g = > > > > > intel fortran 10 and open mpi 1.2.6. > > > > > > > > > This looks (very) familiar to an issue I brought up with the OpenMPI > > > > folks earlier this year. See ticket 1942: > > > > > > > > https://svn.open-mpi.org/trac/ompi/ticket/1942 > > > > > > The error in that ticket is about uninitialised reads which do happen > > > and are semi-expected with socket programming. > > > > No, it is not. It is about valgrinding OpenMPI programs. It links to > > a thread which originally started with uninitialized reads, but if you > > follow the thread you'll note that the discussion became much wider > > than the original posting. > > > > > The error in this email is about a crash (segfault) in the open mpi > > > library, I doubt the two are related. > > > > At no point in Indi's email did he mention the application segfault or > > crashed. > > I was thinking about the "Adress 0x10 is not stack'd, malloc'd or > (recently) free'd" message. Actually there's a spelling mistake in that > error message, I assume this is from transmission somewhere rather than > in the actual valgrind output. > > Given that OpenMPI has it's own malloc implementation it's likely that > allocations aren't being intercepted and buffer over-runs aren't being > intercepted and quite possible the error is being caused by an invalid > write that valgrind isn't catching. > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > _________________________________________________________________ View your other email accounts from your Hotmail inbox. Add them now. http://clk.atdmt.com/UKM/go/186394592/direct/01/ |