I thought I'd post an update to this thread since we've still been working on tracking down the issue here and there.  In the course of debugging, one of our HPC admins built another software stack from the ground up but decided to use mvapich2 instead of openMPI.  We haven't been able to replicate the issue yet with the same problem using the new stack.  Everything is identical except for the MPI library..... so, while we have still have a number of configurations to try, it's looking like it could be an OpenMPI issue.  Again, this hang has been very difficult to replicate and only occurs after a very significant number of time steps on a large number of processors.  All attempts to create a version that hangs on our workstations has failed (which incidentally also use OpenMPI).

If we can definitely say that it's an OpenMPI bug, then I'll report again.

Cody

On Thu, Jan 24, 2013 at 8:25 AM, Cody Permann <codypermann@gmail.com> wrote:



On Thu, Jan 24, 2013 at 8:00 AM, Dmitry Karpeev <karpeev@mcs.anl.gov> wrote:


On Wed, Jan 23, 2013 at 9:41 PM, Cody Permann <codypermann@gmail.com> wrote:
On Wed, Jan 23, 2013 at 11:05 AM, Kirk, Benjamin (JSC-EG311) <
benjamin.kirk-1@nasa.gov> wrote:

> Are these ghosted vectors?
>
> Can't imagine how it could happen, but if the ghost indices are not
> symmetric where they should be you could have processor m waiting on a
> message from processor n that is not coming...
>

Yes, ghosted vectors.  Well I guess that's somewhere to start looking.  I
found the location of the branch down inside of PETSc where the paths
diverge (wait_all vs wait_any) but I admit I have no idea what's happening
at that level. I haven't been able to get the code to hang on my local
workstation with 8-10 processor jobs, and sadly it runs for a long time
before hanging on the cluster sized jobs.

Also, I haven't tried a full debug build yet because of the size of the
problem, but I'll put that on the "to do" list too.  If we're lucky,
perhaps we'll hit an assert if we ever get there.  I'll keep you posted.
It would be easier, if we had line numbers, but it's virtually certain that your Waitany is waiting on a recv,
while Waitall is finalizing the sends ($PETSC_DIR/src/vec/vec/utils/vpscat.c:VecScatterEnd_).
Basically, the Waitany proc has nrecvs too high that can't be satisfied by all of the senders, 
which suggests the problem is in VecScatterCreate() and ultimately likely in the arguments to VecCreateGhost().

That petsc code hasn't changed substantially in quite a while, with the exception of adding optional one-sided
stuff (and some CUDA-related code), so I doubt this problem would depend on using a particular (relatively recent) version of petsc.

Is this a AMR run?  That's the only way I would imagine the number of sends and receives changing midway through.

 
Yes this is an AMR run.  We are definitely doing some weird things that we normally don't do in this sim.  (i.e. we are changing the solution vector manually after the solve to inject features that can't be captured by our equations, AND we are doing multiple adaptivity cycles in-between solves.

Thanks, it sounds like Ben's original hunch was correct and that we've developed some sort of asymmetry in our send lists.  I'll try serializing the solution and will continue to shrink the problem down to ease the debugging process.

Ben,  
The restart idea would be nice, however we are a little ways off from being able to do that with this particular simulation.  It uses user-defined data structures what would have to be serialized and saved as well in order to restart.  We recently disabled that capability for a refactor that we just went through.  We'll hopefully be bringing that back online soon.  I should be able to run the whole sim in debug or devel mode though.  

Thanks for all the suggestions, I'll let you know when I have more data to look at,
Cody 
 
Dmitry.

 


Cody


>
> -Ben
>
>
>
>
> On Jan 23, 2013, at 12:00 PM, "Cody Permann" <codypermann@gmail.com>
> wrote:
>
> > Alright, I could use more sets of eyeballs to help me find the source of
> a
> > hanging job.  We have a user running MOOSE on our cluster here and the
> job
> > hangs after several steps.  It's an end user code so explaining every
> > single piece of the application would be rather long winded.  There are a
> > couple of highlights though that I'll mention here.
> >
> > 1. This application goes through multiple mesh adaptivity cycles between
> > solves.  i.e.  We compute error vectors, mark the mesh, refine and
> coarsen
> > multiple times without another solve in-between.
> >
> > 2. We also forcefully change select values in the solution vector(s) at
> the
> > end of the timestep, before the next solve.
> >
> > Derek and I have written a small python script which attaches a debugger
> to
> > a hung job on a cluster and prints the number of processes in each
> "unique"
> > state (unique determined by the stack trace).  Here is the output:
> >
> > Unique Stack Traces
> > ************************************
> > Count: 32
> > #0  0x00002b2dea658696 in poll () from /lib64/libc.so.6
> > #1  0x00002b2de97a3ef0 in poll_dispatch ()
> > #2  0x00002b2de97a2c23 in opal_event_base_loop ()
> > #3  0x00002b2de9797901 in opal_progress ()
> > #4  0x00002b2de8f1a0c5 in ompi_request_default_wait_any ()
> > #5  0x00002b2de8f4775d in PMPI_Waitany ()
> > #6  0x0000000001387389 in VecScatterEnd_1 ()
> > #7  0x0000000001382cf4 in VecScatterEnd ()
> > #8  0x0000000000d8aaa8 in
> libMesh::PetscVector<double>::**localize(libMesh::
> > **NumericVector<double>&) const ()
> > #9  0x0000000000ae831b in libMesh::DofMap::enforce_**const
> > raints_exactly(libMesh::**System const&,
> libMesh::NumericVector<double>***,
> > bool) const ()
> > #10 0x0000000000e336ed in libMesh::System::project_**vecto
> > r(libMesh::NumericVector<**double> const&,
> libMesh::NumericVector<double>**&)
> > const ()
> > #11 0x0000000000e340dc in libMesh::System::project_**vecto
> > r(libMesh::NumericVector<**double>&) const ()
> > #12 0x0000000000dfed42 in libMesh::System::restrict_**vectors() ()
> > #13 0x0000000000dca50b in libMesh::EquationSystems::**reinit() ()
> > #14 0x00000000008ca72d in FEProblem::meshChanged() ()
> > #15 0x00000000008b12d3 in FEProblem::adaptMesh() ()
> > #16 0x00000000006e24f5 in MeshSolutionModify::endStep() ()
> > #17 0x00000000009a0efb in Transient::execute() ()
> > #18 0x0000000000980f66 in MooseApp::runInputFile() ()
> > #19 0x00000000009868ce in MooseApp::parseCommandLine() ()
> > #20 0x000000000097fd95 in MooseApp::run() ()
> > #21 0x00000000006d93e5 in main ()
> >
> > ************************************
> > Count: 64
> > #0  0x00002adad987f696 in poll () from /lib64/libc.so.6
> > #1  0x00002adad89caef0 in poll_dispatch ()
> > #2  0x00002adad89c9c23 in opal_event_base_loop ()
> > #3  0x00002adad89be901 in opal_progress ()
> > #4  0x00002adad814122d in ompi_request_default_wait_all ()
> > #5  0x00002adad816e5ad in PMPI_Waitall ()
> > #6  0x0000000001387050 in VecScatterEnd_1 ()
> > #7  0x0000000001382cf4 in VecScatterEnd ()
> > #8  0x0000000000d8aaa8 in
> libMesh::PetscVector<double>::**localize(libMesh::
> > **NumericVector<double>&) const ()
> > #9  0x0000000000ae831b in libMesh::DofMap::enforce_**const
> > raints_exactly(libMesh::**System const&,
> libMesh::NumericVector<double>***,
> > bool) const ()
> > #10 0x0000000000e336ed in libMesh::System::project_**vecto
> > r(libMesh::NumericVector<**double> const&,
> libMesh::NumericVector<double>**&)
> > const ()
> > #11 0x0000000000e340dc in libMesh::System::project_**vecto
> > r(libMesh::NumericVector<**double>&) const ()
> > #12 0x0000000000dfed42 in libMesh::System::restrict_**vectors() ()
> > #13 0x0000000000dca50b in libMesh::EquationSystems::**reinit() ()
> > #14 0x00000000008ca72d in FEProblem::meshChanged() ()
> > #15 0x00000000008b12d3 in FEProblem::adaptMesh() ()
> > #16 0x00000000006e24f5 in MeshSolutionModify::endStep() ()
> > #17 0x00000000009a0efb in Transient::execute() ()
> > #18 0x0000000000980f66 in MooseApp::runInputFile() ()
> > #19 0x00000000009868ce in MooseApp::parseCommandLine() ()
> > #20 0x000000000097fd95 in MooseApp::run() ()
> > #21 0x00000000006d93e5 in main ()
> >
> > Take a look at frame 5 in each of these stack traces:  This is all the
> way
> > down inside of PETSc but appears to be the source of the problem.  Does
> > anyone have any ideas of how we might "split" branches in libMesh or
> PETSc
> > and end up in this state?
> >
> > Thanks for any ideas you may have,
> > Cody
> >
> ------------------------------------------------------------------------------
> > Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
> > MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
> > with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
> > MVPs and experts. ON SALE this month only -- learn more at:
> > http://p.sf.net/sfu/learnnow-d2d
> > _______________________________________________
> > Libmesh-users mailing list
> > Libmesh-users@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/libmesh-users
>
------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
_______________________________________________
Libmesh-users mailing list
Libmesh-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libmesh-users