Re: [Libmesh-users] threads-related assertion failure

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Sat, 24 Apr 2010, Boyce Griffith wrote:

> It appears that one problem is that the innermost parallel_reduce is
> being called within a parallel_for.  Regardless of the number of
> threads, calls to parallel_XXX call BoolAcquire(in_threads), and so the
> innermost call to parallel_reduce results in an assertion failure.

That's definitely the root of the problem - BoolAcquire is not
designed to be used nested.

> I'm not sure that this is the best solution, but everything seems to
> work if I comment out the assertion failures in the constructor and
> destructor for BoolAcquire.

That would have been my solution, too, and it should be just fine in
the default, n_threads=1 case, but I suspect there may be a bug in the
n_threads>1 case which your workaround is just hiding.

Let's see:

>>> #8 0x00dbd674 in BoolAcquire (this=0xbfff85cc, b=@0x1b4e7da) at
>>> threads.h:64
>>> #9 0x01470936 in parallel_reduce<ConstNodeRange, <unnamed>::FindBBox>
>>> (range=@0xbfff8c40, body=@0xbfff8bf4) at threads.h:292
>>> #10 0x01466764 in MeshTools::bounding_box (mesh=@0xbfffd77c) at
>>> src/mesh/mesh_tools.C:290

Here's where the inner threaded loop is - bounding_box is
multithreaded, naturally.

>>> #11 0x0168afec in PointLocatorTree::init (this=0x35656c0,
>>> build_type=Trees::NODES) at src/utils/point_locator_tree.C:117
>>> #12 0x0168a96b in PointLocatorTree (this=0x35656c0, mesh=@0xbfffd77c,
>>> master=0x0) at src/utils/point_locator_tree.C:42
>>> #13 0x01686f03 in PointLocatorBase::build (t=MeshEnums::TREE,
>>> mesh=@0xbfffd77c, master=0x0) at src/utils/point_locator_base.C:64
>>> #14 0x01398e11 in MeshBase::point_locator (this=0xbfffd77c) at
>>> src/mesh/mesh_base.C:293
>>> #15 0x00dbd210 in PeriodicBoundaries::neighbor (this=0x354f764,
>>> boundary_id=3, mesh=@0xbfffd77c, e=0x3532340, side=3) at
>>> src/base/dof_map_constraints.C:1342
>>> #16 0x00eb2ff3 in FEBase::compute_periodic_constraints
>>> (constraints=@0x354f740, dof_map=@0x354f610, boundaries=@0x354f764,
>>> mesh=@0xbfffd77c, variable_number=0, elem=0x3532340) at
>>> src/fe/fe_base.C:1983
>>> #17 0x00fa3a86 in FEInterface::compute_periodic_constraints
>>> (constraints=@0x354f740, dof_map=@0x354f610, boundaries=@0x354f764,
>>> mesh=@0xbfffd77c, variable_number=0, elem=0x3532340) at
>>> src/fe/fe_interface.C:1995
>>> #18 0x00dab668 in operator() (this=0xbfffc6b4, range=@0xbfffc600) at
>>> src/base/dof_map_constraints.C:89
>>> #19 0x00dbd28f in parallel_for<ConstElemRange,
>>> <unnamed>::ComputeConstraints> (range=@0xbfffc600, body=@0xbfffc6b4) at
>>> threads.h:267
>>> #20 0x00dabdf9 in DofMap::create_dof_constraints (this=0x354f610,
>>> mesh=@0xbfffd77c) at src/base/dof_map_constraints.C:168

And here's the outer threaded loop - create_dof_constraints is also
naturally multithreaded.

Hmm.. and there's one possible bug (multiple PointLocator creation)
averted here, thanks to Derek, but there's still two very dangerous
situations:

First: we only use one global spinlock right now.  Derek is using said
spinlock correctly, to make sure that MeshBase::point_locator isn't
creating and leaking multiple PointLocatorTree objects.  But the rest
of the code feels free to assume that it can also use the same
spinlock, because we assume we're never in nested Threads:: loops...

That assumption is clearly incorrect.  And in this case it's
rightfully incorrect - we want all but 1 thread to give up on
constraint calculations until we've built our tree, but we'd like
every thread to join in where possible on constructing the tree.  So
we need some way to handle multiple nested Threads:: loops, any of
which may need to do locking.

Second: we had decided that we should never call MPI-using Parallel::
functions from within TBB-using Threads:: functions.

But here we are.  bounding_box() has to use MPI to be accurate on
parallel meshes, and create_dof_constraints needs to be threaded to be
efficient.  When bounding_box() needs MPI, every other thread should be
either finished (the ones internal to bounding_box()) or locked (the
ones waiting to continue compute_*_constraints), so the Parallel::
call should be safe.  But asserting it's safe is no longer as simple
as asserting that we're not in the middle of a Threads:: call.

There are ways to fix all this (an IntIncrement instead of
BoolAcquire, a per-loop lock created by each Threads::parallel_* call,
and a Parallel-specific lock created for MPI calls) but I'm not sure
these are the best fixes; I need to give it some more thought.  Anyone
with other ideas feel free to jump in.

We can take our time on this one, since it looks like there's nothing
broken other than an overzealous assertion *right now*, it's just that
we've got a design which makes it too easy to accidentally add a
deadlock without realizing it.
---
Roy