On Tue, 27 Apr 2010, Derek Gaston wrote:
> Right... this whole idea of ONE global spinlock isn't right. With TBB each
> place where a spinlock is needed should have it's own spinlock defined just
> in the local compilation unit (often they do this at the top of the .C).
I hate global variables sitting at the top of a .C, and for larger .C
files you could still be "overloading" a lock. To be really safe we
want the lock to be local to the data whose concurrent access it
guards - the per-GetPot lock should be a good example of that. There
are probably instances where that idiom won't work, but I'll bet that
so far we can naturally put each lock target into one of our classes.
> We could just use a compilation unit local spinlock in the Parallel namespace
As an aside: in hindsight, the Parallel:: namespace should have been
the Parallel:: class, which should have been the encapsulation of
Parallel::Communicator. Every function there needs a communicator
argument anyway, and calling libMesh::Comm_World.sum(a) would have
been a more natural expression of that than tacking on the argument
with a default value of Comm_World.
Probably too late to change now without too much inconvenience.
> where the MPI functions are and lock / unlock it as we go through the
> parallel calls. The only issue is speed... I don't know how much overhead
> that will add when we're calling those functions and we don't need to create
> / destroy the spinlock.
That's not the only issue, unfortunately. That would be a decent fix
for letting people run with non-thread-safe MPI implementations, and
we could let configure turn off the Parallel::thread_lock if a
thread-safe implementation is detected.
The big issue is: we currently depend on each rank hitting each
Parallel:: call in a sane order. But if e.g. thread A on each rank
is doing MeshTools::n_p_levels() and calls Parallel::max(nl) there,
and thread B on each rank is doing MeshTools::n_active_levels() and
calls Parallel::max(nl) there, then the results of both calls will
basically be garbage. I can think of other cases that will result in
deadlocks instead. Even asynchronous Parallel:: functions aren't
safe, because it's possible for the same caller function (and
thus the same tag) to be called by multiple threads in different
I'm happy we don't actually have examples of such bugs yet, but they'd
be too easy to write, and I haven't thought of a good way to assert()
such bugs out of existence yet. "we want to break up elements for
threading, but when we hit an element in set A we'll need to call some
parallel code to instantiate a resource, and when we hit an element in
set B we'll need to call different parallel code to instantiate a
different resource" is a devastating counterexample to every attempt
so far. Even if we create a unique MPI communicator for each thread,
how do we know *which* thread gets which communicator? In the
PointLocator example, it's pretty much random which thread hits the
"create a PointLocator now" lock first.
I'm actually starting to lean back towards enforcing "no Parallel::
calls from within Threads:: calls". In the periodic constraints
PointLocator example again, on a ParallelMesh it's possible for an
interior partition to never hit the PointLocator creation call - at
that point in opt mode the whole run would stall waiting for that
interior partition's rank to join in on bounding_box calculations, or
in debug mode we'd die in a parallel_only() assertion failure. Maybe
the bugfix here is for create_periodic_constraints to check whether it
will need a PointLocator, and if necessary create one *before* it ever
gets into its own Threads:: loop. And maybe something like that is
the most practical bugfix for *every* Parallel-within-Threads