On Mon, 23 May 2011, Derek Gaston wrote:
> Here's some more info on this subject:
> 1. The old System::update() is really segfaulting. It's pretty reproducible with ~60 million DoFs.
I really wish you'd stopped after "reproducible"...
> 2. Using the old System::update() with a solution->close() at the
> beginning is _not_ sufficient! It still segfaults!
This is astonishing.
This is on PETSc 2.3.3 still? Any chance you could give it a shot
with PETSc 3.1, and/or a debug-compiled PETSc?
> 3. Using the new System::update() works.
> I'm still investigating and will let you know more when I can. Kinda hard to debug on ~5,000 procs....
Sadly, it's even harder to debug by emailing the guy with ~5,000
procs. ;-) The standard lonestar queues max out at 4104 cores, and
I'm very thankful that I've never had to debug anything that didn't
fit downstairs on 112.
I've been thinking of adding a libMeshInit handler for SIGSEGV, which
would do a libmesh_write_traceout() and then hand off to any
previously registered handler. Would this be helpful, or do you
already have a stack trace from the segfault?
> How about the issues you brought up below? Any clarity on those
> yet? In particular the Trilinos problem shouldn't be happening.
> Trilinos support in libMesh doesn't support GHOSTED vectors (as far
> as I know anyway).... so you really shouldn't be able to compile
> with Trilinos and Ghosted both on...
If you couldn't compile with Trilinos and Ghosted both on, most of our
regression tests would fail. libMesh doesn't just support
PETSc or Trilinos, it supports PETSc-and-Trilinos, then defaults
factory-built linear algebra to the former when both are enabled.
The way our Trilinos interface is supposed to work is by implementing
GHOSTED vectors as SERIAL - i.e. the old inefficient way we used to do
all current_local_solution type vectors.
The trouble is that while most operations you'd want to perform on a
GHOSTED vector work fine (just less efficiently) on a SERIAL vector,
operator=(PARALLEL vector) is not yet one of them.
Anyway, that's why I haven't worried about the Trilinos problem: I
think I understand the missing feature that's causing it, that would
be an easy enough feature to add if we needed it, and anyway it'll be
fixed automatically when we figure out how to fix (and can thus again
safely use) localize().
The Petsc-noMPI problem is much more troubling, but the bad news is
that I've been swamped with other stuff and haven't looked into it
The good news is that "other stuff" includes ParallelMesh, which is
now starting to pass tests with adaptive coarsening. The catch is
that redistribute() still needs work, so you have to partition in
serial (or read from a partitioned file, I guess?) and you're stuck
without load-rebalancing. There's probably bugs I haven't run into
yet (in fact, I think there's bugs with redistribute() that go beyond
the DofObject communication work we previously discussed), and if you
guys have any time to play with it I'd appreciate that. I intend to
have some large adaptive ParallelMesh results by mid-October, but for
now we're just working on it to enable some finer-grid uniform runs.