Below is the email from Jens Thomas describing the problem. It looks like this affects all versions of MPQC.
Hello,
I'm trying to get to grips with MPQC on an cluster running Suse 9.3 on
Opteron processors and using SCORE (http://www.pccluster.org/).
I built the code with AMRCI from version 4.0.6 of the Global Arrays,
using the "--with-default-parallel=armcimpi" option to the configure script.
The validation for check0 ran fine on 4 processors (1 node), but hung
running the mp2 examples when I tried running on 32 processors.
After a bit of playing about running mp2h2o.in, I think I've
tracked the problem down to the following:
The code hangs just after the first SCF when it prints:
Memory used for integral storage: 31826758 Bytes
The problem manifests itself in the call:
mem->set_localsize(size_t(nijmax)nbasisnbasis*sizeof(double));
in the file:
mpqc-2.3.1/src/lib/chemistry/qc/mbpt/csgrad.cc
When run on 21 processors, processor 21 does not have any work to do so
localsize is called with 0.
This call appears to be resolved in the file:
mpqc-2.3.1/src/lib/util/group/memarmci.cc
where, within the set_localsize function, there is a statement:
if (localsize == 0) return;
I think this is because set_localsize is called with an argument of 0 by
ARMCIMemoryGrp::finalize, and in this case the call just clears up the
memory and mutexes.
However, in the case that hangs, only processor 21 returns at that
point, and the rest all proceed to the call:
ARMCI_Create_mutexes(1);
which is a collective operation and therefore hangs as processor 21
never makes this call.
As I've only just started using MPQC and have only just got through the
first few pages through my first C++ book ;-) , I wanted to check
whether this is indeed a problem with the code or a problem with my
use/understanding of it.
I'll happily supply any more information if that would be of use.
Best wishes,
Jens
Anonymous