Hi Lawrence 
(Warning  long email ahead... figured I'd post all the details to
the mailing list for posterity)
APBS was designed to enable electrostatics calculations on very large
biological systems where single processor calculations are not
feasible due to memory restrictions.
Here's a summary of a typical situation where APBS's parallel
capabailities are useful...
Assume we have a protein (or chunk of protein) of dimensions (x, y, z)
whose potential we want to calculate. In general, we first choose the
fine (lxf, lyf, lzf) and coarse (lxc, lyc, lzc) grid lengths of the
calculation. We then specify either the number of grid points (nx, ny,
nz) used by the solver or the grid spacings (hx, hy, hz). These two
quantities are related by:
nx = lxf/hx + 1
ny = lyf/hy + 1
nz = lzf/hz + 1
The amount of memory an APBS calculation requires is directly
proportional to (nx*ny*nz).
APBS focuses the solution from the coarse to fine mesh (using the same
number of grid points) in a controlled manner by requiring the grid
spacing/length change between two focusing levels to be less than a
specified ratio:
lx2/lx1 < eps < 1
ly2/ly1 < eps < 1
lz2/lz1 < eps < 1
In doing so, it sets the number of focusing levels in a calculation:
m = max_{i = x,y,z} log(lif/lic)/log(eps)
For a given number of grid points, the length of time that an APBS
calculation runs is directly proportional to the number of focusing
levels (m).
The only thing that parallel focusing does differently from a typical
sequential calculation is to choose smaller fine grid lengths
lxfp ~ (1 + 2*sigma) lxf/npx
lyfp ~ (1 + 2*sigma) lyf/npy
lzfp ~ (1 + 2*sigma) lzf/npz
based on the size of the processor array (npx, npy, npz) and the
desired overlap between processor grids (sigma).
Now, consider a PBE calculation on a large biomolecular system where
we choose the number of grid points (nx, ny, nz) based on the
available memory per processor and the grid lengths (lxf, lyf, lzf)
and (lxc, lyc, lzc) based on the size of the protein. For a large
molecule, we'll find that these platform and proteinspecfic settings
give grid spacings (hx, hy, hz) that are too large for accurate PBE
calculations and, therefore, we'll need to use parallel focusing.
Based on the arguments above, you can see that the number of focusing
levels (and therefore the computational time) will scale as
m ~ max_{i = x, y, z} log(lif/lic/npi)/log(eps)
In the original implementation, eps was so small (~1/100) that the
logarithmic dependence was negligible on the available computational
platforms. This gives us a claim to linear scaling in a manner
entirely analogous to the fast multipole method (the similarity
between the methods is more than superficial, BTW).
This choice seemed to work well for the cases I examined  mainly
ligand binding and comparison of average potentials away from protein
surfaces. However, I later discovered that the resulting potentials
with small eps did not give reliable results for proteinprotein
interactions. Therefore, I chose a very conservative value (0.25) as
the default in recent versions of APBS. This value is overkill but it
ensures that users will likely never receive "surprising" errors with
APBS parallel focusing  this value is also causing the
lessthanlinear scaling you're observing. This value can be modified
(see VREDRAT in src/generic/apbs/vhal.h) to values appropriate to your
application.
The upshot of this long email is that APBS allows users to look at
large systems that are not possible with a single machine. Situations
where this is warranted are indicated by the psize.py utility provided
with APBS. The algorithm is completely latencytolerant (we have a
version that requires no communication) and scales linearly under
certain circumstances.
Thanks,
Nathan
Lawrence Hannon <hannone@...> (03312004 11:07:500600):
>I'm an IBM'er working with Celera on code optimization. They've asked me
>to take a look a apbs/MPI. They have not been big MPI users up to now.
>
>I built the system with the following
>
>CC=mpcc_r
>F77=mpxlf_r
>CFLAGS="O3 qstrict qarch=pwr3 qtune=pwr3 qcache=auto qmaxmem"
>FFLAGS="qfixed=132 O3 qstrict qarch=pwr3 qtune=pwr3 qcache=auto
>qmaxmem"
>LDFLAGS="bmaxdata:0x80000000 bmaxstack:0x10000000 L/usr/local/lib
>lmass lessl "
>
>For maloc
>configure prefix <install directory> enable_mpi enable_blas=no
>gmake install
>
>For apbs0.2.6
>configure prefix <install directory> with_blas="L/usr/lib lblas"
>gmake install
>
>It seems to have hooked into IBM's MPI (poe) because It complains if
>MP_PROCS or other MPI related environment variables are set incorrectly. I
>ran it using the apbsPARALLEL.in input file in examples/actindimer
>(modified to run with 1, 2, 4, & 8 cpus). I'm not seeing any speedup when
>I add cpus. As a matter of fact, it seems to run in the same amount of
>time or longer as I add cpu's. When I profile the code , I see most of the
>time is spent in "ivdwAccExclus". The time spent in this routine is the
>same or more for each thread even when it's run on multiple cpu's. I must
>be doing something wrong.
>
>If I look at which MPI routines are being used, I see MPI_Comm_size,
>MPI_Comm_rank, and MPI_ALLreduce all being used only once. When I look
>thru the source, it's very hard to determine exactly how parallelism is
>entering the problem. Can anyone help?
>
>Thanks,
>
>Lawrence HannonEnd of message from Lawrence Hannon.

Nathan A. Baker, Assistant Professor
Washington University in St. Louis School of Medicine
Dept. of Biochemistry and Molecular Biophysics
Center for Computational Biology
700 S. Euclid Ave., Campus Box 8036, St. Louis, MO 63110
Phone: (314) 3622040, Fax: (314) 3620234
URL: http://www.biochem.wustl.edu/~baker
PGP key: http://cholla.wustl.edu/~baker/pubkey.asc
