From: Nathan A. Baker <baker@ch...>  20040331 19:20:29

Hi Lawrence  (Warning  long email ahead... figured I'd post all the details to the mailing list for posterity) APBS was designed to enable electrostatics calculations on very large biological systems where single processor calculations are not feasible due to memory restrictions. Here's a summary of a typical situation where APBS's parallel capabailities are useful... Assume we have a protein (or chunk of protein) of dimensions (x, y, z) whose potential we want to calculate. In general, we first choose the fine (lxf, lyf, lzf) and coarse (lxc, lyc, lzc) grid lengths of the calculation. We then specify either the number of grid points (nx, ny, nz) used by the solver or the grid spacings (hx, hy, hz). These two quantities are related by: nx = lxf/hx + 1 ny = lyf/hy + 1 nz = lzf/hz + 1 The amount of memory an APBS calculation requires is directly proportional to (nx*ny*nz). APBS focuses the solution from the coarse to fine mesh (using the same number of grid points) in a controlled manner by requiring the grid spacing/length change between two focusing levels to be less than a specified ratio: lx2/lx1 < eps < 1 ly2/ly1 < eps < 1 lz2/lz1 < eps < 1 In doing so, it sets the number of focusing levels in a calculation: m = max_{i = x,y,z} log(lif/lic)/log(eps) For a given number of grid points, the length of time that an APBS calculation runs is directly proportional to the number of focusing levels (m). The only thing that parallel focusing does differently from a typical sequential calculation is to choose smaller fine grid lengths lxfp ~ (1 + 2*sigma) lxf/npx lyfp ~ (1 + 2*sigma) lyf/npy lzfp ~ (1 + 2*sigma) lzf/npz based on the size of the processor array (npx, npy, npz) and the desired overlap between processor grids (sigma). Now, consider a PBE calculation on a large biomolecular system where we choose the number of grid points (nx, ny, nz) based on the available memory per processor and the grid lengths (lxf, lyf, lzf) and (lxc, lyc, lzc) based on the size of the protein. For a large molecule, we'll find that these platform and proteinspecfic settings give grid spacings (hx, hy, hz) that are too large for accurate PBE calculations and, therefore, we'll need to use parallel focusing. Based on the arguments above, you can see that the number of focusing levels (and therefore the computational time) will scale as m ~ max_{i = x, y, z} log(lif/lic/npi)/log(eps) In the original implementation, eps was so small (~1/100) that the logarithmic dependence was negligible on the available computational platforms. This gives us a claim to linear scaling in a manner entirely analogous to the fast multipole method (the similarity between the methods is more than superficial, BTW). This choice seemed to work well for the cases I examined  mainly ligand binding and comparison of average potentials away from protein surfaces. However, I later discovered that the resulting potentials with small eps did not give reliable results for proteinprotein interactions. Therefore, I chose a very conservative value (0.25) as the default in recent versions of APBS. This value is overkill but it ensures that users will likely never receive "surprising" errors with APBS parallel focusing  this value is also causing the lessthanlinear scaling you're observing. This value can be modified (see VREDRAT in src/generic/apbs/vhal.h) to values appropriate to your application. The upshot of this long email is that APBS allows users to look at large systems that are not possible with a single machine. Situations where this is warranted are indicated by the psize.py utility provided with APBS. The algorithm is completely latencytolerant (we have a version that requires no communication) and scales linearly under certain circumstances. Thanks, Nathan Lawrence Hannon <hannone@...> (03312004 11:07:500600): >I'm an IBM'er working with Celera on code optimization. They've asked me >to take a look a apbs/MPI. They have not been big MPI users up to now. > >I built the system with the following > >CC=mpcc_r >F77=mpxlf_r >CFLAGS="O3 qstrict qarch=pwr3 qtune=pwr3 qcache=auto qmaxmem" >FFLAGS="qfixed=132 O3 qstrict qarch=pwr3 qtune=pwr3 qcache=auto >qmaxmem" >LDFLAGS="bmaxdata:0x80000000 bmaxstack:0x10000000 L/usr/local/lib >lmass lessl " > >For maloc >configure prefix <install directory> enable_mpi enable_blas=no >gmake install > >For apbs0.2.6 >configure prefix <install directory> with_blas="L/usr/lib lblas" >gmake install > >It seems to have hooked into IBM's MPI (poe) because It complains if >MP_PROCS or other MPI related environment variables are set incorrectly. I >ran it using the apbsPARALLEL.in input file in examples/actindimer >(modified to run with 1, 2, 4, & 8 cpus). I'm not seeing any speedup when >I add cpus. As a matter of fact, it seems to run in the same amount of >time or longer as I add cpu's. When I profile the code , I see most of the >time is spent in "ivdwAccExclus". The time spent in this routine is the >same or more for each thread even when it's run on multiple cpu's. I must >be doing something wrong. > >If I look at which MPI routines are being used, I see MPI_Comm_size, >MPI_Comm_rank, and MPI_ALLreduce all being used only once. When I look >thru the source, it's very hard to determine exactly how parallelism is >entering the problem. Can anyone help? > >Thanks, > >Lawrence HannonEnd of message from Lawrence Hannon.  Nathan A. Baker, Assistant Professor Washington University in St. Louis School of Medicine Dept. of Biochemistry and Molecular Biophysics Center for Computational Biology 700 S. Euclid Ave., Campus Box 8036, St. Louis, MO 63110 Phone: (314) 3622040, Fax: (314) 3620234 URL: http://www.biochem.wustl.edu/~baker PGP key: http://cholla.wustl.edu/~baker/pubkey.asc 