Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project!

## Re: [Apbs-users] lack of speedup for apbs on AIX 5L using MPI

 Re: [Apbs-users] lack of speedup for apbs on AIX 5L using MPI From: Nathan A. Baker - 2004-03-31 19:20:29 ```Hi Lawrence -- (Warning -- long e-mail ahead... figured I'd post all the details to the mailing list for posterity) APBS was designed to enable electrostatics calculations on very large biological systems where single processor calculations are not feasible due to memory restrictions. Here's a summary of a typical situation where APBS's parallel capabailities are useful... Assume we have a protein (or chunk of protein) of dimensions (x, y, z) whose potential we want to calculate. In general, we first choose the fine (lxf, lyf, lzf) and coarse (lxc, lyc, lzc) grid lengths of the calculation. We then specify either the number of grid points (nx, ny, nz) used by the solver or the grid spacings (hx, hy, hz). These two quantities are related by: nx = lxf/hx + 1 ny = lyf/hy + 1 nz = lzf/hz + 1 The amount of memory an APBS calculation requires is directly proportional to (nx*ny*nz). APBS focuses the solution from the coarse to fine mesh (using the same number of grid points) in a controlled manner by requiring the grid spacing/length change between two focusing levels to be less than a specified ratio: lx2/lx1 < eps < 1 ly2/ly1 < eps < 1 lz2/lz1 < eps < 1 In doing so, it sets the number of focusing levels in a calculation: m = max_{i = x,y,z} log(lif/lic)/log(eps) For a given number of grid points, the length of time that an APBS calculation runs is directly proportional to the number of focusing levels (m). The only thing that parallel focusing does differently from a typical sequential calculation is to choose smaller fine grid lengths lxfp ~ (1 + 2*sigma) lxf/npx lyfp ~ (1 + 2*sigma) lyf/npy lzfp ~ (1 + 2*sigma) lzf/npz based on the size of the processor array (npx, npy, npz) and the desired overlap between processor grids (sigma). Now, consider a PBE calculation on a large biomolecular system where we choose the number of grid points (nx, ny, nz) based on the available memory per processor and the grid lengths (lxf, lyf, lzf) and (lxc, lyc, lzc) based on the size of the protein. For a large molecule, we'll find that these platform- and protein-specfic settings give grid spacings (hx, hy, hz) that are too large for accurate PBE calculations and, therefore, we'll need to use parallel focusing. Based on the arguments above, you can see that the number of focusing levels (and therefore the computational time) will scale as m ~ max_{i = x, y, z} log(lif/lic/npi)/log(eps) In the original implementation, eps was so small (~1/100) that the logarithmic dependence was negligible on the available computational platforms. This gives us a claim to linear scaling in a manner entirely analogous to the fast multipole method (the similarity between the methods is more than superficial, BTW). This choice seemed to work well for the cases I examined -- mainly ligand binding and comparison of average potentials away from protein surfaces. However, I later discovered that the resulting potentials with small eps did not give reliable results for protein-protein interactions. Therefore, I chose a very conservative value (0.25) as the default in recent versions of APBS. This value is overkill but it ensures that users will likely never receive "surprising" errors with APBS parallel focusing -- this value is also causing the less-than-linear scaling you're observing. This value can be modified (see VREDRAT in src/generic/apbs/vhal.h) to values appropriate to your application. The upshot of this long e-mail is that APBS allows users to look at large systems that are not possible with a single machine. Situations where this is warranted are indicated by the psize.py utility provided with APBS. The algorithm is completely latency-tolerant (we have a version that requires no communication) and scales linearly under certain circumstances. Thanks, Nathan Lawrence Hannon (03-31-2004 11:07:50-0600): >I'm an IBM'er working with Celera on code optimization. They've asked me >to take a look a apbs/MPI. They have not been big MPI users up to now. > >I built the system with the following > >CC=mpcc_r >F77=mpxlf_r >CFLAGS="-O3 -qstrict -qarch=pwr3 -qtune=pwr3 -qcache=auto -qmaxmem" >FFLAGS="-qfixed=132 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 -qcache=auto >-qmaxmem" >LDFLAGS="-bmaxdata:0x80000000 -bmaxstack:0x10000000 -L/usr/local/lib >-lmass -lessl " > >For maloc >configure --prefix --enable_mpi --enable_blas=no >gmake install > >For apbs-0.2.6 >configure --prefix --with_blas="-L/usr/lib -lblas" >gmake install > >It seems to have hooked into IBM's MPI (poe) because It complains if >MP_PROCS or other MPI related environment variables are set incorrectly. I >ran it using the apbs-PARALLEL.in input file in examples/actin-dimer >(modified to run with 1, 2, 4, & 8 cpus). I'm not seeing any speedup when >I add cpus. As a matter of fact, it seems to run in the same amount of >time or longer as I add cpu's. When I profile the code , I see most of the >time is spent in "ivdwAccExclus". The time spent in this routine is the >same or more for each thread even when it's run on multiple cpu's. I must >be doing something wrong. > >If I look at which MPI routines are being used, I see MPI_Comm_size, >MPI_Comm_rank, and MPI_ALLreduce all being used only once. When I look >thru the source, it's very hard to determine exactly how parallelism is >entering the problem. Can anyone help? > >Thanks, > >Lawrence HannonEnd of message from Lawrence Hannon. -- Nathan A. Baker, Assistant Professor Washington University in St. Louis School of Medicine Dept. of Biochemistry and Molecular Biophysics Center for Computational Biology 700 S. Euclid Ave., Campus Box 8036, St. Louis, MO 63110 Phone: (314) 362-2040, Fax: (314) 362-0234 URL: http://www.biochem.wustl.edu/~baker PGP key: http://cholla.wustl.edu/~baker/pubkey.asc ```

 [Apbs-users] lack of speedup for apbs on AIX 5L using MPI From: Lawrence Hannon - 2004-03-31 19:07:56 Attachments: Message as HTML ```I'm an IBM'er working with Celera on code optimization. They've asked me to take a look a apbs/MPI. They have not been big MPI users up to now. I built the system with the following CC=mpcc_r F77=mpxlf_r CFLAGS="-O3 -qstrict -qarch=pwr3 -qtune=pwr3 -qcache=auto -qmaxmem" FFLAGS="-qfixed=132 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 -qcache=auto -qmaxmem" LDFLAGS="-bmaxdata:0x80000000 -bmaxstack:0x10000000 -L/usr/local/lib -lmass -lessl " For maloc configure --prefix --enable_mpi --enable_blas=no gmake install For apbs-0.2.6 configure --prefix --with_blas="-L/usr/lib -lblas" gmake install It seems to have hooked into IBM's MPI (poe) because It complains if MP_PROCS or other MPI related environment variables are set incorrectly. I ran it using the apbs-PARALLEL.in input file in examples/actin-dimer (modified to run with 1, 2, 4, & 8 cpus). I'm not seeing any speedup when I add cpus. As a matter of fact, it seems to run in the same amount of time or longer as I add cpu's. When I profile the code , I see most of the time is spent in "ivdwAccExclus". The time spent in this routine is the same or more for each thread even when it's run on multiple cpu's. I must be doing something wrong. If I look at which MPI routines are being used, I see MPI_Comm_size, MPI_Comm_rank, and MPI_ALLreduce all being used only once. When I look thru the source, it's very hard to determine exactly how parallelism is entering the problem. Can anyone help? Thanks, Lawrence Hannon```
 Re: [Apbs-users] lack of speedup for apbs on AIX 5L using MPI From: Nathan A. Baker - 2004-03-31 19:20:29 ```Hi Lawrence -- (Warning -- long e-mail ahead... figured I'd post all the details to the mailing list for posterity) APBS was designed to enable electrostatics calculations on very large biological systems where single processor calculations are not feasible due to memory restrictions. Here's a summary of a typical situation where APBS's parallel capabailities are useful... Assume we have a protein (or chunk of protein) of dimensions (x, y, z) whose potential we want to calculate. In general, we first choose the fine (lxf, lyf, lzf) and coarse (lxc, lyc, lzc) grid lengths of the calculation. We then specify either the number of grid points (nx, ny, nz) used by the solver or the grid spacings (hx, hy, hz). These two quantities are related by: nx = lxf/hx + 1 ny = lyf/hy + 1 nz = lzf/hz + 1 The amount of memory an APBS calculation requires is directly proportional to (nx*ny*nz). APBS focuses the solution from the coarse to fine mesh (using the same number of grid points) in a controlled manner by requiring the grid spacing/length change between two focusing levels to be less than a specified ratio: lx2/lx1 < eps < 1 ly2/ly1 < eps < 1 lz2/lz1 < eps < 1 In doing so, it sets the number of focusing levels in a calculation: m = max_{i = x,y,z} log(lif/lic)/log(eps) For a given number of grid points, the length of time that an APBS calculation runs is directly proportional to the number of focusing levels (m). The only thing that parallel focusing does differently from a typical sequential calculation is to choose smaller fine grid lengths lxfp ~ (1 + 2*sigma) lxf/npx lyfp ~ (1 + 2*sigma) lyf/npy lzfp ~ (1 + 2*sigma) lzf/npz based on the size of the processor array (npx, npy, npz) and the desired overlap between processor grids (sigma). Now, consider a PBE calculation on a large biomolecular system where we choose the number of grid points (nx, ny, nz) based on the available memory per processor and the grid lengths (lxf, lyf, lzf) and (lxc, lyc, lzc) based on the size of the protein. For a large molecule, we'll find that these platform- and protein-specfic settings give grid spacings (hx, hy, hz) that are too large for accurate PBE calculations and, therefore, we'll need to use parallel focusing. Based on the arguments above, you can see that the number of focusing levels (and therefore the computational time) will scale as m ~ max_{i = x, y, z} log(lif/lic/npi)/log(eps) In the original implementation, eps was so small (~1/100) that the logarithmic dependence was negligible on the available computational platforms. This gives us a claim to linear scaling in a manner entirely analogous to the fast multipole method (the similarity between the methods is more than superficial, BTW). This choice seemed to work well for the cases I examined -- mainly ligand binding and comparison of average potentials away from protein surfaces. However, I later discovered that the resulting potentials with small eps did not give reliable results for protein-protein interactions. Therefore, I chose a very conservative value (0.25) as the default in recent versions of APBS. This value is overkill but it ensures that users will likely never receive "surprising" errors with APBS parallel focusing -- this value is also causing the less-than-linear scaling you're observing. This value can be modified (see VREDRAT in src/generic/apbs/vhal.h) to values appropriate to your application. The upshot of this long e-mail is that APBS allows users to look at large systems that are not possible with a single machine. Situations where this is warranted are indicated by the psize.py utility provided with APBS. The algorithm is completely latency-tolerant (we have a version that requires no communication) and scales linearly under certain circumstances. Thanks, Nathan Lawrence Hannon (03-31-2004 11:07:50-0600): >I'm an IBM'er working with Celera on code optimization. They've asked me >to take a look a apbs/MPI. They have not been big MPI users up to now. > >I built the system with the following > >CC=mpcc_r >F77=mpxlf_r >CFLAGS="-O3 -qstrict -qarch=pwr3 -qtune=pwr3 -qcache=auto -qmaxmem" >FFLAGS="-qfixed=132 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 -qcache=auto >-qmaxmem" >LDFLAGS="-bmaxdata:0x80000000 -bmaxstack:0x10000000 -L/usr/local/lib >-lmass -lessl " > >For maloc >configure --prefix --enable_mpi --enable_blas=no >gmake install > >For apbs-0.2.6 >configure --prefix --with_blas="-L/usr/lib -lblas" >gmake install > >It seems to have hooked into IBM's MPI (poe) because It complains if >MP_PROCS or other MPI related environment variables are set incorrectly. I >ran it using the apbs-PARALLEL.in input file in examples/actin-dimer >(modified to run with 1, 2, 4, & 8 cpus). I'm not seeing any speedup when >I add cpus. As a matter of fact, it seems to run in the same amount of >time or longer as I add cpu's. When I profile the code , I see most of the >time is spent in "ivdwAccExclus". The time spent in this routine is the >same or more for each thread even when it's run on multiple cpu's. I must >be doing something wrong. > >If I look at which MPI routines are being used, I see MPI_Comm_size, >MPI_Comm_rank, and MPI_ALLreduce all being used only once. When I look >thru the source, it's very hard to determine exactly how parallelism is >entering the problem. Can anyone help? > >Thanks, > >Lawrence HannonEnd of message from Lawrence Hannon. -- Nathan A. Baker, Assistant Professor Washington University in St. Louis School of Medicine Dept. of Biochemistry and Molecular Biophysics Center for Computational Biology 700 S. Euclid Ave., Campus Box 8036, St. Louis, MO 63110 Phone: (314) 362-2040, Fax: (314) 362-0234 URL: http://www.biochem.wustl.edu/~baker PGP key: http://cholla.wustl.edu/~baker/pubkey.asc ```
 Re: [Apbs-users] lack of speedup for apbs on AIX 5L using MPI From: Robert Konecny - 2004-03-31 20:27:05 ```Hi Lawrence, just to add to Nathan's informative email. How did you modify the actin-dimer input files to study this scaling? Just changing pdime value won't give you any speedup since all processors will be still solving the same grid size problem (as defined by dime). Increasing pdime while keeping other parameters constant will result in lowering effective grid resolution of the system but you won't observe any speedup and you can see a (very) slight increase in wall clock time due to increased communication. To study strong scaling one has to simultaneously increase pdime and decrease dime values in such a way that the system grid size is roughly constant. So I guess that's where APBS differs from most other MPI/parallel programs where simply increasing number of processors you throw on the system will get you answer faster. But as Nathan mentioned in his email APBS was designed to address a different class of computational problems. regards, robert On Wed, Mar 31, 2004 at 11:07:50AM -0600, Lawrence Hannon wrote: > > I'm an IBM'er working with Celera on code optimization. They've asked > me to take a look a apbs/MPI. They have not been big MPI users up to > now. > I built the system with the following > CC=mpcc_r > F77=mpxlf_r > CFLAGS="-O3 -qstrict -qarch=pwr3 -qtune=pwr3 -qcache=auto -qmaxmem" > FFLAGS="-qfixed=132 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 -qcache=auto > -qmaxmem" > LDFLAGS="-bmaxdata:0x80000000 -bmaxstack:0x10000000 -L/usr/local/lib > -lmass -lessl " > For maloc > configure --prefix --enable_mpi --enable_blas=no > gmake install > For apbs-0.2.6 > configure --prefix --with_blas="-L/usr/lib -lblas" > gmake install > It seems to have hooked into IBM's MPI (poe) because It complains if > MP_PROCS or other MPI related environment variables are set > incorrectly. I ran it using the apbs-PARALLEL.in input file in > examples/actin-dimer (modified to run with 1, 2, 4, & 8 cpus). I'm not > seeing any speedup when I add cpus. As a matter of fact, it seems to > run in the same amount of time or longer as I add cpu's. When I > profile the code , I see most of the time is spent in "ivdwAccExclus". > The time spent in this routine is the same or more for each thread > even when it's run on multiple cpu's. I must be doing something wrong. > If I look at which MPI routines are being used, I see MPI_Comm_size, > MPI_Comm_rank, and MPI_ALLreduce all being used only once. When I look > thru the source, it's very hard to determine exactly how parallelism > is entering the problem. Can anyone help? > Thanks, > Lawrence Hannon ```