Re: [Libmesh-devel] optimized function code (fwd) From: Jed Brown - 2009-11-19 21:07 Roy Stogner wrote: > > On Thu, 19 Nov 2009, Jed Brown wrote: > >> You still don't need to assemble the adjoint, you just need to be able >> to multiply with the adjoint. Finite differencing can't do this for >> you, but AD can and it's easy to do yourself when you have a Galerkin >> method (basically just swap test and trial functions in the right >> place). > > Good point. We're wary of AD (some nasty constitutive equations > require calls to external codes which can't easily be retrofit with > Sacado). Yeah, AD has serious software engineering ramifications. > But if memory does become a problem, our current "backup plan" is to > only build the adjoint matrix one element submatrix at a time, use > those for local submatrix-subvector multiplies, then assemble the > resulting subvector products and throw away the submatrices. Yep, note that you can multiply by element matrices more cheaply than you can assemble them. Also, it is frequently possible to isolate space-efficient intermediate forms that are cheaper to apply than rebuilding the whole thing. My simplest example of this technique (unfortunately symmetric) comes from a p-Laplacian where you end up with a conductivity tensor looking like a 1 + b \nabla u \otimes \nabla u (a and b are nonlinear in u, the state we are evaluating the Jacobian at). You get cheap products if you store, e.g. (a,\sqrt b \nabla u). This generalizes to many non-Newtonian flow problems. Even if you haven't done anything clever, you still only need to store this stuff at quadrature points so it's O(q f^2) instead of O(p^2 f^2) like the element submatrices. Jed 

[Libmesh-devel] optimized function code (fwd) Roy Stogner <roystgnr@ic...>
 Re: [Libmesh-devel] optimized function code (fwd) From: Jed Brown - 2009-11-19 11:53 Attachments: signature.asc Roy Stogner wrote: > > There's an optimization project proposal being put together at > UT-Austin, which is likely to end up getting some new eyes towards > optimizing libMesh; possibly PETSc later. (not to insinuate that > PETSc isn't already better optimized than we are, but since it ends up > being a big chunk of many libMesh apps' runtime then any possible > further improvements would be welcome) The core PETSc kernels have had a bit of attention, but there is certainly more loop optimization that would be possible (and restrict annotation, many compilers support it even though C++ refuses to mandate it). But this will have very limited effect because almost nothing is CPU-bound (less every year), instead we're limited by memory bandwidth and L1 reuse. This paper is still very relevant ftp://info.mcs.anl.gov/pub/tech_reports/reports/P833.pdf My lightweight p-version FEM library (implemented as a PETSc DM) performs two memory "optimizations" that have significant effects on the solver: 1. Interlace fields in vector problems, so e.g. deformation dofs in elasticity would be ordered [u0,v0,w0,u1,v1,w1,...]. This allows better L1 reuse (better hardware prefetch because there are fewer pointers to track, you only pay for the latency of a cache miss once for all 3 fields rather than once per field). It also allows blocked matrix formats (BAIJ) which are smaller (one column index per block rather than one per entry, thus sizeof(MatScalar) + sizeof(PetscInt)/(bs*bs) per entry instead of sizeof(MatScalar) + sizeof(PetscInt) Furthermore, matrix kernels (multiplication, factorization, and triangular solves) are fully unrolled within blocks and SOR turns into block SOR (stronger). Insertion is also faster with blocked formats, note that you can use MatSetValuesBlocked() (and variants) even with scalar formats, you just need to call MatSetBlockSize(). Even when you are using a scalar format (AIJ), interlacing fields offers more than just better cache reuse because PETSc uses inodes (by default) for the critical kernels. Inodes identify consecutive rows with identical nonzero structure and store the column indices once, then matrix kernels will be unrolled over the identical rows meaning that you are only seeing sizeof(MatScalar) + sizeof(PetscInt)/bs per entry. Also SOR again becomes block SOR. 2. Reorder dofs within subdomains to reduce bandwidth. This is *not* to speed up direct solves, it is a myth that direct solvers try to minimize bandwidth. It is to improve L1 reuse in matrix kernels and to make relaxation and *incomplete* factorization stronger. Try running src/ksp/ksp/examples/tutorials/ex2 (or any other) with -pc_type lu -pc_factor_mat_ordering_type nd -mat_view_ordering_draw \ -draw_pause 2 -mat_view_draw -info |grep fill Nested dissection (nd) is the default ordering for LU, it is *not* low-bandwidth. Compare to a low-bandwidth ordering like reverse Cuthill-McKee (rcm) which will produce more fill. Now compare iteration count with -pc_type ilu and the ordering choices above, ND is a horrible ordering (much worse than the natural ordering in this case) for incomplete factorization. I reorder nodes within subdomains so that my natural ordering is approximately RCM (by default), thus incomplete factorization with the natural ordering is as strong (and fast) as possible. Beyond these, large improvements are mostly algorithmic (physics-based preconditioners, latency tolerant Krylov methods, improved globalization schemes, stiff time integration, etc). Jed 

 Re: [Libmesh-devel] optimized function code (fwd) From: Roy Stogner - 2009-11-19 19:36 On Thu, 19 Nov 2009, Jed Brown wrote: > 1. Interlace fields in vector problems, so e.g. deformation dofs in > elasticity would be ordered [u0,v0,w0,u1,v1,w1,...]. This allows better > L1 reuse (better hardware prefetch because there are fewer pointers to > track, you only pay for the latency of a cache miss once for all 3 > fields rather than once per field). It also allows blocked matrix > formats (BAIJ) which are smaller (one column index per block rather than > one per entry, thus > > sizeof(MatScalar) + sizeof(PetscInt)/(bs*bs) > > per entry instead of > > sizeof(MatScalar) + sizeof(PetscInt) There's something in the PETSc FAQ about: "the AIJ format automatically searches for matching rows and thus still takes advantage of the natural blocks in your matrix to obtain good performance" - does that at least imply that we'd be getting a 1/bs improvement? libMesh would be using SBAIJ by now, except that problems which are sufficiently complex to need the improvement often don't have constant block size. Mixed finite elements with different polynomial degree used to be the only offender, but now per-subdomain variables make things still more complicated. That's a shame, too. One application I'm working on right now is going to end up with roughly a half dozen variables on a big subdomain and one or two dozen on a much bigger subdomain. Storing block structure alone would have been nice. --- Roy 

 Re: [Libmesh-devel] optimized function code (fwd) From: Jed Brown - 2009-11-19 20:02 Attachments: signature.asc Roy Stogner wrote: > > On Thu, 19 Nov 2009, Jed Brown wrote: > >> 1. Interlace fields in vector problems, so e.g. deformation dofs in >> elasticity would be ordered [u0,v0,w0,u1,v1,w1,...]. This allows better >> L1 reuse (better hardware prefetch because there are fewer pointers to >> track, you only pay for the latency of a cache miss once for all 3 >> fields rather than once per field). It also allows blocked matrix >> formats (BAIJ) which are smaller (one column index per block rather than >> one per entry, thus >> >> sizeof(MatScalar) + sizeof(PetscInt)/(bs*bs) >> >> per entry instead of >> >> sizeof(MatScalar) + sizeof(PetscInt) > > There's something in the PETSc FAQ about: > > "the AIJ format automatically searches for matching rows and thus > still takes advantage of the natural blocks in your matrix to obtain > good performance" - does that at least imply that we'd be getting a > 1/bs improvement? These are the inodes, you benefit from them any time you have consecutive rows with identical nonzero pattern (and it's worth occasionally storing zeros so that this is more common). > libMesh would be using SBAIJ by now, except that problems which are > sufficiently complex to need the improvement often don't have constant > block size. Mixed finite elements with different polynomial degree > used to be the only offender, but now per-subdomain variables make > things still more complicated. That could be tricky but you might be happy with just computing a low-bandwidth ordering and assembling into that, e.g. MatSetLocalToGlobalMapping, then insert with MatSetValuesLocal. The ordering should cluster rows with identical sparsity pattern so you would benefit from inodes. Note that incomplete factorization for mixed FEM is very sensitive to ordering, therefore it can be desirable for, e.g. pressure unknowns to always follow velocity unknowns rather than be interlaced where they minimize bandwidth. Black-box preconditioners are really bad for mixed FEM (mostly due to the resulting system being indefinite) so I don't assemble those Jacobians, just the blocks that will be needed for preconditioning. The only viable alternative is to use exotic DD (Schwarz and Schur) schemes where you carefully choose a coarse level and subdomains so that they remove low-energy modes and satisfy the necessary divergence theorem, then use a direct subdomain solver and redundant or approximate coarse level solver. Such schemes have yet to be made generic (they are tied to a specific discretization of a specific physics), though I have some ideas for implementing BDDC/FETI-DP in a somewhat more flexible way than I've yet seen. (Alas, I have a thesis to finish so I won't be working on it soon). > That's a shame, too. One application I'm working on right now is > going to end up with roughly a half dozen variables on a big subdomain > and one or two dozen on a much bigger subdomain. Storing block > structure alone would have been nice. Sounds like a lot of storage. Are the blocks dense? Have you considered preconditioning with a sparser representation? Jed 

 Re: [Libmesh-devel] optimized function code (fwd) From: Roy Stogner - 2009-11-19 20:35 On Thu, 19 Nov 2009, Jed Brown wrote: > That could be tricky but you might be happy with just computing a > low-bandwidth ordering and assembling into that, e.g. > MatSetLocalToGlobalMapping, then insert with MatSetValuesLocal. The > ordering should cluster rows with identical sparsity pattern so you > would benefit from inodes. Block variable ordering with "--node_major_dofs" is perhaps the least well-documented option in libMesh, which is saying a lot, but we're already using it. >> That's a shame, too. One application I'm working on right now is >> going to end up with roughly a half dozen variables on a big subdomain >> and one or two dozen on a much bigger subdomain. Storing block >> structure alone would have been nice. > > Sounds like a lot of storage. Well, the definition of "big" and "much bigger" is in the eye of the beholder (and is going to depend on how much refinement we turn out to need). We're setting up a 128 core system for local testing; based on some extrapolation and some results with a different code, I'm hoping we'll be able to fit even the full problem on 64 cores without exceeding per-node RAM. > Are the blocks dense? Not entirely, but close enough that we're not trying to shave a few bytes on the exceptions, especially since it might just cost more bytes in the sparsity pattern. > Have you considered preconditioning with a sparser representation? Barely, but we'll have to cross that bridge when we come to it. For the smaller problems we're using this code on now, there's still memory to spare for ILU. For later I'm not sure. We'll have to construct the true Jacobian for the adjoint problem, but might try to set up a preconditioner based on reduced physics or a coarser grid. --- Roy 

 Re: [Libmesh-devel] optimized function code (fwd) From: Jed Brown - 2009-11-19 20:43 Attachments: signature.asc Roy Stogner wrote: > Block variable ordering with "--node_major_dofs" is perhaps the least > well-documented option in libMesh, which is saying a lot, but we're > already using it. Neat. > We'll have to construct the true Jacobian for the adjoint problem, but > might try to set up a preconditioner based on reduced physics or a > coarser grid. You still don't need to assemble the adjoint, you just need to be able to multiply with the adjoint. Finite differencing can't do this for you, but AD can and it's easy to do yourself when you have a Galerkin method (basically just swap test and trial functions in the right place). Jed 

 Re: [Libmesh-devel] optimized function code (fwd) From: Roy Stogner - 2009-11-19 20:51 On Thu, 19 Nov 2009, Jed Brown wrote: > You still don't need to assemble the adjoint, you just need to be able > to multiply with the adjoint. Finite differencing can't do this for > you, but AD can and it's easy to do yourself when you have a Galerkin > method (basically just swap test and trial functions in the right > place). Good point. We're wary of AD (some nasty constitutive equations require calls to external codes which can't easily be retrofit with Sacado). But if memory does become a problem, our current "backup plan" is to only build the adjoint matrix one element submatrix at a time, use those for local submatrix-subvector multiplies, then assemble the resulting subvector products and throw away the submatrices. --- Roy 

 Re: [Libmesh-devel] optimized function code (fwd) From: Jed Brown - 2009-11-19 21:07 Attachments: signature.asc Roy Stogner wrote: > > On Thu, 19 Nov 2009, Jed Brown wrote: > >> You still don't need to assemble the adjoint, you just need to be able >> to multiply with the adjoint. Finite differencing can't do this for >> you, but AD can and it's easy to do yourself when you have a Galerkin >> method (basically just swap test and trial functions in the right >> place). > > Good point. We're wary of AD (some nasty constitutive equations > require calls to external codes which can't easily be retrofit with > Sacado). Yeah, AD has serious software engineering ramifications. > But if memory does become a problem, our current "backup plan" is to > only build the adjoint matrix one element submatrix at a time, use > those for local submatrix-subvector multiplies, then assemble the > resulting subvector products and throw away the submatrices. Yep, note that you can multiply by element matrices more cheaply than you can assemble them. Also, it is frequently possible to isolate space-efficient intermediate forms that are cheaper to apply than rebuilding the whole thing. My simplest example of this technique (unfortunately symmetric) comes from a p-Laplacian where you end up with a conductivity tensor looking like a 1 + b \nabla u \otimes \nabla u (a and b are nonlinear in u, the state we are evaluating the Jacobian at). You get cheap products if you store, e.g. (a,\sqrt b \nabla u). This generalizes to many non-Newtonian flow problems. Even if you haven't done anything clever, you still only need to store this stuff at quadrature points so it's O(q f^2) instead of O(p^2 f^2) like the element submatrices. Jed