There's an optimization project proposal being put together at
UT-Austin, which is likely to end up getting some new eyes towards
optimizing libMesh; possibly PETSc later. (not to insinuate that
PETSc isn't already better optimized than we are, but since it ends up
being a big chunk of many libMesh apps' runtime then any possible
further improvements would be welcome)
One of the recent discussions on this topic proved very interesting to
me. With Martin's permission, I'm forwarding a few messages to the
lists for everyone to see.
---------- Forwarded message ----------
Date: Tue, 17 Nov 2009 16:21:46 -0600
From: Martin Burtscher <burtscher@...>
To: Roy Stogner <roystgnr@...>
Subject: optimized function code
I got around to optimize the most important loop in the ex18 code. I've
attached the new code. Just diff it with the original file to see the changes.
Only about 20 lines are different. The new loop runs about 35% faster, giving
an overall speedup of about 5% for this application. I'll take a look at a few
other important code sections as soon as I have time, but this may take a
while, which is why I'm sending you the optimized code for this one loop now.
Please let me know if you have questions. Thanks and I'll keep you posted.
Have a nice day,
From: Roy Stogner <roystgnr@ic...> - 2009-11-19 19:36:06
On Thu, 19 Nov 2009, Jed Brown wrote:
> 1. Interlace fields in vector problems, so e.g. deformation dofs in
> elasticity would be ordered [u0,v0,w0,u1,v1,w1,...]. This allows better
> L1 reuse (better hardware prefetch because there are fewer pointers to
> track, you only pay for the latency of a cache miss once for all 3
> fields rather than once per field). It also allows blocked matrix
> formats (BAIJ) which are smaller (one column index per block rather than
> one per entry, thus
> sizeof(MatScalar) + sizeof(PetscInt)/(bs*bs)
> per entry instead of
> sizeof(MatScalar) + sizeof(PetscInt)
There's something in the PETSc FAQ about:
"the AIJ format automatically searches for matching rows and thus
still takes advantage of the natural blocks in your matrix to obtain
good performance" - does that at least imply that we'd be getting a
libMesh would be using SBAIJ by now, except that problems which are
sufficiently complex to need the improvement often don't have constant
block size. Mixed finite elements with different polynomial degree
used to be the only offender, but now per-subdomain variables make
things still more complicated.
That's a shame, too. One application I'm working on right now is
going to end up with roughly a half dozen variables on a big subdomain
and one or two dozen on a much bigger subdomain. Storing block
structure alone would have been nice.