Re: [Libmesh-users] optimized function code (fwd)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

---------- Forwarded message ----------
Date: Tue, 17 Nov 2009 17:37:27 -0600 (CST)
From: Roy Stogner <roy...@ic...>
To: Martin Burtscher <bur...@ic...>
Subject: Re: optimized function code

On Tue, 17 Nov 2009, Martin Burtscher wrote:

> I got around to optimize the most important loop in the ex18 code.  I've 
> attached the new code.  Just diff it with the original file to see the 
> changes.  Only about 20 lines are different.  The new loop runs about 35% 
> faster, giving an overall speedup of about 5% for this application.  I'll 
> take a look at a few other important code sections as soon as I have time, 
> but this may take a while, which is why I'm sending you the optimized code 
> for this one loop now.  Please let me know if you have questions.

There don't seem to be any loop order changes here - is that because
the use of temporary variables like phiiqp were enough to ameliorate
the problems with the bad loop order that you pointed out before, or
is it just that you didn't want to make that more major change?

Also, there are two types of change I see, and I've got a question
about each:

1. Pre-indexing common variables like JxWqp, phiiqp, etc.  Is the
compiler not allowed to do this because it's not allowed to assume
that JxW or phi isn't overlapping (and thus changed by writes to) K?

Freaking aliasing.  It looks like even C++0x won't be getting the
"restrict" keyword from C99; Fortran users will still have something
to gloat about for another decade.

For that matter if phi[i][qp] can't be assumed constant then the
compiler can't precompute shared calculations, making your second
optimization all the more important...

2. Precomputing shared calculations like your mb1/2/3/4/5 variables
seems like an obvious optimization, but even with aliasing out of the
way it's an optimization which the compiler is often forbidden because
rounding error is changed by reordering operations.  Of course, in
this case we're fine with reordering operations, especially if it's a
35% speedup.  However, there are two other ways this might be
attempted:

With compiler flags: "-fassociative-math" on g++ or "-IPF_fp_relaxed"
with icpc.

By changing the order of operations to make sure that common factors
are actually computed in a common fashion, e.g. changing
Kuw(i,j) += JxW[qp] * -Reynolds*u_z*phi[i][qp]*phi[j][qp];
to
Kuw(i,j) += JxW[qp] * -Reynolds * phi[i][qp] * phi[j][qp] * u_z;
and then hoping that the compiler is smart enough to realize that it
only needs to perform the first two multiplies outside the j loop and
the third multiply at the start of the j loop.

Do you think either of these methods might work, once the aliasing
problem is resolved via optimization (1)?  I was disillusioned with
hand-optimizing-out common factors when, after learning to do it in
Matlab, I discovered via experimenting in C that the compiler was
usually doing it for me.  But I suppose there's a limit to the
complexity that the compiler can deal with.

I'd like to make whatever optimizations I can to the libMesh examples,
but there's a tradeoff between making the code less clear (which would
make the example apps less useful as a teaching tool) and leaving the
code slower (which would make the examples teach some of the wrong
habits).  Caching phi[i][qp] et. al. is straightforward enough that I
may change all the official examples to do so.  Explicit
precomputations are uglier; that's why I'm still hoping that the
compiler can be coerced into doing it for us.
---
Roy