Re: [Libmesh-users] optimized function code (fwd)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

---------- Forwarded message ----------
Date: Tue, 17 Nov 2009 18:19:04 -0600
From: Martin Burtscher <bur...@ic...>
To: Roy Stogner <roy...@ic...>
Subject: Re: optimized function code

Hi Roy,

Thanks for your detailed email!

> There don't seem to be any loop order changes here - is that because
> the use of temporary variables like phiiqp were enough to ameliorate
> the problems with the bad loop order that you pointed out before, or
> is it just that you didn't want to make that more major change?

It's the latter along with the fact that I just haven't had time to do more 
extensive optimizations yet.  I'm planning to optimize the code more, but this 
will take a while as I'm busy with several projects (who isn't?).  I did try 
one other optimization, but it didn't help, so I didn't include it in the file 
I sent you.

> 1. Pre-indexing common variables like JxWqp, phiiqp, etc.  Is the
> compiler not allowed to do this because it's not allowed to assume
> that JxW or phi isn't overlapping (and thus changed by writes to) K?

The compiler is supposed to do this, but often it gets confused and doesn't do 
it.  Basically, if it cannot statically prove that an optimization is safe 
under all conditions, it will not do it. Generally, these proofs use heuristics 
and simplifications to make them fast, which, unfortunately, often results in 
the loss of essential information.  As a result, only very simple pieces of 
code get optimized by the compiler.

> Freaking aliasing.  It looks like even C++0x won't be getting the
> "restrict" keyword from C99; Fortran users will still have something
> to gloat about for another decade.

Yes, "restrict" would be nice...

> 2. Precomputing shared calculations like your mb1/2/3/4/5 variables
> seems like an obvious optimization, but even with aliasing out of the
> way it's an optimization which the compiler is often forbidden because
> rounding error is changed by reordering operations.  Of course, in
> this case we're fine with reordering operations, especially if it's a
> 35% speedup.  However, there are two other ways this might be
> attempted:
> 
> With compiler flags: "-fassociative-math" on g++ or "-IPF_fp_relaxed"
> with icpc.

The problem is that not all compilers support this, and even the ones that have 
these flags are not guaranteed to be able to perform this optimization due to 
the problems outlined above.  In practice, compilers often don't even optimize 
things that look really simple to humans. Moreover, some of the shared 
computations in this code involve templates, which all but guarantee that the 
compiler will not apply any sophisticated optimizations such as exploiting 
associativity.

> By changing the order of operations to make sure that common factors
> are actually computed in a common fashion, e.g. changing
> Kuw(i,j) += JxW[qp] * -Reynolds*u_z*phi[i][qp]*phi[j][qp];
> to
> Kuw(i,j) += JxW[qp] * -Reynolds * phi[i][qp] * phi[j][qp] * u_z;
> and then hoping that the compiler is smart enough to realize that it
> only needs to perform the first two multiplies outside the j loop and
> the third multiply at the start of the j loop.

Yes, this might help.

> Do you think either of these methods might work, once the aliasing
> problem is resolved via optimization (1)?  I was disillusioned with
> hand-optimizing-out common factors when, after learning to do it in
> Matlab, I discovered via experimenting in C that the compiler was
> usually doing it for me.  But I suppose there's a limit to the
> complexity that the compiler can deal with.

Precisely.  It will often work beautifully on a simple example but fail in real 
application code.

It is quite possible that not all the steps I took are needed to get most, if 
not all, of the performance improvement.  I didn't try partial solutions but 
just factored out whatever I could in a single step.  You would have to 
incrementally undo what I did to find out what helps how much.

> I'd like to make whatever optimizations I can to the libMesh examples,
> but there's a tradeoff between making the code less clear (which would
> make the example apps less useful as a teaching tool) and leaving the
> code slower (which would make the examples teach some of the wrong
> habits).  Caching phi[i][qp] et al. is straightforward enough that I
> may change all the official examples to do so.  Explicit
> precomputations are uglier; that's why I'm still hoping that the
> compiler can be coerced into doing it for us.

I agree with you and am all for changing as little as possible while reaping as 
much of a benefit as possible.

I'll let you know what else I find.

Have a nice evening,
Martin