From: Roy S. <roy...@ic...> - 2009-11-18 23:53:04
|
---------- Forwarded message ---------- Date: Tue, 17 Nov 2009 17:37:27 -0600 (CST) From: Roy Stogner <roy...@ic...> To: Martin Burtscher <bur...@ic...> Subject: Re: optimized function code On Tue, 17 Nov 2009, Martin Burtscher wrote: > I got around to optimize the most important loop in the ex18 code. I've > attached the new code. Just diff it with the original file to see the > changes. Only about 20 lines are different. The new loop runs about 35% > faster, giving an overall speedup of about 5% for this application. I'll > take a look at a few other important code sections as soon as I have time, > but this may take a while, which is why I'm sending you the optimized code > for this one loop now. Please let me know if you have questions. There don't seem to be any loop order changes here - is that because the use of temporary variables like phiiqp were enough to ameliorate the problems with the bad loop order that you pointed out before, or is it just that you didn't want to make that more major change? Also, there are two types of change I see, and I've got a question about each: 1. Pre-indexing common variables like JxWqp, phiiqp, etc. Is the compiler not allowed to do this because it's not allowed to assume that JxW or phi isn't overlapping (and thus changed by writes to) K? Freaking aliasing. It looks like even C++0x won't be getting the "restrict" keyword from C99; Fortran users will still have something to gloat about for another decade. For that matter if phi[i][qp] can't be assumed constant then the compiler can't precompute shared calculations, making your second optimization all the more important... 2. Precomputing shared calculations like your mb1/2/3/4/5 variables seems like an obvious optimization, but even with aliasing out of the way it's an optimization which the compiler is often forbidden because rounding error is changed by reordering operations. Of course, in this case we're fine with reordering operations, especially if it's a 35% speedup. However, there are two other ways this might be attempted: With compiler flags: "-fassociative-math" on g++ or "-IPF_fp_relaxed" with icpc. By changing the order of operations to make sure that common factors are actually computed in a common fashion, e.g. changing Kuw(i,j) += JxW[qp] * -Reynolds*u_z*phi[i][qp]*phi[j][qp]; to Kuw(i,j) += JxW[qp] * -Reynolds * phi[i][qp] * phi[j][qp] * u_z; and then hoping that the compiler is smart enough to realize that it only needs to perform the first two multiplies outside the j loop and the third multiply at the start of the j loop. Do you think either of these methods might work, once the aliasing problem is resolved via optimization (1)? I was disillusioned with hand-optimizing-out common factors when, after learning to do it in Matlab, I discovered via experimenting in C that the compiler was usually doing it for me. But I suppose there's a limit to the complexity that the compiler can deal with. I'd like to make whatever optimizations I can to the libMesh examples, but there's a tradeoff between making the code less clear (which would make the example apps less useful as a teaching tool) and leaving the code slower (which would make the examples teach some of the wrong habits). Caching phi[i][qp] et. al. is straightforward enough that I may change all the official examples to do so. Explicit precomputations are uglier; that's why I'm still hoping that the compiler can be coerced into doing it for us. --- Roy |