On Mar 4, 2009, at 6:24 AM, Gábor Melis wrote:
> This function is compiled and run with a varying
> number of NOPs inserted before the loop.
> What I see on my Core Duo is that execution time varies
> and greatly (~15%) depending on the number of NOPs. Furthermore, the
> results are counter-intuitive to me because the 16 byte aligned loop
> (with 3 NOPs added) is slow. There are even worse performers,
> NOPs=9,10,11, 25,26,27 and so on with 16 increments.
> On a P4 results are much steadier.
I found this comment from http://x264dev.multimedia.cx/?p=51
interesting (site is down right now unfortunately)
> One interesting feature of the Nehalem is that it appears to not
> have the “code alignment” problem that the Core 2 had. For a reason
> we have yet to figure out (though we have much speculation), the
> Core 2 has this odd habit of execution times changing dramatically
> solely due to alignment of the code itself. That is, one could
> literally speed up a small segment of code just by inserting random
> numbers of nops before it until it got faster. This wasn’t useful
> for optimization, as it was not only random but misaligning one set
> of codewould hurt another set of code just as much. We suspected it
> was due to some weirdness involving the cache or TLB, but regardless
> of what it was, the Nehalem seems to be much more consistent in this
> regard, making measurements of optimizations’ effectiveness much
If the developers of highly optimized video compression cannot figure
out how to align code for the C2D, I don't think there's much hope of
modifying SBCL to get the best alignment.
Google cache: http://22.214.171.124/search?q=cache:o79pNhk3208J:x264dev.multimedia.cx/%3Fp%3D51