Douglas Katzman wrote:
> I find 'rep stos' to outperform the tight loop that is currently emitted
> for filling, except when beneath the threshold number of iterations at
> which the fixed setup cost outweighs a small loop.
> On the Macbook, 'rep stos' seems to be as much as 2x to 3x faster, and
> on a fast workstation, as much as 5.5x faster.
> The Macbook's the threshold below which you wouldn't prefer 'rep stos'
> is somewhere between 20 and 40 qwords, but on the workstation, it's in
> the range of 15 to 20.
How bad is the slowdown on really short loops?
FWIW, I see next to no reason to expand FILL inline. A couple assembly
routines would easily exploit instructions like rep stos, and save on
I$. There are certainly other such cases where we seem to be inlining
out of laziness, e.g. logcount.