first, thanks to Nikodemus for accepting the patch!
Nathan Froyd wrote:
> On Thu, Mar 18, 2010 at 7:17 AM, Nikodemus Siivola
> <nikodemus@...> wrote:
> > I didn't adopt your technique to MOVE-TO-SINGLE-STACK, as I could not
> > easily come up with a case that exhibited it, and so was unable to
> > evaluate its performance there -- maybe Nathan can post one?
> Um. The particular badness I was seeing (DESCRIPTOR-REG -> XMM ->
> SINGLE-STACK) came up on a toy function some redditor was playing
> around with. I think if you play around with the functions from:
> You can see the MOVE-TO-SINGLE-STACK VOP invoked. I don't think
> Lutz's technique is applicable to that VOP, though.
I concur with Nathan here: this is not applicable to MOVE-TO-SINGLE-STACK.
But neither is it necessary; the VOP works just fine the way it is now.
Nikodemus, please forgive me if I state the obvious, but: The changes
I proposed to MOVE-TO-SINGLE (now MOVE-TO-SINGLE-REG) were for reading
from the control stack where single floats are stored as tagged
entities, having their payload in the upper 32 bits of the 64 bits of
the stack slot. The access pattern common in function argument passing
violates AMD's recommendation as first 64 bits are written to memory and
then the upper 32 bits of these are read.
In contrast the single-stack contains untagged floats the payload
of which is in the lower 32 bits. Nearly all VOPs that access the
single-stack do so by reading or writing only the lower 32 bits of
the 64-bit stack slot, thus there is no problem. In particular,
MOVE-TO-SINGLE-STACK does so, too.
AFAIK there are only two VOPs that access the single-stack 64-bit-wise:
MAKE-SINGLE-FLOAT and SINGLE-FLOAT-BITS. MAKE-SINGLE-FLOAT only writes,
which poses no problems according to AMD's documentation.
SINGLE-FLOAT-BITS may be problematic as in it reads 64 bits from the
single-stack which AMD advises against if only 32 bits have been written
to the same address beforehand (this is a "narrow-to-wide store-to-load
forwarding restriction"). I have not yet evaluated how often this case
occurs and whether it poses a performance problem.
The code SINGLE-FLOAT-BITS generates could be improved anyway.
For example, when the source is on the single-stack it generates
MOV RDX, [RBP-8]
SHL RDX, 32
SAR RDX, 32
which is a funny way to sign-extend a value and where
MOVSXD RDX, DWORD PTR [RBP-8]
would be better -- a change that happens to shorten the read from the
single-stack from 64 to 32 bits, too, thus addressing the issue at hand.
Unfortunately I can't promise that I will get around to doing something
for this cause in the near future.
All the best,