|
From: Patrick J. L. <lop...@gm...> - 2012-08-21 22:59:04
|
Hi, Julian. I finally decided to take a crack at this but I have some
questions.
I took the code from host_x86_isel.c that handles 64-bit returns from
Ist_Dirty statements on x86, and adapted it like so for 128-bit
returns on x86_64:
--- VEX/priv/host_amd64_isel.c (revision 2472)
+++ VEX/priv/host_amd64_isel.c (working copy)
@@ -3912,6 +3912,15 @@
return;
retty = typeOfIRTemp(env->type_env, d->tmp);
+ if (retty == Ity_I128) {
+ HReg dstHi, dstLo;
+ /* The returned value is in %rdx:%rax. Park it in the
+ register-pair associated with tmp. */
+ lookupIRTempPair( &dstHi, &dstLo, env, d->tmp);
+ addInstr(env, mk_iMOVsd_RR(hregAMD64_RDX(),dstHi) );
+ addInstr(env, mk_iMOVsd_RR(hregAMD64_RAX(),dstLo) );
+ return;
+ }
if (retty == Ity_I64 || retty == Ity_I32
|| retty == Ity_I16 || retty == Ity_I8) {
/* The returned value is in %rax. Park it in the register
This assumes the dirty helper will return an Ity_I128 as a pair of
64-bit numbers, consistent with the x86_64 ABI.
So, first question: Does this patch look reasonable? If so, how
would you feel about applying it right away? :-)
Next, once my dirty helper returns its value as an Ity_I128 temporary
(representing the validity bits for a 128-bit vector load), I need to
move it into an Ity_V128 temporary (the actual shadow register). It
looks like I want an "Iop_I128toV128" operand, which does not exist
(?). But I think I can compose it from Iop_I128to64, Iop_I128HIto64,
and Iop_64HLtoV128. Does this sound right to you?
One more question, this time about procedure. In general, would you
prefer that I propose patches on this mailing list, or that I attach
them to my bug report?
Thanks.
- Pat
|
|
From: Julian S. <js...@ac...> - 2012-08-23 14:00:16
|
> --- VEX/priv/host_amd64_isel.c (revision 2472)
> +++ VEX/priv/host_amd64_isel.c (working copy)
> @@ -3912,6 +3912,15 @@
> return;
>
> retty = typeOfIRTemp(env->type_env, d->tmp);
> + if (retty == Ity_I128) {
> + HReg dstHi, dstLo;
> + /* The returned value is in %rdx:%rax. Park it in the
> + register-pair associated with tmp. */
> + lookupIRTempPair( &dstHi, &dstLo, env, d->tmp);
> + addInstr(env, mk_iMOVsd_RR(hregAMD64_RDX(),dstHi) );
> + addInstr(env, mk_iMOVsd_RR(hregAMD64_RAX(),dstLo) );
> + return;
> + }
> if (retty == Ity_I64 || retty == Ity_I32
> || retty == Ity_I16 || retty == Ity_I8) {
>
> /* The returned value is in %rax. Park it in the register
>
> This assumes the dirty helper will return an Ity_I128 as a pair of
> 64-bit numbers, consistent with the x86_64 ABI.
>
> So, first question: Does this patch look reasonable?
Yes. I take it that the x86_64 ABI guarantees that all 128 bit
values will be returned in RDX:RAX (yes?) so the above is always correct.
> If so, how would you feel about applying it right away? :-)
I would prefer that we work through some more of the details first,
in order to be sure we have a good design. In particular I am concerned
to arrive at something which also works for 256 bit return values,
and for other architectures.
> Next, once my dirty helper returns its value as an Ity_I128 temporary
> (representing the validity bits for a 128-bit vector load), I need to
> move it into an Ity_V128 temporary (the actual shadow register). It
> looks like I want an "Iop_I128toV128" operand, which does not exist
> (?). But I think I can compose it from Iop_I128to64, Iop_I128HIto64,
> and Iop_64HLtoV128. Does this sound right to you?
That would work, but it is inefficient and burdens the IR generators
with having to insert convoluted IR to work around limitations of a
particular back end. A simpler solution would be to state the
return type of the helper to be Ity_V128 (after all, that is what it is,
essentially) and change the back end to generate code to get the
result value into a 128-bit vector class register:
> + addInstr(env, mk_iMOVsd_RR(hregAMD64_RDX(),dstHi) );
> + addInstr(env, mk_iMOVsd_RR(hregAMD64_RAX(),dstLo) );
then push RAX and RDX on the stack and load into a vector register.
Copy whatever is done for Iop_64HLtoV128. Overall, we should try to
do this better using MOVD et al so as to avoid going via memory, but
I am not sure we can do that and still guarantee to generate SSE2-only
insns. Anyway, a problem for a different day.
> One more question, this time about procedure. In general, would you
> prefer that I propose patches on this mailing list, or that I attach
> them to my bug report?
Patches on bug reports are way better, since those on mailing lists
tend to get lost. The mail list is good for kicking people into
doing reviews, taking notice of patch updates, etc, though.
J
|
|
From: Patrick J. L. <lop...@gm...> - 2012-08-23 16:29:59
|
Hi, Julian. Thank you for your reply.
On Thu, Aug 23, 2012 at 6:47 AM, Julian Seward <js...@ac...> wrote:
>
> Yes. I take it that the x86_64 ABI guarantees that all 128 bit
> values will be returned in RDX:RAX (yes?) so the above is always correct.
Short answer: Yes, for our purposes.
Long answer: Doubles are returned in %xmm0. So are SSE2 values
(__m128i). But integer aggregates -- including all arms of Valgrind's
V128 union -- are returned in RDX:RAX according to the ABI.
So, assuming dirty helpers are written in portable C (or "portable"
GCC), and that they only return integers or integer aggregates, then
yes, you can rely on the values being in RDX:RAX.
Of course, there is nothing in principle preventing dirty helpers from
being platform dependent, but I assume you want to avoid that (?)
On a related note, do you prefer portable C, or do you like GCC
extensions? In particular, GCC allows this:
typedef unsigned long V128 __attribute__ ((mode(TI)));
...and then you can say "V128 x = 37; ++x;" and GCC will actually
generate code to perform 128-bit arithmetic on 64- or 32-bit
platforms.
I assume you prefer portable C, especially since the above does not
extend to V256 and beyond, but I thought I would throw it out there.
> I would prefer that we work through some more of the details first,
> in order to be sure we have a good design. In particular I am concerned
> to arrive at something which also works for 256 bit return values,
> and for other architectures.
Well, an ABI is what it is, so there is not much to design if we want
to allow dirty helpers to return V128. The ABI determines this
convention; we just have to implement it for each platform.
In portable C, the representation for anything larger than 64 bits
will always be an aggregate. V128 is already an aggregate; V256 will
be another aggregate; V512 yet another; and so on. If we want helpers
to be able to return V128, V256, etc., we need to implement the ABI
for each platform, whatever it is. And on any platform, at some point
(256 bit return, 512 bit return, 1024 bit return, etc.), the
convention will become "pass a pointer to some stack space and let the
callee fill it in".
An alternative would to implement our own portable convention for
dirty helpers that want to return these types. So instead of:
V128 dirty_helper(ULong arg1, ULong arg2) { ... }
...we would write:
void dirty_helper(ULong arg1, ULong arg2, V128 *result) { ... }
The advantage of this approach is that it would be
platform-independent for arbitrary-size aggregates (V128, V256, etc).
The disadvantage is that it would require an entirely new dirty call
infrastructure, with the ability to marshall IExprs (or whatever)
across this call boundary, presumably via the stack (?).
So I guess my #1 question is this: Which one do you want? Implement
the platform-dependent ABI for each platform to allow helpers to
return V128? Or implement platform-independent marshalling to pass
"V128 *" to helpers?
> A simpler solution would be to state the
> return type of the helper to be Ity_V128 (after all, that is what it is,
> essentially) and change the back end to generate code to get the
> result value into a 128-bit vector class register:
>
>> + addInstr(env, mk_iMOVsd_RR(hregAMD64_RDX(),dstHi) );
>> + addInstr(env, mk_iMOVsd_RR(hregAMD64_RAX(),dstLo) );
>
> then push RAX and RDX on the stack and load into a vector register.
> Copy whatever is done for Iop_64HLtoV128.
I understand. But first I need to know the answer to my question,
especially if you want me to implement this for all platforms.
> Patches on bug reports are way better, since those on mailing lists
> tend to get lost. The mail list is good for kicking people into
> doing reviews, taking notice of patch updates, etc, though.
Once we agree on an approach and I have a patch in some kind of
reasonable state, I will attach it to a bug report.
Thanks again.
- Pat
|
|
From: Julian S. <js...@ac...> - 2012-08-23 17:08:28
|
> Long answer: Doubles are returned in %xmm0. So are SSE2 values > (__m128i). But integer aggregates -- including all arms of Valgrind's > V128 union -- are returned in RDX:RAX according to the ABI. Ok, good. > So, assuming dirty helpers are written in portable C (or "portable" > GCC), and that they only return integers or integer aggregates, then > yes, you can rely on the values being in RDX:RAX. > > Of course, there is nothing in principle preventing dirty helpers from > being platform dependent, but I assume you want to avoid that (?) > > On a related note, do you prefer portable C, or do you like GCC > extensions? In particular, GCC allows this: > > typedef unsigned long V128 __attribute__ ((mode(TI))); > > ...and then you can say "V128 x = 37; ++x;" and GCC will actually > generate code to perform 128-bit arithmetic on 64- or 32-bit > platforms. I prefer portable C. The ability of gcc to generate such 128 bit arithmetic is pretty much useless since we do zero 128 bit arithmetic except for some very obscure multiply cases, perhaps. > I assume you prefer portable C, especially since the above does not > extend to V256 and beyond, Yep. > So I guess my #1 question is this: Which one do you want? Implement > the platform-dependent ABI for each platform to allow helpers to > return V128? Or implement platform-independent marshalling to pass > "V128 *" to helpers? My inclination is to go for the platform-dependent ABI. After all we have to do ABI dependent stuff anyway as it is. There might be one complication though. If this works out, we will have a situation where helpers can return V128/V256 values, so that 128/256 bit shadow loads can be done in a single unit and hence we can do the partial- loads-ok game, which IIUC is the motivation for all this. However, it does mean Memcheck will then have an asymmetry w.r.t. processing shadow stores, in the sense that they will still be chopped up into 64-bit pieces. So it would ideally be nice to have a solution where we can also pass V128 and V256 as arguments. Is that doable also using the platform dependent ABIs? IIUC it's just more of the same game of passing the args on the stack, right? J |
|
From: Patrick J. L. <lop...@gm...> - 2012-08-23 17:54:10
|
On Thu, Aug 23, 2012 at 9:55 AM, Julian Seward <js...@ac...> wrote: > > There might be one complication though. If this works out, we will have > a situation where helpers can return V128/V256 values, so that 128/256 bit > shadow loads can be done in a single unit and hence we can do the partial- > loads-ok game, which IIUC is the motivation for all this. However, it does > mean Memcheck will then have an asymmetry w.r.t. processing shadow stores, > in the sense that they will still be chopped up into 64-bit pieces. So it > would ideally be nice to have a solution where we can also pass V128 and V256 > as arguments. Is that doable also using the platform dependent ABIs? IIUC > it's just more of the same game of passing the args on the stack, right? Doable, yes. But it does require implementation for every platform. Other than aesthetics, is there a technical reason for wanting this? (For example, do you think the performance would be improved for 128-bit stores?) It just seems like a bunch of work, both implementation and maintenance, if it is not something we technically need. Also, do you want my initial implementation to include V256 -- which, as far as I can tell, does not yet exist as a typedef -- or is V128 enough? - Pat |
|
From: Julian S. <js...@ac...> - 2012-08-23 18:21:10
|
> Other than aesthetics, is there a technical reason for wanting this? > (For example, do you think the performance would be improved for > 128-bit stores?) True, partly for aesthetics, but also partly because as you say, it would probably improve performance for 128/256 bit stores. Anyway, that's something for the future; not now. > Also, do you want my initial implementation to include V256 -- which, > as far as I can tell, does not yet exist as a typedef -- or is V128 > enough? I'm unclear what you mean. Do you mean should you introduce a V256 type, or do 256 bit returns? Ideally both. With 128 bit returns only, we will have to split 256s into two 128s, which still leaves the original style overrun problems in cases where vectorising compilers (icc) generates AVX vector loops. J |
|
From: Patrick J. L. <lop...@gm...> - 2012-08-24 01:08:10
|
On Thu, Aug 23, 2012 at 11:21 AM, Julian Seward <js...@ac...> wrote: > >> Also, do you want my initial implementation to include V256 -- which, >> as far as I can tell, does not yet exist as a typedef -- or is V128 >> enough? > > I'm unclear what you mean. Do you mean should you introduce a V256 type, > or do 256 bit returns? Ideally both. I was sort of conflating the two, assuming that if we handle 256-bit return values, it would be to allow helpers to return "V256". I will take a crack at V128 and V256 return values for x86 and x86_64 early next week. Last question: Do I need to implement the ABI for all supported platforms? I do not even have access to most of them... - Pat |