|
From: Patrick J. L. <lop...@gm...> - 2012-02-21 22:38:20
|
On Tue, Feb 21, 2012 at 1:12 PM, Julian Seward <js...@ac...> wrote:
>
> Why are we not having this discussion on developers@ ? There are
> other people who lurk there and have views on this stuff.
No reason at all :-). I have added valgrind-developers to the recipient list.
> On Tuesday, February 21, 2012, you wrote:
>> I did not understand the JIT well enough to "teach the backends to
>> generate code so that the helpers can write the results into a 16 byte
>> area". I agree that would be cleaner, and in fact it was my first
>> thought... But I got stuck trying to figure out how to do it :-)
>>
>> I certainly agree we need to be thinking about 256-bit, 512-bit, and
>> beyond. I tried to use a scheme that would extend in a
>> straightforward, if "dirty", way.
>>
>> Anyhow, if you can provide some sort of template for returning a
>> 128-bit value from a helper, I can take a crack at filling in the
>> details.
>
> Well, it also entails passing 128 bit values to helpers; not just
> returns.
My first attempt handled that by passing two 64-bit arguments "ULong
al, ULong ah" (effective but sloppy). But I could not see how to
return a pair of values from a helper, so I would up storing static
state and returning it in two separate helper calls (effective but
very very sloppy).
> Let me contemplate it more .. it's not entirely simple, and I'm up to
> my ears in review requests etc. Plus out of town on Friday. Basically
> it's a case of messing with doHelperCall in host_amd64_isel.c, and in
> particular its handling of the argument array (the IRExpr** args bit).
>
> When it comes to an arg of type Ity_V128, instead of just asserting, it
> needs to emit insns to push the arg on the stack and pass an argument
> to that location. And if the return type is Ity_V128 then it needs to
> preallocate a 16 byte area on the stack and pass a pointer to that as
> the first arg. Or something.
>
> What this means is that there has to be a certain discipline in writing
> the helpers themselves (which we'll need to decide on):
>
> * if a helper "returns" a V128, then it will be given an pointer to it
> as the first arg, and it must park the result there
>
> * if a helper takes a V128 arg, then it gets a pointer to it in the
> relevant arg position. But this is a readonly arg; writes to it
> will be ignored. (IOW we can't use the standard C scheme of
> getting a pointer to something which the routine both reads and
> writes)
What we want, I think, is a data type to represent 128-bit values,
just as "ULong" represents 64-bit values now. That could just be a
typedef for a struct containing an array of 2 ULongs. (Or 4 ULongs,
or 8, etc., for larger vectors.)
Another option, I suppose, is to use GCC's __attribute__((mode(TI)) to
get a 128-bit integer type. Not sure how tightly coupled you want to
be to GCC-isms, but it might allow direct arithmetic on the V128 in C.
Anyway, the arguments to the helpers can be declared "const V128 *",
making it clear that they are read-only, at least for someone looking
at the helpers for the first time.
> Anyway, you can maybe try to hack it up if you are crazy enough. Dodging
> around the extra complexities of "slow" vs "fast" marshalling (see big
> comment on the function) and of also passing the baseblock pointer ("BBP",
> ditto) should keep you suitably entertained for a while. Generating calling
> code which is both fast and correct is difficult.
>
> Getting this sorted out properly would remove by the biggest
> difficulty in fixing the --partial-loads-ok swamp and handling SSEised
> code better, so I am all in favour of it really.
I may take a crack at it, but I am traveling myself next weekend. If
you decide to work on it yourself, please let me know so I do not
"entertain" myself needlessly...
> If you want to get into this, you'll need to make friends with
> --trace-flags=00001000 (for the final IR), 00000100 for the
> vregisterised insns, and 00000010 for the real-registerised insns.
>
>
>> But... Let's get my proposed patch for the 8-byte case sorted first (?)
>
> Yes.
>
>> I guess my question still is: Do you agree with my proposed semantic
>> change to partial_loads_ok?
>
> I think in principle I do. And I see you cut down the patch so it
> only messes with mc_LOADVn_slow -- thanks for that.
>
> So .. what would actually make me happy (and more inclined to take the
> patch) is a test set (more than just one) that comprehensively exercises
> your new logic and makes it clear that it actually works right. Memcheck
> is complex and I've very reluctant to take changes to it without some
> way to know we aren't regressing. Also .. bear in mind that although
> you may be interested only in SSE, I have to also be interested in keeping
> it working properly on all targets, and that tends to add complexity.
I understand. I will see if I can come up with some more test cases
and attach them to my bug report
(https://bugs.kde.org/show_bug.cgi?id=294523). Although at this
point, the patch is so small that it "obviously" does not change any
behavior unless --partial-loads-ok=yes.
Another option is to make --partial-loads-ok a tri-state: "no", "yes",
and "conservative". "no" and "yes" can mean the same as they do now,
while "conservative" can mean the behavior I want (i.e., allow
partially-accessible aligned loads but mark the inaccessible bytes as
undefined). The advantage of the tri-state is that it would be 100%
backward-compatible. The disadvantage is that it would add a
maintenance burden, both now and going forward with 128-bit loads and
beyond. Personally, I would just make "yes" behave the way I want,
because false negatives are so dangerous... But I will implement the
tri-state if you think that is better.
...
Ultimately, I am looking for three things, in this order:
1) Eliminate false negatives for --partial-loads-ok=yes
2) Implement --partial-loads-ok=yes for SSE loads
3) Correctly propagate validity bits for PMOVMSKB
It is turning out to be quite hard to eliminate Memcheck's false
positives on my current code base (using the Intel compiler with full
SSE4.2 optimization). And it is only going to get worse as compilers
get smarter. I believe these three things together would eliminate
most -- possibly all -- of my false positives without introducing any
false negatives.
I am willing to do the work to make these happen, although of course
any help would be welcome.
Thanks!
- Pat
|
|
From: Julian S. <js...@ac...> - 2012-02-22 11:53:58
|
> > What this means is that there has to be a certain discipline in writing
> > the helpers themselves (which we'll need to decide on):
> >
> > * if a helper "returns" a V128, then it will be given an pointer to it
> > as the first arg, and it must park the result there
> >
> > * if a helper takes a V128 arg, then it gets a pointer to it in the
> > relevant arg position. But this is a readonly arg; writes to it
> > will be ignored. (IOW we can't use the standard C scheme of
> > getting a pointer to something which the routine both reads and
> > writes)
So .. the above is nonsense, really. The idea that VEX's code generation
for calls is limited to functions that take machine words as args and return
on or two words as results, is based on the incorrect assumption that
C doesn't allow passing and returning arbitrary structs by value.
But it does, and gcc compiles the test fragments below into almost
sane assembly (on x86, amd64 and arm), especially if you tell it
-fno-stack-protector.
So we can forget about the above conventions, and instead expand
VEX's call/return code generation to be able to handle passing and
returning V128s by value in accordance with whatever the relevant
platform ABIs specify. (Which is presumably what GCC is generating
for.)
I guess the scary thing is if either the ABIs don't specify how to
pass/return structs by value, or different OSs on the same architecture
do it differently, or GCC doesn't comply with them somehow.
> I may take a crack at it, but I am traveling myself next weekend. If
> you decide to work on it yourself, please let me know so I do not
> "entertain" myself needlessly...
This is not on my stuff-to-hack-on-soon list, so you're OK for the
time being.
J
typedef struct { unsigned char c[16]; } V128;
__attribute__ ((regparm(3)))
int takesV128 ( int x, V128 y, int z ) {
int i, s = 0;
for (i = 0; i < 16; i++) s += y.c[i];
return x + s + z;
}
__attribute((regparm(1)))
V128 returnsV128 ( char c ) {
int i;
V128 v;
for (i = 0; i < 16; i++) v.c[i] = (c+i) & 0xFF;
return v;
}
__attribute((regparm(2)))
V128 vectorSum_16x8 ( V128 a, V128 b ) {
int i;
V128 v;
for (i = 0; i < 16; i++) v.c[i] = a.c[i] + b.c[i];
return v;
}
|
|
From: Patrick J. L. <lop...@gm...> - 2012-02-22 15:42:58
|
On Wed, Feb 22, 2012 at 3:52 AM, Julian Seward <js...@ac...> wrote: > > > So .. the above is nonsense, really. The idea that VEX's code generation > for calls is limited to functions that take machine words as args and return > on or two words as results, is based on the incorrect assumption that > C doesn't allow passing and returning arbitrary structs by value. Oh! I just assumed you had a reason for this limitation :-) > So we can forget about the above conventions, and instead expand > VEX's call/return code generation to be able to handle passing and > returning V128s by value in accordance with whatever the relevant > platform ABIs specify. (Which is presumably what GCC is generating > for.) Yes, this should all be documented by the ABI specs for the various platforms. AMD64 Unix is here: http://www.x86-64.org/documentation/ (In particular, see section 3.2. __int128 is at least somewhat interesting.) - Pat |
|
From: Patrick J. L. <lop...@gm...> - 2012-02-22 17:15:53
|
On Wed, Feb 22, 2012 at 3:52 AM, Julian Seward <js...@ac...> wrote: > >> > * if a helper "returns" a V128, then it will be given an pointer to it >> > as the first arg, and it must park the result there >> > >> > * if a helper takes a V128 arg, then it gets a pointer to it in the >> > relevant arg position. But this is a readonly arg; writes to it >> > will be ignored. (IOW we can't use the standard C scheme of >> > getting a pointer to something which the routine both reads and >> > writes) By the way, for the helpers I need, passing a V128 is not needed, because addresses are still "only" 64 bits. So if and when I do take a crack at this, I will most likely implement the return path first. - Pat |
|
From: Julian S. <js...@ac...> - 2012-02-23 09:14:44
|
On Wednesday, February 22, 2012, Patrick J. LoPresti wrote: > On Wed, Feb 22, 2012 at 3:52 AM, Julian Seward <js...@ac...> wrote: > >> > * if a helper "returns" a V128, then it will be given an pointer to it > >> > as the first arg, and it must park the result there > >> > > >> > * if a helper takes a V128 arg, then it gets a pointer to it in the > >> > relevant arg position. But this is a readonly arg; writes to it > >> > will be ignored. (IOW we can't use the standard C scheme of > >> > getting a pointer to something which the routine both reads and > >> > writes) > > By the way, for the helpers I need, passing a V128 is not needed, > because addresses are still "only" 64 bits. > > So if and when I do take a crack at this, I will most likely implement > the return path first. Sure. Of course, you'll wind up with a temporary asymmetry between how the 128 bit load and store helpers work, until such time as the store path uses direct V128 passing. But that's no big deal. Yeah, so looks like x86_64 ELF uses rdx:rax for 128 bit integer returns, which means you could probably copy the relevant code fragments more or less directly from host_x86_isel.c (the 32 bit insn selector) for 64 bit returns, since 64 bit returns on x86 are done in edx:eax. If you see what I mean. J |
|
From: Julian S. <js...@ac...> - 2012-02-23 08:18:27
|
> Another option is to make --partial-loads-ok a tri-state: "no", "yes",
> and "conservative". "no" and "yes" can mean the same as they do now,
> while "conservative" can mean the behavior I want (i.e., allow
> partially-accessible aligned loads but mark the inaccessible bytes as
> undefined). The advantage of the tri-state is that it would be 100%
> backward-compatible. The disadvantage is that it would add a
> maintenance burden, both now and going forward with 128-bit loads and
> beyond. Personally, I would just make "yes" behave the way I want,
I tend to agree, providing it does not cause serious problems when
tested. Another disadvantage of adding a third option is that we would
have to explain the semantics to users, and they are in general already
confused enough by the plethora of options available.
> Ultimately, I am looking for three things, in this order:
>
> 1) Eliminate false negatives for --partial-loads-ok=yes
> 2) Implement --partial-loads-ok=yes for SSE loads
> 3) Correctly propagate validity bits for PMOVMSKB
Sounds sane.
> It is turning out to be quite hard to eliminate Memcheck's false
> positives on my current code base (using the Intel compiler with full
> SSE4.2 optimization).
What specific problems are you having?
In my uncommitted work on this, I added a new IROp
Iop_GetMSBs8x8, /* I64 -> I8 */ with behaviour
+UInt h_generic_calc_GetMSBs8x8 ( ULong xx )
+{
+ UInt r = 0;
+ if (xx & (1ULL << (64-1))) r |= (1<<7);
+ if (xx & (1ULL << (56-1))) r |= (1<<6);
+ if (xx & (1ULL << (48-1))) r |= (1<<5);
+ if (xx & (1ULL << (40-1))) r |= (1<<4);
+ if (xx & (1ULL << (32-1))) r |= (1<<3);
+ if (xx & (1ULL << (24-1))) r |= (1<<2);
+ if (xx & (1ULL << (16-1))) r |= (1<<1);
+ if (xx & (1ULL << ( 8-1))) r |= (1<<0);
+ return r;
+}
and used that to implement pmovmskb, rather than using a helper
function in the front end. This allows Memcheck to instrument it
exactly, since the same operation exactly describes the V bit
propagation. Back end then generates calls to this function
and that handles both the real computation and Memcheck's
instrumentation of it. LMK if you want the diff.
J
|
|
From: Patrick J. L. <lop...@gm...> - 2012-02-24 00:45:40
|
On Thu, Feb 23, 2012 at 12:17 AM, Julian Seward <js...@ac...> wrote: > > > Ultimately, I am looking for three things, in this order: > > > > 1) Eliminate false negatives for --partial-loads-ok=yes > > 2) Implement --partial-loads-ok=yes for SSE loads > > 3) Correctly propagate validity bits for PMOVMSKB > > Sounds sane. > > > It is turning out to be quite hard to eliminate Memcheck's false > > positives on my current code base (using the Intel compiler with full > > SSE4.2 optimization). > > What specific problems are you having? Mostly strlen(). Every time you call this function, the Intel compiler inlines a version that chops off the low bits of the address, loads 16 aligned bytes, and uses PCMPEQB + PMOVMSKB + BSF to see if the first null character of the string occurs in those 16 bytes. If not, it then calls an optimized strlen() function. (The optimized strlen() function is easily suppressed, but all the inlined code is much harder.) I say "mostly" because there are occasional warnings from __intel_memcpy() and such, and I have not examined the underlying assembly thoroughly. I have been steadily adding suppression rules based on instinct, but there are now so many that I am worried I may have silenced real problems. > In my uncommitted work on this, I added a new IROp > Iop_GetMSBs8x8, /* I64 -> I8 */ with behaviour [snip] > and used that to implement pmovmskb, rather than using a helper > function in the front end. This allows Memcheck to instrument it > exactly, since the same operation exactly describes the V bit > propagation. Back end then generates calls to this function > and that handles both the real computation and Memcheck's > instrumentation of it. LMK if you want the diff. I am definitely interested. I think this will add great value after we have (1) and (2). I am traveling this weekend, but I will take a first crack at (2) next week. Thanks. - Pat |