|
From: Patrick J. L. <lop...@gm...> - 2012-10-18 21:51:15
|
I have just filed the following two bugs: bsfl validity bit propagation is imprecise https://bugs.kde.org/show_bug.cgi?id=308626 pmovmskb validity bit propagation is imprecise https://bugs.kde.org/show_bug.cgi?id=308627 I would like to take a crack at fixing these myself. (I believe they should be easier than my main problem in https://bugs.kde.org/show_bug.cgi?id=294285) Could someone familiar with the source give me a few words to point me in the right direction? For example, what file(s) I will likely have to edit and (roughly) what to look for inside them? Thank you. - Pat |
|
From: Julian S. <js...@ac...> - 2012-10-19 09:58:15
|
Timely. Fixing the memcheck+vectorised code swamp properly is approaching the top of my priority list. Certainly it is something that needs to happen before the next major release. > pmovmskb validity bit propagation is imprecise > https://bugs.kde.org/show_bug.cgi?id=308627 A couple of years back I hacked up a fix to this, but never landed it. It needs rebasing, generalising to 128- and 256-bit cases, and testing properly, but I think it basically works. I'll attach the patch to the bug. > bsfl validity bit propagation is imprecise > https://bugs.kde.org/show_bug.cgi?id=308626 Have a look at commented out verbose_Clz32() at the end of guest_ppc_toIR.c. It might be useful (as a source of ideas). It relies on the fact that Memcheck tracks definedness exactly correctly for left shifts, right shifts, and, or and not. J |
|
From: Stephen M. <sm...@al...> - 2012-10-19 17:34:36
|
PJL> bsfl validity bit propagation is imprecise PJL> https://bugs.kde.org/show_bug.cgi?id=308626 >>>>> "Julian" == Julian Seward <js...@ac...> writes: Julian> Have a look at commented out verbose_Clz32() at the end of Julian> guest_ppc_toIR.c. It might be useful (as a source of ideas). Julian> It relies on the fact that Memcheck tracks definedness exactly Julian> correctly for left shifts, right shifts, and, or and not. The propagation part of that, which just uses bitwise operations, should be completely precise. However my intuition would be that the bitcount part, which uses addition, would sometimes introduce imprecision even if you make Memcheck use the more precise carry-aware translation. For Patrick's example it would probably go OK because the undefinedness would be eliminated during the propagation. Another approach that had occurred to me would be to make the "lower bound" value where all the undefined bits are replaced by zero, and the "upper bound" value where they're replaced by one, and run Ctz on both of them. The case where the results are the same is the important one where it's completely defined. (However this still isn't completely precise: for instance consider Ctz8(010?0?0?) = 00000??0.) -- Stephen |
|
From: John R. <jr...@bi...> - 2012-10-19 18:19:29
|
> consider Ctz8(010?0?0?) = 00000??0. ctz8 with actual Uninit input bits should be implemented by a helper function of two inputs (the operand, the bitmask of Valid bits) which uses table lookup. If the host lacks ctz8 (or similar) then the helper function should be called regardless of bit state. -- |
|
From: Lok K. Y. <lo...@sy...> - 2012-10-19 19:40:09
|
Can you please help me understand the problem a bit more by answering the following questions? 1. Where is addition being used in CTZ - and especially expensiveCTZ. From the code, I only see subtraction being used. 2. Why is PCast used for CTZ (and subsequently for BSF) if all of the upper bits are 0 as long as the operand is non-zero? That is, why not and the PCast with 0x1E since 31 is the maximum output value. 3. In the case of BSF, what about PCast( BSF(va) <= BSF(a) )? Enjoy, Lok ________________________________________ From: Stephen McCamant [sm...@al...] Sent: Friday, October 19, 2012 1:34 PM To: Julian Seward Cc: val...@li... Subject: Re: [Valgrind-developers] bsfl and pmovmskb validity bit propagation PJL> bsfl validity bit propagation is imprecise PJL> https://bugs.kde.org/show_bug.cgi?id=308626 >>>>> "Julian" == Julian Seward <js...@ac...> writes: Julian> Have a look at commented out verbose_Clz32() at the end of Julian> guest_ppc_toIR.c. It might be useful (as a source of ideas). Julian> It relies on the fact that Memcheck tracks definedness exactly Julian> correctly for left shifts, right shifts, and, or and not. The propagation part of that, which just uses bitwise operations, should be completely precise. However my intuition would be that the bitcount part, which uses addition, would sometimes introduce imprecision even if you make Memcheck use the more precise carry-aware translation. For Patrick's example it would probably go OK because the undefinedness would be eliminated during the propagation. Another approach that had occurred to me would be to make the "lower bound" value where all the undefined bits are replaced by zero, and the "upper bound" value where they're replaced by one, and run Ctz on both of them. The case where the results are the same is the important one where it's completely defined. (However this still isn't completely precise: for instance consider Ctz8(010?0?0?) = 00000??0.) -- Stephen ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct _______________________________________________ Valgrind-developers mailing list Val...@li... https://lists.sourceforge.net/lists/listinfo/valgrind-developers |
|
From: Patrick J. L. <lop...@gm...> - 2012-11-05 01:32:25
|
On Fri, Oct 19, 2012 at 12:24 PM, Lok Kwong Yan <lo...@sy...> wrote: > > 3. In the case of BSF, what about PCast( BSF(va) <= BSF(a) )? I ended up using this approach. I have added two new attachments to <https://bugs.kde.org/show_bug.cgi?id=308627>. They re-base Julian's old patches for VEX and memcheck, respectively, to current SVN. I would appreciate a code review and also some advice. This new VEX+memcheck does perform better validity bit propagation for Iop_Ctz32 and Iop_Ctz64. But guest_amd64_to_IR contains the following unfortunate code in dis_bs_E_G (which handles BSF): /* Generate an 8-bit expression which is zero iff the original is zero, and nonzero otherwise */ assign( src8, unop(Iop_1Uto8, binop(Iop_CmpNE64, mkexpr(src64), mkU64(0))) ); This emits a CmpNE64 instruction against the BSF argument. Later that result goes to IRExpr_Mux0X() to handle the semantics for a zero input. The net result is that memcheck thinks the entire bsfl result is always invalid if any input bits are invalid, even though Iop_Ctz32 and Iop_Ctz64 now propagate the validity bits correctly. What is the right way to handle this? I have confirmed that this is the problem by forcing memcheck to call expensiveCmpEQorNE() unconditionally for Iop_CmpNE64. This allows it to see that any valid "1" bit means the value is not zero, and this actually lets my "bsfl" test case pass. Thanks. - Pat |
|
From: Stephen M. <sm...@al...> - 2012-10-19 20:32:32
|
>>>>> "LY" == Lok Kwong Yan <lo...@sy...> writes:
LY> Can you please help me understand the problem a bit more by
LY> answering the following questions?
LY> 1. Where is addition being used in CTZ - and especially
LY> expensiveCTZ. From the code, I only see subtraction being used.
I think we may be looking at different code. The addition I was
referring to is in the commented-out verbose_Clz32 at the bottom of
guest_ppc_toIR.c. It uses addition to implement an efficient parallel
add for counting the number of 1 bits in a word, as in the following C
code in the comment:
// /* unsigned int bitcount ( unsigned int n )
// {
// n = n - ((n >> 1) & 0x55555555);
// n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
// n = (n + (n >> 4)) & 0x0F0F0F0F;
// n = n + (n >> 8);
// n = (n + (n >> 16)) & 0x3F;
// return n;
// }
// */
For instance there's a "binop(Iop_Add32" on line 17863 of the current
SVN revision r2537.
LY> 2. Why is PCast used for CTZ (and subsequently for BSF) if all of
LY> the upper bits are 0 as long as the operand is non-zero? That
LY> is, why not and the PCast with 0x1E since 31 is the maximum
LY> output value.
I believe you're right that ANDing the PCast with 31 would be sound
and improve precision; my guess would be that the current
implementation of a plain PCast was chosen simply for ease of
implementation under the assumption that Ctz would not be commonly
used on partially-defined values.
LY> 3. In the case of BSF, what about PCast( BSF(va) <= BSF(a) )?
I think an approach like that could work pretty well too. It captures
the same defined-result common case that I agree is important, and it
uses fewer additional operations (though still two Ctzs) compared to
the sort of thing I was thinking of.
-- Stephen
|
|
From: Lok K. Y. <lo...@sy...> - 2012-10-20 00:35:36
|
> LY> Can you please help me understand the problem a bit more by
> LY> answering the following questions?
>
> LY> 1. Where is addition being used in CTZ - and especially
> LY> expensiveCTZ. From the code, I only see subtraction being used.
>
I> think we may be looking at different code. The addition I was
> referring to is in the commented-out verbose_Clz32 at the bottom of
> guest_ppc_toIR.c. It uses addition to implement an efficient parallel
> add for counting the number of 1 bits in a word, as in the following C
> code in the comment:
Thank you for the clarification. It makes more sense now.
I don't think the additions should affect the precision of the resulting VBits. The VBits calculation do not have to be calculated exactly as bitcount is calculated. In the case of expensiveCTZ addition is no longer used. Or in the case of the BSF(va) <= BSF(a), only the output of bitcount is used and not the output VBits of the operations used to emulate bitcount.
>I believe you're right that ANDing the PCast with 31 would be sound
>and improve precision; my guess would be that the current
>implementation of a plain PCast was chosen simply for ease of
>implementation under the assumption that Ctz would not be commonly
>used on partially-defined values.
>I think an approach like that could work pretty well too. It captures
>the same defined-result common case that I agree is important, and it
>uses fewer additional operations (though still two Ctzs) compared to
>the sort of thing I was thinking of.
Thanks for the clarification again. I guess I forgot about the need for good performance.
Enjoy,
Lok
________________________________________
From: Stephen McCamant [sm...@cs...] on behalf of Stephen McCamant [sm...@al...]
Sent: Friday, October 19, 2012 4:32 PM
To: Lok Kwong Yan
Cc: val...@li...
Subject: Re: [Valgrind-developers] bsfl and pmovmskb validity bit propagation
>>>>> "LY" == Lok Kwong Yan <lo...@sy...> writes:
LY> Can you please help me understand the problem a bit more by
LY> answering the following questions?
LY> 1. Where is addition being used in CTZ - and especially
LY> expensiveCTZ. From the code, I only see subtraction being used.
I think we may be looking at different code. The addition I was
referring to is in the commented-out verbose_Clz32 at the bottom of
guest_ppc_toIR.c. It uses addition to implement an efficient parallel
add for counting the number of 1 bits in a word, as in the following C
code in the comment:
// /* unsigned int bitcount ( unsigned int n )
// {
// n = n - ((n >> 1) & 0x55555555);
// n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
// n = (n + (n >> 4)) & 0x0F0F0F0F;
// n = n + (n >> 8);
// n = (n + (n >> 16)) & 0x3F;
// return n;
// }
// */
For instance there's a "binop(Iop_Add32" on line 17863 of the current
SVN revision r2537.
LY> 2. Why is PCast used for CTZ (and subsequently for BSF) if all of
LY> the upper bits are 0 as long as the operand is non-zero? That
LY> is, why not and the PCast with 0x1E since 31 is the maximum
LY> output value.
I believe you're right that ANDing the PCast with 31 would be sound
and improve precision; my guess would be that the current
implementation of a plain PCast was chosen simply for ease of
implementation under the assumption that Ctz would not be commonly
used on partially-defined values.
LY> 3. In the case of BSF, what about PCast( BSF(va) <= BSF(a) )?
I think an approach like that could work pretty well too. It captures
the same defined-result common case that I agree is important, and it
uses fewer additional operations (though still two Ctzs) compared to
the sort of thing I was thinking of.
-- Stephen
|