|
From: Stephen M. <sm...@al...> - 2012-10-19 20:32:32
|
>>>>> "LY" == Lok Kwong Yan <lo...@sy...> writes:
LY> Can you please help me understand the problem a bit more by
LY> answering the following questions?
LY> 1. Where is addition being used in CTZ - and especially
LY> expensiveCTZ. From the code, I only see subtraction being used.
I think we may be looking at different code. The addition I was
referring to is in the commented-out verbose_Clz32 at the bottom of
guest_ppc_toIR.c. It uses addition to implement an efficient parallel
add for counting the number of 1 bits in a word, as in the following C
code in the comment:
// /* unsigned int bitcount ( unsigned int n )
// {
// n = n - ((n >> 1) & 0x55555555);
// n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
// n = (n + (n >> 4)) & 0x0F0F0F0F;
// n = n + (n >> 8);
// n = (n + (n >> 16)) & 0x3F;
// return n;
// }
// */
For instance there's a "binop(Iop_Add32" on line 17863 of the current
SVN revision r2537.
LY> 2. Why is PCast used for CTZ (and subsequently for BSF) if all of
LY> the upper bits are 0 as long as the operand is non-zero? That
LY> is, why not and the PCast with 0x1E since 31 is the maximum
LY> output value.
I believe you're right that ANDing the PCast with 31 would be sound
and improve precision; my guess would be that the current
implementation of a plain PCast was chosen simply for ease of
implementation under the assumption that Ctz would not be commonly
used on partially-defined values.
LY> 3. In the case of BSF, what about PCast( BSF(va) <= BSF(a) )?
I think an approach like that could work pretty well too. It captures
the same defined-result common case that I agree is important, and it
uses fewer additional operations (though still two Ctzs) compared to
the sort of thing I was thinking of.
-- Stephen
|