|
From: Vince W. <vi...@cs...> - 2007-03-22 21:47:16
|
Hello,
I know in general I shouldn't expect floating point to be _exact_, but
I've found a problem where valgrind is just slightly off and it majorly
effects results.
I've made a valgrind plugin that calculates Basic Block Vectors for use
with the Simpoint analysis tool. It gets an instruction count using
methods simiar to cachegrind and I've validated it with performance
counters on a P3 system (one special case has to be added; with the
"rep" prefix and string instructions an actual machine counts up to 4096
reps as one instruction retired, not as 4096 separate ones)
In any case I've run this on the spec2k benchmarks, and all of them are
close except for art. The art benchmark finishes in half the number of
instructions than it should.
It turns out that art is using the "==" operator to compare two floating
point numbers. And valgrind returns values that have the LSB wrong
on 64-bit fmul and fadd instructions. This is enough to make the program
finish early.
Looking through the valgrind code, I am guessing maybe this is a problem
with the rounding mode, but I haven't been able to track down a good fix.
I've attached code after this that shows the problem.
On a native system I get
xr=0.426335 qr=0.505253
3fdb4914520a783a
3fe02b07a0efb19b
v=5.478862 4015ea5ace4c4585
Under valgrind with --tool=none I get
xr=0.426335 qr=0.505253
3fdb4914520a783a
3fe02b07a0efb19b
v=5.478862 4015ea5ace4c4586
Notice that only the very last bit of the result is off, which is why I
think it might be rounding related.
Any help with this problem would be appreciated... I am using
valgrind 3.2.3
Thanks,
Vince
#include <stdio.h>
void print_hex(double value) {
long long *blah;
blah=(long long *)&value;
printf("%llx\n",*(blah));
}
int main(int argc, char **argv) {
unsigned long long xr_l=0x3fdb4914520a783aULL;
unsigned long long qr_l=0x3fe02b07a0efb19bULL;
double xr,qr,v;
long long *int_ptr;
unsigned short cw;
int_ptr=(unsigned long long *)&xr;
*int_ptr=xr_l;
int_ptr=(unsigned long long *)&qr;
*int_ptr=qr_l;
printf("xr=%lg qr=%lg\n",xr,qr);
print_hex(xr); print_hex(qr);
// asm ("fstcw %0":"=m"(cw)::"memory");
// printf("cw=%x, rounding=%d\n",cw,(cw>>9)&3);
v=xr+10.0*qr;
printf(" v=%lf ",v);
print_hex(v);
return 0;
}
|
|
From: Julian S. <js...@ac...> - 2007-03-23 14:35:38
|
Vince
Valgrind's handling of x86 FP is something of a kludge. In short
it regards all operations internally as 64-bit, to increase commonality
of Valgrind's internals with other platforms and reduce overall
engineering effort. Unfortunately this can give rise to the kinds
of problem you saw. Fixing it properly would take a significant amount
of time hacking around the internals of VEX.
I peered at various bits of vex re your test case, but didn't see
any other obvious bugs. I suspect the problem is intrinsic in
using 64-bit FP to simulate 80-bit FP. There is a problem with
FP spilling in the x86 code generator which I should look further
at, but it doesn't affect your test case.
The only helpful suggestions I can offer are:
- redo your experiments on a 64-bit x86 platform. In 64 bit
mode, the native FP size is 64 bits anyway (not 80)
since FP is done by default on the lower halves of SSE
registers, and experience shows these accuracy problems are
much reduced
- better still, redo your experiments on a ppc32-linux or ppc64-linux
platform. Unlike its x86 cousins, the Valgrind ppc simulation
produces bit-exact floating point results.
J
On Thursday 22 March 2007 21:46, Vince Weaver wrote:
> Hello,
>
> I know in general I shouldn't expect floating point to be _exact_, but
> I've found a problem where valgrind is just slightly off and it majorly
> effects results.
>
> I've made a valgrind plugin that calculates Basic Block Vectors for use
> with the Simpoint analysis tool. It gets an instruction count using
> methods simiar to cachegrind and I've validated it with performance
> counters on a P3 system (one special case has to be added; with the
> "rep" prefix and string instructions an actual machine counts up to 4096
> reps as one instruction retired, not as 4096 separate ones)
>
> In any case I've run this on the spec2k benchmarks, and all of them are
> close except for art. The art benchmark finishes in half the number of
> instructions than it should.
>
> It turns out that art is using the "==" operator to compare two floating
> point numbers. And valgrind returns values that have the LSB wrong
> on 64-bit fmul and fadd instructions. This is enough to make the program
> finish early.
>
> Looking through the valgrind code, I am guessing maybe this is a problem
> with the rounding mode, but I haven't been able to track down a good fix.
>
> I've attached code after this that shows the problem.
> On a native system I get
>
> xr=0.426335 qr=0.505253
> 3fdb4914520a783a
> 3fe02b07a0efb19b
> v=5.478862 4015ea5ace4c4585
>
>
> Under valgrind with --tool=none I get
>
> xr=0.426335 qr=0.505253
> 3fdb4914520a783a
> 3fe02b07a0efb19b
> v=5.478862 4015ea5ace4c4586
>
>
> Notice that only the very last bit of the result is off, which is why I
> think it might be rounding related.
>
> Any help with this problem would be appreciated... I am using
> valgrind 3.2.3
>
> Thanks,
>
> Vince
>
>
>
> #include <stdio.h>
>
> void print_hex(double value) {
>
> long long *blah;
>
> blah=(long long *)&value;
> printf("%llx\n",*(blah));
> }
>
> int main(int argc, char **argv) {
>
>
> unsigned long long xr_l=0x3fdb4914520a783aULL;
> unsigned long long qr_l=0x3fe02b07a0efb19bULL;
>
> double xr,qr,v;
> long long *int_ptr;
> unsigned short cw;
>
> int_ptr=(unsigned long long *)&xr;
> *int_ptr=xr_l;
>
> int_ptr=(unsigned long long *)&qr;
> *int_ptr=qr_l;
>
> printf("xr=%lg qr=%lg\n",xr,qr);
> print_hex(xr); print_hex(qr);
>
> // asm ("fstcw %0":"=m"(cw)::"memory");
> // printf("cw=%x, rounding=%d\n",cw,(cw>>9)&3);
>
> v=xr+10.0*qr;
> printf(" v=%lf ",v);
> print_hex(v);
>
> return 0;
> }
>
>
>
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share
> your opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Valgrind-developers mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-developers
|
|
From: Nicholas N. <nj...@cs...> - 2007-03-24 01:19:45
|
On Fri, 23 Mar 2007, Julian Seward wrote: > Valgrind's handling of x86 FP is something of a kludge. In short > it regards all operations internally as 64-bit, to increase commonality > of Valgrind's internals with other platforms and reduce overall > engineering effort. Unfortunately this can give rise to the kinds > of problem you saw. Fixing it properly would take a significant amount > of time hacking around the internals of VEX. To augment Julian's comment: for x86, Valgrind simulates a machine with 64-bit FP registers. The important question here: is that 64-bit implementation correct? Ie. is 'art' relying on the extra 16 bits of accuracy? If so, then 'art' is arguably (some would disagree) at fault, because it's doing non-portable things. Given that it is in SPEC, that would be surprising. Or, Valgrind's 64-bit FP implementation may have bugs. Nick |
|
From: Vince W. <vi...@cs...> - 2007-03-24 01:52:20
|
> To augment Julian's comment: for x86, Valgrind simulates a machine with > 64-bit FP registers. The important question here: is that 64-bit > implementation correct? Ie. is 'art' relying on the extra 16 bits of > accuracy? If so, then 'art' is arguably (some would disagree) at fault, > because it's doing non-portable things. Given that it is in SPEC, that > would be surprising. Or, Valgrind's 64-bit FP implementation may have bugs. I think the problem does lie with 'art', but unfortunately it's a bit late to do anything about this (it's the spec2k version of art, I don't think it is even included in spec2k6). I've done the same experiment with art compiled with -msse2 and found that the problem goes away; since the sse2 floating point code only uses 64-bit math this seems to indicate valgrind is behaving properly with regards to 64-bit math. Unfortunately for me the machine I am using to do the performance counter measurements is an older pentium3 that doesn't have sse2 support, so I am going to have to find another way to work around this for now. Thanks for the help, Vince |
|
From: Julian S. <js...@ac...> - 2007-03-24 02:05:55
|
> Unfortunately for me the machine I am using to do the performance > counter measurements is an older pentium3 that doesn't have sse2 support, > so I am going to have to find another way to work around this for now. If this comparison is not in a an inner loop, can you do some nasty kludge like masking off the lowest couple of mantissa bits before doing the comparison? J |
|
From: Vince W. <vi...@cs...> - 2007-03-24 02:21:03
|
> If this comparison is not in a an inner loop, can you do some nasty > kludge like masking off the lowest couple of mantissa bits before > doing the comparison? I was thinking about trying that as a last resort. I came across an interesting paper: http://www.wrcad.com/linux_numerics.txt Maybe I can try forcing art to use FPU_DOUBLE mode instead of FPU_EXTENDED in the manner described in the paper... Vince |
|
From: Julian S. <js...@ac...> - 2007-03-24 02:10:22
|
On Friday 23 March 2007 23:00, Nicholas Nethercote wrote: > On Fri, 23 Mar 2007, Julian Seward wrote: > > Valgrind's handling of x86 FP is something of a kludge. In short > > it regards all operations internally as 64-bit, to increase commonality > > of Valgrind's internals with other platforms and reduce overall > > engineering effort. Unfortunately this can give rise to the kinds > > of problem you saw. Fixing it properly would take a significant amount > > of time hacking around the internals of VEX. > > To augment Julian's comment: for x86, Valgrind simulates a machine with > 64-bit FP registers. The important question here: is that 64-bit > implementation correct? Having looked at Vince's test case, I didn't see any place where Valgrind incorrectly double-rounds the value. At least that's one good thing, even though it doesn't help Vince. I did notice that valgrind's register allocator was using 64-bit loads/ stores to spill FP registers, which isn't really right -- it means a spill-reload event isn't "transparent" to the value. I fixed it to do 80-bit spilling. This made no difference whatsoever to the big FP suite I use for testing (GNU gsl 1.6), alas. J |
|
From: Bart V. A. <bar...@gm...> - 2007-03-24 08:23:18
|
On 3/24/07, Nicholas Nethercote <nj...@cs...> wrote: > > > To augment Julian's comment: for x86, Valgrind simulates a machine with > 64-bit FP registers. The important question here: is that 64-bit > implementation correct? Ie. is 'art' relying on the extra 16 bits of > accuracy? If so, then 'art' is arguably (some would disagree) at fault, > because it's doing non-portable things. Given that it is in SPEC, that > would be surprising. Or, Valgrind's 64-bit FP implementation may have > bugs. > My opinion is that the art program is flawed: it is never a good idea to compare floating point numbers with the "==" or "!=" operator. Floating point numbers must be compared via fabs(... - ...) < ... or fabs(... - ...) > ... This is something you can find in any decent FAQ about numerical computing. Bart. |
|
From: Julian S. <js...@ac...> - 2007-03-24 12:18:50
|
> On Saturday 24 March 2007 08:23, Bart Van Assche wrote: > > My opinion is that the art program is flawed: it is never a good idea to > compare floating point numbers with the "==" or "!=" operator. [...] I agree, floating point comparison is not good. On the other hand, SPEC CPU is designed to be portable and I would be amazed if the SPEC folks had not looked into these problems in depth. Perhaps they fixed all problems they encountered in testing, but this one did not happen at that time, and it is only triggered by Valgrind's extra inaccuracy on x86. Who knows. J |
|
From: Vince W. <vi...@cs...> - 2007-03-24 18:27:40
|
On Sat, 24 Mar 2007, Julian Seward wrote: > > > On Saturday 24 March 2007 08:23, Bart Van Assche wrote: > > > > My opinion is that the art program is flawed: it is never a good idea to > > compare floating point numbers with the "==" or "!=" operator. [...] > > I agree, floating point comparison is not good. On the other hand, > SPEC CPU is designed to be portable and I would be amazed if the SPEC > folks had not looked into these problems in depth. Perhaps they > fixed all problems they encountered in testing, but this one did not > happen at that time, and it is only triggered by Valgrind's extra > inaccuracy on x86. Who knows. While you would think SPEC would have done a good job picking portable benchmarks, in actual fact they are a mess. What passed for acceptable code ~1998 when the spec2000 codes were frozen just wouldn't fly today. They've had to release a number of service packs along the way because probably at least half the original spec2k code release won't compile with a gcc more recent than 2.8 or so. Even now, with the most recent spec2k release, you can't compile the 'vortex' benchmark with optimizations turned on with gcc 4.0 or it will crash on x86-linux. I do wonder if any of the compiler vendors noticed this problem with art.. you could in theory make your compiler look better on the FP score by having art finish in half the time if you made sure it ran in 64-bit rather than 80-bit mode on x86... For my purposes I hacked the art code and added a few lines of code at the beginning to force the x87 fpu state to be FPU_DOUBLE (instead of FPU_EXTENDED) and that is enough to make the valgrind runs match the actual perf counter runs. >From what I gather from the document I linked to earlier by Whiteley, it is only Linux x86 that even shows this behavior; other OSes like x86-BSD and Windows don't enable extended mode by default. Thanks for all the help looking into this, Vince |
|
From: Nicholas N. <nj...@cs...> - 2007-03-24 23:46:07
|
On Sat, 24 Mar 2007, Vince Weaver wrote: > I do wonder if any of the compiler vendors noticed this problem with art.. > you could in theory make your compiler look better on the FP score by > having art finish in half the time if you made sure it ran in 64-bit > rather than 80-bit mode on x86... Surely the SPEC output checking would catch this, if you are doing proper, reportable runs? Nick |
|
From: Vince W. <vi...@cs...> - 2007-03-25 20:04:45
|
On Sun, 25 Mar 2007, Nicholas Nethercote wrote: > On Sat, 24 Mar 2007, Vince Weaver wrote: > > I do wonder if any of the compiler vendors noticed this problem with art.. > > you could in theory make your compiler look better on the FP score by > > having art finish in half the time if you made sure it ran in 64-bit > > rather than 80-bit mode on x86... > > Surely the SPEC output checking would catch this, if you are doing proper, > reportable runs? This is rapidly getting more and more off-topic, for which I apologize... The output from the 'art' benchmark is identical in all cases... the difference is that it converges twice as fast when using 64-bit math rather than 80-bit math. I only noticed this problem because the experiments I am doing depend on the instructions_retired metric to be roughly the same across all the tools I am testing. Vince |
|
From: Nicholas N. <nj...@cs...> - 2007-03-26 01:57:31
|
On Sun, 25 Mar 2007, Vince Weaver wrote: > The output from the 'art' benchmark is identical in all cases... > the difference is that it converges twice as fast when using 64-bit > math rather than 80-bit math. That's awful. Is SPEC2006 any better than SPEC2000? Nick |