|
From: Thomas R. <tr...@st...> - 2013-05-02 13:24:49
|
Hi, This is somewhat prompted by the "what do we want to go in for 3.9" thread, but I didn't want to hijack it with hypothetical features. I've been working on AMD64 rounding mode support on and off, largely because we need this at $DAYJOB. The ticket https://bugs.kde.org/show_bug.cgi?id=136779 keeps getting stalled. I don't want to sound overly pushy, but it would help me a bit to at least know generally what is holding it up. General disinterest? Disappointment with a 4% slowdown in some cases? Bad design? Broken code? I still want to add a thorough regression test and run it through some stricter testing if I can find something workable tests (probably from the CGAL test suite, if I can get access). But we have been using this in production since before I posted the first version, and that's now more than 2.5 years ago. I have again rebased it and made another big commit to modify the new AVX instruction support to look out for rounding; the relevant commits are currently at https://github.com/trast/valgrind/commits/amd64-rounding for lack of a better place to keep them. I have some other work-in-progress there, notably the regalloc speed improvement hack at https://bugs.kde.org/show_bug.cgi?id=318030 but that is relatively minor. Thanks, -- Thomas Rast trast@{inf,student}.ethz.ch |
|
From: Julian S. <js...@ac...> - 2013-05-03 08:52:55
|
> keeps getting stalled. I don't want to sound overly pushy, but it would > help me a bit to at least know generally what is holding it up. General > disinterest? Disappointment with a 4% slowdown in some cases? I think the fact that this is somewhat of a niche item, plus the potential slowdown. But, no matter. Can you tell me exactly the set of patches that I should review/try out? I'll try to get some feedback to you by early next week. > https://bugs.kde.org/show_bug.cgi?id=318030 I looked at that before now, but didn't come to much conclusion. Personally I would also like ARM to benefit from that, since that's a target on which I run large codes on relatively low-powered hardware. J |
|
From: Sebastian F. <seb...@gm...> - 2013-05-03 14:26:46
|
On Fri, May 3, 2013 at 10:52 AM, Julian Seward <js...@ac...> wrote: > >> keeps getting stalled. I don't want to sound overly pushy, but it would >> help me a bit to at least know generally what is holding it up. General >> disinterest? Disappointment with a 4% slowdown in some cases? > > I think the fact that this is somewhat of a niche item, plus the potential > slowdown. Which slowdown? Are you serious? Shouldn't valgrind first try to get things running *correct* before trying to optimize performance? There are already enough applications like spice or matlab which grossly misbehave, produce cropped results (80bit fp gets reduced to 64bit fp math, resulting in gross matlab malfunctions) or just plainly crash. |
|
From: Thomas R. <tr...@st...> - 2013-05-03 14:56:12
|
Sebastian Feld <seb...@gm...> writes:
> On Fri, May 3, 2013 at 10:52 AM, Julian Seward <js...@ac...> wrote:
>>
>>> keeps getting stalled. I don't want to sound overly pushy, but it would
>>> help me a bit to at least know generally what is holding it up. General
>>> disinterest? Disappointment with a 4% slowdown in some cases?
>>
>> I think the fact that this is somewhat of a niche item, plus the potential
>> slowdown.
>
> Which slowdown? Are you serious? Shouldn't valgrind first try to get
> things running *correct* before trying to optimize performance? There
> are already enough applications like spice or matlab which grossly
> misbehave, produce cropped results (80bit fp gets reduced to 64bit fp
> math, resulting in gross matlab malfunctions) or just plainly crash.
Note that handling the x87 instruction, extending them to 80 bits, is
much harder. I merely made the SSE instructions respect the SSE
rounding mode. This proves sufficient for our purposes, because GCC
defaults to -mfpmath=sse on x86_86 and we do not use 'long double'.
--
Thomas Rast
trast@{inf,student}.ethz.ch
|
|
From: John R. <jr...@bi...> - 2013-05-03 16:49:42
|
On 05/03/2013 07:26 AM, Sebastian Feld wrote:
> There
> are already enough applications like spice or matlab which grossly
> misbehave, produce cropped results (80bit fp gets reduced to 64bit fp
> math, resulting in gross matlab malfunctions) or just plainly crash.
Do you run with the following, every time, all the time, no exceptions?
void *malloc(size_t size) { return calloc(1, size); }
That prevents *ALL* uninit errors which might arise from malloc()ed blocks.
Typical costs are about 3% or less in time. Make a shared library,
name that library in LD_PRELOAD, and *ALL* your programs benefit.
If you don't do this, then you care less than 3% about *ROBUST* computing.
Next: local variables. Compile with "-pg -mfentry", and implement an
mcount() or __fentry__() which zeroes the new stack frame. That prevents
*ALL* uninit errors for stack-resident locals. Typical cost varies,
but most often 3% to 10%; sometimes 15% to 20%. [Do you pay sales tax (VAT)?]
The only case left for uninit is register-only local variables. Beat your
compiler over the head with a club until it warns for every declaration
of a local scalar that does not contain an initialization.
With uninit eradicated, then the only error left is out-of-bounds.
Sorry, that's a halting problem [probably unsolvable.]
Next: spice and matlab. The subroutines themselves are *correct*.
The only possible errors are due to incorrect parameters: client code.
So construct an indirection layer: check overlap of input arrays,
pre-zero output arrays, etc. Use this for every run, all the time.
Next: 80-bit floating point. Put your money where your mouth is.
Recently DFP (Decimal Floating Point as in s390) was added to memcheck+VEX.
Adding 80-bit FP would be a similar effort, and can even use much of
the DFP code as a guide. I'll do it for N * $10,000; 'N' to be negotiated.
--
|
|
From: Julian S. <js...@ac...> - 2013-05-07 13:16:44
|
On 05/03/2013 04:26 PM, Sebastian Feld wrote: > Which slowdown? Are you serious? Shouldn't valgrind first try to get > things running *correct* before trying to optimize performance? There are various places -- particularly within Memcheck's definedness tracking -- where there is a tradeoff between correctness and performance. This is not the only case. We have to choose some tradeoff between accuracy and performance which satisfies the majority of users. There are always going to be some set of users for whom this tradeoff doesn't work well, so improvements are necessary. J |
|
From: John R. <jr...@bi...> - 2013-05-03 15:18:53
|
On 05/03/2013 07:26 AM, Sebastian Feld wrote: > On Fri, May 3, 2013 at 10:52 AM, Julian Seward <js...@ac...> wrote: >> >>> keeps getting stalled. I don't want to sound overly pushy, but it would >>> help me a bit to at least know generally what is holding it up. General >>> disinterest? Disappointment with a 4% slowdown in some cases? >> >> I think the fact that this is somewhat of a niche item, plus the potential >> slowdown. > > Which slowdown? Are you serious? Shouldn't valgrind first try to get > things running *correct* before trying to optimize performance? Late answers are also incorrect answers, even for a correctness checker. memcheck is almost too slow already. There are more than a few cases in which memcheck is not used because memcheck takes too long. A 4% slowdown means that a memcheck run that used to take an entire day (for a program that takes an hour at normal speed) would take a whole hour longer. That can be too slow. The probabilistic risk of an unrevealed error can be less than the guaranteed cost of memcheck. This happens often enough that it causes serious consideration of the trade-offs. -- |
|
From: Thomas R. <tr...@st...> - 2013-05-03 18:30:47
|
John Reiser <jr...@bi...> writes:
> On 05/03/2013 07:26 AM, Sebastian Feld wrote:
>> On Fri, May 3, 2013 at 10:52 AM, Julian Seward <js...@ac...> wrote:
>>>
>>>> keeps getting stalled. I don't want to sound overly pushy, but it would
>>>> help me a bit to at least know generally what is holding it up. General
>>>> disinterest? Disappointment with a 4% slowdown in some cases?
>>>
>>> I think the fact that this is somewhat of a niche item, plus the potential
>>> slowdown.
>>
>> Which slowdown? Are you serious? Shouldn't valgrind first try to get
>> things running *correct* before trying to optimize performance?
>
> Late answers are also incorrect answers, even for a correctness checker.
> memcheck is almost too slow already. There are more than a few cases
> in which memcheck is not used because memcheck takes too long.
>
> A 4% slowdown means that a memcheck run that used to take an entire day
> (for a program that takes an hour at normal speed) would take a whole hour
> longer. That can be too slow. The probabilistic risk of an unrevealed error
> can be less than the guaranteed cost of memcheck. This happens often enough
> that it causes serious consideration of the trade-offs.
Just to make sure this does not go off on a 4% tangent: the actual
statistics I have are
-- bigcode1 --
bigcode1 vg-upstream:0.10s no: 1.4s (13.9x, -----) me: 2.6s (26.4x, -----)
bigcode1 vg-round :0.10s no: 1.4s (13.8x, 0.7%) me: 2.7s (26.6x, -0.8%)
-- bigcode2 --
bigcode2 vg-upstream:0.10s no: 3.2s (32.5x, -----) me: 6.8s (68.4x, -----)
bigcode2 vg-round :0.10s no: 3.3s (32.6x, -0.3%) me: 6.9s (69.0x, -0.9%)
-- bz2 --
bz2 vg-upstream:0.62s no: 2.1s ( 3.5x, -----) me: 6.8s (11.0x, -----)
bz2 vg-round :0.62s no: 2.1s ( 3.5x, 0.5%) me: 6.6s (10.6x, 3.4%)
-- fbench --
fbench vg-upstream:0.22s no: 0.9s ( 4.0x, -----) me: 3.5s (16.1x, -----)
fbench vg-round :0.22s no: 0.9s ( 4.0x, -2.3%) me: 3.7s (16.6x, -3.4%)
-- ffbench --
ffbench vg-upstream:0.19s no: 0.8s ( 4.3x, -----) me: 2.6s (13.9x, -----)
ffbench vg-round :0.19s no: 0.8s ( 4.3x, 0.0%) me: 2.6s (13.9x, 0.4%)
-- heap --
heap vg-upstream:0.12s no: 0.7s ( 5.8x, -----) me: 4.6s (38.5x, -----)
heap vg-round :0.12s no: 0.7s ( 5.8x, -0.0%) me: 4.6s (38.2x, 0.6%)
-- heap_pdb4 --
heap_pdb4 vg-upstream:0.13s no: 0.8s ( 5.8x, -----) me: 7.1s (54.7x, -----)
heap_pdb4 vg-round :0.13s no: 0.8s ( 5.8x, 0.0%) me: 7.0s (53.5x, 2.3%)
-- many-loss-records --
many-loss-records vg-upstream:0.01s no: 0.2s (17.0x, -----) me: 1.1s (108.0x, -----)
many-loss-records vg-round :0.01s no: 0.2s (17.0x, 0.0%) me: 1.1s (109.0x, -0.9%)
-- many-xpts --
many-xpts vg-upstream:0.04s no: 0.3s ( 6.5x, -----) me: 1.6s (39.7x, -----)
many-xpts vg-round :0.04s no: 0.3s ( 6.5x, 0.0%) me: 1.6s (39.7x, -0.0%)
-- sarp --
sarp vg-upstream:0.02s no: 0.2s (10.0x, -----) me: 2.1s (104.5x, -----)
sarp vg-round :0.02s no: 0.2s ( 9.0x, 10.0%) me: 2.1s (103.0x, 1.4%)
-- tinycc --
tinycc vg-upstream:0.16s no: 1.4s ( 8.8x, -----) me: 8.3s (52.0x, -----)
tinycc vg-round :0.16s no: 1.4s ( 8.9x, -1.4%) me: 8.5s (53.1x, -2.2%)
Consider them as superseding those posted on the ticket; I generated the
latter on a laptop that has some severe thermal issues with resulting
highly unstable clock.
So it's by no means a clear-cut 4% slowdown. It *is* a slowdown for
some tests, but it is actually a speedup for others.
--
Thomas Rast
trast@{inf,student}.ethz.ch
|
|
From: Lionel C. <lio...@go...> - 2013-05-03 17:42:01
|
On 3 May 2013 16:26, Sebastian Feld <seb...@gm...> wrote: > On Fri, May 3, 2013 at 10:52 AM, Julian Seward <js...@ac...> wrote: >> >>> keeps getting stalled. I don't want to sound overly pushy, but it would >>> help me a bit to at least know generally what is holding it up. General >>> disinterest? Disappointment with a 4% slowdown in some cases? >> >> I think the fact that this is somewhat of a niche item, plus the potential >> slowdown. > > Which slowdown? Are you serious? Shouldn't valgrind first try to get > things running *correct* before trying to optimize performance? There > are already enough applications like spice or matlab which grossly > misbehave, produce cropped results (80bit fp gets reduced to 64bit fp > math, resulting in gross matlab malfunctions) or just plainly crash. You're using the wrong tool. valgrind is great for testing and aiding development of GUI applications or games with little or no precise fp, but it is unsuited for scientific applications. Just from experience with CERN's toolchain - valgrind is a low end *toy* in such cases. Lionel |
|
From: Philippe W. <phi...@sk...> - 2013-05-03 17:56:15
|
On Fri, 2013-05-03 at 19:41 +0200, Lionel Cons wrote: > You're using the wrong tool. valgrind is great for testing and aiding > development of GUI applications or games with little or no precise fp, > but it is unsuited for scientific applications. Just from experience > with CERN's toolchain - valgrind is a low end *toy* in such cases. If there is enough interest, maybe the CERN (or CERN people) would be willing to contribute work (or contribute € or $) to have a better fp support in Valgrind ? Or maybe a CERN internship ? Philippe |
|
From: Irek S. <isz...@gm...> - 2013-05-03 18:42:02
|
On Fri, May 3, 2013 at 7:56 PM, Philippe Waroquiers <phi...@sk...> wrote: > On Fri, 2013-05-03 at 19:41 +0200, Lionel Cons wrote: > >> You're using the wrong tool. valgrind is great for testing and aiding >> development of GUI applications or games with little or no precise fp, >> but it is unsuited for scientific applications. Just from experience >> with CERN's toolchain - valgrind is a low end *toy* in such cases. > > If there is enough interest, maybe the CERN (or CERN people) > would be willing to contribute work (or contribute € or $) to have > a better fp support in Valgrind ? > Or maybe a CERN internship ? How do you do justify the funding for sites which already have Rational tools licensed? CERN has such a campus license. You're facing a chicken and egg problem as soon as money becomes involved: If sites have money they just buy Rational and if they don't have money they can't fund contributions either. I've been there with GE Healthcare. The bean counters told me that it is cheaper to pay for a Rational license to get an immediate solution than funding valgrind development which *may* provide one a year later. Irek |
|
From: Philippe W. <phi...@sk...> - 2013-05-03 19:12:04
|
On Fri, 2013-05-03 at 20:41 +0200, Irek Szczesniak wrote: > How do you do justify the funding for sites which already have > Rational tools licensed? CERN has such a campus license. Maybe because Valgrind is not an exact equivalent of Purify/Quantify ? (e.g. in terms of bit precision of memcheck and/or more tools such as helgrind or drd or ...) ? Maybe because there is some complaints from CERN people about Valgrind ? (I guess that if Rational tools would be all what is needed, there wouldn't be much feedback about Valgrind from CERN :). > > You're facing a chicken and egg problem as soon as money becomes involved: > If sites have money they just buy Rational and if they don't have > money they can't fund contributions either. > I've been there with GE Healthcare. The bean counters told me that it > is cheaper to pay for a Rational license to get an immediate solution > than funding valgrind development which *may* provide one a year > later. I do not know if better 80 bits fp will take one year of work. John Reiser said "I'll do it for N * $10,000; 'N' to be negotiated." Maybe CERN can negotiate starting from N = 0 and increase to 1 maybe ? :). And maybe an internship is relevant ? if Valgrind has correct fp support, it might help to detect a few bugs that might cost a lot otherwise. Philippe |
|
From: Thomas R. <tr...@st...> - 2013-05-03 19:40:20
|
Julian Seward <js...@ac...> writes:
>> keeps getting stalled. I don't want to sound overly pushy, but it would
>> help me a bit to at least know generally what is holding it up. General
>> disinterest? Disappointment with a 4% slowdown in some cases?
>
> I think the fact that this is somewhat of a niche item, plus the potential
> slowdown. But, no matter. Can you tell me exactly the set of patches
> that I should review/try out? I'll try to get some feedback to you by
> early next week.
That would be awesome. I'll go through everything again and post a
revised set to the ticket.
--
Thomas Rast
trast@{inf,student}.ethz.ch
|
|
From: Thomas R. <tr...@st...> - 2013-05-04 09:32:25
|
Thomas Rast <tr...@st...> writes: > Julian Seward <js...@ac...> writes: > >>> keeps getting stalled. I don't want to sound overly pushy, but it would >>> help me a bit to at least know generally what is holding it up. General >>> disinterest? Disappointment with a 4% slowdown in some cases? >> >> I think the fact that this is somewhat of a niche item, plus the potential >> slowdown. But, no matter. Can you tell me exactly the set of patches >> that I should review/try out? I'll try to get some feedback to you by >> early next week. > > That would be awesome. I'll go through everything again and post a > revised set to the ticket. So I think I got them cleaned up. The only changes I made to them as compared to what was on github yesterday include a slight reordering and some explanations in the commit messages. I also posted all 7 patches to bugzilla in their current form. Note for other readers: if you just want to test this, it's probably easier, if a bit wasteful on bandwidth, to run cd /tmp git clone -b amd64-rounding git://github.com/trast/valgrind.git cd valgrind git clone -b amd64-rounding git://github.com/trast/valgrind-VEX.git VEX -- Thomas Rast trast@{inf,student}.ethz.ch |
|
From: Thomas R. <tr...@st...> - 2013-05-04 09:44:42
|
Julian Seward <js...@ac...> writes: >> https://bugs.kde.org/show_bug.cgi?id=318030 > > I looked at that before now, but didn't come to much conclusion. Personally > I would also like ARM to benefit from that, since that's a target on which > I run large codes on relatively low-powered hardware. Are you saying I should make the same change across all platforms? I can give it a stab, but I can only test on x86, amd64 and perhaps ARM if the raspi doesn't strain my patience too much. -- Thomas Rast trast@{inf,student}.ethz.ch |
|
From: Julian S. <js...@ac...> - 2013-05-07 13:05:32
|
(changing the name of this sub-thread to reduce confusion) >>> https://bugs.kde.org/show_bug.cgi?id=318030 >> >> I looked at that before now, but didn't come to much conclusion. Personally >> I would also like ARM to benefit from that, since that's a target on which >> I run large codes on relatively low-powered hardware. > > Are you saying I should make the same change across all platforms? > > I can give it a stab, but I can only test on x86, amd64 and perhaps ARM > if the raspi doesn't strain my patience too much. Don't hack on platforms you can't test on -- that just leads to problems. If you can get amd64 and ARM working with your speedups, though, that would be great. J |
|
From: Thomas R. <tr...@st...> - 2013-05-13 12:03:49
|
Julian Seward <js...@ac...> writes: > (changing the name of this sub-thread to reduce confusion) > >>>> https://bugs.kde.org/show_bug.cgi?id=318030 >>> >>> I looked at that before now, but didn't come to much conclusion. Personally >>> I would also like ARM to benefit from that, since that's a target on which >>> I run large codes on relatively low-powered hardware. >> >> Are you saying I should make the same change across all platforms? >> >> I can give it a stab, but I can only test on x86, amd64 and perhaps ARM >> if the raspi doesn't strain my patience too much. > > Don't hack on platforms you can't test on -- that just leads to problems. > If you can get amd64 and ARM working with your speedups, though, that would > be great. I found out (stupid me) that a raspberry pi isn't the sort of ARM that valgrind likes. Is there a painless way to try it on something else that you can recommend, or is it ok if I just do it for x86 and x86_64? -- Thomas Rast trast@{inf,student}.ethz.ch |