From: Julian S. <js...@ac...> - 2002-10-04 10:44:56
|
Cobbling together a response to this from the archives, since I didn't get it via the normal routes. > This patch makes FPU state changes lazy, so there should only be one > save/restore pair per basic block. With this change in place, > FPU-intensive programs (in my case, some 3D code using OpenGL) are > significantly faster. Interesting. This is something I'd wondered about doing at the time I did the FPU stuff in the first place. How much faster is "significantly faster" ? So, my main point. I think this patch is unsafe and will lead to hard to find problems down the line. The difficulty is that it allows the simulated FPU state to hang around in the real FPU for long periods, up to a whole basic block's worth of execution (if I understand it write). We only need a skin to call out to a helper function which modifies the real FPU state on some obscure path, and we're hosed. Since we don't have any control over what skins people might plug in, this seems like and unsafe modification to the core. The modification I had in mind for a while was a lot more conservative, and more along the lines of a peephole optimisation. Essentially if we see a FPU-no-mem op followed by another FPU-no-mem op we can skip the save at the end of the first and the restore at the start of the second. Looking at the stable branch vg_from_ucode.c and the codegen cases for FPU, FPU_R and FPU_W it's clear we can also do the same for FPU_R/W followed by FPU since there is no calls to helpers in the gap between these two. Or am I missing something? It would definitely be good to speed up the FPU stuff a bit, but I need to be convinced that you've got this 100% tied down in a not-too-complex way, in the face of arbitrary actions carried out by skins-not-invented-yet. J |
From: Jeremy F. <je...@go...> - 2002-10-04 15:44:12
|
On Fri, 2002-10-04 at 03:51, Julian Seward wrote: How much faster is "significantly faster" ? I haven't measured it in detail, but the frame rate increased from about 1100ms/frame to 800-900ms/frame. I'll so some more scientific measurements soon. So, my main point. I think this patch is unsafe and will lead to hard to find problems down the line. The difficulty is that it allows the simulated FPU state to hang around in the real FPU for long periods, up to a whole basic block's worth of execution (if I understand it write). We only need a skin to call out to a helper function which modifies the real FPU state on some obscure path, and we're hosed. Since we don't have any control over what skins people might plug in, this seems like and unsafe modification to the core. The modification I had in mind for a while was a lot more conservative, and more along the lines of a peephole optimisation. Essentially if we see a FPU-no-mem op followed by another FPU-no-mem op we can skip the save at the end of the first and the restore at the start of the second. What I'm doing is not conceptually different from caching an ArchReg in a RealReg for the lifetime of a basic block. The general idea is that the FP state is pulled in just before the first FPU/FPU_[RW] instruction, and saved again just before: - JMP - CCALL - any skin UInstr I can't see how a skin can introduce any instrumentation which would be able to catch the FP state unsaved (is there any way for a skin to do instrumentation or call a C function without using either CCALL or its own UInstr?). Your idea is basically the same, except we add a fourth saving condition: - any non FPU instruction This would only be necessary if you imagine a non-FPU instruction which can inspect the architectural state of the FPU (in other words, is a memory access offset into the baseBlock: something which skins can't generate directly). In summary, I think this is actually pretty conservative, simple and safe. J |
From: Jeremy F. <je...@go...> - 2002-10-04 20:42:41
Attachments:
fptest.c
valgrind-lazy-fp.diff
|
On Fri, 2002-10-04 at 03:51, Julian Seward wrote: > How much faster is "significantly faster" ? OK, I've quantified this now. Using the attached test program (a matrix multiply extracted from Mesa), I'm getting the following timings (fptest compiled with -O; 600MHz PIII laptop; --skin=none): native execution: 0.38s baseline valgrind: 65.35s lazy-fp valgrind: 4.05s In other words, Valgrind is currently 172 times slower than running native for FP-intensive code; the lazy save-restore improves this by a factor of 16 or so to make the valgrind overhead only about 11 times slower than native. I'd say that's significant. I'm attaching my current diff against HEAD; the previous one left out saving before JIFZ. J |
From: Julian S. <js...@ac...> - 2002-10-04 20:48:32
|
On Friday 04 October 2002 9:43 pm, Jeremy Fitzhardinge wrote: > On Fri, 2002-10-04 at 03:51, Julian Seward wrote: > > How much faster is "significantly faster" ? > > OK, I've quantified this now. > > Using the attached test program (a matrix multiply extracted from Mesa), > I'm getting the following timings (fptest compiled with -O; 600MHz PIII > laptop; --skin=none): > > native execution: 0.38s > baseline valgrind: 65.35s > lazy-fp valgrind: 4.05s > > In other words, Valgrind is currently 172 times slower than running > native for FP-intensive code; the lazy save-restore improves this by a > factor of 16 or so to make the valgrind overhead only about 11 times > slower than native. Hmm, not bad. What's the numbers for --skin=memcheck and --skin=addrcheck ? I know the improvement factor will be a lot less, but I'd still like to know what it is. /me is suitably impressed, just in case you were getting any other impression, btw. J |
From: Jeremy F. <je...@go...> - 2002-10-04 23:02:59
|
On Fri, 2002-10-04 at 13:55, Julian Seward wrote: > Hmm, not bad. What's the numbers for --skin=memcheck and --skin=addrcheck ? > I know the improvement factor will be a lot less, but I'd still like to know > what it is. baseline lazy fp addrcheck 77.33 41.10 memcheck 82.98 44.76 J |