|
From: Paul M. <pa...@sa...> - 2004-03-07 22:40:52
|
I have just put a new version of my PPC port of valgrind up at http://ozlabs.org/~paulus. There is a patch against valgrind-2.1.0 plus a tarball (.tar.bz2) there. I can now successfully valgrind Mozilla and OpenOffice on PPC. However, it is painfully slow and it detects a lot of errors. Mozilla, for instance, runs about 500 times slower under valgrind than it does natively on my G5. It's not executing 500x as many instructions, so there must be something about the kinds of instruction sequences I am generating that cause the CPU to run a lot more slowly than it does on "normal" code. Maybe I am getting a lot of cache misses. I have started merging my changes into the current CVS version. It's going to take me a little while, though, to understand the new startup sequence. Paul. |
|
From: Johan R. <jry...@ni...> - 2004-03-07 23:58:47
|
Paul Mackerras <pa...@sa...> wrote: : However, it is painfully slow and it detects a lot of errors. : Mozilla, for instance, runs about 500 times slower under valgrind than : it does natively on my G5. [...] What about chaining between blocks? That normally increases performance by a magnitude. best regards j |
|
From: Paul M. <pa...@sa...> - 2004-03-08 00:19:31
|
Johan Rydberg writes: > What about chaining between blocks? That normally increases performance > by a magnitude. Yes, I do block chaining. Usually I see about 80% of jumps being chained (i.e. 20% unchained). But the chaining only seems to increase performance by about 5%. I'm sure I must be doing something wrong somewhere but I just can't put my finger on it. Paul. |
|
From: Julian S. <js...@ac...> - 2004-03-08 01:05:27
|
On Monday 08 March 2004 00:04, Paul Mackerras wrote: > Johan Rydberg writes: > > What about chaining between blocks? That normally increases performance > > by a magnitude. No, not for us. > Yes, I do block chaining. Usually I see about 80% of jumps being > chained (i.e. 20% unchained). But the chaining only seems to increase > performance by about 5%. I'm sure I must be doing something wrong > somewhere but I just can't put my finger on it. 5% gain with block chaining sounds roughly on a par with what we got in x86 land, so looks like you're OK there at least. Have you tried with and without your new ultra-accurate add-tracking sequence -- the one with the min and max? That looks a bit expensive to me. J |
|
From: Julian S. <js...@ac...> - 2004-03-08 01:03:37
|
On Monday 08 March 2004 00:04, Paul Mackerras wrote:
> Johan Rydberg writes:
> > What about chaining between blocks? That normally increases performance
> > by a magnitude.
>
> Yes, I do block chaining. Usually I see about 80% of jumps being
> chained (i.e. 20% unchained). But the chaining only seems to increase
> performance by about 5%. I'm sure I must be doing something wrong
> somewhere but I just can't put my finger on it.
On x86 we unexpectedly got hammered (lost a huge number of cycles) due to
instructions to save and restore the cpu's flags register in memory.
Surprisingly this apparently-trivial action seems to cause PIII,
P4 and Athlon to stop until all pipelines are empty, losing 20-40
cycles for what is basically a simple load or store.
I wonder if something bad like that is happening to you. I tracked
this down by running an ultra-trivial loop on V; something like
for (i = 0; i < 100000000; i++)
;
and so you get a single translated bb jumping back to itself.
That means the code in it is simple enough to inspect and perhaps
that might lead you to something.
Other things I can think of are some kind of Icache coherency
problem due to dynamic code generation? Does writing at some
address invalidate all Icache entries in the vicinity?
J
|
|
From: Jeremy F. <je...@go...> - 2004-03-08 01:21:33
|
On Sun, 2004-03-07 at 14:26, Paul Mackerras wrote: > I can now successfully valgrind Mozilla and OpenOffice on PPC. Good news. > However, it is painfully slow and it detects a lot of errors. What kinds of errors? The error paths, even for suppressed or suplicate errors, are pretty slow compared to the non-error paths; I could imagine a pretty significant performance hit from just those. > Mozilla, for instance, runs about 500 times slower under valgrind than > it does natively on my G5. It's not executing 500x as many > instructions, so there must be something about the kinds of > instruction sequences I am generating that cause the CPU to run a lot > more slowly than it does on "normal" code. Maybe I am getting a lot > of cache misses. 500x is a bit of a surprise - it could just be a result of "lots of errors". I'd look to see if there's issues with sharing code and data on the same page. The current translation cache puts a structure immediately before each BB. I don't think it's modified much, but there could be issues. I think we should probably consider separating the data and code pieces of the TC anyway. Also, I presume you flush the icache when generating new blocks of code; is that a global flush, or just parts of the icache? Does linux-ppc support oprofile? I have a little hack which allocates the TC with mmap to a file rather than in anonymous memory, which allows oprofile to give good overall results to see whether time is being spent in generated code or in Valgrind core code (though mapping the generated code addresses to something meaningful is trickier). > I have started merging my changes into the current CVS version. It's > going to take me a little while, though, to understand the new startup > sequence. Tell me if I can help in any way. I'm interested to know if I've made any assumptions which won't work for ppc (32 or 64). I already know that we're going to have to do things slightly differently for x86-64, because we can't put the Valgrind code at a very high address (the toolchain doesn't support code outside of 4G, even if pointers are large), so we're going to have to do something like move Valgrind very low, reserving all the low addresses for its code (which would also work for most x86-32 programs... hmm). J |
|
From: Tom H. <th...@cy...> - 2004-03-08 07:28:14
|
In message <1078708420.28976.15.camel@localhost.localdomain>
Jeremy Fitzhardinge <je...@go...> wrote:
> Tell me if I can help in any way. I'm interested to know if I've made
> any assumptions which won't work for ppc (32 or 64). I already know
> that we're going to have to do things slightly differently for x86-64,
> because we can't put the Valgrind code at a very high address (the
> toolchain doesn't support code outside of 4G, even if pointers are
> large), so we're going to have to do something like move Valgrind very
> low, reserving all the low addresses for its code (which would also work
> for most x86-32 programs... hmm).
I don't think that's true at all - our x86-64 box seems to map code
outside the 4G range. Look at libc in this map:
gill [~] % uname -a
Linux gill.uk.cyberscience.com 2.4.22-1.2166.nptl #1 Fri Jan 30 13:44:52 EST 2004 x86_64 x86_64 x86_64 GNU/Linux
gill [~] % cat /proc/self/maps
0000000000400000-0000000000404000 r-xp 0000000000000000 03:41 868386 /bin/cat
0000000000504000-0000000000505000 rw-p 0000000000004000 03:41 868386 /bin/cat
0000000000505000-0000000000526000 rwxp 0000000000000000 00:00 0
0000002a95556000-0000002a9556b000 r-xp 0000000000000000 03:41 786436 /lib64/ld-2.3.2.so
0000002a9556b000-0000002a9556c000 rw-p 0000000000000000 00:00 0
0000002a9557d000-0000002a9557e000 rw-p 0000000000000000 00:00 0
0000002a9566a000-0000002a9566b000 rw-p 0000000000014000 03:41 786436 /lib64/ld-2.3.2.so
0000002a9566b000-0000002a957a6000 r-xp 0000000000000000 03:41 6307844 /lib64/tls/libc-2.3.2.so
0000002a957a6000-0000002a9586b000 ---p 000000000013b000 03:41 6307844 /lib64/tls/libc-2.3.2.so
0000002a9586b000-0000002a958ab000 rw-p 0000000000100000 03:41 6307844 /lib64/tls/libc-2.3.2.so
0000002a958ab000-0000002a958af000 rw-p 0000000000000000 00:00 0
0000007fbfffd000-0000007fc0000000 rwxp ffffffffffffe000 00:00 0
Tom
--
Tom Hughes (th...@cy...)
Software Engineer, Cyberscience Corporation
http://www.cyberscience.com/
|
|
From: Jeremy F. <je...@go...> - 2004-03-08 10:25:41
|
On Sun, 2004-03-07 at 23:19, Tom Hughes wrote: > I don't think that's true at all - our x86-64 box seems to map code > outside the 4G range. Look at libc in this map: I think the restriction is on executables; they can't be outside the 2G limit. Shared objects use EIP-relative addressing, so they don't really care where they're placed. The x86-64 ABI talks about small, kernel, medium and large models; small is where text and data are below 2G; kernel is mapped into the negative 2G part of the address space; medium forces text to be under 2G, but data can be higher; large has no restrictions. gcc/binutils doesn't implement large. The upshot is that I think we can fit all of Valgrind's static text below the start of the client executable (4 MBytes should be enough space), and put the .so's and data way above the client address space. J |
|
From: Nicholas N. <nj...@ca...> - 2004-03-08 09:25:05
|
On Mon, 8 Mar 2004, Paul Mackerras wrote: > However, it is painfully slow and it detects a lot of errors. > Mozilla, for instance, runs about 500 times slower under valgrind than > it does natively on my G5. It's not executing 500x as many > instructions, so there must be something about the kinds of > instruction sequences I am generating that cause the CPU to run a lot > more slowly than it does on "normal" code. Maybe I am getting a lot > of cache misses. How does Memcheck compare with Nulgrind (--skin=none) and Addrcheck? A comparison there could be instructive. N |