|
From: Jeremy F. <je...@go...> - 2002-12-07 08:46:47
|
I fixed the ldt problem of the other day. The LDT stuff wasn't dealing
properly with a child thread inheriting a copy of the parent's LDT
state.
I did a comparison between the relative slowdown of P3 native:valgrind,
vs P4 native:valgrind. Previously the P4 was about twice as slow as the
P3, proportionally (that is, for a given benchmark, on the P3 a program
may have run, say, 10 times slower, whereas the same test run on a P4
would be about 20 times slower).
I'm pleased to say that the P3 and P4 are now equally slow - they're
both 5-10 times slower than native when run under Valgrind
(--skin=none). I suspect this is mostly to do with flags handling
improvements; pushf/popf must be proportionally worse for P4 than P3.
I also tried some experiments to try to batch together larger chunks of
compilation. I added the idea of "speculative translation", where
translating one basic block would attempt to follow jumps and translate
their targets too. Not surprisingly, doing this to every jump was
somewhat slower.
What is surprising is that when the speculation was reduced to following
the only direct jump in a basic block (ie, a jump to a basic block which
*must* be executed next), it is still a speed loss. I would have though
that translating multiple basic at once blocks would take advantage of
the compiler being in cache, and amortize the cost of various
self-modifying-code interlocks, etc.
I suspect that VG_(search_transtab) is the problem, since it collapses
into a linear scan of the entire TT when it is full and the address
you're searching for isn't present. Maybe some hashing will help.
J
|