|
From: Ivo R. <iv...@iv...> - 2017-08-24 08:39:18
|
Dear Valgrind developers, I had just a thought if it would be possible for Valgrind/Memchek to instrument a binary just once, save the instrumented code in a pre-image and then execute it many times already instrumented. My motivation is a huge binary which takes a lot of time to instrument and which is executed frequently during many many test suite runs without any human intervention. What would be the major challenges here? My preliminary idea was that trans-cache could request blocks either from VEX or from the pre-image. Please let me know your thoughts! Thank you, I. |
|
From: John R. <jr...@bi...> - 2017-08-24 13:02:09
|
> My motivation is a huge binary which takes a lot of time to instrument Please show measurements to substantiate this claim. What is the total /usr/bin/size in bytes of all executable code in the main program and shared libraries? Instrumenting is performed Just-In-Time, with a cache. Is the cache too small for the working set of instructions, and therefore is thrashed much of the time, forcing the re-instrumentation of too many basic blocks? Or does even the first instrumentation for each basic block (that is executed) take too long? And how long is "too long"? |
|
From: Diane M <dia...@gm...> - 2017-08-30 16:29:35
|
Ivo, On Aug 24, 2017, at 4:39 AM, Ivo Raisr <iv...@iv...> wrote: Dear Valgrind developers, I had just a thought if it would be possible for Valgrind/Memchek to instrument a binary just once, save the instrumented code in a pre-image and then execute it many times already instrumented. My motivation is a huge binary which takes a lot of time to instrument and which is executed frequently during many many test suite runs without any human intervention. What would be the major challenges here? My preliminary idea was that trans-cache could request blocks either from VEX or from the pre-image. So do you mean that valgrind would store translated blocks in a (memory-mapped) file and then search for them before translating and instrumenting? If so, this is interesting, but I suspect that it will not speed things up much. The search would have to be a lot faster than the disassembly plus instrument in order to make a big difference. If you are considering translating the entire program and caching it, I think that would be much faster, but probably not possible with the way that valgrind works. Diane On Thu, Aug 24, 2017 at 4:39 AM, Ivo Raisr <iv...@iv...> wrote: > Dear Valgrind developers, > > I had just a thought if it would be possible for Valgrind/Memchek to > instrument a binary just once, > save the instrumented code in a pre-image and then execute it many > times already instrumented. > > My motivation is a huge binary which takes a lot of time to instrument and > which > is executed frequently during many many test suite runs without any > human intervention. > > What would be the major challenges here? > My preliminary idea was that trans-cache could request blocks either > from VEX or from the pre-image. > > Please let me know your thoughts! > Thank you, > I. > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > |
|
From: Ivo R. <iv...@iv...> - 2017-08-30 19:47:01
|
2017-08-30 18:24 GMT+02:00 Diane Meirowitz <Dia...@or...>: >> I had just a thought if it would be possible for Valgrind/Memchek to >> instrument a binary just once, >> save the instrumented code in a pre-image and then execute it many >> times already instrumented. >> >> [...] >> > So do you mean that valgrind would store translated blocks in a (memory-mapped) file and > then search for them before translating and instrumenting? If so, this is interesting, > but I suspect that it will not speed things up much. The search would have to be a lot faster > than the disassembly plus instrument in order to make a big difference. > > If you are considering translating the entire program and caching it, I think that would > be much faster, but probably not possible with the way that valgrind works. You are right. Independently Julian did some measurements recently and it showed that the cost of JIT'ing perf/bz2 under Memcheck is only approx. 1/40. More precisely, whole Valgrind/Memcheck running that program executed approx. 40 Gi instructions whereas JIT subsystem executed just 1 Gi of them. I expect similar ratios for other programs, depending on their nature of course. I. |
|
From: Julian S. <js...@ac...> - 2017-08-31 08:25:14
|
> What would be the major challenges here? > My preliminary idea was that trans-cache could request blocks either > from VEX or from the pre-image. I've thought about this a couple of times in the past but never did anything about it. One of the reasons is that I thought it would be difficult to do and actually get a win. Eg, for starting Firefox on Memcheck, the JIT needs to process about 500,000 blocks, giving about 300MB of instrumented code. If we say (perhaps somewhat optimistically) that the JIT can process about 10000 blocks/sec, then that is 50 seconds of computation. In order to get a win, we'd need to be able to at least compute a hash of the block to be jitted (based on the instruction bytes), find the offset of the block in our memory-mapped file, and pull in the relevant translation, all in around 100 microseconds. I might be persuaded that this is doable if the cache file is in the filesystem cache, but as soon as we hit backing storage (especially if it's a rotating disk) I think our prospects are poor. None of that is it real reason I didn't persue it, though. The real reason is address space layout randomization. Because different libraries get loaded at different addresses in subsequent runs, this will cause the hit rate on the cache to be zero for the libraries involved. This implies that the load address for the library somehow needs to be incorporated in the cache keys that we're using. And that's true because the front ends (guest_amd64_toIR.c, etc) bake into the IR, values derived from the program counter: branch target addresses, and PC-relative load/store addresses. I can't see any way around this without major re-engineering of the JITs. Because what we'd need to somehow parameterise the cache so that we could look up a translation independent of its load address, and then if found, patch up the old version so it works for the "new" address. > If you are considering translating the entire program and caching it, I > think that would be much faster, Mhm, but then you have the problem of finding all the code that is part of the program, which is equivalent to solving the halting problem. ----- For these reasons, my preference is to make the JIT faster, and ultimately to move to having a "two speed" JIT. That is, where code initially is instrumented using a fast and low quality JIT, to reduce latency and to gather branch and block-use statistics. When we decide a particular path is hot enough then those blocks are given to a slower, optimising JIT, so we ultimately get both low latency for cold paths and high performance for hot paths. This seems to be the "modern way". Also, the optimising JIT can run in a helper thread, so in effect we never have to wait for it, because we can just use the unoptimised version of a (super)block until the optimised version is ready. J |
|
From: John R. <jr...@bi...> - 2017-09-01 14:03:56
|
>> If you are considering translating the entire program and caching it, I >> think that would be much faster, > > Mhm, but then you have the problem of finding all the code that is part of > the program, which is equivalent to solving the halting problem. In practice is it not that hard; I've done it twice. Transitive closure of lexical calls, using as roots the .e_entry and all the function symbols, goes a long way. The rest is covered by C++ virtual function tables (or equivalent), and recognizing the code for 'switch' statements. (Yeah, that's ugly+heuristic+compiler-dependent, but it works well enough after a few iterations.) -- |
|
From: Julian S. <js...@ac...> - 2017-08-31 08:29:09
|
On 31/08/17 10:25, Julian Seward wrote: > None of that is it real reason I didn't persue it, though. The real reason > is address space layout randomization. Because different libraries get loaded > at different addresses in subsequent runs, Duh. What I meant is "because *the same* library gets loaded at different .." J |
|
From: John R. <jr...@bi...> - 2017-08-31 11:11:16
|
>> None of that is it real reason I didn't pursue it, though. The real reason >> is address space layout randomization. Because different libraries get loaded >> at different addresses in subsequent runs, > > Duh. What I meant is "because *the same* library gets loaded at different .." Apply 'prelink' to [a copy of] each shared library, specifying the particular address at which the library was loaded when the VEX translation was performed. -- |
|
From: John R. <jr...@bi...> - 2017-08-31 11:46:09
|
> My motivation is a huge binary which takes a lot of time to instrument and which > is executed frequently during many many test suite runs without any > human intervention. If you have only a hammer [memcheck] then everything begins to look like a nail. Probably you should enlarge your toolbox instead of trying to optimize memcheck. Please be less coy about the huge binary. How big is it? How many shared libraries? What is the total /usr/bin/size, particularly .text? What programming languages does it employ? How much address space does it use at run time? How much time does memcheck take? How many machines are you running round-the-clock for this test suite? [Yes, the numerical answers to each question do matter.] Probably the software has very poor quality: few unit tests, undocumented design and implementation strategy, little or no consideration of testability. So: Apply profiling to the subroutine call graph. Use code coverage analysis. Look at the bugs in the last year. Look at the changes to the source code in the last year [each change is a proxy for a bug.] Identify the 20% of the source that is responsible for 80% of the bugs. Attack that 20% using divide-and-conquer: There should be a three-level hierarchy of pieces. Develop the unit tests for the lowest-level pieces and the integration tests for each node in the hierarchy. Apply profiling + coverage + memcheck at each node. (Expect 6 months. Hire two graduate students and a manager [perhaps yourself: but it will take 25% of your time].) -- |
|
From: Diane M <dia...@gm...> - 2017-09-01 13:51:11
|
Julian, Somehow this message wound up in my spam folder. This is a very interesting thread. Please see my comments below. On Thu, Aug 31, 2017 at 4:25 AM, Julian Seward <js...@ac...> wrote: > > > If you are considering translating the entire program and caching it, I > > think that would be much faster, > > Mhm, but then you have the problem of finding all the code that is part of > the program, which is equivalent to solving the halting problem. > It depends how you go about it. Oracle's Discover does this, but it depends upon annotations inserted by the Studio compilers to tell it where functions start and end, when things in the .text section are actually not code, etc. So Discover is function-based, and Valgrind is code block-based. Obviously the latter is much more robust, but unfortunately, slower. > > ----- > > For these reasons, my preference is to make the JIT faster, and ultimately > to move to having a "two speed" JIT. That is, where code initially is > instrumented using a fast and low quality JIT, to reduce latency and to > gather branch and block-use statistics. When we decide a particular path > is hot enough then those blocks are given to a slower, optimising JIT, so > we ultimately get both low latency for cold paths and high performance for > hot paths. This seems to be the "modern way". > > Also, the optimising JIT can run in a helper thread, so in effect we never > have to wait for it, because we can just use the unoptimised version of > a (super)block until the optimised version is ready. > Those optimizing JITs can take a very long time, but I like the idea of using the slower code while waiting for the faster code to be optimized by a separate thread. Though there are issues with doing that also having to do with maintaining correct program state when switching between the two. Diane > > J > |
|
From: Diane M <dia...@gm...> - 2017-09-01 13:52:42
|
Forgot to mention that we might be able to look up code in a cache using offsets from function symbols, rather than addresses... I still don't think this will be faster though. Diane On Fri, Sep 1, 2017 at 9:50 AM, Diane M <dia...@gm...> wrote: > Julian, > > Somehow this message wound up in my spam folder. This is a very > interesting thread. > Please see my comments below. > > On Thu, Aug 31, 2017 at 4:25 AM, Julian Seward <js...@ac...> wrote: > >> >> > > If you are considering translating the entire program and caching it, I >> > think that would be much faster, >> >> Mhm, but then you have the problem of finding all the code that is part of >> the program, which is equivalent to solving the halting problem. >> > > It depends how you go about it. Oracle's Discover does this, but it > depends upon annotations inserted > by the Studio compilers to tell it where functions start and end, when > things in the .text section are > actually not code, etc. So Discover is function-based, and Valgrind is > code block-based. Obviously > the latter is much more robust, but unfortunately, slower. > >> >> ----- >> >> For these reasons, my preference is to make the JIT faster, and ultimately >> to move to having a "two speed" JIT. That is, where code initially is >> instrumented using a fast and low quality JIT, to reduce latency and to >> gather branch and block-use statistics. When we decide a particular path >> is hot enough then those blocks are given to a slower, optimising JIT, so >> we ultimately get both low latency for cold paths and high performance for >> hot paths. This seems to be the "modern way". >> >> Also, the optimising JIT can run in a helper thread, so in effect we never >> have to wait for it, because we can just use the unoptimised version of >> a (super)block until the optimised version is ready. >> > > Those optimizing JITs can take a very long time, but I like the idea of > using the slower code while > waiting for the faster code to be optimized by a separate thread. Though > there are issues with > doing that also having to do with maintaining correct program state when > switching between the two. > > Diane > > >> >> J >> > > |