|
From: Peng Du <imd...@gm...> - 2009-08-17 21:19:03
|
Hello, everyone A newbie question, according to Valgrind's manual for the lackey tool: "It (lackey) could be made to run a lot faster by doing a slightly more sophisticated job of the instrumentation ..." Now I need a very simple memory read/write counting tool, just like lackey. But the tool has to be fast. Can anyone elaborate a little on how to make lackey A LOT FASTER? Or has anyone done so? If yes, do you mind sharing the source? Thank you in advance! |
|
From: Nicholas N. <n.n...@gm...> - 2009-08-17 22:24:58
|
On Tue, Aug 18, 2009 at 7:18 AM, Peng Du<imd...@gm...> wrote: > Hello, everyone > > A newbie question, according to Valgrind's manual for the lackey tool: > > "It (lackey) could be made to run a lot faster by doing a slightly more > sophisticated job of the instrumentation ..." > > Now I need a very simple memory read/write counting tool, just like > lackey. But the tool has to be fast. > > Can anyone elaborate a little on how to make lackey A LOT FASTER? Or > has anyone done so? If yes, do you mind sharing the source? I see two possibilities: - Currently there is one C function call per original instruction. You could batch these up to a degree, eg. count multiple instructions with a single C function call. Cachegrind does something like this. - You could use inline (Vex IR) instrumentation rather than C function calls. See VEX/pub/libvex_ir.h for details of Vex IR. You could also do both of these together. Doing so is left as an exercise to the reader. Even if you do all that, the tool would still have a significant slowdown -- the limit case is Nulgrind (--tool=none) which does no instrumentation and typically has a 5x slow-down. I don't know if that is fast enough for your purposes. You could look at Pin or DynamoRIO as alternative instrumentation frameworks that are better suited to simple tools such as the one you need. Pin in particular may have such a tool included in its distribution. Or you could look at hardware program counters, if they provide the information you need. Nick |
|
From: Peng Du <imd...@gm...> - 2009-08-17 23:07:29
|
Nick Thanks for the hints. The penalty is expected. I've played with PIN and its pinatrace and MemTrace tools for a while. They worked and the performance is not too bad, though a bit slower than Valgrind. I need the virtual address of the read/write. Hence, HW counters can't do the job. I did considered the Precise Event Based Sampling (PEBS). But only L1/2 cache load miss events are supported by PEBS. So it can't do the job either. If HW counters could work, it would the best solution since it is most light-weight. Now my 2nd choice is binary instrumentation frameworks like Valgrind. Now I still have a questions: is the memory trace generated by Valgrind or PIN representative enough to model real program behaviours, considering multi-threading? Thanks On Tue, 18 Aug 2009 08:24:47 +1000 Nicholas Nethercote <n.n...@gm...> wrote: > On Tue, Aug 18, 2009 at 7:18 AM, Peng Du<imd...@gm...> wrote: > > Hello, everyone > > > > A newbie question, according to Valgrind's manual for the lackey > > tool: > > > > "It (lackey) could be made to run a lot faster by doing a slightly > > more sophisticated job of the instrumentation ..." > > > > Now I need a very simple memory read/write counting tool, just like > > lackey. But the tool has to be fast. > > > > Can anyone elaborate a little on how to make lackey A LOT FASTER? Or > > has anyone done so? If yes, do you mind sharing the source? > > I see two possibilities: > > - Currently there is one C function call per original instruction. > You could batch these up to a degree, eg. count multiple instructions > with a single C function call. Cachegrind does something like this. > > - You could use inline (Vex IR) instrumentation rather than C function > calls. See VEX/pub/libvex_ir.h for details of Vex IR. > > You could also do both of these together. Doing so is left as an > exercise to the reader. > > Even if you do all that, the tool would still have a significant > slowdown -- the limit case is Nulgrind (--tool=none) which does no > instrumentation and typically has a 5x slow-down. I don't know if > that is fast enough for your purposes. You could look at Pin or > DynamoRIO as alternative instrumentation frameworks that are better > suited to simple tools such as the one you need. Pin in particular > may have such a tool included in its distribution. Or you could look > at hardware program counters, if they provide the information you > need. > > Nick |
|
From: Nicholas N. <n.n...@gm...> - 2009-08-18 00:36:10
|
On Tue, Aug 18, 2009 at 9:07 AM, Peng Du<imd...@gm...> wrote: > > Thanks for the hints. The penalty is expected. I've played with PIN and > its pinatrace and MemTrace tools for a while. They worked and the > performance is not too bad, though a bit slower than Valgrind. That's interesting. The conventional wisdom is that Valgrind is slower than Pin, especially for simple tools like this. > Now I still have a questions: is the memory trace generated by Valgrind > or PIN representative enough to model real program behaviours, considering > multi-threading? Look at the comment at the top of lackey/lk_main.c. Towards the bottom of that comment there is a discussion of the inaccuracies in the memory trace. Nick |
|
From: John R. <jr...@bi...> - 2009-08-18 00:34:04
|
> Now I need a very simple memory read/write counting tool, just like > lackey. But the tool has to be fast. Memory is not simple any more. You must precisely define what is a read and what is a write. For example, what is a cache hit (level 1 [L1])? An L2 hit? L3? Coalesced consecutive writes? If what you count is architectural read/write to data, then disassemble enough to recognize instructions and basic blocks. Modify the code in-place to increment a counter at the end of every basic block. -- |