|
From: Josef W. <Jos...@gm...> - 2005-08-30 18:44:03
|
On Tuesday 30 August 2005 17:00, you wrote: > Hi Josef > > We just released Valgrind 3.0.1 and I was wondering what the > situation with callgrind is -- I did not hear from you for some > weeks. I just looked at your web site and no callgrind for > 3.0.X is visible :-( Hi Julian, yes, I recently had not much time. My current version works quite stable, but has known issues. Actually, with the cache simulation switched on, results should be pretty OK already (for AMD64, too). IMHO two things are missing for a release: 1) Changing internals a little bit to support the multiple conditional exits in a BB 2) a solution to trigger special handling of the runtime linker (1) is important for --collect-jumps=yes, ie. for jump tracking. That actually can be regarded as lower priority, as this is currently only useful for assembler annotation. Trivially mapping jumps back to source annotation currently is confusing at best; this would need a better solution in KCachegrind (e.g. flow control graph with loops marked). Still, when cache simulation is switched off (default), I attribute a "Ir" (instruction fetched) event for every instruction of an BB execution, and this is wrong with BBs with multiple exits. For (2) I need a heuristic for detecting the function for lazy symbol relocation in the runtime linker. Some background: When I see a JMP crossing functions (e.g. A calls B, B jumps to C), I have two possibilities to map this onto a call graph: a) A calls B, returns to A, A calls C b) A calls B, B calls C I default to (b), as this gives nice graphs e.g. with tail recursion. With one exception: every call to a shared lib usually first calls the runtime linker, which jumps to the right function at the end. So using (b) here gives call chains with the relocation function multiple times showing up, and this looks like recursive cycles for KCachegrind. This leads to profiles showing most functions in a recursive cycle, which is bad. So I want (a) for the relocation function. x86 used "push <addr>;ret" at end of relocation, and in a bad hack I used this to decide for (a). But AMD64 uses a normal JMP instruction. The thing is, that with a stripped runtime linker, the relocation function gets no symbol name (how is this possible?). How should I decide that I want (a) in this case? Ideas welcome. > It would be really great to have callgrind/kcachgrind working for > Valgrind 3.X/. Your tool is an excellent profiler -- I heard many > people say so. I have the impression that the call graph with exact call counts often is all that is needed for people. Cache simulator results is quite low level already. To go further, you should use statistical sampling with performance counters (eg. OProfile with a good set of events). Simulation and sampling complement themself quite nicely. Josef PS: I'm on vacation with my family the next 10 days, but I'll try to put up a beta version of callgrind/VG3 in the next few hours. > > Let me know if there is anything we can do to help. > > J |
|
From: Josef W. <Jos...@gm...> - 2005-08-31 00:34:15
|
On Tuesday 30 August 2005 20:43, Josef Weidendorfer wrote: > but I'll try to put up > a beta version of callgrind/VG3 in the next few hours. OK, here is a preview version of Callgrind for Valgrind 3.0.x. It works on x86 and AMD64 with the regular KCachegrind release. When using --dump-instr=yes, KCachegrind is able to show annotated AMD64-assembler. Download from: http://kcachegrind.sf.net/callgrind-0.9.13-VG30-alpha.tar.gz Known issues: * without cache simulation (default), some instructions (<2%) are shown as executed even if in reality not, leading to >100% cost in some cases. So better use --simulate-cache=yes * --collect-jumps=yes sometimes shows wrong source/targets for jumps inside a function * on AMD64, calls into shared libs are routed via ld_runtime_resolve, leading to false recursive cycles. Use --separate-recs=20 to minimize this effect. Please send me feedback if there are compile/runtime errors or failed assertions. Thanks, Josef |
|
From: Tom H. <to...@co...> - 2005-09-01 07:47:30
|
In message <200...@gm...>
Josef Weidendorfer <Jos...@gm...> wrote:
> For (2) I need a heuristic for detecting the function for lazy symbol
> relocation in the runtime linker. Some background:
> When I see a JMP crossing functions (e.g. A calls B, B jumps to C), I have two
> possibilities to map this onto a call graph:
> a) A calls B, returns to A, A calls C
> b) A calls B, B calls C
> I default to (b), as this gives nice graphs e.g. with tail recursion.
> With one exception: every call to a shared lib usually first calls the runtime
> linker, which jumps to the right function at the end. So using (b) here
> gives call chains with the relocation function multiple times showing up,
> and this looks like recursive cycles for KCachegrind. This leads to profiles
> showing most functions in a recursive cycle, which is bad. So I want (a) for
> the relocation function.
> x86 used "push <addr>;ret" at end of relocation, and in a bad hack I used this
> to decide for (a). But AMD64 uses a normal JMP instruction.
> The thing is, that with a stripped runtime linker, the relocation function
> gets no symbol name (how is this possible?). How should I decide that I want
> (a) in this case?
> Ideas welcome.
When this happens do you still know the original address that the
first jump went to? So if foo() calls bar() which is in a different
shared library then foo jumps to the PLT entry which jumps to the
dynamic linker which eventually jumps to bar. Do you know when that
last jump to bar happens the target address of the original jump from
foo into the PLT?
If you do then the trick is that when you read the ELF header for
each shared library you remember the address and size of the PLT
section for each one and if that original jump was into a PLT then
you know you've just done a lazy symbol resolution.
I guess the problem is that future calls will still go through
the PLT and if the called functions then tail calls it will get
confusing.
Maybe you have to look for a jump into the PLT followed by a jump
into ld.so followed by a another jump... It's all getting a bit
horrible though ;-)
Tom
--
Tom Hughes (to...@co...)
http://www.compton.nu/
|
|
From: Josef W. <Jos...@gm...> - 2005-09-11 23:32:24
|
Hi Tom, On Thursday 01 September 2005 09:47, Tom Hughes wrote: > In message <200...@gm...> > > Josef Weidendorfer <Jos...@gm...> wrote: > > For (2) I need a heuristic for detecting the function for lazy symbol > > relocation in the runtime linker. Some background: > > When I see a JMP crossing functions (e.g. A calls B, B jumps to C), I > > have two possibilities to map this onto a call graph: > > a) A calls B, returns to A, A calls C > > b) A calls B, B calls C > > I default to (b), as this gives nice graphs e.g. with tail recursion. > > With one exception: every call to a shared lib usually first calls the > > runtime linker, which jumps to the right function at the end. So using > > (b) here gives call chains with the relocation function multiple times > > showing up, and this looks like recursive cycles for KCachegrind. This > > leads to profiles showing most functions in a recursive cycle, which is > > bad. So I want (a) for the relocation function. > > x86 used "push <addr>;ret" at end of relocation, and in a bad hack I used > > this to decide for (a). But AMD64 uses a normal JMP instruction. > > The thing is, that with a stripped runtime linker, the relocation > > function gets no symbol name (how is this possible?). How should I decide > > that I want (a) in this case? > > Ideas welcome. > > When this happens do you still know the original address that the > first jump went to? So if foo() calls bar() which is in a different > shared library then foo jumps to the PLT entry which jumps to the > dynamic linker which eventually jumps to bar. Yes. > Do you know when that > last jump to bar happens the target address of the original jump from > foo into the PLT? Sorry, I do not understand this question... > If you do then the trick is that when you read the ELF header for > each shared library you remember the address and size of the PLT > section for each one and if that original jump was into a PLT then > you know you've just done a lazy symbol resolution. > > I guess the problem is that future calls will still go through > the PLT and if the called functions then tail calls it will get > confusing. Yes. Calls will always go through PLT. At least on x86/AMD64, the PLT code does an indirect jump, which usually first jumps into the runtime linker. The "usually" is a problem here: for prelinked code, this is of course not true. And then, a call into the runtime linker does not have to be about symbol resolution at all... Detecting PLT sections is no problem, and looking up if a function was jumped to from PLT would also be possible, but this all is getting quite complex and probably fragile. > Maybe you have to look for a jump into the PLT followed by a jump > into ld.so followed by a another jump... It's all getting a bit > horrible though ;-) Yes. The easiest way would be to look at the symbol "dl_runtime_resolve". I just do not understand why this symbol on some systems seems not to be available. Does this happen on your system? I think I'll go with the symbol name, which can be configured via command line option. Josef > > Tom |
> ... The easiest way would be to look at the symbol "dl_runtime_resolve". > I just do not understand why this symbol on some systems seems not to be > available. ... The glibc developers have decided that _dl_runtime_resolve is a private symbol internal to glibc, and that it is nobody else's business. So in a somewhat-recent "sanitizing pass", they removed _dl_runtime_resolve from the list of exported symbols. (They did the same thing earlier with _dl_relocate_object, but that one is so important that I re-export it when I build audited glibc. See http://BitWagon.com/glibc-audit/glibc-audit.html ) For any module that actually does any resolving, _dl_runtime_resolve will appear in _GLOBAL_OFFSET_TABLE_[2]. This is specified in the Application Binary Interface (ABI), and it the client code in the executable application's Procedure Linkage Table (PLT) depends on it, so it will "never" change. See "System V Application Binary Interface, x86-64 Architecture Processor Supplement" [google for it]. The catch is that a module which is 100% pre-linked [with no conflicts] does not do any resolving, so GOT[2] will be 0 in that module. For each particular architecture, the implementation of _dl_runtime_resolve is in hand-written assembly language. Thus it changes rarely, has a distinctive signature, and can be found quickly and reliably by a string search of the entire .text PT_LOAD of the runtime loader. To figure out the string to search for, look with gdb on a non-stripped ld-linux. -- |