|
From: Milind C. <Mil...@ri...> - 2013-08-13 19:38:08
|
Hello, I am considering using Valgrind for a heavy-weight instrumentation that would instrument each memory access. For detailed information about the access, I want to collect the full call path of each access. Can someone tell me (at the high-level) what call path collection technique exists in Valgrind ? and give me some details of the feature such as algorithmic complexity of getting the call stack esp. for each instruction. Is it done using a stack unwinding technique or it is built on-the-fly via call-ret instructions? How is a call path represented (is it an identifier to a call path object that can be queried at a later time) ? Thanks Milind |
|
From: Milind C. <Mil...@ri...> - 2013-08-14 15:44:50
|
Resending my question. Is this the right mailing list to ask valgrind internals question? On Tue, Aug 13, 2013 at 12:19 PM, Milind Chabbi <Mil...@ri...>wrote: > Hello, > > I am considering using Valgrind for a heavy-weight instrumentation > that would instrument each memory access. > For detailed information about the access, I want to collect the full > call path of each access. > Can someone tell me (at the high-level) what call path collection > technique exists in Valgrind ? and give me some details of the feature > such as algorithmic complexity of getting the call stack esp. for each > instruction. Is it done using a stack unwinding technique or it is > built on-the-fly via call-ret instructions? How is a call path > represented (is it an identifier to a call path object that can be > queried at a later time) ? > > Thanks > Milind > |
|
From: Philippe W. <phi...@sk...> - 2013-08-16 10:27:38
|
On Wed, 2013-08-14 at 08:44 -0700, Milind Chabbi wrote: > Resending my question. Is this the right mailing list to ask valgrind > internals question? Yes. > > > On Tue, Aug 13, 2013 at 12:19 PM, Milind Chabbi > <Mil...@ri...> wrote: > Hello, > > I am considering using Valgrind for a heavy-weight > instrumentation > that would instrument each memory access. > For detailed information about the access, I want to collect > the full > call path of each access. > Can someone tell me (at the high-level) what call path > collection > technique exists in Valgrind ? and give me some details of the > feature > such as algorithmic complexity of getting the call stack esp. > for each > instruction. Is it done using a stack unwinding technique or > it is > built on-the-fly via call-ret instructions? How is a call path > represented (is it an identifier to a call path object that > can be > queried at a later time) ? A stack trace is obtained by calling VG_(get_StackTrace) (see pub_tool_stacktrace.h). For frequent events (e.g. stacktraces of malloc,free,... in memcheck), there will be a lot of identical stacktraces. These can be maintained in a "dictionnary of stacktraces". See pub_tool_execontext.h. So, an execontext corresponds more or less to an "identifier ... that can be queried at a later time". Getting stacktraces is done via unwind technique, so is quite costly. Doing it for each memory access instruction will be very slow. The callgrind tool maintains the callstack on-the-fly via call-ret, but this callstack maintenance is specialised for callgrind. It might be an interesting project to generalise this callgrind callstack maintenance and make it "general" (i.e. part of the valgrind core framework) rather than tool specific. What kind of tool are you thinking to ? It might maybe easier to extend callgrind, which already captures memory access related information and callstack. Philippe |
|
From: Julian S. <js...@ac...> - 2013-08-16 14:29:53
|
On 08/16/2013 12:28 PM, Philippe Waroquiers wrote: > Getting stacktraces is done via unwind technique, so is quite costly. > Doing it for each memory access instruction will be very slow. Yes, it will slow down execution by a factor of several thousand compared to native execution -- I'd guess -- so you'll wind up with something that is unusably slow on anything except the smallest problems. J |
|
From: Milind C. <Mil...@ri...> - 2013-08-16 18:37:49
|
Philippe, >> The callgrind tool maintains the callstack on-the-fly via call-ret, but this callstack maintenance is specialised for callgrind. I noticed that Helgrind shows full call stack of 2 conflicting accesses during a data races. Does it maintain a call path identifier associated with each memory access in a shadow memory? If it does, doesn't Helgrind need to obtain the call stack on each memory access instruction? Does Helgrind also build the stack on-the-fly via call-ret and maintain a dictionary of call paths? On Fri, Aug 16, 2013 at 3:28 AM, Philippe Waroquiers < phi...@sk...> wrote: > On Wed, 2013-08-14 at 08:44 -0700, Milind Chabbi wrote: > > Resending my question. Is this the right mailing list to ask valgrind > > internals question? > Yes. > > > > > > On Tue, Aug 13, 2013 at 12:19 PM, Milind Chabbi > > <Mil...@ri...> wrote: > > Hello, > > > > I am considering using Valgrind for a heavy-weight > > instrumentation > > that would instrument each memory access. > > For detailed information about the access, I want to collect > > the full > > call path of each access. > > Can someone tell me (at the high-level) what call path > > collection > > technique exists in Valgrind ? and give me some details of the > > feature > > such as algorithmic complexity of getting the call stack esp. > > for each > > instruction. Is it done using a stack unwinding technique or > > it is > > built on-the-fly via call-ret instructions? How is a call path > > represented (is it an identifier to a call path object that > > can be > > queried at a later time) ? > A stack trace is obtained by calling VG_(get_StackTrace) > (see pub_tool_stacktrace.h). > For frequent events (e.g. stacktraces of malloc,free,... in memcheck), > there will be a lot of identical stacktraces. These can be maintained > in a "dictionnary of stacktraces". See pub_tool_execontext.h. > So, an execontext corresponds more or less to an > "identifier ... that can be queried at a later time". > > Getting stacktraces is done via unwind technique, so is quite costly. > Doing it for each memory access instruction will be very slow. > > The callgrind tool maintains the callstack on-the-fly via call-ret, > but this callstack maintenance is specialised for callgrind. > It might be an interesting project to generalise this callgrind > callstack maintenance and make it "general" (i.e. part of the > valgrind core framework) rather than tool specific. > > What kind of tool are you thinking to ? > It might maybe easier to extend callgrind, which already captures > memory access related information and callstack. > > Philippe > > > > > > |
|
From: Philippe W. <phi...@sk...> - 2013-08-16 18:24:48
|
On Fri, 2013-08-16 at 11:16 -0700, Milind Chabbi wrote: > Philippe, > > > >> The callgrind tool maintains the callstack on-the-fly via > call-ret, but this callstack maintenance is specialised for callgrind. > > > I noticed that Helgrind shows full call stack of 2 conflicting > accesses during a data races. Does it maintain a call path identifier > associated with each memory access in a shadow memory? If it does, > doesn't Helgrind need to obtain the call stack on each memory access > instruction? Does Helgrind also build the stack on-the-fly via > call-ret and maintain a dictionary of call paths? helgrind does not use call-ret, but uses VG_(get_StackTrace) (limited to 8 frames from what I can see). I do not know if helgrind does that for all memory access or only a subset of the memory access. Philippe |
|
From: Niall D. <ndo...@bl...> - 2013-08-16 14:41:11
Attachments:
smime.p7s
|
> Getting stacktraces is done via unwind technique, so is quite costly. > Doing it for each memory access instruction will be very slow. > > The callgrind tool maintains the callstack on-the-fly via call-ret, but this callstack > maintenance is specialised for callgrind. > It might be an interesting project to generalise this callgrind callstack > maintenance and make it "general" (i.e. part of the valgrind core framework) > rather than tool specific. Note that for non-x86/x86 callgrind's optimized callstack generator isn't reliable. There is scope to greatly improve valgrind's callstack generator in general if we had an unwind table parser (see -funwind-tables in GCC). libunwind has one, but it need libc which makes it unsuitable for valgrind. Also, Julian has some ARM EXIDX unwind table parser code he can point you at if you're interested. As how to start into a callpath collection tool, I'd say an improved callstack generator for valgrind would be an excellent beginning, especially as it would greatly improve all valgrind tooling. Niall --- Opinions expressed here are my own and do not necessarily represent those of BlackBerry Inc. |
|
From: Philippe W. <phi...@sk...> - 2013-08-16 15:26:18
|
On Fri, 2013-08-16 at 14:25 +0000, Niall Douglas wrote: > > Getting stacktraces is done via unwind technique, so is quite costly. > > Doing it for each memory access instruction will be very slow. > > > > The callgrind tool maintains the callstack on-the-fly via call-ret, but > this callstack > > maintenance is specialised for callgrind. > > It might be an interesting project to generalise this callgrind callstack > > maintenance and make it "general" (i.e. part of the valgrind core > framework) > > rather than tool specific. > > Note that for non-x86/x86 callgrind's optimized callstack generator isn't > reliable. Is there a way to differentiate the (at instrumentation time or at runtime) the instructions which leads to this non reliability. Then for such instructions, a more reliable but slower technique could be used (e.g. unwind information, see below). > > There is scope to greatly improve valgrind's callstack generator in general > if we had an unwind table parser (see -funwind-tables in GCC). libunwind has > one, but it need libc which makes it unsuitable for valgrind. Not too sure how to understand the above: Valgrind today generates stacktraces using unwind information, see e.g. in coregrind pub_core_debuginfo.h, m_stacktrace.c and files in coregrind/m_debuginfo. > Also, Julian > has some ARM EXIDX unwind table parser code he can point you at if you're > interested. What is the difference between these unwind tables and the dwarf unwind tables ? Philippe |
|
From: Niall D. <ndo...@bl...> - 2013-08-16 15:37:39
Attachments:
smime.p7s
|
> > Note that for non-x86/x86 callgrind's optimized callstack generator > > isn't reliable. > Is there a way to differentiate the (at instrumentation time or at > runtime) the instructions which leads to this non reliability. > Then for such instructions, a more reliable but slower technique could be > used > (e.g. unwind information, see below). Josef and Julian will know the answer to this, but essentially the problem is that it is very hard to (quickly) reconstruct what's going on precisely on ARM from just the stack alone. > > There is scope to greatly improve valgrind's callstack generator in > > general if we had an unwind table parser (see -funwind-tables in GCC). > > libunwind has one, but it need libc which makes it unsuitable for > > valgrind. > Not too sure how to understand the above: Valgrind today generates > stacktraces using unwind information, see e.g. in coregrind > pub_core_debuginfo.h, m_stacktrace.c and files in coregrind/m_debuginfo. Correct, though not in callgrind nor cachegrind. It is rather slow and memory hungry though. > > Also, Julian > > has some ARM EXIDX unwind table parser code he can point you at if > > you're interested. > What is the difference between these unwind tables and the dwarf unwind > tables ? ARM EXIDX unwind tables are specified by the ARM ABI for C++ exception unwinds. They are not a dwarf format, though can be (slowly) converted into a subset of dwarf format which is what Julian's code does. A native ARM EXIDX unwind table parser would very significantly improve stack backtrace performance on ARM valgrind, as well as making callgrind and cachegrind actually useful on ARM. Niall --- Opinions expressed here are my own and do not necessarily represent those of BlackBerry Inc. |
|
From: Vasily G. <vas...@gm...> - 2013-08-22 09:30:02
|
Dear Mr. Douglas, I also started some activity with ARM EXIDX and Valgrind. You can see here some code: https://breakpad.appspot.com/480002/ I also asked Mr. Seward about EXIDX and, as far as I understand (maybe wrong!), it is not a good way for Valgrind (EXIDX wil be deprecated in ARM?!). The second point is my request at http://news.gmane.org/gmane.comp.debugging.valgrind ("Request for ideas for Vaglrind ..."). So, if we can estimate that EXIDX improves somehow callstack (simplest example?) in Valgrind under ARM I will be really happy to support the implementation of this features. If you can spend some time and write some more description of problem and what part of Valgrind's source I can look into for understanding current algorithm and it's problems - it will be great! There is also a bug at https://bugs.kde.org/show_bug.cgi?id=289578 Thank you in advance, Vasily Golubev On Fri, Aug 16, 2013 at 7:37 PM, Niall Douglas <ndo...@bl...>wrote: > > > Note that for non-x86/x86 callgrind's optimized callstack generator > > > isn't reliable. > > Is there a way to differentiate the (at instrumentation time or at > > runtime) the instructions which leads to this non reliability. > > Then for such instructions, a more reliable but slower technique could be > > used > > (e.g. unwind information, see below). > > Josef and Julian will know the answer to this, but essentially the problem > is > that it is very hard to (quickly) reconstruct what's going on precisely on > ARM > from just the stack alone. > > > > There is scope to greatly improve valgrind's callstack generator in > > > general if we had an unwind table parser (see -funwind-tables in GCC). > > > libunwind has one, but it need libc which makes it unsuitable for > > > valgrind. > > Not too sure how to understand the above: Valgrind today generates > > stacktraces using unwind information, see e.g. in coregrind > > pub_core_debuginfo.h, m_stacktrace.c and files in coregrind/m_debuginfo. > > Correct, though not in callgrind nor cachegrind. It is rather slow and > memory > hungry though. > > > > Also, Julian > > > has some ARM EXIDX unwind table parser code he can point you at if > > > you're interested. > > What is the difference between these unwind tables and the dwarf unwind > > tables ? > > ARM EXIDX unwind tables are specified by the ARM ABI for C++ exception > unwinds. They are not a dwarf format, though can be (slowly) converted > into a > subset of dwarf format which is what Julian's code does. > > A native ARM EXIDX unwind table parser would very significantly improve > stack > backtrace performance on ARM valgrind, as well as making callgrind and > cachegrind actually useful on ARM. > > Niall > > --- > Opinions expressed here are my own and do not necessarily represent those > of > BlackBerry Inc. > > > > ------------------------------------------------------------------------------ > Get 100% visibility into Java/.NET code with AppDynamics Lite! > It's a free troubleshooting tool designed for production. > Get down to code-level detail for bottlenecks, with <2% overhead. > Download for free and get started troubleshooting in minutes. > http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > > -- Best Regards, Vasily |
|
From: Josef W. <Jos...@gm...> - 2013-08-23 23:47:08
|
Am 16.08.2013 17:37, schrieb Niall Douglas: >>> Note that for non-x86/x86 callgrind's optimized callstack generator >>> isn't reliable. >> Is there a way to differentiate the (at instrumentation time or at >> runtime) the instructions which leads to this non reliability. >> Then for such instructions, a more reliable but slower technique could be >> used >> (e.g. unwind information, see below). > > Josef and Julian will know the answer to this, but essentially the problem is > that it is very hard to (quickly) reconstruct what's going on precisely on ARM > from just the stack alone. I think that's one important point, yes. To be robust in the face of manually written assembly code and longjmp's, callgrind regularily checks if its shadow stack still is in sync with the real stack, with the help of the stack pointer. For this strategy to work, calls/returns should change the stack pointer, which is not the case on ARM. Further, e.g. symbol resolution by the dynamic linker or tail recursion has special casing, taking x86 ABI conventions into account. This is needed to make the resulting call paths more useful (e.g. get rid of PLT dispatching, or detect jumps between functions as tail recursion optimization). This is more about using the right heuristics tuned for the architecture and ABI conventions. In the end, I think that with careful heuristics, we can make the call stack tracing on ARM similar useful as it is on x86 now. But I never found the time for that. It may involve sync'ing the shadow stack with the real stack via unwind tables (stopping at the right time to keep it fast). More or less all of the call/return tracing heuristic is found in the ugly setup_bbcc function in callgrind/bbcc.c - a first step would be to clean it up to allow for different heuristics on different architectures in the first place. And yes, it would be nice to somehow make it part of Valgrind core. However, I am not sure about the best interface. I could be some internal instrumentation which is feeding call/return events to tool functions, and allow to request the current state of an internal shadow stack. Josef >>> There is scope to greatly improve valgrind's callstack generator in >>> general if we had an unwind table parser (see -funwind-tables in GCC). >>> libunwind has one, but it need libc which makes it unsuitable for >>> valgrind. >> Not too sure how to understand the above: Valgrind today generates >> stacktraces using unwind information, see e.g. in coregrind >> pub_core_debuginfo.h, m_stacktrace.c and files in coregrind/m_debuginfo. > > Correct, though not in callgrind nor cachegrind. It is rather slow and memory > hungry though. > >>> Also, Julian >>> has some ARM EXIDX unwind table parser code he can point you at if >>> you're interested. >> What is the difference between these unwind tables and the dwarf unwind >> tables ? > > ARM EXIDX unwind tables are specified by the ARM ABI for C++ exception > unwinds. They are not a dwarf format, though can be (slowly) converted into a > subset of dwarf format which is what Julian's code does. > > A native ARM EXIDX unwind table parser would very significantly improve stack > backtrace performance on ARM valgrind, as well as making callgrind and > cachegrind actually useful on ARM. > > Niall > > --- > Opinions expressed here are my own and do not necessarily represent those of > BlackBerry Inc. > > > > ------------------------------------------------------------------------------ > Get 100% visibility into Java/.NET code with AppDynamics Lite! > It's a free troubleshooting tool designed for production. > Get down to code-level detail for bottlenecks, with <2% overhead. > Download for free and get started troubleshooting in minutes. > http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk > > > > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > |
|
From: Emilio C. <er...@gm...> - 2013-08-26 13:36:01
|
Hi all, More or less all of the call/return tracing heuristic is found in > the ugly setup_bbcc function in callgrind/bbcc.c - a first step > would be to clean it up to allow for different heuristics on > different architectures in the first place. You can also check aprof's callstack code<https://code.google.com/p/aprof/source/browse/branches/stable/aprof/callstack.c>, highly inspired by callgrind's code. As how to start into a callpath collection tool, I'd say an improved > callstack generator for valgrind would be an excellent beginning, > especially > as it would greatly improve all valgrind tooling. > I am really interested in this. I am not a Valgrind expert but I will be happy to help. Emilio. |
|
From: Vasily G. <vas...@gm...> - 2013-09-18 11:09:24
|
Hello, Mr. Coppa. I read your articles "Input-Sensitive Profiling" and Multithreaded Input-Sensitive Profiling". As far as I understand aprof measures number of BBs executed in some particular function. Could I kindly ask you to explain why you measure BBs rather than NUMBER OF INSTRUCTIONS? Because one BB may contains 3 - 50 INSNs. You told that "Measuring basic blocks rather than time has several other advantages, very well motivated in [8]". I read [8] "Measuring Empirical Computational Complexity" too. But they measure then number that each BB was execution. And in aprof you measure approximately (number of BB executed in function) * (number of calls to this function) for constant number of input data (RMS). As far as understand it is completely different values. Thank you in advance, Vasily On Mon, Aug 26, 2013 at 5:35 PM, Emilio Coppa <er...@gm...> wrote: > Hi all, > > More or less all of the call/return tracing heuristic is found in >> the ugly setup_bbcc function in callgrind/bbcc.c - a first step >> would be to clean it up to allow for different heuristics on >> different architectures in the first place. > > > You can also check aprof's callstack code<https://code.google.com/p/aprof/source/browse/branches/stable/aprof/callstack.c>, > highly inspired by callgrind's code. > > As how to start into a callpath collection tool, I'd say an improved >> callstack generator for valgrind would be an excellent beginning, >> especially >> as it would greatly improve all valgrind tooling. >> > > I am really interested in this. I am not a Valgrind expert but I will be > happy to help. > > Emilio. > > > > ------------------------------------------------------------------------------ > Introducing Performance Central, a new site from SourceForge and > AppDynamics. Performance Central is your source for news, insights, > analysis and resources for efficient Application Performance Management. > Visit us today! > http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > > -- Best Regards, Vasily |
|
From: Vasily G. <vas...@gm...> - 2013-08-29 04:41:56
|
Mr. Coppa, Thank you. It is very helpful to read two codes (callgrind + aprof) for better understanding of algorithms underlined. Unfortunately, I still didn't catch clearly entry\exit tracing at x86 and, of course, no idea for arm. But I am working on it. Mr. Weidendorfer, I noticed that heuristics for x86 based on the fact that we can identify all entry\exit analyzing BB for BB at start. I mean the fact that entry in\exit from function can occur only as the first IMark at BB (superblock). Is it true in general? Or I misunderstand logic of analysis? Vasily On Mon, Aug 26, 2013 at 5:35 PM, Emilio Coppa <er...@gm...> wrote: > Hi all, > > More or less all of the call/return tracing heuristic is found in >> the ugly setup_bbcc function in callgrind/bbcc.c - a first step >> would be to clean it up to allow for different heuristics on >> different architectures in the first place. > > > You can also check aprof's callstack code<https://code.google.com/p/aprof/source/browse/branches/stable/aprof/callstack.c>, > highly inspired by callgrind's code. > > As how to start into a callpath collection tool, I'd say an improved >> callstack generator for valgrind would be an excellent beginning, >> especially >> as it would greatly improve all valgrind tooling. >> > > I am really interested in this. I am not a Valgrind expert but I will be > happy to help. > > Emilio. > > > > ------------------------------------------------------------------------------ > Introducing Performance Central, a new site from SourceForge and > AppDynamics. Performance Central is your source for news, insights, > analysis and resources for efficient Application Performance Management. > Visit us today! > http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > > -- Best Regards, Vasily |
|
From: Josef W. <Jos...@gm...> - 2013-08-30 00:14:51
|
Am 29.08.2013 06:41, schrieb Vasily Golubev: > I noticed that heuristics for x86 based on the fact that we can identify > all entry\exit analyzing BB for BB at start. I mean the fact that entry > in\exit from function can occur only as the first IMark at BB > (superblock). Callgrind is asking VEX to feed BBs (not SBs) of guest code to its instrumentation function. Is it true in general? As a BB is a continuous number of bytes in memory: yes, it is true that a function entry can only start at the beginning of a (dynamic) BB, and a function exit needs to be at the end of a BB (which then is detected at beginning of the next BB). Or I misunderstand logic of analysis? No, your analysis was correct. Josef > > Vasily > > > On Mon, Aug 26, 2013 at 5:35 PM, Emilio Coppa <er...@gm... > <mailto:er...@gm...>> wrote: > > Hi all, > > More or less all of the call/return tracing heuristic is found in > the ugly setup_bbcc function in callgrind/bbcc.c - a first step > would be to clean it up to allow for different heuristics on > different architectures in the first place. > > > You can also check aprof's callstack code > <https://code.google.com/p/aprof/source/browse/branches/stable/aprof/callstack.c>, > highly inspired by callgrind's code. > > As how to start into a callpath collection tool, I'd say an improved > callstack generator for valgrind would be an excellent > beginning, especially > as it would greatly improve all valgrind tooling. > > > I am really interested in this. I am not a Valgrind expert but I > will be happy to help. > > Emilio. > > > ------------------------------------------------------------------------------ > Introducing Performance Central, a new site from SourceForge and > AppDynamics. Performance Central is your source for news, insights, > analysis and resources for efficient Application Performance Management. > Visit us today! > http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > <mailto:Val...@li...> > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > > > > > -- > Best Regards, > Vasily > > > ------------------------------------------------------------------------------ > Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! > Discover the easy way to master current and previous Microsoft technologies > and advance your career. Get an incredible 1,500+ hours of step-by-step > tutorial videos with LearnDevNow. Subscribe today and save! > http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk > > > > _______________________________________________ > Valgrind-developers mailing list > Val...@li... > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > |
|
From: Vasily G. <vas...@gm...> - 2013-09-18 15:42:02
|
Hello, Mr. Weidendorfer. I am trying to propose some technique for improvement of callgraph on ARM. Could I kindly ask you to review one idea? I think that we can find ENTRANCE and EXIT of function by analysis of sequence of BBs. In runtime we can have Table with Name_of_function->Start_address. It is available from debug info or from available tables in binary (and libraries). As far as I understand, we can test (bb_addr == one_of_address in our Table) and if true -> we are ENTERING in function. So, we can track ALL entrances in functions. And I propose to add one internal structure (in fact, simple STACK). If we find ENTRANCE, we push guest_lr on our STACK. During testing (bb_addr(bb) == one_of_address in our Table) if fail we can also test (bb_addr(bb) == TOP_ELEMENT_IN_OUR_STACK). And if true, we are EXITING from function. So, we can track all EXITS from functions. Some problems: 1. I am not sure that we can tests all conditions in runtime if we want to work fast enough. Maybe we can somehow organize a set of Tables. At firs, identify object (file) and then analyze in it. 2. Maybe we have to test not only with TOP_OF_STACK, but also try to find deeper (in case of some optimizations). 3. As far as I understand, each exit of function (next running instruction, in fact) will be the start of new BB. Or maybe I am wrong? 4. I am not sure that it will be enough (track ENTRY and EXIT) for callgrind's purposes. Thank you in advance, Vasily On Fri, Aug 30, 2013 at 4:14 AM, Josef Weidendorfer < Jos...@gm...> wrote: > Am 29.08.2013 06:41, schrieb Vasily Golubev: > > I noticed that heuristics for x86 based on the fact that we can identify > > all entry\exit analyzing BB for BB at start. I mean the fact that entry > > in\exit from function can occur only as the first IMark at BB > > (superblock). > > Callgrind is asking VEX to feed BBs (not SBs) of guest code > to its instrumentation function. > > Is it true in general? > > As a BB is a continuous number of bytes in memory: yes, it is true that > a function entry can only start at the beginning of a (dynamic) BB, and > a function exit needs to be at the end of a BB (which then is detected > at beginning of the next BB). > > Or I misunderstand logic of analysis? > > No, your analysis was correct. > > Josef > > > > > Vasily > > > > > > On Mon, Aug 26, 2013 at 5:35 PM, Emilio Coppa <er...@gm... > > <mailto:er...@gm...>> wrote: > > > > Hi all, > > > > More or less all of the call/return tracing heuristic is found in > > the ugly setup_bbcc function in callgrind/bbcc.c - a first step > > would be to clean it up to allow for different heuristics on > > different architectures in the first place. > > > > > > You can also check aprof's callstack code > > < > https://code.google.com/p/aprof/source/browse/branches/stable/aprof/callstack.c > >, > > highly inspired by callgrind's code. > > > > As how to start into a callpath collection tool, I'd say an > improved > > callstack generator for valgrind would be an excellent > > beginning, especially > > as it would greatly improve all valgrind tooling. > > > > > > I am really interested in this. I am not a Valgrind expert but I > > will be happy to help. > > > > Emilio. > > > > > > > ------------------------------------------------------------------------------ > > Introducing Performance Central, a new site from SourceForge and > > AppDynamics. Performance Central is your source for news, insights, > > analysis and resources for efficient Application Performance > Management. > > Visit us today! > > > http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk > > _______________________________________________ > > Valgrind-developers mailing list > > Val...@li... > > <mailto:Val...@li...> > > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > > > > > > > > > > -- > > Best Regards, > > Vasily > > > > > > > ------------------------------------------------------------------------------ > > Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more! > > Discover the easy way to master current and previous Microsoft > technologies > > and advance your career. Get an incredible 1,500+ hours of step-by-step > > tutorial videos with LearnDevNow. Subscribe today and save! > > > http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk > > > > > > > > _______________________________________________ > > Valgrind-developers mailing list > > Val...@li... > > https://lists.sourceforge.net/lists/listinfo/valgrind-developers > > > > -- Best Regards, Vasily |
|
From: Josef W. <Jos...@gm...> - 2013-09-18 19:18:20
|
Hi,
Am 18.09.2013 17:41, schrieb Vasily Golubev:
> I am trying to propose some technique for improvement of callgraph on
> ARM. Could I kindly ask you to review one idea?
> I think that we can find ENTRANCE and EXIT of function by analysis of
> sequence of BBs. In runtime we can have Table with
> Name_of_function->Start_address.
> It is available from debug info or from available tables in binary (and
> libraries). As far as I understand, we can test (bb_addr ==
> one_of_address in our Table) and if true -> we are ENTERING in function.
> So, we can track ALL entrances in functions.
So you unconditionally assume a function entry whenever a given
instruction is executed? What if a function has an empty prolog, and
a loop always jumps back to the first instruction of the function?
Anyway. You do not need a table for that, as this property can be
detected at instrumentation time.
> And I propose to add one internal structure (in fact, simple STACK).
Callgrind already maintains a shadow stack.
If
> we find ENTRANCE, we push guest_lr on our STACK. During testing
> (bb_addr(bb) == one_of_address in our Table) if fail we can also test
> (bb_addr(bb) == TOP_ELEMENT_IN_OUR_STACK). And if true, we are EXITING
> from function. So, we can track all EXITS from functions.
So you assume that the LR found when executing the first BB of a
function always represents a return address?
If the code just jumps to a function, LR may have an arbitrary value,
which never gets popped from the shadow stack. This may easily result
in an overflow. What to do then?
> Some problems:
> 1. I am not sure that we can tests all conditions in runtime if we want
> to work fast enough. Maybe we can somehow organize a set of Tables. At
> firs, identify object (file) and then analyze in it.
> 2. Maybe we have to test not only with TOP_OF_STACK, but also try to
> find deeper (in case of some optimizations).
Yes. It should work with longjmps, returning from multiple stack frames
at once.
For that reason, Callgrind uses the stack pointer to regularly sync its
shadow stack. Exactly this is not really working on ARM, as not every
entry/exit of a function changes the stack pointer.
> 3. As far as I understand, each exit of function (next running
> instruction, in fact) will be the start of new BB. Or maybe I am wrong?
VEX is working on superblocks (SB), not basic blocks. Thus, in can
instrument multiple BBs at once, if they are linked by unconditional
jumps ("chasing").
With tail recursion optimization, a function may end by jumping to
another function (in original source, this is a call).
Thus, the beginning of the next function could be inside an SB, and
would not be recognized by your approach.
However, Callgrind bypasses this issue by setting
VG_(clo_vex_control).guest_chase_thresh = 0;
So, the anwser to your question is "yes".
> 4. I am not sure that it will be enough (track ENTRY and EXIT) for
> callgrind's purposes.
The main idea of callgrind's call graph tracing is to produce a sensible
call graph for compiler generated code, ie. the graph
that is expected by a programmer, reflecting his program code.
However, this is not always possible, as different source may
have the same compiler output (e.g. iterative vs. recursive
version of the same algorithm).
And the technique must be robust on hand-crafted machine code.
I do not really know the ABI conventions used by compilers on
ARM, so I cannot really say if your proposal is useful/working.
In the end, you need to check the outcome of your proposal against
what you would expect with a reasonable large code yourself.
It would be nice to have tests which check that the right call
graph was detected for a given code.
Josef
>
> Thank you in advance,
> Vasily
>
>
> On Fri, Aug 30, 2013 at 4:14 AM, Josef Weidendorfer
> <Jos...@gm... <mailto:Jos...@gm...>> wrote:
>
> Am 29.08.2013 06:41, schrieb Vasily Golubev:
> > I noticed that heuristics for x86 based on the fact that we can
> identify
> > all entry\exit analyzing BB for BB at start. I mean the fact that
> entry
> > in\exit from function can occur only as the first IMark at BB
> > (superblock).
>
> Callgrind is asking VEX to feed BBs (not SBs) of guest code
> to its instrumentation function.
>
> Is it true in general?
>
> As a BB is a continuous number of bytes in memory: yes, it is true that
> a function entry can only start at the beginning of a (dynamic) BB, and
> a function exit needs to be at the end of a BB (which then is detected
> at beginning of the next BB).
>
> Or I misunderstand logic of analysis?
>
> No, your analysis was correct.
>
> Josef
>
> >
> > Vasily
> >
> >
> > On Mon, Aug 26, 2013 at 5:35 PM, Emilio Coppa <er...@gm...
> <mailto:er...@gm...>
> > <mailto:er...@gm... <mailto:er...@gm...>>> wrote:
> >
> > Hi all,
> >
> > More or less all of the call/return tracing heuristic is
> found in
> > the ugly setup_bbcc function in callgrind/bbcc.c - a first
> step
> > would be to clean it up to allow for different heuristics on
> > different architectures in the first place.
> >
> >
> > You can also check aprof's callstack code
> >
> <https://code.google.com/p/aprof/source/browse/branches/stable/aprof/callstack.c>,
> > highly inspired by callgrind's code.
> >
> > As how to start into a callpath collection tool, I'd say
> an improved
> > callstack generator for valgrind would be an excellent
> > beginning, especially
> > as it would greatly improve all valgrind tooling.
> >
> >
> > I am really interested in this. I am not a Valgrind expert but I
> > will be happy to help.
> >
> > Emilio.
> >
> >
> >
> ------------------------------------------------------------------------------
> > Introducing Performance Central, a new site from SourceForge and
> > AppDynamics. Performance Central is your source for news,
> insights,
> > analysis and resources for efficient Application Performance
> Management.
> > Visit us today!
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Valgrind-developers mailing list
> > Val...@li...
> <mailto:Val...@li...>
> > <mailto:Val...@li...
> <mailto:Val...@li...>>
> > https://lists.sourceforge.net/lists/listinfo/valgrind-developers
> >
> >
> >
> >
> > --
> > Best Regards,
> > Vasily
> >
> >
> >
> ------------------------------------------------------------------------------
> > Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
> > Discover the easy way to master current and previous Microsoft
> technologies
> > and advance your career. Get an incredible 1,500+ hours of
> step-by-step
> > tutorial videos with LearnDevNow. Subscribe today and save!
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
> >
> >
> >
> > _______________________________________________
> > Valgrind-developers mailing list
> > Val...@li...
> <mailto:Val...@li...>
> > https://lists.sourceforge.net/lists/listinfo/valgrind-developers
> >
>
>
>
>
> --
> Best Regards,
> Vasily
|
|
From: Vasily G. <vas...@gm...> - 2013-10-06 05:52:55
|
Hello, all.
I would like to propose for discussion one assumption for function's
boundaries detection on ARM. I tried it on "expedit" - test N 48. It
is not a simplest case, but we can see from sources (comparing with
Callgrind's output) that Callgrind doesn't identify correctly return
from _op_copy_p_dp_neon. In fact, if someone can provide simplest case
that identifies problems of callgraph on ARM - it will be really great
to test my proposal on it.
At first, I consulted with gcc-arm Community. Gcc doesn't (normally!)
generates "unclear" call to function on ARM. It means: compilers uses
BL (move in LR return address) or somehow synthesizes the situation
when at the first instruction of function we indeed have in LR the
return address according to the Caller Conventions. So, I think it is
normal idea to rely on this fact.
Under Valgrind we have a set of Basic Blocks (BB) generated
dynamically. And we can insert some code right before the first
instruction of any BB. Let’s describe the algorithm:
1. At the start of BB we ask Valgrind’s core if the address of the
BB is the start of some function. According to the rules of generation
of BB the start of any function will be the start of some BB.
Valgrind’s core used available information from .dynsym, .symtab and
debug-info to answer us.
2. If it is the ENTRY in function we save LR at our local stack
structure (1 per each tread for multithreaded program), because
according to the calling convenient it is guaranteed that LR == return
address from this function.
3. If it is not the ENTRY we compare the address of BB with the top
of the local STACK for current thread. If they are equal then we found
return from function and we do pop from local STACK. Hint: for working
with optimization like:
{
main:
mov r0, r0
bl func_1
…
func_1:
mov r0, r0
b func_2
// Nothing.
func_2:
// Now in LR we have RA to the main.
mov PC, LR
}
We will pop until value at the top of the STACK equals to the address of BB.
Of course, there are some limitations and I think I can provide
reasonable solution:
1. At the first instruction of function in LR we have “wrong value”
rather than return address (some crazy asm implementation, no Calling
Convention) .
a. Solution: Implement new flag –skip-functions=”name_1, name_2,
etc.” to skip it in call graph (aggregate cost in upper function). At
this point developer have to know names of this functions.
2. Some functions (like _Exit) has no return. In gcc there is
attribute “noreturn”
a. Solution: Implement new flag –skip-noreturn-functions=”name_1,
name_2, etc.” to skip it in call graph (aggregate cost in upper
function). It is possible manually to analyze source code to prepare
this list. There is the assumption that it will be not a huge one.
3. If function has zero prolog and some loop that each time jumps
to the first instruction of this function we will detect a lot of
functions, not a loop.
a. Solution: Maybe it is not a common situation in real code? I
mean we have to save some registers at stack if we want to have free
registers in body of function... In any case - it can be #1.
5. What about the time consumption of the algorithm proposed VS
current algorithm.
a. Solution: Test, think about local cache mechanism. Simplest
implementation is 4 times slower that current Callgrind on expedit. I
think we can ask Core (valgrind's) about size of function (if we found
start of function) and maintain (A_start_previous_func,
A_end_previous_func). If bb_addr(BB) > A_start_previous_func and
bb_addr(BB) < A_end_previous_funct we haven't call to Valgrind's core
and ask if this BB is the start of some function. It may significantly
decrease time.
Any comments\suggestions are welcome. If someone have good test cases
- please, provide them. Now I am trying to "hack" creation of
callgraph based on my approach in the simplest way (1 thread, static
allocation of my local stack structure, not (A_start, A_end), etc.).
When I will have smth to show I will do it. And, maybe, it will be
ready for FOSDEM meeting, not sure.
Best regards,
Vasily
On Wed, Sep 18, 2013 at 11:18 PM, Josef Weidendorfer
<Jos...@gm...> wrote:
> Hi,
>
> Am 18.09.2013 17:41, schrieb Vasily Golubev:
>> I am trying to propose some technique for improvement of callgraph on
>> ARM. Could I kindly ask you to review one idea?
>> I think that we can find ENTRANCE and EXIT of function by analysis of
>> sequence of BBs. In runtime we can have Table with
>> Name_of_function->Start_address.
>> It is available from debug info or from available tables in binary (and
>> libraries). As far as I understand, we can test (bb_addr ==
>> one_of_address in our Table) and if true -> we are ENTERING in function.
>> So, we can track ALL entrances in functions.
>
> So you unconditionally assume a function entry whenever a given
> instruction is executed? What if a function has an empty prolog, and
> a loop always jumps back to the first instruction of the function?
>
> Anyway. You do not need a table for that, as this property can be
> detected at instrumentation time.
>
>> And I propose to add one internal structure (in fact, simple STACK).
>
> Callgrind already maintains a shadow stack.
>
> If
>> we find ENTRANCE, we push guest_lr on our STACK. During testing
>> (bb_addr(bb) == one_of_address in our Table) if fail we can also test
>> (bb_addr(bb) == TOP_ELEMENT_IN_OUR_STACK). And if true, we are EXITING
>> from function. So, we can track all EXITS from functions.
>
> So you assume that the LR found when executing the first BB of a
> function always represents a return address?
> If the code just jumps to a function, LR may have an arbitrary value,
> which never gets popped from the shadow stack. This may easily result
> in an overflow. What to do then?
>
>> Some problems:
>> 1. I am not sure that we can tests all conditions in runtime if we want
>> to work fast enough. Maybe we can somehow organize a set of Tables. At
>> firs, identify object (file) and then analyze in it.
>> 2. Maybe we have to test not only with TOP_OF_STACK, but also try to
>> find deeper (in case of some optimizations).
>
> Yes. It should work with longjmps, returning from multiple stack frames
> at once.
>
> For that reason, Callgrind uses the stack pointer to regularly sync its
> shadow stack. Exactly this is not really working on ARM, as not every
> entry/exit of a function changes the stack pointer.
>
>> 3. As far as I understand, each exit of function (next running
>> instruction, in fact) will be the start of new BB. Or maybe I am wrong?
>
> VEX is working on superblocks (SB), not basic blocks. Thus, in can
> instrument multiple BBs at once, if they are linked by unconditional
> jumps ("chasing").
> With tail recursion optimization, a function may end by jumping to
> another function (in original source, this is a call).
> Thus, the beginning of the next function could be inside an SB, and
> would not be recognized by your approach.
> However, Callgrind bypasses this issue by setting
> VG_(clo_vex_control).guest_chase_thresh = 0;
>
> So, the anwser to your question is "yes".
>
>> 4. I am not sure that it will be enough (track ENTRY and EXIT) for
>> callgrind's purposes.
>
> The main idea of callgrind's call graph tracing is to produce a sensible
> call graph for compiler generated code, ie. the graph
> that is expected by a programmer, reflecting his program code.
> However, this is not always possible, as different source may
> have the same compiler output (e.g. iterative vs. recursive
> version of the same algorithm).
> And the technique must be robust on hand-crafted machine code.
>
> I do not really know the ABI conventions used by compilers on
> ARM, so I cannot really say if your proposal is useful/working.
> In the end, you need to check the outcome of your proposal against
> what you would expect with a reasonable large code yourself.
>
> It would be nice to have tests which check that the right call
> graph was detected for a given code.
>
> Josef
>
>
>>
>> Thank you in advance,
>> Vasily
>>
>>
>> On Fri, Aug 30, 2013 at 4:14 AM, Josef Weidendorfer
>> <Jos...@gm... <mailto:Jos...@gm...>> wrote:
>>
>> Am 29.08.2013 06:41, schrieb Vasily Golubev:
>> > I noticed that heuristics for x86 based on the fact that we can
>> identify
>> > all entry\exit analyzing BB for BB at start. I mean the fact that
>> entry
>> > in\exit from function can occur only as the first IMark at BB
>> > (superblock).
>>
>> Callgrind is asking VEX to feed BBs (not SBs) of guest code
>> to its instrumentation function.
>>
>> Is it true in general?
>>
>> As a BB is a continuous number of bytes in memory: yes, it is true that
>> a function entry can only start at the beginning of a (dynamic) BB, and
>> a function exit needs to be at the end of a BB (which then is detected
>> at beginning of the next BB).
>>
>> Or I misunderstand logic of analysis?
>>
>> No, your analysis was correct.
>>
>> Josef
>>
>> >
>> > Vasily
>> >
>> >
>> > On Mon, Aug 26, 2013 at 5:35 PM, Emilio Coppa <er...@gm...
>> <mailto:er...@gm...>
>> > <mailto:er...@gm... <mailto:er...@gm...>>> wrote:
>> >
>> > Hi all,
>> >
>> > More or less all of the call/return tracing heuristic is
>> found in
>> > the ugly setup_bbcc function in callgrind/bbcc.c - a first
>> step
>> > would be to clean it up to allow for different heuristics on
>> > different architectures in the first place.
>> >
>> >
>> > You can also check aprof's callstack code
>> >
>> <https://code.google.com/p/aprof/source/browse/branches/stable/aprof/callstack.c>,
>> > highly inspired by callgrind's code.
>> >
>> > As how to start into a callpath collection tool, I'd say
>> an improved
>> > callstack generator for valgrind would be an excellent
>> > beginning, especially
>> > as it would greatly improve all valgrind tooling.
>> >
>> >
>> > I am really interested in this. I am not a Valgrind expert but I
>> > will be happy to help.
>> >
>> > Emilio.
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > Introducing Performance Central, a new site from SourceForge and
>> > AppDynamics. Performance Central is your source for news,
>> insights,
>> > analysis and resources for efficient Application Performance
>> Management.
>> > Visit us today!
>> >
>> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
>> > _______________________________________________
>> > Valgrind-developers mailing list
>> > Val...@li...
>> <mailto:Val...@li...>
>> > <mailto:Val...@li...
>> <mailto:Val...@li...>>
>> > https://lists.sourceforge.net/lists/listinfo/valgrind-developers
>> >
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> > Vasily
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
>> > Discover the easy way to master current and previous Microsoft
>> technologies
>> > and advance your career. Get an incredible 1,500+ hours of
>> step-by-step
>> > tutorial videos with LearnDevNow. Subscribe today and save!
>> >
>> http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
>> >
>> >
>> >
>> > _______________________________________________
>> > Valgrind-developers mailing list
>> > Val...@li...
>> <mailto:Val...@li...>
>> > https://lists.sourceforge.net/lists/listinfo/valgrind-developers
>> >
>>
>>
>>
>>
>> --
>> Best Regards,
>> Vasily
>
--
Best Regards,
Vasily
|