Re: [Valgrind-developers] callpath collection in valgrind

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello, all.

I would like to propose for discussion one assumption for function's
boundaries detection on ARM. I tried it on "expedit" - test N 48. It
is not a simplest case, but we can see from sources (comparing with
Callgrind's output) that Callgrind doesn't identify correctly return
from _op_copy_p_dp_neon. In fact, if someone can provide simplest case
that identifies problems of callgraph on ARM - it will be really great
to test my proposal on it.

At first, I consulted with gcc-arm Community. Gcc doesn't (normally!)
generates "unclear" call to function on ARM. It means: compilers uses
BL (move in LR return address) or somehow synthesizes the situation
when at the first instruction of function we indeed have in LR the
return address according to the Caller Conventions. So, I think it is
normal idea to rely on this fact.

Under Valgrind we have a set of Basic Blocks (BB) generated
dynamically. And we can insert some code right before the first
instruction of any BB. Let’s describe the algorithm:
1.    At the start of BB we ask Valgrind’s core if the address of the
BB is the start of some function. According to the rules of generation
of BB the start of any function will be the start of some BB.
Valgrind’s core used available information from .dynsym, .symtab and
debug-info to answer us.
2.    If it is the ENTRY in function we save LR at our local stack
structure (1 per each tread for multithreaded program), because
according to the calling convenient it is guaranteed that LR == return
address from this function.
3.    If it is not the ENTRY we compare the address of BB with the top
of the local STACK for current thread. If they are equal then we found
return from function and we do pop from local STACK. Hint: for working
with optimization like:
{
main:
mov r0, r0
bl func_1
…
func_1:
mov r0, r0
b func_2
// Nothing.

func_2:
// Now in LR we have RA to the main.
mov PC, LR
}
We will pop until value at the top of the STACK equals to the address of BB.

Of course, there are some limitations and I think I can provide
reasonable solution:

1.    At the first instruction of function in LR we have “wrong value”
rather than return address (some crazy asm implementation, no Calling
Convention) .
a.    Solution: Implement new flag –skip-functions=”name_1, name_2,
etc.” to skip it in call graph (aggregate cost in upper function). At
this point developer have to know names of this functions.
2.    Some functions (like _Exit) has no return. In gcc there is
attribute “noreturn”
a.    Solution: Implement new flag –skip-noreturn-functions=”name_1,
name_2, etc.” to skip it in call graph (aggregate cost in upper
function). It is possible manually to analyze source code to prepare
this list. There is the assumption that it will be not a huge one.
3.    If function has zero prolog and some loop that each time jumps
to the first instruction of this function we will detect a lot of
functions, not a loop.
a.    Solution: Maybe it is not a common situation in real code? I
mean we have to save some registers at stack if we want to have free
registers in body of function... In any case - it can be #1.
5.    What about the time consumption of the algorithm proposed VS
current algorithm.
a.    Solution: Test, think about local cache mechanism. Simplest
implementation is 4 times slower that current Callgrind on expedit. I
think we can ask Core (valgrind's) about size of function (if we found
start of function) and maintain (A_start_previous_func,
A_end_previous_func). If bb_addr(BB)  > A_start_previous_func and
bb_addr(BB) < A_end_previous_funct we haven't call to Valgrind's core
and ask if this BB is the start of some function. It may significantly
decrease time.

Any comments\suggestions are welcome. If someone have good test cases
- please, provide them. Now I am trying to "hack" creation of
callgraph based on my approach in the simplest way (1 thread, static
allocation of my local stack structure, not (A_start, A_end), etc.).
When I will have smth to show I will do it. And, maybe, it will be
ready for FOSDEM meeting, not sure.

Best regards,
Vasily

On Wed, Sep 18, 2013 at 11:18 PM, Josef Weidendorfer
<Jos...@gm...> wrote:
> Hi,
>
> Am 18.09.2013 17:41, schrieb Vasily Golubev:
>> I am trying to propose some technique for improvement of callgraph on
>> ARM. Could I kindly ask you to review one idea?
>> I think that we can find ENTRANCE and EXIT of function by analysis of
>> sequence of BBs. In runtime we can have Table with
>> Name_of_function->Start_address.
>> It is available from debug info or from available tables in binary (and
>> libraries). As far as I understand, we can test (bb_addr ==
>> one_of_address in our Table) and if true -> we are ENTERING in function.
>> So, we can track ALL entrances in functions.
>
> So you unconditionally assume a function entry whenever a given
> instruction is executed? What if a function has an empty prolog, and
> a loop always jumps back to the first instruction of the function?
>
> Anyway. You do not need a table for that, as this property can be
> detected at instrumentation time.
>
>> And I propose to add one internal structure (in fact, simple STACK).
>
> Callgrind already maintains a shadow stack.
>
>  If
>> we find ENTRANCE, we push guest_lr on our STACK. During testing
>> (bb_addr(bb) == one_of_address in our Table) if fail we can also test
>> (bb_addr(bb) == TOP_ELEMENT_IN_OUR_STACK). And if true, we are EXITING
>> from function. So, we can track all EXITS from functions.
>
> So you assume that the LR found when executing the first BB of a
> function always represents a return address?
> If the code just jumps to a function, LR may have an arbitrary value,
> which never gets popped from the shadow stack. This may easily result
> in an overflow. What to do then?
>
>> Some problems:
>> 1. I am not sure that we can tests all conditions in runtime if we want
>> to work fast enough. Maybe we can somehow organize a set of Tables. At
>> firs, identify object (file) and then analyze in it.
>> 2. Maybe we have to test not only with TOP_OF_STACK, but also try to
>> find deeper (in case of some optimizations).
>
> Yes. It should work with longjmps, returning from multiple stack frames
> at once.
>
> For that reason, Callgrind uses the stack pointer to regularly sync its
> shadow stack. Exactly this is not really working on ARM, as not every
> entry/exit of a function changes the stack pointer.
>
>> 3. As far as I understand, each exit of function (next running
>> instruction, in fact) will be the start of new BB. Or maybe I am wrong?
>
> VEX is working on superblocks (SB), not basic blocks. Thus, in can
> instrument multiple BBs at once, if they are linked by unconditional
> jumps ("chasing").
> With tail recursion optimization, a function may end by jumping to
> another function (in original source, this is a call).
> Thus, the beginning of the next function could be inside an SB, and
> would not be recognized by your approach.
> However, Callgrind bypasses this issue by setting
>   VG_(clo_vex_control).guest_chase_thresh = 0;
>
> So, the anwser to your question is "yes".
>
>> 4. I am not sure that it will be enough (track ENTRY and EXIT) for
>> callgrind's purposes.
>
> The main idea of callgrind's call graph tracing is to produce a sensible
> call graph for compiler generated code, ie. the graph
> that is expected by a programmer, reflecting his program code.
> However, this is not always possible, as different source may
> have the same compiler output (e.g. iterative vs. recursive
> version of the same algorithm).
> And the technique must be robust on hand-crafted machine code.
>
> I do not really know the ABI conventions used by compilers on
> ARM, so I cannot really say if your proposal is useful/working.
> In the end, you need to check the outcome of your proposal against
> what you would expect with a reasonable large code yourself.
>
> It would be nice to have tests which check that the right call
> graph was detected for a given code.
>
> Josef
>
>
>>
>> Thank you in advance,
>> Vasily
>>
>>
>> On Fri, Aug 30, 2013 at 4:14 AM, Josef Weidendorfer
>> <Jos...@gm... <mailto:Jos...@gm...>> wrote:
>>
>>     Am 29.08.2013 06:41, schrieb Vasily Golubev:
>>     > I noticed that heuristics for x86 based on the fact that we can
>>     identify
>>     > all entry\exit analyzing BB for BB at start. I mean the fact that
>>     entry
>>     > in\exit from function can occur only as the first IMark at BB
>>     > (superblock).
>>
>>     Callgrind is asking VEX to feed BBs (not SBs) of guest code
>>     to its instrumentation function.
>>
>>      Is it true in general?
>>
>>     As a BB is a continuous number of bytes in memory: yes, it is true that
>>     a function entry can only start at the beginning of a (dynamic) BB, and
>>     a function exit needs to be at the end of a BB (which then is detected
>>     at beginning of the next BB).
>>
>>      Or I misunderstand logic of analysis?
>>
>>     No, your analysis was correct.
>>
>>     Josef
>>
>>     >
>>     > Vasily
>>     >
>>     >
>>     > On Mon, Aug 26, 2013 at 5:35 PM, Emilio Coppa <er...@gm...
>>     <mailto:er...@gm...>
>>     > <mailto:er...@gm... <mailto:er...@gm...>>> wrote:
>>     >
>>     >     Hi all,
>>     >
>>     >         More or less all of the call/return tracing heuristic is
>>     found in
>>     >         the ugly setup_bbcc function in callgrind/bbcc.c - a first
>>     step
>>     >         would be to clean it up to allow for different heuristics on
>>     >         different architectures in the first place.
>>     >
>>     >
>>     >     You can also check aprof's callstack code
>>     >
>>     <https://code.google.com/p/aprof/source/browse/branches/stable/aprof/callstack.c>,
>>     >     highly inspired by callgrind's code.
>>     >
>>     >         As how to start into a callpath collection tool, I'd say
>>     an improved
>>     >         callstack generator for valgrind would be an excellent
>>     >         beginning, especially
>>     >         as it would greatly improve all valgrind tooling.
>>     >
>>     >
>>     >     I am really interested in this. I am not a Valgrind expert but I
>>     >     will be happy to help.
>>     >
>>     >     Emilio.
>>     >
>>     >
>>     >
>>     ------------------------------------------------------------------------------
>>     >     Introducing Performance Central, a new site from SourceForge and
>>     >     AppDynamics. Performance Central is your source for news,
>>     insights,
>>     >     analysis and resources for efficient Application Performance
>>     Management.
>>     >     Visit us today!
>>     >
>>     http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
>>     >     _______________________________________________
>>     >     Valgrind-developers mailing list
>>     >     Val...@li...
>>     <mailto:Val...@li...>
>>     >     <mailto:Val...@li...
>>     <mailto:Val...@li...>>
>>     >     https://lists.sourceforge.net/lists/listinfo/valgrind-developers
>>     >
>>     >
>>     >
>>     >
>>     > --
>>     > Best Regards,
>>     > Vasily
>>     >
>>     >
>>     >
>>     ------------------------------------------------------------------------------
>>     > Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
>>     > Discover the easy way to master current and previous Microsoft
>>     technologies
>>     > and advance your career. Get an incredible 1,500+ hours of
>>     step-by-step
>>     > tutorial videos with LearnDevNow. Subscribe today and save!
>>     >
>>     http://pubads.g.doubleclick.net/gampad/clk?id=58040911&iu=/4140/ostg.clktrk
>>     >
>>     >
>>     >
>>     > _______________________________________________
>>     > Valgrind-developers mailing list
>>     > Val...@li...
>>     <mailto:Val...@li...>
>>     > https://lists.sourceforge.net/lists/listinfo/valgrind-developers
>>     >
>>
>>
>>
>>
>> --
>> Best Regards,
>> Vasily
>

-- 
Best Regards,
Vasily