|
From: Roel M. <r.j...@gm...> - 2007-06-06 08:44:19
|
Dear reader, =46or a paper we need to publish relatively soon we need to profile several= =20 applications on the powerpc core avaliable on the Xilinx Virtex-II pro FPGA= =20 board. These are powerpc 405 models. We managed to compile valgrind and too= ls=20 for this platform and we also were able to generate results. However the=20 results do not seem to make any sense. In MPEG-2 for example, the function= =20 fprintf is reported to call motion_estimation. Which of course is not reall= y=20 sensible. Do you guys know what is going on? Is the powerpc 405 not supported? Are we= =20 doing something wrong? If you would like to see the source and profiles, le= t=20 me know where to send it greetings, Kamana Sigdel and Roel Meeuws P.S. the command we ran was:=20 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0valgrind --tool=3Dcallgrind= ../mpeg2encode test.par test.m2v =2D-=20 Roel Meeuws Delft University of Technology =46aculty of Electrical Engineering Mathematics and Computer Science Computer Engineering Laboratory Mekelweg 4, 2628 CD Delft, The Netherlands =2D------------------------------------------- Email:r.j...@gm... Office phone: +31 (0)6 10 82 44 01 =2D------------------------------------------- |
|
From: Josef W. <Jos...@gm...> - 2007-06-06 09:29:34
|
On Wednesday 06 June 2007, Roel Meeuws wrote: > Dear reader, >=20 > For a paper we need to publish relatively soon we need to profile several= =20 > applications on the powerpc core avaliable on the Xilinx Virtex-II pro FP= GA=20 > board. These are powerpc 405 models. We managed to compile valgrind and t= ools=20 > for this platform and we also were able to generate results. However the= =20 > results do not seem to make any sense. In MPEG-2 for example, the functio= n=20 > fprintf is reported to call motion_estimation. Which of course is not rea= lly=20 > sensible. >=20 > Do you guys know what is going on? Is the powerpc 405 not supported? Are = we=20 > doing something wrong? If you would like to see the source and profiles, = let=20 > me know where to send it In short: Call graph tracing is experimental on PPC. Despite being quite robust, call graph tracing is far from giving reliable call graphs. The problem is that I am totally incompetent about the PPC ISA/ABI, and the PPC ISA has no explicit call/return instructions, so there have to be heuristics about what e.g. branch in a binary was meant to be a call. [I recently added this fact to the callgrind manual in SVN; but it probably would be better to put it into the FAQ list] I am quite sure that ABI knowledge and debug info can help a lot here. I am really very interested in any patch/discussion which makes this work in a better way. A good start already would be to analyze why the above bogus call "fprintf =3D=3D> motion_estimation" was found, and try to make this work. Cheers, Josef >=20 > greetings, >=20 > Kamana Sigdel and Roel Meeuws >=20 > P.S. the command we ran was:=20 > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0valgrind --tool=3Dcallgri= nd ../mpeg2encode test.par test.m2v |
Josef Weidendorfer wrote: > the PPC ISA has no explicit call/return instructions, > so there have to be heuristics about what e.g. branch in a binary was meant > to be a call. Please give an example or two. The PowerPC does have explicit call and return instructions. Any instruction whose top 6 bits (major opcode) are 16, 18, or 19 (decimal), and whose bottom bit (the LK bit) is a 1, is a call [and/or return.] The PowerPC has conditional call and conditional return in addition to unconditional call and unconditional return. Any "computed GOTO" must use the link register (lr) or the count register (ctr) to hold the address. In particular, any return-from-subroutine [where the destination is not lexically known] is a computed GOTO which must use the count register or link register. The code for calling a shared library, generating absolute addresses from PIC code, eliminating tail recursion via branch, etc., are similarly "complex" on the PowerPC as on other architectures [such as x86.] Inlining a subroutine might be easier (because of three-address instructions and more reigsters) but is otherwise similar. The code for a C-language 'switch' is just as recognizable on PowerPC as on other architectures. The PowerPC ISA does have glaring deficiencies (cannot branch on CArry which is generated by an add or subtract [must use a compare], lacks a subtract which sets both CArry and condition codes for an unsigned operation, lacks "compare with immediate shifted by 16 bits", always sets condition codes for "and immediate", does not offer "fill low bits with ones" for "and immediate shifted", etc.) But explicit call and return are present in PowerPC ISA, and compilers use them. -- John Reiser, jreiser@BitWagon.com |
|
From: Julian S. <js...@ac...> - 2007-06-06 12:12:37
|
> On Wednesday 06 June 2007 12:49, John Reiser wrote:
> Josef Weidendorfer wrote:
> > the PPC ISA has no explicit call/return instructions,
> > so there have to be heuristics about what e.g. branch in a binary was
> > meant to be a call.
I don't think that's exactly true, but there are ambiguities to do
with returns. See below.
> Please give an example or two.
The usual form of 'call to known destination' is
branch-pc-relative and write NIA to LR ("bl")
where NIA = Next Instruction Address (& this_insn + 4)
It follows (although I do not know this for sure) that 'call to
unknown destination' must be one of these
branch-to-LR and write NIA to LR ("blrl")
branch-to-CTR and write NIA to LR ("bctrl")
The usual form of 'return' is
branch-to-LR ("blr")
Assuming the above is correct, I think the problem is
branch-to-LR could also be any old computed goto (eg, a switch)
How to distinguish this case from a return?
J
|
Julian Seward wrote:
> The usual form of 'return' is
>
> branch-to-LR ("blr")
>
> Assuming the above is correct, I think the problem is
>
> branch-to-LR could also be any old computed goto (eg, a switch)
True, just like the x86 "jmp *%ecx" is a computed goto of any kind:
C-language 'switch', return from subroutine, hard-coded state machine,
etc.
> How to distinguish this case from a return?
As compiled by gcc, the "branch to link regiser [or count register]"
for nearly all C-language 'switch' statements is preceded by a
bounds check of the switch value, and an indexed fetch from a table
of addresses or offsets. This can be analyzed by a backwards scan
of dataflow through instructions at preceding descending addresses.
(Once in a while the scan must follow labels and jumps.)
It is not particularly hard to do, but a new release of gcc may
use a different style that an old analyzer does not recognize.
The style also varies with position-independent code (-fPIC),
optimization level (-O2), complexity of nearby expressions, etc.
The C-language runtime support library glibc does have some hand code
that uses computed GOTO which does not have this bounds check, because
other code checks it implicitly; "AND immediate with 7" guarantees
0<=value<=7. In theory sometimes gcc could do a range analysis
of the index, then subsume the test, but I have not seen it.
Note that x86 has the same generic difficulty. "jmp *%ecx"
could be a switch or a return, and a 'ret' _also_ could be a switch
or a return. Both are "merely" a computed GOTO. Stylistically
nearly every 'ret' is a subroutine return, and nearly every "jmp *reg"
is not; but there are legitimate exceptions both ways.
--
John Reiser, jreiser@BitWagon.com
|
|
From: Josef W. <Jos...@gm...> - 2007-06-06 15:30:55
|
On Wednesday 06 June 2007, Julian Seward wrote:
> > On Wednesday 06 June 2007 12:49, John Reiser wrote:
> > Josef Weidendorfer wrote:
> > > the PPC ISA has no explicit call/return instructions,
> > > so there have to be heuristics about what e.g. branch in a binary was
> > > meant to be a call.
>
> I don't think that's exactly true, but there are ambiguities to do
> with returns. See below.
Yes, this statement probably was quite provocative ;-)
On Wednesday 06 June 2007, John Reiser wrote:
> Julian Seward wrote:
>
> > The usual form of 'return' is
> >
> > branch-to-LR ("blr")
> >
> > Assuming the above is correct, I think the problem is
> >
> > branch-to-LR could also be any old computed goto (eg, a switch)
>
> True, just like the x86 "jmp *%ecx" is a computed goto of any kind:
> C-language 'switch', return from subroutine, hard-coded state machine,
> etc.
>
> > How to distinguish this case from a return?
The main problem for callgrind is that it only sees the intermediate
IR generated by VEX, which has annotations for the jump kind, like
"this jump is a return instruction". This information is reliable for
x86, but not for PPC, as in the example given by Julian.
Another thing is that I almost did not touch any of the heuristics
of callgrind for PPC; it is still "tuned" for x86.
It would be nice to incorporate heuristics into callgrind
which allow to reconstruct the call relations as given in the
original high level language (C/C++/Fortran/ADA/...), independent
of the ISA used for the binary.
> As compiled by gcc, the "branch to link regiser [or count register]"
> for nearly all C-language 'switch' statements is preceded by a
> bounds check of the switch value, and an indexed fetch from a table
> of addresses or offsets. This can be analyzed by a backwards scan
> of dataflow through instructions at preceding descending addresses.
> (Once in a while the scan must follow labels and jumps.)
> It is not particularly hard to do, but a new release of gcc may
> use a different style that an old analyzer does not recognize.
> The style also varies with position-independent code (-fPIC),
> optimization level (-O2), complexity of nearby expressions, etc.
>
> The C-language runtime support library glibc does have some hand code
> that uses computed GOTO which does not have this bounds check, because
> other code checks it implicitly; "AND immediate with 7" guarantees
> 0<=value<=7. In theory sometimes gcc could do a range analysis
> of the index, then subsume the test, but I have not seen it.
Perhaps the best heuristic for deciding whether a computed goto
actually is a return, is to look at debug info (same vs. different
function/ELF object) and equality of target to an expected return
address on callgrinds shadow stack.
> Note that x86 has the same generic difficulty. "jmp *%ecx"
> could be a switch or a return, and a 'ret' _also_ could be a switch
> or a return. Both are "merely" a computed GOTO. Stylistically
> nearly every 'ret' is a subroutine return, and nearly every "jmp *reg"
> is not; but there are legitimate exceptions both ways.
In practice, callgrind currently is working really well with x86
(and x86-64) code.
Using a "ret" only as subroutine return is not only a stylistic
issue: on modern x86 processors, using a return stack for branch prediction
is quite common. Ie. it gives bad performance when you use a "ret" as
a computed goto, as you confuse the branch predictor.
Josef
|
>>Note that x86 has the same generic difficulty. "jmp *%ecx" >>could be a switch or a return, and a 'ret' _also_ could be a switch >>or a return. Both are "merely" a computed GOTO. Stylistically >>nearly every 'ret' is a subroutine return, and nearly every "jmp *reg" >>is not; but there are legitimate exceptions both ways. > > > In practice, callgrind currently is working really well with x86 > (and x86-64) code. > > Using a "ret" only as subroutine return is not only a stylistic > issue: on modern x86 processors, using a return stack for branch prediction > is quite common. Ie. it gives bad performance when you use a "ret" as > a computed goto, as you confuse the branch predictor. The branch predictor gets confused only because Intel refused to allow the programmer to inform the branch predictor of the program's intent. The opcode FF /7 is available to say "PUSH a return address", and the opcodes 8F /1 through 8F /7 are available to say "POP a return address". These are right next to the existing opcodes FF /6 (PUSH r/m) and 8F /0 (POP r/m) so there would be almost no cost in the hardware decoding. [In the typical case where r/m is in fact a register, then the programmer pays 1 byte for using each new opcode, in contrast to a single-byte PUSH/POP register instruction (50/58 series opcodes.)] Sending the PUSH/POP signal from the decoder to the on-chip return address stack also would use existing wires. For want of two gates and two multiplexors, measurable performance is lost. -- |
|
From: Julian S. <js...@ac...> - 2007-06-07 06:32:36
|
> Using a "ret" only as subroutine return is not only a stylistic > issue: on modern x86 processors, using a return stack for branch prediction > is quite common. Ie. it gives bad performance when you use a "ret" as > a computed goto, as you confuse the branch predictor. That's true, and it's also true on ppc. If a branch-to-LR appears, the branch predictor needs to decide whether to predict from the return stack or using the normal mechanism. Given that it has no obvious way to decide, the ppc optimisation guides suggest that in practice it is best to use branch-to-LR only for procedure returns, and use branch-to-CTR only for other computed gotos, and so the cpu has that knowledge embedded in it. If you see what I mean. (LR and CTR are different registers). Indeed, coregrind/m_dispatch in V on ppc used to use branch-to-LR to jump to translations. When I changed that to branch-to-CTR I got a measurable performance increase on POWER4, presumably due to not trashing the branch predictors so much. J |