Re: [Tack-devel] Bringing up PowerPC again...

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

At 10:49 AM -0400 5/23/07, David Given wrote:
>Well... not necessarily. The convention is that the caller is responsible for
>both pushing parameters onto the stack and then popping them off again
>afterwards, so the callee doesn't need to know about such things. All it gets
>is, in effect, a pointer to the first parameter.

Yeah, this is one noticeable difference between Forth and EM (if I'm
understanding your statement above accurately).  AFAIK, in Forth parameters
pushed onto a stack are consumed by the callee, and the caller is
responsible for popping the result back off.  If you need the original
parameters, the caller must do "Ndup" which duplicates N number of values
on the stack.  Movement of the stack pointer is transparent to the code,
though.

>[...]
>> Therefore,
>> the only missing link is the MES statement letting the backend know how
>> many parameters are passed to the EM-based subroutine/function call.  (The
>> prolog and epilog then prepare the local variables in registers and save
>> and restore non-volatile registers.)
>
>That is, in fact, exactly what happens when using the regvars extension. The
>compilers generate hints to say 'Local #n is used X times, it'd be nice if you
>could optimise that a bit'. The platform-independent part of the code
>generator figures out which locals go in which registers, and then the
>platform-dependent part prepares the registers by copying the values out of
>the stack frame into the registers. (See MES 3.)

:-O...the platform-independent part determines what registers to use for
locals?  Pardon my ignorance on this, but all I've seen so far is that
registers are described in generic terms, but the specifics of the
registers are left to the backend.  If this isn't true, then I would argue
this has spanned true platform-independence.  Beyond a few bookkeeping
registers (which arguably could reside in memory), EM shouldn't have any
requirements on the final machine-dependent code.

>(Of course, this will only work if the code isn't referring to the address of
>those stack slots.)
>
>But this doesn't affect the calling convention; the parameters still get
>*passed* in memory.

I don't understand this at all, because my understanding is that EM isn't
executable except through an interpreter (therefore doesn't do any actual
"passing").  EM still needs to be taken to machine code before the
application can execute.  The stack is a "virtual" stack during this
intermediate stage.  Again, it could be my ignorance on the matter.

Every function call after main() is a subroutine, so in theory this only
has to be solved for one subroutine and then applied to all of the others.
In the environment I am developing, main() doesn't get passed parameters
(POSIX is not a concern, everything is done with message passing).
Regardless, though, if registers are initialized before entering main() and
that convention is kept, there shouldn't be any overlap.

One aspect that comes to mind is that under the stack model as described it
appears functions can access parameters not local to themselves, simply by
reading further down the stack.  That would violate local scope rules.  Am
I misunderstanding this?

>You can see the EM bytecode if you compile with -c.e. The EM white paper
>contains a reasonably complete description of what they all do...

Yes; however, I've been trying to find that tie between EM and ncg.

>
>Typically, I'd expect ACK on a register-centric architecture like the PowerPC
>to reserve, say, eight registers for expression evaluation, have a few for
>housekeeping, and to use all the rest for local storage. So:
>
>int x(int i1, int i2, int i3, int i4, int i5, int i6)
>{
>        i1 = i1+i2+i3+i4+i5+i6 + (int)&i3;
>}
>
>becomes (hand-compilation, omitting the prologue and epilogue boilerplate):
>
>; preload registers
>lwz r8, 4(sp)           ; r8 = local #0 = i1
>lwz r9, 8(sp)           ; i2
>lwz r10, 16(sp)         ; i4
>lwz r11, 20(sp)         ; i5
>lwz r12, r4(sp)         ; i6
>; perform calculation
>add r1, r8, r9          ; x = i1 + i2
>lwz r2, 12(sp)          ; load i3, not cached in register
>add r1, r1, r2
>add r1, r1, r10         ; x += i4
>add r1, r1, r11         ; x += i5
>add r1, r1, r12         ; x += i6
>addi r2, sp, 12         ; get address of i3
>add r8, r1, r2          ; result goes directly into the i1 register
>
>The prologue and epilogue would need to save and reload r8-r12, of course. By
>carefully tweaking how the registers are used you may be able to do this in
>one instruction. r1-r7 are scratch and don't need saving.

Except the above is really bad code (no insult intended, you are giving a
concrete depiction of a typical output) and would be unbelievably slow on
PowerPC.  Even if you manage to get all of the stack on the same cache
line, you will almost certainly stall significantly at some point - like
the next time the routine was called with the stack in a different
location.  If the parameters/stack span a cache line, stalling will be
enormously painful to performance.

gcc with -O3 could quite likely produce something like (i3 is in r3 and i6
is in r8)

stwu r5, 0(sp)
add r8, r7, r8  ; i5+i6 to r8
add r3, r4, r3  ; i1+i2 to r3
add r3, r6, r3  ; i4+(i1+i2) to r3
add r3, r3, sp  ; adding the address of i3, which was stored on the local stack
add r3, r8, r3  ; (i5+i6) to everything else

The result is already in r3, and memory was never accessed.  Granted, most
likely gcc would choke and shove some stuff into non-volatile registers,
but sometimes the optimizations are pretty decent.  (And of course, the
results are going to be highly irregular and differ depending on
optimization levels and stack location.)

What would be the EM code for the x function, including the MES notes?

>[...]
>> Isn't EM basically a representation of logic?  Although an interpreter can
>> take EM opcodes and convert them on the fly, the representation of the
>> programming logic isn't going to be affected during EM generation, and EM
>> generation doesn't affect the final object code.  Therefore, the backend is
>> still responsible for the realities of the underlying architecture.  EM
>> might represent values being on a stack, but that's still just a "virtual"
>> stack.
>
>Unfortunately, not always. EM specifies a particular format for the stack
>frame, and there's a magic EM pseudo-register that points to it. Parameters
>are then defined at particular offsets from this stack frame. There are EM
>opcodes that will either read or write single or double-word values, or else
>take the address of a particular frame slot --- there's no difference between
>'lol 3' (load word local #3) or 'lal 3; loi 4' (load address of word local #3;
>derefence word). What's more, there's no information about types, either; a
>double-word local simply occupies two frame slots, and it's possible to read
>or write the high and low words separately.
>
>(32 bit words, here. Also, EM uses 'local' to refer to function parameters and
>function temporaries.)

I still don't see where this implies or requires some adherence to
accessing the parameters in memory.  The MES notes state what size the
local variables are, and where on the (virtual) stack they are.  If a
parameter is half a word, it requires opcodes that only fill half the
register.  This can be done by the caller before jumping.  EM lays out a
roadmap, but the backend does the actual translation to something
appropriate to the machine.

It does, in essence, posit an ABI that each function call will recognize,
from top to bottom, and determined and implemented by the backend.  The
compromise/solution might resemble something like SPARC's register window
(similar to what you described above), but I think enforcing a
memory-centric model on a register-centric CPU is not really portable.
I've already seen too much of forcing x86-centric models onto PowerPC and
the resulting devastating effects on PowerPC performance to go down that
road again.

>The ARM code generator is probably the best one to look at, but it's a bit
>cryptic (there's a lot of support for the ARM's odd addressing modes, which is
>all entirely irrevelant for the simple PowerPC). The SPARC code generator
>actually uses an entirely different and unhelpful code generator mechanism
>that I haven't bothered to make work (because it makes lousy code).

Ah.  I was hoping for something that had the opcodes all ready to go so I
could focus on the optimizations.  Then I could take the optimization for
register-centric CPUs and write the tables for the opcodes for PowerPC.
That way I don't have to try both at the same time.

>Anything in the mach directory with a ncg/table file is a new-style code
>generator.
>
>...incidentally, you may want to investigate using qemu as a testbed; it
>supports ARM, i386, MIPS, PowerPC, x86_64 and sparc and will allow 'hardware'
>debugging of the emulated machine (clunkily, via gdb).

Good point.  Someone else I know had suggested that as well, some time ago,
for a different project.

tim

Gregory T. (tim) Kelly
Owner
Dialectronics.com

P.O. Box 606
Newberry, SC 29108

"Anything war can do, peace can do better."  -- Bishop Desmond Tutu