Re: [Tack-devel] Bringing up PowerPC again...
Brought to you by:
dtrg
From: tim k. <gt...@di...> - 2007-05-23 16:58:30
|
At 10:49 AM -0400 5/23/07, David Given wrote: >Well... not necessarily. The convention is that the caller is responsible for >both pushing parameters onto the stack and then popping them off again >afterwards, so the callee doesn't need to know about such things. All it gets >is, in effect, a pointer to the first parameter. Yeah, this is one noticeable difference between Forth and EM (if I'm understanding your statement above accurately). AFAIK, in Forth parameters pushed onto a stack are consumed by the callee, and the caller is responsible for popping the result back off. If you need the original parameters, the caller must do "Ndup" which duplicates N number of values on the stack. Movement of the stack pointer is transparent to the code, though. >[...] >> Therefore, >> the only missing link is the MES statement letting the backend know how >> many parameters are passed to the EM-based subroutine/function call. (The >> prolog and epilog then prepare the local variables in registers and save >> and restore non-volatile registers.) > >That is, in fact, exactly what happens when using the regvars extension. The >compilers generate hints to say 'Local #n is used X times, it'd be nice if you >could optimise that a bit'. The platform-independent part of the code >generator figures out which locals go in which registers, and then the >platform-dependent part prepares the registers by copying the values out of >the stack frame into the registers. (See MES 3.) :-O...the platform-independent part determines what registers to use for locals? Pardon my ignorance on this, but all I've seen so far is that registers are described in generic terms, but the specifics of the registers are left to the backend. If this isn't true, then I would argue this has spanned true platform-independence. Beyond a few bookkeeping registers (which arguably could reside in memory), EM shouldn't have any requirements on the final machine-dependent code. >(Of course, this will only work if the code isn't referring to the address of >those stack slots.) > >But this doesn't affect the calling convention; the parameters still get >*passed* in memory. I don't understand this at all, because my understanding is that EM isn't executable except through an interpreter (therefore doesn't do any actual "passing"). EM still needs to be taken to machine code before the application can execute. The stack is a "virtual" stack during this intermediate stage. Again, it could be my ignorance on the matter. Every function call after main() is a subroutine, so in theory this only has to be solved for one subroutine and then applied to all of the others. In the environment I am developing, main() doesn't get passed parameters (POSIX is not a concern, everything is done with message passing). Regardless, though, if registers are initialized before entering main() and that convention is kept, there shouldn't be any overlap. One aspect that comes to mind is that under the stack model as described it appears functions can access parameters not local to themselves, simply by reading further down the stack. That would violate local scope rules. Am I misunderstanding this? >You can see the EM bytecode if you compile with -c.e. The EM white paper >contains a reasonably complete description of what they all do... Yes; however, I've been trying to find that tie between EM and ncg. > >Typically, I'd expect ACK on a register-centric architecture like the PowerPC >to reserve, say, eight registers for expression evaluation, have a few for >housekeeping, and to use all the rest for local storage. So: > >int x(int i1, int i2, int i3, int i4, int i5, int i6) >{ > i1 = i1+i2+i3+i4+i5+i6 + (int)&i3; >} > >becomes (hand-compilation, omitting the prologue and epilogue boilerplate): > >; preload registers >lwz r8, 4(sp) ; r8 = local #0 = i1 >lwz r9, 8(sp) ; i2 >lwz r10, 16(sp) ; i4 >lwz r11, 20(sp) ; i5 >lwz r12, r4(sp) ; i6 >; perform calculation >add r1, r8, r9 ; x = i1 + i2 >lwz r2, 12(sp) ; load i3, not cached in register >add r1, r1, r2 >add r1, r1, r10 ; x += i4 >add r1, r1, r11 ; x += i5 >add r1, r1, r12 ; x += i6 >addi r2, sp, 12 ; get address of i3 >add r8, r1, r2 ; result goes directly into the i1 register > >The prologue and epilogue would need to save and reload r8-r12, of course. By >carefully tweaking how the registers are used you may be able to do this in >one instruction. r1-r7 are scratch and don't need saving. Except the above is really bad code (no insult intended, you are giving a concrete depiction of a typical output) and would be unbelievably slow on PowerPC. Even if you manage to get all of the stack on the same cache line, you will almost certainly stall significantly at some point - like the next time the routine was called with the stack in a different location. If the parameters/stack span a cache line, stalling will be enormously painful to performance. gcc with -O3 could quite likely produce something like (i3 is in r3 and i6 is in r8) stwu r5, 0(sp) add r8, r7, r8 ; i5+i6 to r8 add r3, r4, r3 ; i1+i2 to r3 add r3, r6, r3 ; i4+(i1+i2) to r3 add r3, r3, sp ; adding the address of i3, which was stored on the local stack add r3, r8, r3 ; (i5+i6) to everything else The result is already in r3, and memory was never accessed. Granted, most likely gcc would choke and shove some stuff into non-volatile registers, but sometimes the optimizations are pretty decent. (And of course, the results are going to be highly irregular and differ depending on optimization levels and stack location.) What would be the EM code for the x function, including the MES notes? >[...] >> Isn't EM basically a representation of logic? Although an interpreter can >> take EM opcodes and convert them on the fly, the representation of the >> programming logic isn't going to be affected during EM generation, and EM >> generation doesn't affect the final object code. Therefore, the backend is >> still responsible for the realities of the underlying architecture. EM >> might represent values being on a stack, but that's still just a "virtual" >> stack. > >Unfortunately, not always. EM specifies a particular format for the stack >frame, and there's a magic EM pseudo-register that points to it. Parameters >are then defined at particular offsets from this stack frame. There are EM >opcodes that will either read or write single or double-word values, or else >take the address of a particular frame slot --- there's no difference between >'lol 3' (load word local #3) or 'lal 3; loi 4' (load address of word local #3; >derefence word). What's more, there's no information about types, either; a >double-word local simply occupies two frame slots, and it's possible to read >or write the high and low words separately. > >(32 bit words, here. Also, EM uses 'local' to refer to function parameters and >function temporaries.) I still don't see where this implies or requires some adherence to accessing the parameters in memory. The MES notes state what size the local variables are, and where on the (virtual) stack they are. If a parameter is half a word, it requires opcodes that only fill half the register. This can be done by the caller before jumping. EM lays out a roadmap, but the backend does the actual translation to something appropriate to the machine. It does, in essence, posit an ABI that each function call will recognize, from top to bottom, and determined and implemented by the backend. The compromise/solution might resemble something like SPARC's register window (similar to what you described above), but I think enforcing a memory-centric model on a register-centric CPU is not really portable. I've already seen too much of forcing x86-centric models onto PowerPC and the resulting devastating effects on PowerPC performance to go down that road again. >The ARM code generator is probably the best one to look at, but it's a bit >cryptic (there's a lot of support for the ARM's odd addressing modes, which is >all entirely irrevelant for the simple PowerPC). The SPARC code generator >actually uses an entirely different and unhelpful code generator mechanism >that I haven't bothered to make work (because it makes lousy code). Ah. I was hoping for something that had the opcodes all ready to go so I could focus on the optimizations. Then I could take the optimization for register-centric CPUs and write the tables for the opcodes for PowerPC. That way I don't have to try both at the same time. >Anything in the mach directory with a ncg/table file is a new-style code >generator. > >...incidentally, you may want to investigate using qemu as a testbed; it >supports ARM, i386, MIPS, PowerPC, x86_64 and sparc and will allow 'hardware' >debugging of the emulated machine (clunkily, via gdb). Good point. Someone else I know had suggested that as well, some time ago, for a different project. tim Gregory T. (tim) Kelly Owner Dialectronics.com P.O. Box 606 Newberry, SC 29108 "Anything war can do, peace can do better." -- Bishop Desmond Tutu |