[Tack-devel] incomplete changes to PowerPC ncg, ego
Moved to https://github.com/davidgiven/ack
Brought to you by:
dtrg
From: George K. <ke...@gm...> - 2017-02-20 23:03:18
|
My branch https://github.com/kernigh/ack/tree/kernigh-linuxppc has some incomplete changes to PowerPC ncg and ego. No pull request, because my branch has at least one problem with 4-byte floats. I will not be active for the next several days, so my branch will remain as is. We have two PowerPC back ends, ncg and mcg. The new code generator (ncg) is really the old one. PowerPC ncg existed by 2007, got some important fixes in September 2016, and can now compile some but not all C and Modula-2 programs. David Given's modern code generator (mcg) existed by October 2016, and can now compile most programs, but often emits wrong code. When I last checked, printf() in C worked in ncg but not in mcg. In my branch, I tried to complete the old back end, PowerPC ncg. I added the missing conversions between integers and 4-byte floats. I implemented the EM instruction lxl, for nested procedures in Modula-2. I also made some changes to register allocation. Since 2007, PowerPC ncg had defined an individual register class for each of the 32 general-purpose and 32 floating-point registers. These 64 classes had names like GPR3, FPR3, GPR4, FPR4, and so on. The table used these classes to coerce values from the EM stack into specific registers. For example, the rule for EM instruction aar coerced 3 values into GPR3, GPR4, GPR5. But ncg's register allocator is too slow with so many classes. A rule using 3 GPRs would take about 2 seconds to allocate them. So in October 2016, I added REG_PAIR to speed up some rules. REG_PAIR meant to allocate a pair of GPRs from a list of only 4 pairs. In http://tack.sourceforge.net/olddocs.html, I found warnings against too many register classes. Frank Doodeman's m68020 paper said, > Since Hans van Staveren in his document [4] clearly states that *cg* execution time is negatively influenced by the number of properties, only four different properties have been defined. van Staveren's ncg paper said, > Every extra property means the register set is more unorthogonal and *cg* execution time is influenced by that, because it has to take into account a larger set of registers that are not equivalent. So try to keep the number of different register classes to a minimum. When faced with the choice between two possible code rules for a nonfrequent EM sequence, one being elegant but requiring an extra property, and the other less elegant, elegance should probably loose. In my branch, I removed 63 of the 64 individual register classes. (I left a singleton class for register r3.) I also removed REG_PAIR. Register allocation becomes much faster, because each allocation picks from only 1 or 2 classes. Compilation with ack -O1 is quick; compilation with ack -O2 or higher uses most time to run ego, the EM global optimizer. To remove the register classes, I changed libem. When rules in ncg call libem, they can no longer coerce stack values to registers (except r3). So I changed libem to pass most values on the real stack, not in registers. This is slower. (Regular calls to C functions or Modula-2 procedures continue to use the real stack, and are as slow as always.) Because of these libem changes, I needed to delete all my PowerPC .o files. My branch also made changes to register variables. These use a second method of register allocation, where the registers get preserved across function calls. The method, in ncg, simply maps EM local variables into registers. There is an RA phase in ego that rearranges the local variables so ncg can emit better code. In the default branch, PowerPC ncg has regvars only for integers, not for floats. We run the RA phase in ego. Platform osxppc runs ego with the descr file, but platform linuxppc runs ego without a descr. (When I wrote powerpc.descr, I enabled it for osxppc but forgot to enable it for other platforms.) I find that it harms code generation to run the RA phase without a descr file. Each EM local variable has a register score. Before ego runs, this score is about the number of times that the var appears in the code. If the score is bigger than about 3, then ncg would try to allocate a regvar. If ego runs the RA phase, it changes each score to 0 or 10000. The number of registers with score 10000 is never greater than the number of registers in the descr file. But if there's no descr, the phase changes all the scores to zero. When linuxppc runs ego without a descr, if we run the RA phase, we disable regvars in ncg. So we can emit better code for linuxppc by running ack -O1 or -O2, because -O3 enables the RA phase. In my branch, I tried to add floating-point regvars to PowerPC ncg. But in ncg, all float regvars must have the same size. I added only 8-byte float regvars, because 8-byte floats seem more common then 4-byte floats. I added the float regvars to ego's descr. But the RA phase assumed that all float regvars hold 4-byte floats. It changed all the scores for 8-byte floats to zero, so ncg never allocated the float regvars! I then changed ego to put both 4-byte floats and 8-byte floats in registers. My branch has a problem. When the RA phase puts a local in a register, it also frees the stack space for the local. So the RA phase can put a 4-byte float in a register and free its stack space. Then PowerPC ncg refuses to put the 4-byte float in a register (because it only has 8-byte float regvars), and ncg tries to use the stack space that ego freed. This doesn't work, so I am observing corruption of 4-byte floats in programs. I have not fixed this problem. My branch https://github.com/kernigh/ack/tree/kernigh-linuxppc will remain as is (with the 4-byte float problem) for at least the next several days, while I am not active. -George Koehler |