[Tack-devel] incomplete changes to PowerPC ncg, ego

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

My branch https://github.com/kernigh/ack/tree/kernigh-linuxppc has
some incomplete changes to PowerPC ncg and ego. No pull request,
because my branch has at least one problem with 4-byte floats. I will
not be active for the next several days, so my branch will remain as
is.

We have two PowerPC back ends, ncg and mcg. The new code generator
(ncg) is really the old one. PowerPC ncg existed by 2007, got some
important fixes in September 2016, and can now compile some but not
all C and Modula-2 programs. David Given's modern code generator (mcg)
existed by October 2016, and can now compile most programs, but often
emits wrong code. When I last checked, printf() in C worked in ncg but
not in mcg.

In my branch, I tried to complete the old back end, PowerPC ncg. I
added the missing conversions between integers and 4-byte floats. I
implemented the EM instruction lxl, for nested procedures in Modula-2.
I also made some changes to register allocation.

Since 2007, PowerPC ncg had defined an individual register class for
each of the 32 general-purpose and 32 floating-point registers. These
64 classes had names like GPR3, FPR3, GPR4, FPR4, and so on. The table
used these classes to coerce values from the EM stack into specific
registers. For example, the rule for EM instruction aar coerced 3
values into GPR3, GPR4, GPR5.

But ncg's register allocator is too slow with so many classes. A rule
using 3 GPRs would take about 2 seconds to allocate them. So in
October 2016, I added REG_PAIR to speed up some rules. REG_PAIR meant
to allocate a pair of GPRs from a list of only 4 pairs.

In http://tack.sourceforge.net/olddocs.html, I found warnings against
too many register classes. Frank Doodeman's m68020 paper said,

> Since Hans van Staveren in his document [4] clearly states that *cg* execution time is negatively influenced by the number of properties, only four different properties have been defined.

van Staveren's ncg paper said,

> Every extra property means the register set is more unorthogonal and *cg* execution time is influenced by that, because it has to take into account a larger set of registers that are not equivalent. So try to keep the number of different register classes to a minimum. When faced with the choice between two possible code rules for a nonfrequent EM sequence, one being elegant but requiring an extra property, and the other less elegant, elegance should probably loose.

In my branch, I removed 63 of the 64 individual register classes. (I
left a singleton class for register r3.) I also removed REG_PAIR.
Register allocation becomes much faster, because each allocation picks
from only 1 or 2 classes. Compilation with ack -O1 is quick;
compilation with ack -O2 or higher uses most time to run ego, the EM
global optimizer.

To remove the register classes, I changed libem. When rules in ncg
call libem, they can no longer coerce stack values to registers
(except r3). So I changed libem to pass most values on the real stack,
not in registers. This is slower. (Regular calls to C functions or
Modula-2 procedures continue to use the real stack, and are as slow as
always.) Because of these libem changes, I needed to delete all my
PowerPC .o files.

My branch also made changes to register variables. These use a second
method of register allocation, where the registers get preserved
across function calls. The method, in ncg, simply maps EM local
variables into registers. There is an RA phase in ego that rearranges
the local variables so ncg can emit better code.

In the default branch, PowerPC ncg has regvars only for integers, not
for floats. We run the RA phase in ego. Platform osxppc runs ego with
the descr file, but platform linuxppc runs ego without a descr. (When
I wrote powerpc.descr, I enabled it for osxppc but forgot to enable it
for other platforms.)

I find that it harms code generation to run the RA phase without a
descr file. Each EM local variable has a register score. Before ego
runs, this score is about the number of times that the var appears in
the code. If the score is bigger than about 3, then ncg would try to
allocate a regvar. If ego runs the RA phase, it changes each score to
0 or 10000. The number of registers with score 10000 is never greater
than the number of registers in the descr file. But if there's no
descr, the phase changes all the scores to zero. When linuxppc runs
ego without a descr, if we run the RA phase, we disable regvars in
ncg. So we can emit better code for linuxppc by running ack -O1 or
-O2, because -O3 enables the RA phase.

In my branch, I tried to add floating-point regvars to PowerPC ncg.
But in ncg, all float regvars must have the same size. I added only
8-byte float regvars, because 8-byte floats seem more common then
4-byte floats. I added the float regvars to ego's descr. But the RA
phase assumed that all float regvars hold 4-byte floats. It changed
all the scores for 8-byte floats to zero, so ncg never allocated the
float regvars! I then changed ego to put both 4-byte floats and 8-byte
floats in registers.

My branch has a problem. When the RA phase puts a local in a register,
it also frees the stack space for the local. So the RA phase can put a
4-byte float in a register and free its stack space. Then PowerPC ncg
refuses to put the 4-byte float in a register (because it only has
8-byte float regvars), and ncg tries to use the stack space that ego
freed. This doesn't work, so I am observing corruption of 4-byte
floats in programs. I have not fixed this problem.

My branch https://github.com/kernigh/ack/tree/kernigh-linuxppc will
remain as is (with the 4-byte float problem) for at least the next
several days, while I am not active.

-George Koehler

[Tack-devel] incomplete changes to PowerPC ncg, ego

Moved to https://github.com/davidgiven/ack

[Tack-devel] incomplete changes to PowerPC ncg, ego