Just Launched: You can now import projects and releases from Google Code onto SourceForge
We are excited to release new functionality to enable a 1-click import from Google Code onto the Allura platform on SourceForge. You can import tickets, wikis, source, releases, and more with a few simple steps. Read More
Last night I decided to take a crack at fixing the PPC machine
definition to cooperate with the instruction scheduler. The instruction
scheduler in SBCL assumes that the target machine acts like a MIPS.
That is, instructions have a certain number of cycles before their
results become available and reading from the result produces bogus
results. Christophe had, in times past, hacked the PPC machine
definition to include appropriate :DELAY values on all the
instructions. However, because of the scheduler's assumptions about
what :DELAY meant, many NOPs were inserted into the code, thus
slowing things down.
Attached is a patch which tweaks the PPC machine definition as well
as the assembler to use "modern" assumptions about how instructions
work. Each instruction is now blessed with a :LATENCY, which specifies
the number of cycles we'd like to see pass before executing it.
Unlike :DELAY, however, NOPs are not inserted in spare cycles; the
scheduler simply schedules the dependents of potentially long-running
instructions and lets the machine sort it out.
The patch is a little hackish (mostly out of hopes that no changes
would be necessary for the MIPS or SPARC ports, since those two
ports use :DELAY extensively), but seems to work.
I compiled SB-MD5 both with the scheduler enabled and without the
scheduler enabled to see how much of a difference the scheduler makes.
To test, I disassembled SB-MD5's core routine, UPDATE-MD5-BLOCK and
took a diff between the two disassemblies (modulo the addresses and
so forth). The diff is attached. Suffice to say, the results are
somewhat disappointing; only minor rearrangements are being done on
I think the problem is the lack of available non-descriptor registers;
MD5 has four chaining variables, leaving only three non-descriptor
registers for temporary use. This constrains the possible reorderings
of the instructions. Furthermore, neither compile run used CFUNC as
the seventh non-descriptor register, further reducing the available
space. (Not quite sure why; this may be worth investigating.)
Better register allocation would likely help the instruction scheduler,
Note; I have not actually timed the two routines, nor have I inspected
the disassemblies from other pieces of code.
Nathan | From Man's effeminate slackness it begins. --Paradise Lost