Hi developers,

I've looked at previous starts on this idea (notably Jonathan Smith's plus some chatter between Nathan and Nikodemus > .5 decades ago), and I've thought to take up the cause again.  I have a working piece of code that operates in "finalize-segment" after having buffered up all instructions in their abstract (not machine-encoded) format, with TNs as their operands.  This works nicely modulo a bug in my understanding of TN liveness.  

To motivate this discussion I'll show two examples of what the thing should be capable of doing, subject to an assumption (presently false) that its def/use tracking at the register level is correct:

* (disassemble '(lambda (a) (setf (svref a 3) #xf00d)))
;; with assembler debugging
@ 15 "eliminate tmp reg for CMP"
  (MOV RDX QWORD PTR [RCX-7]),(MOV RAX 6),(CMP RDX RAX) -> ((MOV RAX 6) (CMP QWORD PTR [RCX-7] RAX))
@ 18 "Imm-to-mem move"
  (MOV RDX 122906),(MOV QWORD PTR [RBX+RAX*4+1] RDX) -> ((MOV QWORD PTR [RBX+RAX*4+1] 122906))

(I see it missed that it could use an immediate 6, subject to RAX being dead, for an operand in an instruction that was already edited one time)
Despite that, it makes mincemeat out of OPTIMIZATIONS item #35:

* (disassemble '(lambda (a i) (aref (the simple-vector a) i)))
;; with assembler debugging
@ 17 "eliminate tmp reg for CMP"
  (MOV RBX RDI),(CMP RBX 0) -> ((CMP RDI 0))
@ 19 "eliminate tmp reg for CMP"
  (MOV RBX RDI),(CMP RBX QWORD PTR [+#S(SB-C:FIXUP :NAME NIL :FLAVOR CODE-OBJECT :OFFSET L1)]) -> ((CMP RDI QWORD PTR [+#S(SB-C:FIXUP :NAME NIL :FLAVOR CODE-OBJECT :OFFSET L1)]))
@ 20 "Relocate BREAK"
  (JMP LE L2),(MOV RAX Const5),(BREAK),(.TRAP-INFO 10 31 RDI RAX) -> ((JMP NLE L3))
@ 17 "Signed compares -> unsigned compare"
  (CMP RDI 0),(JMP L L3),(CMP RDI QWORD PTR [+#S(SB-C:FIXUP :NAME NIL :FLAVOR CODE-OBJECT :OFFSET L1)]),(JMP NLE L3) -> ((CMP RDI QWORD PTR [+#S(SB-C:FIXUP :NAME NIL :FLAVOR CODE-OBJECT :OFFSET L1)]) (JMP A L3))
@ 19 "eliminate tmp reg for CMP"
  (MOV RBX QWORD PTR [RDX-7]),(MOV RSI RDI),(CMP RBX RSI) -> ((MOV RSI RDI) (CMP QWORD PTR [RDX-7] RSI))
@ 19 "eliminate tmp reg for vector index in safe SVREF"
  (MOV RSI RDI),(CMP QWORD PTR [RDX-7] RSI),(JMP BE L4),(MOV RDX QWORD PTR [RDX+RDI*4+1]) -> ((CMP QWORD PTR [RDX-7] RDI) (JMP BE L4) (MOV RDX QWORD PTR [RDX+RSI*4+1]))
@ 17 "vector bound check subsumes INDEX type check"
  (CMP RDI QWORD PTR [+#S(SB-C:FIXUP :NAME NIL :FLAVOR CODE-OBJECT :OFFSET L1)]),(JMP A L3),(CMP QWORD PTR [RDX-7] RDI),(JMP BE L4) -> ((CMP QWORD PTR [RDX-7] RDI) (JMP BE L4))

Here's the fly in the ointment -

My first cut performed optimizations in a vacuum, without knowing IR2 stuff.
It was highly conservative, examining only straight-line code; and as well as being unmaintainable and inflexible crud, could not perform interesting transformations.
For what little it was worth, this could operate on anybody's x86 assembly source code doing only very trivial rewrites such as
 "AND RAX,imm;AND RAX,other-imm" -> "AND RAX,(AND imm other-imm)"

My next approach looked for TNs that could be deleted with impunity since SBCL's abstract assembly code (not machine-encoded) has the TNs.  When we find for example "MOV t1, [mem]; MOV t2, t1" with no other apparent use of t1 then we can maybe substitute "MOV t2, [mem]" 
This approach scanned the assembly segment for references to context TNs (the fragment of code in which replacement occurs) outside that context. This admits more optimization but is still wrong because the helpful MOVE function elides references to TNs in the same register, so despite that you see no uses of a TN by looking at instructions, that's not necessarily true.

My third way got rid of the scanning logic that didn't work, and avails itself of the TN-READS slot but:
 (1) it doesn't work because code emitters can use MAKE-RANDOM-TN to put something in an instruction that is unreflective of what the VOP operands were,
AND
 (2) I thought I understood TN-REFs but I don't.

My sense was that sufficed to ensure that each TN in the instructions being edited (the context), to be a candidate for elision, must have exactly one read ref and that the ref is also in the context, except that some ref can read a TN from anywhere else. (This glosses over some points but that's the general idea)  These conditions holding, all the above replacements that say "eliminate ..." would be admissible.

I suspect that a hybrid approach, with some way of detecting random TN creation will win, and ultimately be inordinately simpler than abstract interpretation and register tracing.  I preserved my hack to keep "do-nothing" TN moves just to see them in case of need; then they're deleted when calling the real assembler.

Anyway, my changes to 'assem.lisp' are significantly less invasive than one would think; the hair is in other files plus some ancillary cleanups.  All the EMIT-foo macros just buffer stuff up, and then FINALIZE-SEGMENT calls the rewriter which then calls back to the real EMIT encoders.
Nothing in the scheduling assembler logic is altered - this could be a pre-pass to instruction-level scheduling if so desired. (They're at different levels of abstraction)

On the plus side, SBCL is able to compile itself and pass all tests in the following regimes:
- the target-feature to bring in vm-specific rewriting rules is absent (i.e. the few hooks in 'assem' are in, the new rule driving loop is in, but the rewriter has nothing to do).
- the rules are present but those which use the mythical TN-ELIDABLE-P test are disabled
These give confidence that the assembler still works.

I'm certainly happy to compose a patch of everything, especially since at least a few parts of it are factorable out such as adding an assembler directive for '.trap-info' rather than having to reverse engineer that "BYTE;BYTE;BYTE..." was perhaps a list of SC-OFFSETs for a trap handler. This allows for example possible elision of a register-register move for a register holding
the bound of a vector - you have to alter the trap data too. (Basically asm-time copy-propagation)

But rather than my mailing this around as is, might I ask who would be willing to help me understand the piece I lack, namely a TN-ELIDABLE-P predicate?

Regards and thanks,

Doug