Menu

#929 Redundant generated code

None
open
nobody
redundancy (1)
5
2024-06-29
2024-06-17
No

I'm testing Version 4.4.1 #14887 (MINGW64). As before, I'm trying to optimise some fixed point math and even at insane level of optimisation (10000000 nodes) the resulting code (after hours of processing) seems suboptimal.

This is the C

x1 =  (cp*x0+sp*z0)/256;
z2 =  (cp*z0-sp*x0)/128;

For the first line I get:

    ld  de, (_MoveEnemy_x0_20001_650)
    ld  hl, (_cp)
    call    __mulint
    push    de
    ld  de, (_MoveEnemy_z0_20001_650)
    ld  hl, (_sp)
    call    __mulint
    ex  de, hl            
    pop de               ;<- why not popping hl ?
    add hl, de           ; it is neglecting that HL+DE=DE+HL
    ex  de, hl
    ld  b, d             ; why do we need bc ?
    bit 7, d
    jr  Z, 00122$
    ld  c, e
    ld  b, d
    dec bc
    inc b
00122$:
    ld  l, b
    ld  a, l
    rlca
    sbc a, a
    ld  h, a
    ld  (_MoveEnemy_x1_20001_650), hl

It could have been:

    ld  de, (_MoveEnemy_x0_20001_650)
    ld  hl, (_cp)
    call    __mulint
    push    de
    ld  de, (_MoveEnemy_z0_20001_650)
    ld  hl, (_sp)
    call    __mulint
    pop hl                  ; pop directly in HL
    add hl, de
    bit 7, h                ; avoid BC
    jr  Z, 00122$
    dec hl
    inc h                   ; if HL<0 add 255
00122$:
    ld  l, h
    ld  a, l
    rlca
    sbc a, a
    ld  h, a
    ld  (_MoveEnemy_x1_20001_650), hl
    ~~~

I know that the difference is only few opcodes/bytes, but this is a sequence repeated for each point in my code many hundred of times per second.


For the second line of C (z2 =  (cp*z0-sp*x0)/128) I get this:
ld  de, (_MoveEnemy_z0_20001_650)
ld  hl, (_cp)
call    __mulint
ex  de, hl                        ; WHY NOT DE?
push    hl
ld  de, (_MoveEnemy_x0_20001_650)
ld  hl, (_sp)
call    __mulint
pop hl
cp  a, a
sbc hl, de
ex  de, hl                      ; WHY NOT HL DIRECLY?
ld  c, e
ld  b, d
bit 7, d
jr  Z, 00123$
ld  hl, #0x007f                 ; WHY NOT INC H/DEC HL ?
add hl, de
ld  c, l
ld  b, h

00123$:
sra b
rr c
sra b
rr c
sra b
rr c
sra b
rr c
sra b
rr c
sra b
rr c
sra b
rr c
ld (_MoveEnemy_z2_20001_650), bc

Even here, it could have done better in this way
ld  de, (_MoveEnemy_z0_20001_650)
ld  hl, (_cp)
call    __mulint
push    de
ld  de, (_MoveEnemy_x0_20001_650)
ld  hl, (_sp)
call    __mulint
pop hl
cp  a, a
sbc hl, de
bit 7, h
jr  Z, 00123$
inc h
dec hl

00123$:
add hl, hl ; instead of shifting 7 times right, shift once left and 8 times right
ld l,h
ld a, l
rlca
sbc a, a
ld h, a
ld (_MoveEnemy_z2_20001_650), hl
~~~

This time the difference is huge, not only for the redundancy in register manipulation, but also because of the shift.
As general optimisation, shifting left 16 bit quantities, thanks to ADD HL,HL, is more efficient that shifting right.

Discussion

1 2 > >> (Page 1 of 2)
  • Ragozini Arturo

    Ragozini Arturo - 2024-06-17

    For the sake of precision, considering the worst branch, in the former case, we pass from 251 cycles and 42 bytes, to 226 cycles and 37 bytes, in the latter case, we pass from 397 cycles and 71 bytes to 248 cycles and 40 bytes.

     

    Last edit: Ragozini Arturo 2024-06-17
  • Philipp Klaus Krause

    I'm a bit busy with other stuff for the moment, but if you can provide a small, self-contained, compileable example, I should be able to look into it at the end of the month.

     
  • Ragozini Arturo

    Ragozini Arturo - 2024-06-17

    Here it is! The corresponding ASM is generated with an optimization level of 200.000 nodes ... and it is even worse than mine as the compiler now is using IY to shift the second result (!!)
    Not only it is doing 7+7 8 bit shift operation instead of one 16 bit shift, but it is also using the slowest register

     

    Last edit: Ragozini Arturo 2024-06-17
  • Maarten Brock

    Maarten Brock - 2024-06-18

    As for why not HL or why not DE, I think the register allocator cannot change the allocated register(s) halfway through the lifetime. So first the mulint result is stored in the registers chosen for the intermediate and then it is spilt to the stack and later restored.

     
    • Ragozini Arturo

      Ragozini Arturo - 2024-06-18

      In this case, probably, the problem is in the register allocator that is doing plenty of dummy jugglering passing values from here to there and back from there to here....
      This is PUSH DE

      ex  de, hl
      push    hl
      

      and this

      ex  de, hl                      
      ld  c, e
      ld  b, d
      bit 7, d
      

      is equivalent to use HL in place of BC (and DE )

      Moreover pushing a register on the stack and popping the value to another register is a way to pass a value from a 16bit register to another. Probably, once a value is pushed on the stack, the register allocator should reset its choices and repeat the evaluation when it is time to pop it.

       

      Last edit: Ragozini Arturo 2024-06-19
  • Ragozini Arturo

    Ragozini Arturo - 2024-06-19

    note: I did a mistake in the rounding for negative shifts.
    The code should add 127, instead of 255, before shifting right 7 bits.
    Actually, if you shift left 1 position, add 255 and then shift right 8 places all goes right
    The correct shifting should work like this

    add hl,hl
    jr  nc, 00123$   ; here C holds the sign of HL
    inc h
    dec hl
    00123$:
    ld l,h
    ld a, l
    rlca
    sbc a, a
    ld h, a
    

    In this way, it is even shorter and faster because ADD HL,HL replaces also BIT 7,H

    The complete code results:

    ld  de, (_MoveEnemy_z0_20001_650)
    ld  hl, (_cp)
    call    __mulint
    push    de
    ld  de, (_MoveEnemy_x0_20001_650)
    ld  hl, (_sp)
    call    __mulint
    pop hl
    cp  a, a
    sbc hl, de
    add hl,hl
    jr  nc, 00123$   ; here C holds the sign of HL
    inc h
    dec hl
    00123$:
    ld l,h
    ld a, l
    rlca
    sbc a, a
    ld h, a
    ld (_MoveEnemy_z2_20001_650), hl
    

    corresponding to 238 cycles and 38 bytes.

     

    Last edit: Ragozini Arturo 2024-06-19
    • Maarten Brock

      Maarten Brock - 2024-06-20

      I don't think this is correct. The RLCA throws away the "sign of HL" in the carry flag. You also don't need to load A with L.

       
      • Ragozini Arturo

        Ragozini Arturo - 2024-06-20

        I do not see your point.
        RLCA is used to load in C the bit 7 of H. This has to be done AFTER you have added the rounding factor, which in turn is applied only when HL is negative.
        What you mean?

        The sole change could be the use of RLC H instead of LD A, L / RLCA
        But both takes two bytes and cost 10 cycles so here they are totally equivalent

         
        • Maarten Brock

          Maarten Brock - 2024-06-20

          I believe the intent was to divide by 128. Your implementation is to first shift 1 bit left and then 8 bits right. ADD HL,HL performs the shift left and places the original MSbit in the carry flag. Since this is the sign bit you must sign extend this bit in the MSByte H. Instead you shift the second MSbit into the carry using RLCA and sign extend that one into H. Using RLC H would also shift the second MSbit into the carry.

           
          • Ragozini Arturo

            Ragozini Arturo - 2024-06-20

            No, you have to extend the sign bit after having added the rounding factor.
            There are cases where the two quantities differ.

            If the sign bit would have been the one you say you could have branched two separate paths on the sole test on the sign of HL, after ADD HL,HL.
            This is not the case.

            This code would be the consequence of what you propose

                ld  de, (_z0)
                ld  hl, (_cp)
                call    __mulint
                push    de
                ld  de, (_x0)
                ld  hl, (_sp)
                call    __mulint
                pop hl
                cp  a
                sbc hl, de
                add hl,hl
                jr  nc,positive   ; here C holds the sign of HL
                inc h               ; negative
                dec hl
                ld l,h  
                ld h,#-1
                ld (_z), hl
                jp cont:
            positive:
                ld l,h
                ld h,#0
                ld (_z), hl
            cont:
            

            The problem is that adding 127 in the negative case can lead to a change of sign in overflow cases. You would get a "wrong" result when overflow occurs.
            I know it could be considered marginal but you cannot know how the result will be used later.

            Consider -1 : you expect -1/128 = 0
            If you take the sign AFTER the rounding (addition of 127) you get 01111110b>>7 which is 0
            If you take the sign BEFORE the rounding, you get -1

             

            Last edit: Ragozini Arturo 2024-06-20
  • Ragozini Arturo

    Ragozini Arturo - 2024-06-20

    Anyway, you are right in the sense that, having shifted everything 1 bit left, after adding the 255 I should test the sign of the 17th bit (which is missing) .
    In this case, willing to test bit 17, you cannot use DEC HL/INC H
    The sole option is to use ADD HL,DE and resort to the CF

    This code is fully equivalent to the 7 bit shifting

        ld  de, (_z0)
        ld  hl, (_cp)
        call    __mulint
        push    de
        ld  de, (_x0)
        ld  hl, (_sp)
        call    __mulint
        pop hl
        xor  a
        sbc hl, de
        add hl,hl
        jr  nc,positive     ; here C holds the sign of HL
        ld de,#255
        add hl,de
        scf                 ; if CF then the sign has to become positive
        sbc a, a
    positive:
        ld l,h
        ld h,a
        ld (_z), hl
    

    Which costs 244 cycles and 39 bytes

     

    Last edit: Ragozini Arturo 2024-06-20
  • Philipp Klaus Krause

    Ticket moved from /p/sdcc/bugs/3747/

    Can't be converted:

    • _category: Z80
     
  • Philipp Klaus Krause

        ex  de, hl  
        pop de               ;<- why not popping hl ?
        add hl, de           ; it is neglecting that HL+DE=DE+HL
        ex  de, hl
    

    Code generation can't pop it into a different register than where it was pushed from. It is one variable, placed in de, and having it in hl instead would mean hl also before the push, which doesn't look easily possible. So this part would have to be done by the peephole optimizer. There we get an extra complication: a rule that replaces the first three lines wouldn't trigger, since the new code would leave a different value in de, and the peephole optimizer wouldn't realize that the value in de after add hl, de in unused (after all it is read by the ex de, hl instruction, and the peephole optimizer is not smart enough to keep looking to check what happens to hl after that.

    I think a good approach here would be to first find out why the result of the addition is put into de instead of just being left in hl. Once register allocation and codegen put the addition result into hl, then a peephole optimizer rule could use the pop hl, to get us finally to what you want:

        pop hl                  ; pop directly in HL
        add hl, de
    
     
    • Philipp Klaus Krause

      This part is implemented in [r14897].

       
      👍
      1

      Related

      Commit: [r14897]

      • Ragozini Arturo

        Ragozini Arturo - 2024-06-24

        Now the peephole optimizer could replace

            ex  de, hl            
            pop de               ;<- why not popping hl ?
            add hl, de           ; it is neglecting that HL+DE=DE+HL
        

        by

            pop hl
            add hl, de 
        

        Couldn't it ?

         

        Last edit: Ragozini Arturo 2024-06-24
        • Philipp Klaus Krause

          It can, with the right rules: [r14898].

           
          👍
          1

          Related

          Commit: [r14898]

  • Ragozini Arturo

    Ragozini Arturo - 2024-06-24

    Ok, the first step is to undersand why this sequence

        ex  de, hl
        ld  b, d             ; why do we need bc ?
        bit 7, d
        jr  Z, 00122$
        ld  c, e
        ld  b, d
        dec bc
        inc b
    

    isn't generated like this

        bit 7, h
        jr  Z, 00122$
        dec hl
        inc h
    
     
    • Philipp Klaus Krause

      The decision to have a copy is made by earlier stages of the compiler. The register allocator then decides to put that copy in bc. Compiling with --i-code-in-asm shows the iTemps (i.e. the variables the register allocator has to allocate) by name in the asm output. Compiling with --dump-i-code shows shows at which stage the iTemp was introduced.

       
      👍
      1
  • Philipp Klaus Krause

    In [r14899], better code is generated for the right shift.

     
    😄
    1
    👍
    1

    Related

    Commit: [r14899]

  • Ragozini Arturo

    Ragozini Arturo - 2024-06-26

    I've just tested Version 4.4.1 #14899 (MINGW32). I get an error in the linking step, but this could be a problem in MSXGL. Anyway, the resulting code for the rotation now is this one and it is improved a lot! Thanks!

    ;./ManageEnemy.c:74: z2 =  (cp*z0-sp*x0)/128;   // 8.8 * 15.0 + 8.8 * 16.0
        ld  de, (_AnimateExplosion_z0_20001_651)
        ld  hl, (_cp)
        call    __mulint
        push    de
        ld  de, (_AnimateExplosion_x0_20001_651)
        ld  hl, (_sp)
        call    __mulint
        pop hl
        cp  a, a
        sbc hl, de
        ld  e, l              ; this is strange
        ld  d, h              ; this is strange
        bit 7, d
        jr  Z, 00119$
        ld  hl, #0x007f
        add hl, de
    00119$:
        add hl, hl
        ld  l, h
        ld  h, #0x00
        jr  nc,00224$
        dec h
    00224$:
        ld  (_AnimateExplosion_z2_20001_651), hl
    

    I've only one small complain about the fact that after SBC HL,DE the compiler is copying HL to DE instead of testing the sign of HL directly.
    The passage could have been this:

        sbc hl, de
        bit 7, h
        jr  Z, 00119$
        ld  de, #0x007f
        add hl, de
    00119$:
        add hl, hl
    

    Could the peephole optimizer fix this problem ?

    The other line of C is converted in this way

    ;./ManageEnemy.c:73: x1 =  (cp*x0+sp*z0)/256;   // 8.8 * 16.0 + 8.8 * 15.0
        ld  de, (_AnimateExplosion_x0_20001_651)
        ld  hl, (_cp)
        call    __mulint
        push    de
        ld  de, (_AnimateExplosion_z0_20001_651)
        ld  hl, (_sp)
        call    __mulint
        ex  de, hl              ; the peephole optimizer missed this 
        pop de
        add hl, de
        ld  b, h                ; why h is copied to b? it could keep using HL
        bit 7, h
        jr  Z, 00118$
        ld  c, l                ; why BC ?
        ld  b, h
        dec bc
        inc b
    00118$:
        ld  l, b
        ld  a, l
        rlca
        sbc a, a
        ld  h, a
        ld  (_AnimateExplosion_x1_20001_651), hl
    

    This time there is something strange about the need of copying HL to BC. It could have been avoided in this way

        add hl, de
        bit 7, h
        jr  Z, 00118$
        dec hl
        inc h
    00118$:
        ld  l, h
        ld  a, l
        rlca
        sbc a, a
        ld  h, a
        ld  (_AnimateExplosion_x1_20001_651), hl
    

    Moreover this section

        ex  de, hl
        pop de
        add hl, de
    

    could become just

        pop hl
        add hl, de
    

    The peephole optimizer probably here didn't work

     

    Last edit: Ragozini Arturo 2024-06-27
  • Ragozini Arturo

    Ragozini Arturo - 2024-06-27

    The peephole optimizer needs an additional rule for ex de,hl/ pop de/addl hl,de
    Maybe this could work

     replace restart {
        ex  de, hl
        pop hl
        add hl,de
    } by {
        pop de
        add hl,de    
    } if notUsed( 'de' )
    
     

    Last edit: Ragozini Arturo 2024-06-27
    • Philipp Klaus Krause

      That is current rule 140b.

       
      • Ragozini Arturo

        Ragozini Arturo - 2024-06-27

        So, why it is not applied to the above code?

         
        • Philipp Klaus Krause

          Don't know. Need C source to reproduce, along with the options that result in the asm where you think it should be applied. Then I can use a build of sdcc with the #if 0 in

          #if 0
          #define D(_s) { printf _s; fflush(stdout); }
          #else
          #define D(_s)
          #endif
          

          in src/z80/peep.c changed to #if 1, to get a trace of why an attempt to apply that rule would fail.

           
          👍
          1

          Last edit: Philipp Klaus Krause 2024-06-27
          • Ragozini Arturo

            Ragozini Arturo - 2024-06-27

            Here it is. Note in the ASM at line 103 the sequence

                ex  de, hl
                pop de
                add hl, de
            

            Also at line 106 and following the use of BC where HL would have been more efficient. Instead of :

                ld  b, h
                bit 7, h
                jr  Z, 00103$
                ld  c, l
                ld  b, h
                dec bc
                inc b
            00103$:
                ld  hl, #_test_x1_10000_5
                ld  a,b
                ld  (hl),a
                rlca
                sbc a, a
                ld  (_test_x1_10000_5+1), a
            

            it should have done

                bit 7, h
                jr  Z, 00103$
                dec hl
                inc h
            00103$:
               ld l,h
               ld a,h
               sbc a,a
               ld h,a
                ld  (_test_x1_10000_5), hl
            

            Look at line 131 and following the use of DE where HL would have been more efficient.
            Instead of

                ld  e, l
                ld  d, h
                bit 7, d
                jr  Z, 00104$
                ld  hl, #0x007f
                add hl, de
            

            it should have used

                bit 7, h
                jr  Z, 00104$
                ld  de #0x007f
                add hl, de
            
             

            Last edit: Ragozini Arturo 2024-06-27
1 2 > >> (Page 1 of 2)

Log in to post a comment.