Small Device C Compiler (SDCC) / Feature Requests / #929 Redundant generated code

Ragozini Arturo - 2024-06-17

For the sake of precision, considering the worst branch, in the former case, we pass from 251 cycles and 42 bytes, to 226 cycles and 37 bytes, in the latter case, we pass from 397 cycles and 71 bytes to 248 cycles and 40 bytes.

Last edit: Ragozini Arturo 2024-06-17

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2024-06-17

I'm a bit busy with other stuff for the moment, but if you can provide a small, self-contained, compileable example, I should be able to look into it at the end of the month.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ragozini Arturo - 2024-06-17

Here it is! The corresponding ASM is generated with an optimization level of 200.000 nodes ... and it is even worse than mine as the compiler now is using IY to shift the second result (!!)
Not only it is doing 7+7 8 bit shift operation instead of one 16 bit shift, but it is also using the slowest register

Last edit: Ragozini Arturo 2024-06-17

template.asm

template.c

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maarten Brock - 2024-06-18

As for why not HL or why not DE, I think the register allocator cannot change the allocated register(s) halfway through the lifetime. So first the mulint result is stored in the registers chosen for the intermediate and then it is spilt to the stack and later restored.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ragozini Arturo - 2024-06-18
  
  In this case, probably, the problem is in the register allocator that is doing plenty of dummy jugglering passing values from here to there and back from there to here....
  This is PUSH DE
  
  ex de, hl push hl
  
  and this
  
  ex de, hl ld c, e ld b, d bit 7, d
  
  is equivalent to use HL in place of BC (and DE )
  
  Moreover pushing a register on the stack and popping the value to another register is a way to pass a value from a 16bit register to another. Probably, once a value is pushed on the stack, the register allocator should reset its choices and repeat the evaluation when it is time to pop it.
  
  Last edit: Ragozini Arturo 2024-06-19
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ragozini Arturo - 2024-06-19

note: I did a mistake in the rounding for negative shifts.
The code should add 127, instead of 255, before shifting right 7 bits.
Actually, if you shift left 1 position, add 255 and then shift right 8 places all goes right
The correct shifting should work like this

add hl,hl jr nc, 00123$ ; here C holds the sign of HL inc h dec hl 00123$: ld l,h ld a, l rlca sbc a, a ld h, a

In this way, it is even shorter and faster because ADD HL,HL replaces also BIT 7,H

The complete code results:

ld de, (_MoveEnemy_z0_20001_650) ld hl, (_cp) call __mulint push de ld de, (_MoveEnemy_x0_20001_650) ld hl, (_sp) call __mulint pop hl cp a, a sbc hl, de add hl,hl jr nc, 00123$ ; here C holds the sign of HL inc h dec hl 00123$: ld l,h ld a, l rlca sbc a, a ld h, a ld (_MoveEnemy_z2_20001_650), hl

corresponding to 238 cycles and 38 bytes.

Last edit: Ragozini Arturo 2024-06-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Maarten Brock - 2024-06-20
  
  I don't think this is correct. The RLCA throws away the "sign of HL" in the carry flag. You also don't need to load A with L.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ragozini Arturo - 2024-06-20
    
    I do not see your point.
    RLCA is used to load in C the bit 7 of H. This has to be done AFTER you have added the rounding factor, which in turn is applied only when HL is negative.
    What you mean?
    
    The sole change could be the use of RLC H instead of LD A, L / RLCA
    But both takes two bytes and cost 10 cycles so here they are totally equivalent
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Maarten Brock - 2024-06-20
      
      I believe the intent was to divide by 128. Your implementation is to first shift 1 bit left and then 8 bits right. ADD HL,HL performs the shift left and places the original MSbit in the carry flag. Since this is the sign bit you must sign extend this bit in the MSByte H. Instead you shift the second MSbit into the carry using RLCA and sign extend that one into H. Using RLC H would also shift the second MSbit into the carry.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Ragozini Arturo - 2024-06-20
        
        No, you have to extend the sign bit after having added the rounding factor.
        There are cases where the two quantities differ.
        
        If the sign bit would have been the one you say you could have branched two separate paths on the sole test on the sign of HL, after ADD HL,HL.
        This is not the case.
        
        This code would be the consequence of what you propose
        
        ld de, (_z0) ld hl, (_cp) call __mulint push de ld de, (_x0) ld hl, (_sp) call __mulint pop hl cp a sbc hl, de add hl,hl jr nc,positive ; here C holds the sign of HL inc h ; negative dec hl ld l,h ld h,#-1 ld (_z), hl jp cont: positive: ld l,h ld h,#0 ld (_z), hl cont:
        
        The problem is that adding 127 in the negative case can lead to a change of sign in overflow cases. You would get a "wrong" result when overflow occurs.
        I know it could be considered marginal but you cannot know how the result will be used later.
        
        Consider -1 : you expect -1/128 = 0
        If you take the sign AFTER the rounding (addition of 127) you get 01111110b>>7 which is 0
        If you take the sign BEFORE the rounding, you get -1
        
        Last edit: Ragozini Arturo 2024-06-20
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ragozini Arturo - 2024-06-20

Anyway, you are right in the sense that, having shifted everything 1 bit left, after adding the 255 I should test the sign of the 17th bit (which is missing) .
In this case, willing to test bit 17, you cannot use DEC HL/INC H
The sole option is to use ADD HL,DE and resort to the CF

This code is fully equivalent to the 7 bit shifting

ld de, (_z0) ld hl, (_cp) call __mulint push de ld de, (_x0) ld hl, (_sp) call __mulint pop hl xor a sbc hl, de add hl,hl jr nc,positive ; here C holds the sign of HL ld de,#255 add hl,de scf ; if CF then the sign has to become positive sbc a, a positive: ld l,h ld h,a ld (_z), hl

Which costs 244 cycles and 39 bytes

Last edit: Ragozini Arturo 2024-06-20
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2024-06-24

Ticket moved from /p/sdcc/bugs/3747/

Can't be converted:

_category: Z80
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2024-06-24

ex de, hl pop de ;<- why not popping hl ? add hl, de ; it is neglecting that HL+DE=DE+HL ex de, hl

Code generation can't pop it into a different register than where it was pushed from. It is one variable, placed in de, and having it in hl instead would mean hl also before the push, which doesn't look easily possible. So this part would have to be done by the peephole optimizer. There we get an extra complication: a rule that replaces the first three lines wouldn't trigger, since the new code would leave a different value in de, and the peephole optimizer wouldn't realize that the value in de after add hl, de in unused (after all it is read by the ex de, hl instruction, and the peephole optimizer is not smart enough to keep looking to check what happens to hl after that.

I think a good approach here would be to first find out why the result of the addition is put into de instead of just being left in hl. Once register allocation and codegen put the addition result into hl, then a peephole optimizer rule could use the pop hl, to get us finally to what you want:

pop hl ; pop directly in HL add hl, de
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2024-06-24
  
  This part is implemented in [r14897].
  
  👍
  1
  
  Related
  
  Commit: [r14897]
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ragozini Arturo - 2024-06-24
    
    Now the peephole optimizer could replace
    
    ex de, hl pop de ;<- why not popping hl ? add hl, de ; it is neglecting that HL+DE=DE+HL
    
    by
    
    pop hl add hl, de
    
    Couldn't it ?
    
    Last edit: Ragozini Arturo 2024-06-24
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Philipp Klaus Krause - 2024-06-24
      
      It can, with the right rules: [r14898].
      
      👍
      1
      
      Related
      
      Commit: [r14898]
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ragozini Arturo - 2024-06-24

Ok, the first step is to undersand why this sequence

ex de, hl ld b, d ; why do we need bc ? bit 7, d jr Z, 00122$ ld c, e ld b, d dec bc inc b

isn't generated like this

bit 7, h jr Z, 00122$ dec hl inc h
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2024-06-24
  
  The decision to have a copy is made by earlier stages of the compiler. The register allocator then decides to put that copy in bc. Compiling with --i-code-in-asm shows the iTemps (i.e. the variables the register allocator has to allocate) by name in the asm output. Compiling with --dump-i-code shows shows at which stage the iTemp was introduced.
  
  👍
  1
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2024-06-25

In [r14899], better code is generated for the right shift.

😄
1
👍
1

Related

Commit: [r14899]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I've just tested Version 4.4.1 #14899 (MINGW32). I get an error in the linking step, but this could be a problem in MSXGL. Anyway, the resulting code for the rotation now is this one and it is improved a lot! Thanks!

;./ManageEnemy.c:74: z2 =  (cp*z0-sp*x0)/128;   // 8.8 * 15.0 + 8.8 * 16.0
    ld  de, (_AnimateExplosion_z0_20001_651)
    ld  hl, (_cp)
    call    __mulint
    push    de
    ld  de, (_AnimateExplosion_x0_20001_651)
    ld  hl, (_sp)
    call    __mulint
    pop hl
    cp  a, a
    sbc hl, de
    ld  e, l              ; this is strange
    ld  d, h              ; this is strange
    bit 7, d
    jr  Z, 00119$
    ld  hl, #0x007f
    add hl, de
00119$:
    add hl, hl
    ld  l, h
    ld  h, #0x00
    jr  nc,00224$
    dec h
00224$:
    ld  (_AnimateExplosion_z2_20001_651), hl

I've only one small complain about the fact that after SBC HL,DE the compiler is copying HL to DE instead of testing the sign of HL directly.
The passage could have been this:

    sbc hl, de
    bit 7, h
    jr  Z, 00119$
    ld  de, #0x007f
    add hl, de
00119$:
    add hl, hl

Could the peephole optimizer fix this problem ?

The other line of C is converted in this way

;./ManageEnemy.c:73: x1 =  (cp*x0+sp*z0)/256;   // 8.8 * 16.0 + 8.8 * 15.0
    ld  de, (_AnimateExplosion_x0_20001_651)
    ld  hl, (_cp)
    call    __mulint
    push    de
    ld  de, (_AnimateExplosion_z0_20001_651)
    ld  hl, (_sp)
    call    __mulint
    ex  de, hl              ; the peephole optimizer missed this 
    pop de
    add hl, de
    ld  b, h                ; why h is copied to b? it could keep using HL
    bit 7, h
    jr  Z, 00118$
    ld  c, l                ; why BC ?
    ld  b, h
    dec bc
    inc b
00118$:
    ld  l, b
    ld  a, l
    rlca
    sbc a, a
    ld  h, a
    ld  (_AnimateExplosion_x1_20001_651), hl

This time there is something strange about the need of copying HL to BC. It could have been avoided in this way

    add hl, de
    bit 7, h
    jr  Z, 00118$
    dec hl
    inc h
00118$:
    ld  l, h
    ld  a, l
    rlca
    sbc a, a
    ld  h, a
    ld  (_AnimateExplosion_x1_20001_651), hl

Moreover this section

    ex  de, hl
    pop de
    add hl, de

could become just

    pop hl
    add hl, de

The peephole optimizer probably here didn't work

Last edit: Ragozini Arturo 2024-06-27

Ragozini Arturo - 2024-06-27

The peephole optimizer needs an additional rule for ex de,hl/ pop de/addl hl,de
Maybe this could work

replace restart { ex de, hl pop hl add hl,de } by { pop de add hl,de } if notUsed( 'de' )

Last edit: Ragozini Arturo 2024-06-27
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2024-06-27
  
  That is current rule 140b.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ragozini Arturo - 2024-06-27
    
    So, why it is not applied to the above code?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Philipp Klaus Krause - 2024-06-27
      
      Don't know. Need C source to reproduce, along with the options that result in the asm where you think it should be applied. Then I can use a build of sdcc with the #if 0 in
      
      #if 0 #define D(_s) { printf _s; fflush(stdout); } #else #define D(_s) #endif
      
      in src/z80/peep.c changed to #if 1, to get a trace of why an attempt to apply that rule would fail.
      
      👍
      1
      
      Last edit: Philipp Klaus Krause 2024-06-27
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Ragozini Arturo - 2024-06-27
        
        Here it is. Note in the ASM at line 103 the sequence
        
        ex de, hl pop de add hl, de
        
        Also at line 106 and following the use of BC where HL would have been more efficient. Instead of :
        
        ld b, h bit 7, h jr Z, 00103$ ld c, l ld b, h dec bc inc b 00103$: ld hl, #_test_x1_10000_5 ld a,b ld (hl),a rlca sbc a, a ld (_test_x1_10000_5+1), a
        
        it should have done
        
        bit 7, h jr Z, 00103$ dec hl inc h 00103$: ld l,h ld a,h sbc a,a ld h,a ld (_test_x1_10000_5), hl
        
        Look at line 131 and following the use of DE where HL would have been more efficient.
        Instead of
        
        ld e, l ld d, h bit 7, d jr Z, 00104$ ld hl, #0x007f add hl, de
        
        it should have used
        
        bit 7, h jr Z, 00104$ ld de #0x007f add hl, de
        
        Last edit: Ragozini Arturo 2024-06-27
        
        template.asm
        
        template.c
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Redundant generated code

The Small Device C Compiler (SDCC), targeting 8-bit architectures

Group

Searches

Help

#929 Redundant generated code

Discussion

Related

Related

Related