Small Device C Compiler (SDCC) / Bugs / #3687 Regression in code size in newer versions (> 3..5.0)

marian - 2023-12-24

Other big differences in code like dbuf[0] = ledtable[(hour/10)]; in 3.5 we have

mov b,#0x0A mov a,r6 div ab mov dptr,#_ledtable movc a,@a+dptr mov r7,a mov _dbuf,r7

while in 4.4 we have

~~~
mov r7,#0x00
mov __divsint_PARM_2,#0x0a
mov (__divsint_PARM_2 + 1),r7
mov dpl, r6
mov dph, r7
push ar7
push ar6
lcall __divsint
mov r4, dpl
mov r5, dph
pop ar6
pop ar7
mov a,r4
add a, #_ledtable
mov dpl,a
mov a,r5
addc a, #(_ledtable >> 8)
mov dph,a
clr a
movc a,@a+dptr
mov r5,a
mov _dbuf,r5
~~~
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2023-12-27

Comparing the SDCC 3.5.0 code you posted to what current SDCC 4.4.0RC1 generates, I see the code size regression.
In general, there has indeed been a trend of increasing code size in the mcs51 port up to about SDCC 4.3.0 that reversed only recently. Naturally, the intent of the corresponding changes in SDCC was not to increase code size, but to ensure correctness in some corner cases.

Can you post the main.asm from SDCC 3.5.0 when using --fverbose-asm --i-code-in-asm?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

marian - 2023-12-27

Here it is - I used sdcc -o /tmp/build/ -mmcs51 --opt-code-size --model-small --less-pedantic --fverbose-asm --i-code-in-asm src/main.c.
Don't hesitate to as more details if necessary ; the binary has 2735 bytes, too big to entre the 2K of the flash!

main.asm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2023-12-27
  
  Thanks. That main.asm is from SDCC 4.2.0, and the same as what I see with SDCC 4.4.0 RC1. Could you post the output from SDCC 3.5.0, please?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - marian - 2023-12-27
    
    Sorry, my mistake! I hope now is OK.
    
    main.asm
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2023-12-27

For those increment functions, the change is that , basically, 3.5.0 use the global variable multiple times, while newer SDCC instead uses a temporary. Which one is better in terms of code size depends on the cost of using the temporary vs accessing the global variable, but for the small memory model, as we see here, we do get this regression.

But, we do not have a choice here: the code generated by newer SDCC is correct, the one generated by SDCC 3.5.0 is wrong. clock_second is volatile. As we now (AFAIR, the C standard was clarified wrt. to this) know, ++timer_second needs to read timer_second exactly once, then write it exactly once, no matter if the resulting value is used. We could optimize the code by keeping the value in a instead of using r7, but that would save only one byte.

Last edit: Philipp Klaus Krause 2023-12-27

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- marian - 2023-12-27
  
  the code generated by newer SDCC is correct
  
  Maybe, but the old compiler version generated code works (on AT89C2051), so I wouldn't say that it is incorrect! Anyhow, the "old" code is 1972 bytes, so 76 bytes remains free to use...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Philipp Klaus Krause - 2023-12-27
    
    Well, it works for you, the way you use it, but in general it is not correct.
    The perspective for improvement I see here is: Make the register allocator use a instead of r7 - saves one byte. Then for the if (--next_blink == 0), etc (i.e. the ones where we test for equality to 0, the peephole optimizer could optimize out the use of a, so we directly decrement again (for the jump we don't need to read the volatile value again, as we can just rely on the flags).
    That won't get us to the code size of 3.5.0, but it would already improve the situation a lot.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Maarten Brock - 2023-12-28
      
      The 3.5.0 generated code is wrong for --next_blink == 0 with the read-modify-write dec followed by reading volatile next_blink again. But if the compiler used djnz it would be correct. There is no need for any register use here.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Philipp Klaus Krause - 2023-12-28
        
        I don't think it would be easy to directly emit the djnz in codegen here. I think a good way would be to have the register allocator use a instead of r7, so we'd get this from codegen:
        
        mov a,_next_second dec a mov _next_second,a jz 00145$ ljmp 00104$ 00145$:
        
        which would then be optimized further by the peephole optimizer, until we'd end up with just the djnz.
        
        P.S.: In [r14559], in the next branch, I added peephole rule 303 to eliminate the use of r7 (in the long term, the register allocator should use a directly, so r7 is available for other uses), bringing the code size down to 2069B.
        A further peephole, like this one
        
        replace restart { mov a, %1 dec a mov %1, a } by { dec %1 ; Peephole 304 decremented in %1 instead of a. } if notUsed('a')
        
        could then eliminate the use of a (not possible yet, since notUsed apparently doesn't work well enough here).
        
        P.P.S.: By improving notUsed a bit, the same peephole rule 303 could also eliminate the use of r7 at the increments discussed here.
        
        Related
        
        Commit: [r14559]
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2023-12-27

I haven't looked into the division / modulo regression much yet, but I suspect that a change was made to improve integer promotion correctness, but we fail to later optimize the ints back into unsigned 8-bit types here. Maybe the optimization can be implemented in generalized constant propagation (we already have a similar optimization for multiplications there).

Last edit: Philipp Klaus Krause 2023-12-27

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2023-12-27
  
  In [r14555], in the next branch (to be merged to trunk after the 4.4.0 release), we now have an optimization that for this code results in div instructions being emitted instead of the support function calls.
  
  Related
  
  Commit: [r14555]
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2023-12-27

There is one remaining issue not yet discussed in my replies above: The ledtable accesses. In SDCC 3.5.0, these used movc a,@a+dptr for the indexing, which was quite efficient. But now I see the value in a cast to 16 bits, then a full 16.bit addition done.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2023-12-27
  
  Turns out, there is nothing really fancy going on there. Newer SDCC just happens to generate slightly different code (not really better or worse than old SDCC), but there is peephole optimizer rule 186.d that triggers for the old code from codegen.
  So the quick-and-dirty approach here is to just add another rule 186.f that handles the new code.
  In the long term, a better solution would be to implement support for a right operand for GET_VALUE_AT_ADDRESS iCodes (like e.g. in the z80 and stm8 ports).
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Philipp Klaus Krause - 2023-12-27
    
    In [r14556], there is a new peephole rule 186.f in the next branch, that optimizes many of these array accesses. IMO, we got all the low-hanging fruit now; code size is now at 2132 B, I think; further improvements can come at the time of the major improvements in the mcs51 port planned for sometime next year.
    
    Related
    
    Commit: [r14556]
    
    Last edit: Philipp Klaus Krause 2023-12-27
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

marian - 2023-12-30

Another increase of code (from 8 to 19 instructions) is in the C code dbuf[3] = ledtable[(alarm_minute%10)];

In 3.5 we had:

mov b,#0x0A mov a,_alarm_minute div ab mov a,b mov dptr,#_ledtable movc a,@a+dptr mov r7,a mov (_dbuf + 0x0003),r7

In 4.4 we now have

mov r6,_alarm_minute
mov r7,#0x00
mov __modsint_PARM_2,#0x0a
mov (__modsint_PARM_2 + 1),r7
mov dpl, r6
mov dph, r7
lcall __modsint
mov r6, dpl
mov r7, dph
mov a,r6
add a, #_ledtable
mov dpl,a
mov a,r7
addc a, #(_ledtable >> 8)
mov dph,a
clr a
movc a,@a+dptr
mov r7,a
mov (_dbuf + 0x0003),r7
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2023-12-30
  
  That one is already covered by changes in the next branch (i.e. features to be merged to trunk after the 4.4.0 release). Using the next branch, I see no calls to division or modulo support functions in your code, everything uses the div instruction.
  
  In the next branch:
  
  ; src/main.c:641: dbuf[3] = ledtable[(alarm_minute%10)]; mov r7,_alarm_minute mov b,#0x0a mov a,r7 div ab mov a,b add a, #_ledtable mov dpl,a mov a,r6 addc a, #(_ledtable >> 8) mov dph,a clr a movc a,@a+dptr mov (_dbuf + 0x0003), a sjmp 00216$ 00215$:
  
  Not perfect, but better than 4.4.0.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - marian - 2023-12-30
    
    so basically it means the fix will be available in 4.4.1 ? Is it possible to get that branch and test it on my AT80C2051?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Maarten Brock - 2023-12-31
      
      That is correct.
      You can get that branch by checking out the sources in that branch from subversion and building SDCC yourself. There are no snapshots created.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Regression in code size in newer versions (> 3..5.0)

The Small Device C Compiler (SDCC), targeting 8-bit architectures

Searches

Help

#3687 Regression in code size in newer versions (> 3..5.0)

Discussion

Related

Related

Related