Menu

#3687 Regression in code size in newer versions (> 3..5.0)

open
nobody
None
MCS51
5
2025-02-27
2023-12-24
marian
No

Please look at the project https://github.com/tlovie/at89c2051_clock (that is a fork of https://github.com/ruthsarian/at89c2051_clock) - the code is supposed to enter the 2k AT80C2051 flash.
And indeed with SDCC version 3.5.0 the result code is 1972 bytes only; however compiling with more recent versions results in a larger code: 2730 with the 4.4.0 version (and even 2735 with 4.2.0).

The code is compiled with the options -mmcs51 --opt-code-size --model-small.

Attached there are the assembly listings of both versions; basically there's an increase of 2 instructions in 4.4. from 3.5 in 'ifs' such as if (++timer_second == SEC_MAX) { that produces in 3.5

inc _timer_second
mov a,#0x3C
cjne    a,_timer_second,00103$

and in 4.4

mov a,_timer_second
inc a
mov r7,a
mov _timer_second,r7
cjne    r7,#0x3c,00103$

Even worse in C code like hour = display_hour % 12; - in 3.5 it generates

mov b,#0x0C
mov a,r7
div ab

while in 4.4 we have

mov ar5,r7
mov r6,#0x00
mov __modsint_PARM_2,#0x0c
mov (__modsint_PARM_2 + 1),r6
mov dpl, r5
mov dph, r6
lcall   __modsint
2 Attachments

Discussion

  • marian

    marian - 2023-12-24

    Other big differences in code like dbuf[0] = ledtable[(hour/10)]; in 3.5 we have

     mov    b,#0x0A
    mov a,r6
    div ab
    mov dptr,#_ledtable
    movc    a,@a+dptr
    mov r7,a
    mov _dbuf,r7
    

    while in 4.4 we have

    ~~~
    mov r7,#0x00
    mov __divsint_PARM_2,#0x0a
    mov (__divsint_PARM_2 + 1),r7
    mov dpl, r6
    mov dph, r7
    push ar7
    push ar6
    lcall __divsint
    mov r4, dpl
    mov r5, dph
    pop ar6
    pop ar7
    mov a,r4
    add a, #_ledtable
    mov dpl,a
    mov a,r5
    addc a, #(_ledtable >> 8)
    mov dph,a
    clr a
    movc a,@a+dptr
    mov r5,a
    mov _dbuf,r5
    ~~~

     
  • Philipp Klaus Krause

    Comparing the SDCC 3.5.0 code you posted to what current SDCC 4.4.0RC1 generates, I see the code size regression.
    In general, there has indeed been a trend of increasing code size in the mcs51 port up to about SDCC 4.3.0 that reversed only recently. Naturally, the intent of the corresponding changes in SDCC was not to increase code size, but to ensure correctness in some corner cases.

    Can you post the main.asm from SDCC 3.5.0 when using --fverbose-asm --i-code-in-asm?

     
  • marian

    marian - 2023-12-27

    Here it is - I used sdcc -o /tmp/build/ -mmcs51 --opt-code-size --model-small --less-pedantic --fverbose-asm --i-code-in-asm src/main.c.
    Don't hesitate to as more details if necessary ; the binary has 2735 bytes, too big to entre the 2K of the flash!

     
    • Philipp Klaus Krause

      Thanks. That main.asm is from SDCC 4.2.0, and the same as what I see with SDCC 4.4.0 RC1. Could you post the output from SDCC 3.5.0, please?

       
      • marian

        marian - 2023-12-27

        Sorry, my mistake! I hope now is OK.

         
  • Philipp Klaus Krause

    For those increment functions, the change is that , basically, 3.5.0 use the global variable multiple times, while newer SDCC instead uses a temporary. Which one is better in terms of code size depends on the cost of using the temporary vs accessing the global variable, but for the small memory model, as we see here, we do get this regression.

    But, we do not have a choice here: the code generated by newer SDCC is correct, the one generated by SDCC 3.5.0 is wrong. clock_second is volatile. As we now (AFAIR, the C standard was clarified wrt. to this) know, ++timer_second needs to read timer_second exactly once, then write it exactly once, no matter if the resulting value is used. We could optimize the code by keeping the value in a instead of using r7, but that would save only one byte.

     

    Last edit: Philipp Klaus Krause 2023-12-27
    • marian

      marian - 2023-12-27

      the code generated by newer SDCC is correct

      Maybe, but the old compiler version generated code works (on AT89C2051), so I wouldn't say that it is incorrect! Anyhow, the "old" code is 1972 bytes, so 76 bytes remains free to use...

       
      • Philipp Klaus Krause

        Well, it works for you, the way you use it, but in general it is not correct.
        The perspective for improvement I see here is: Make the register allocator use a instead of r7 - saves one byte. Then for the if (--next_blink == 0), etc (i.e. the ones where we test for equality to 0, the peephole optimizer could optimize out the use of a, so we directly decrement again (for the jump we don't need to read the volatile value again, as we can just rely on the flags).
        That won't get us to the code size of 3.5.0, but it would already improve the situation a lot.

         
        • Maarten Brock

          Maarten Brock - 2023-12-28

          The 3.5.0 generated code is wrong for --next_blink == 0 with the read-modify-write dec followed by reading volatile next_blink again. But if the compiler used djnz it would be correct. There is no need for any register use here.

           
          • Philipp Klaus Krause

            I don't think it would be easy to directly emit the djnz in codegen here. I think a good way would be to have the register allocator use a instead of r7, so we'd get this from codegen:

                mov a,_next_second
                dec a
                mov _next_second,a
                jz  00145$
                ljmp    00104$
            00145$:
            

            which would then be optimized further by the peephole optimizer, until we'd end up with just the djnz.

            P.S.: In [r14559], in the next branch, I added peephole rule 303 to eliminate the use of r7 (in the long term, the register allocator should use a directly, so r7 is available for other uses), bringing the code size down to 2069B.
            A further peephole, like this one

            replace restart {
                mov a, %1
                dec a
                mov %1, a
            } by {
                dec %1
                ;   Peephole 304 decremented in %1 instead of a.
            } if notUsed('a')
            

            could then eliminate the use of a (not possible yet, since notUsed apparently doesn't work well enough here).

            P.P.S.: By improving notUsed a bit, the same peephole rule 303 could also eliminate the use of r7 at the increments discussed here.

             

            Related

            Commit: [r14559]

  • Philipp Klaus Krause

    I haven't looked into the division / modulo regression much yet, but I suspect that a change was made to improve integer promotion correctness, but we fail to later optimize the ints back into unsigned 8-bit types here. Maybe the optimization can be implemented in generalized constant propagation (we already have a similar optimization for multiplications there).

     

    Last edit: Philipp Klaus Krause 2023-12-27
    • Philipp Klaus Krause

      In [r14555], in the next branch (to be merged to trunk after the 4.4.0 release), we now have an optimization that for this code results in div instructions being emitted instead of the support function calls.

       

      Related

      Commit: [r14555]

  • Philipp Klaus Krause

    There is one remaining issue not yet discussed in my replies above: The ledtable accesses. In SDCC 3.5.0, these used movc a,@a+dptr for the indexing, which was quite efficient. But now I see the value in a cast to 16 bits, then a full 16.bit addition done.

     
    • Philipp Klaus Krause

      Turns out, there is nothing really fancy going on there. Newer SDCC just happens to generate slightly different code (not really better or worse than old SDCC), but there is peephole optimizer rule 186.d that triggers for the old code from codegen.
      So the quick-and-dirty approach here is to just add another rule 186.f that handles the new code.
      In the long term, a better solution would be to implement support for a right operand for GET_VALUE_AT_ADDRESS iCodes (like e.g. in the z80 and stm8 ports).

       
      • Philipp Klaus Krause

        In [r14556], there is a new peephole rule 186.f in the next branch, that optimizes many of these array accesses. IMO, we got all the low-hanging fruit now; code size is now at 2132 B, I think; further improvements can come at the time of the major improvements in the mcs51 port planned for sometime next year.

         

        Related

        Commit: [r14556]


        Last edit: Philipp Klaus Krause 2023-12-27
  • marian

    marian - 2023-12-30

    Another increase of code (from 8 to 19 instructions) is in the C code dbuf[3] = ledtable[(alarm_minute%10)];

    In 3.5 we had:

        mov b,#0x0A
        mov a,_alarm_minute
        div ab
        mov a,b
        mov dptr,#_ledtable
        movc    a,@a+dptr
        mov r7,a
        mov (_dbuf + 0x0003),r7
    

    In 4.4 we now have

    mov r6,_alarm_minute
    mov r7,#0x00
    mov __modsint_PARM_2,#0x0a
    mov (__modsint_PARM_2 + 1),r7
    mov dpl, r6
    mov dph, r7
    lcall __modsint
    mov r6, dpl
    mov r7, dph
    mov a,r6
    add a, #_ledtable
    mov dpl,a
    mov a,r7
    addc a, #(_ledtable >> 8)
    mov dph,a
    clr a
    movc a,@a+dptr
    mov r7,a
    mov (_dbuf + 0x0003),r7

     
    • Philipp Klaus Krause

      That one is already covered by changes in the next branch (i.e. features to be merged to trunk after the 4.4.0 release). Using the next branch, I see no calls to division or modulo support functions in your code, everything uses the div instruction.

      In the next branch:

      ;   src/main.c:641: dbuf[3] = ledtable[(alarm_minute%10)];
          mov r7,_alarm_minute
          mov b,#0x0a
          mov a,r7
          div ab
          mov a,b
          add a, #_ledtable
          mov dpl,a
          mov a,r6
          addc    a, #(_ledtable >> 8)
          mov dph,a
          clr a
          movc    a,@a+dptr
          mov (_dbuf + 0x0003), a
          sjmp    00216$
      00215$:
      

      Not perfect, but better than 4.4.0.

       
      • marian

        marian - 2023-12-30

        so basically it means the fix will be available in 4.4.1 ? Is it possible to get that branch and test it on my AT80C2051?

         
        • Maarten Brock

          Maarten Brock - 2023-12-31

          That is correct.
          You can get that branch by checking out the sources in that branch from subversion and building SDCC yourself. There are no snapshots created.

           

Log in to post a comment.

MongoDB Logo MongoDB