Please look at the project https://github.com/tlovie/at89c2051_clock (that is a fork of https://github.com/ruthsarian/at89c2051_clock) - the code is supposed to enter the 2k AT80C2051 flash.
And indeed with SDCC version 3.5.0 the result code is 1972 bytes only; however compiling with more recent versions results in a larger code: 2730 with the 4.4.0 version (and even 2735 with 4.2.0).
The code is compiled with the options -mmcs51 --opt-code-size --model-small.
Attached there are the assembly listings of both versions; basically there's an increase of 2 instructions in 4.4. from 3.5 in 'ifs' such as if (++timer_second == SEC_MAX) { that produces in 3.5
inc _timer_second
mov a,#0x3C
cjne a,_timer_second,00103$
and in 4.4
mov a,_timer_second
inc a
mov r7,a
mov _timer_second,r7
cjne r7,#0x3c,00103$
Even worse in C code like hour = display_hour % 12; - in 3.5 it generates
mov b,#0x0C
mov a,r7
div ab
while in 4.4 we have
mov ar5,r7
mov r6,#0x00
mov __modsint_PARM_2,#0x0c
mov (__modsint_PARM_2 + 1),r6
mov dpl, r5
mov dph, r6
lcall __modsint
Other big differences in code like
dbuf[0] = ledtable[(hour/10)];in 3.5 we havewhile in 4.4 we have
~~~
mov r7,#0x00
mov __divsint_PARM_2,#0x0a
mov (__divsint_PARM_2 + 1),r7
mov dpl, r6
mov dph, r7
push ar7
push ar6
lcall __divsint
mov r4, dpl
mov r5, dph
pop ar6
pop ar7
mov a,r4
add a, #_ledtable
mov dpl,a
mov a,r5
addc a, #(_ledtable >> 8)
mov dph,a
clr a
movc a,@a+dptr
mov r5,a
mov _dbuf,r5
~~~
Comparing the SDCC 3.5.0 code you posted to what current SDCC 4.4.0RC1 generates, I see the code size regression.
In general, there has indeed been a trend of increasing code size in the mcs51 port up to about SDCC 4.3.0 that reversed only recently. Naturally, the intent of the corresponding changes in SDCC was not to increase code size, but to ensure correctness in some corner cases.
Can you post the main.asm from SDCC 3.5.0 when using --fverbose-asm --i-code-in-asm?
Here it is - I used
sdcc -o /tmp/build/ -mmcs51 --opt-code-size --model-small --less-pedantic --fverbose-asm --i-code-in-asm src/main.c.Don't hesitate to as more details if necessary ; the binary has 2735 bytes, too big to entre the 2K of the flash!
Thanks. That main.asm is from SDCC 4.2.0, and the same as what I see with SDCC 4.4.0 RC1. Could you post the output from SDCC 3.5.0, please?
Sorry, my mistake! I hope now is OK.
For those increment functions, the change is that , basically, 3.5.0 use the global variable multiple times, while newer SDCC instead uses a temporary. Which one is better in terms of code size depends on the cost of using the temporary vs accessing the global variable, but for the small memory model, as we see here, we do get this regression.
But, we do not have a choice here: the code generated by newer SDCC is correct, the one generated by SDCC 3.5.0 is wrong.
clock_secondisvolatile. As we now (AFAIR, the C standard was clarified wrt. to this) know,++timer_secondneeds to readtimer_secondexactly once, then write it exactly once, no matter if the resulting value is used. We could optimize the code by keeping the value in a instead of using r7, but that would save only one byte.Last edit: Philipp Klaus Krause 2023-12-27
Maybe, but the old compiler version generated code works (on AT89C2051), so I wouldn't say that it is incorrect! Anyhow, the "old" code is 1972 bytes, so 76 bytes remains free to use...
Well, it works for you, the way you use it, but in general it is not correct.
The perspective for improvement I see here is: Make the register allocator use a instead of r7 - saves one byte. Then for the
if (--next_blink == 0), etc (i.e. the ones where we test for equality to 0, the peephole optimizer could optimize out the use of a, so we directly decrement again (for the jump we don't need to read the volatile value again, as we can just rely on the flags).That won't get us to the code size of 3.5.0, but it would already improve the situation a lot.
The 3.5.0 generated code is wrong for
--next_blink == 0with the read-modify-writedecfollowed by readingvolatile next_blinkagain. But if the compiler useddjnzit would be correct. There is no need for any register use here.I don't think it would be easy to directly emit the djnz in codegen here. I think a good way would be to have the register allocator use a instead of r7, so we'd get this from codegen:
which would then be optimized further by the peephole optimizer, until we'd end up with just the
djnz.P.S.: In [r14559], in the next branch, I added peephole rule 303 to eliminate the use of
r7(in the long term, the register allocator should useadirectly, sor7is available for other uses), bringing the code size down to 2069B.A further peephole, like this one
could then eliminate the use of
a(not possible yet, since notUsed apparently doesn't work well enough here).P.P.S.: By improving notUsed a bit, the same peephole rule 303 could also eliminate the use of r7 at the increments discussed here.
Related
Commit: [r14559]
I haven't looked into the division / modulo regression much yet, but I suspect that a change was made to improve integer promotion correctness, but we fail to later optimize the ints back into unsigned 8-bit types here. Maybe the optimization can be implemented in generalized constant propagation (we already have a similar optimization for multiplications there).
Last edit: Philipp Klaus Krause 2023-12-27
In [r14555], in the next branch (to be merged to trunk after the 4.4.0 release), we now have an optimization that for this code results in div instructions being emitted instead of the support function calls.
Related
Commit: [r14555]
There is one remaining issue not yet discussed in my replies above: The ledtable accesses. In SDCC 3.5.0, these used
movc a,@a+dptrfor the indexing, which was quite efficient. But now I see the value in a cast to 16 bits, then a full 16.bit addition done.Turns out, there is nothing really fancy going on there. Newer SDCC just happens to generate slightly different code (not really better or worse than old SDCC), but there is peephole optimizer rule 186.d that triggers for the old code from codegen.
So the quick-and-dirty approach here is to just add another rule 186.f that handles the new code.
In the long term, a better solution would be to implement support for a right operand for GET_VALUE_AT_ADDRESS iCodes (like e.g. in the z80 and stm8 ports).
In [r14556], there is a new peephole rule 186.f in the next branch, that optimizes many of these array accesses. IMO, we got all the low-hanging fruit now; code size is now at 2132 B, I think; further improvements can come at the time of the major improvements in the mcs51 port planned for sometime next year.
Related
Commit: [r14556]
Last edit: Philipp Klaus Krause 2023-12-27
Another increase of code (from 8 to 19 instructions) is in the C code
dbuf[3] = ledtable[(alarm_minute%10)];In 3.5 we had:
In 4.4 we now have
mov r6,_alarm_minute
mov r7,#0x00
mov __modsint_PARM_2,#0x0a
mov (__modsint_PARM_2 + 1),r7
mov dpl, r6
mov dph, r7
lcall __modsint
mov r6, dpl
mov r7, dph
mov a,r6
add a, #_ledtable
mov dpl,a
mov a,r7
addc a, #(_ledtable >> 8)
mov dph,a
clr a
movc a,@a+dptr
mov r7,a
mov (_dbuf + 0x0003),r7
That one is already covered by changes in the next branch (i.e. features to be merged to trunk after the 4.4.0 release). Using the next branch, I see no calls to division or modulo support functions in your code, everything uses the div instruction.
In the next branch:
Not perfect, but better than 4.4.0.
so basically it means the fix will be available in 4.4.1 ? Is it possible to get that branch and test it on my AT80C2051?
That is correct.
You can get that branch by checking out the sources in that branch from subversion and building SDCC yourself. There are no snapshots created.