I'm currently looking into calling conventions. The basic idea is this:
See what calling conventions are the most efficient (can depend on function type). Choose an efficient one.
The basis is the noasm2 branch, which replaces all asm functions in the standard library by C code, so no asm code needs to be rewritten when experimenting with different calling conventions.
Also, the code generators need to be fixed so they generate correct code for different calling conventions. Basically: Try a new calling convention in a local copy of sdcc based on the noasm2 branch, if it breaks, fix the bugs in trunk.
So far I want to consider the stm8, hc08-related and z80-related ports. Aspects of the calling convention looked into will include registers used for return value and the use of register parameters.
This is long-term work; I hope to have results sometime in between the SDCC 4.1.0 and 4.2.0 releases.
Is there any particular reason why x would not be eligible as frame pointer?
Technically, their roles could be swapped.
However, x is just so useful to have available to the register allocator, that we don't want to use it for something else.
Many instructions are 1 byte shorter when using x. On the other hand, the frame pointer typically is not used that much. After all we can reach 255 B of the stack using cheap sp-relative adressing, and only need y for the rest.
I took a bit more time than expected, since I was busy with other stuff this week, and also ran into some bugs and inefficiencies in SDCC that I had to fix first.
But now there are some more results for stm8. I have tried various registers for return values and arguments, and callee vs caller stack cleanup.
Here is the best I saw in score and code size:
Which gives us a good idea of what can and what can't be achieved by making changes to the calling convention. However, which convention was best varied for the benchmarks on for score vs size.
However, there are some insights I got:
My idea is that for now, I won't do further experiments on parameters and return value. For the return value, I'll use the a/x/xy approach SDCC already uses, and for parmeters the first in a/x, second in a/x approach that turned out to work well (and is essentially the same as the one Raisonance uses, and in many cases also the same as IAR uses).
But I need to do some further experiments, where caller vs callee stack cleanup depends on the type of the return value and on the parameters. I also need to look into caller vs callee-cleanup for functions that return void.
P.S.: For the experiments so far, I used attached makepatches.c, which writes a large number of patches, each of which can be applied to the callingconvention branch of sdcc to try a calling convention, e.g.
Will use a/x for the first 8/16 parameter, second parameter always on stack. 8-bit return values in xh, 16-bit return values in y, 32-bit return values in yx (just _ would indicate all parameters on the stack, 2a_2x would try to also pass the second parameter in a/x). The digit is a bit mask: 1 indicates that functions that return 8-bit values use callee-cleanup, but other functions use caller-cleanup (2 for 16-bit 4 for 32-bit, e.g. 6 would mean callee-cleanup for 16- and 32-bit but not other functions).
P.P.S.: After the caller/callee experiments I intend to do two more rounds of experiments:
Last edit: Philipp Klaus Krause 2021-06-20
While I will do some more experiments, callee vs caller stack cleanup is likely to not be that easy to decide:
So far I have only done experiments for the medium memory model. Consistent with the above reasoning, the results from Whetstone experiments clearly indicate that callee-cleanup can substantially improve code size, but comes at a noticeable cost in speed.
Caller cleanup cost is typically 2 bytes, 2 cycles. So ignoring potential tail call optimization, for free x, we benefit from callee-cleanup for every function called at least twice, for free y we benefit for every function called at least thrice. With neither x nor y free, code size benefits for each function called at least thrice, but code speed suffers.
However, callee-cleanup hinders tail call optimization; the effect is stronger for the large memory model, but happens for both medium and large memory model.
P.S.:
A small part of the above is however more an artifact of current codegen: With caller-cleanup, we use ret to return, for callee-cleanup with free x we use popw x; addw sp, #d, jp (x). For callee-cleanup we are forced to use popw x and jp(x), which are 1 byte more and 1 cycle faster than ret. For callee-cleanup we have a choice. That choice is currently always ret. If we wouldn't use ret, code size and speed would be the same for a callee vs caller cleanup with free x or y (again ignoring any effects on tail call optimization).
P.P.S.:
I still want to do more experiments, but at the moment I think it would be a good choice to use callee-cleanup for the medium memory model and keep caller-cleanup for the large memory model.
P.P.P.S.:
Another aspect: callee-cleanup prevents optimizations of early returns. a jp to a label that is just a ret can be optimized into a ret by the peephole optimizer (and code generation could also be made to be able to just emit the ret in some cases). A jp to a popw x; add sp, #d, jp (x) can't be optimized that way by the peephole optimizer (as it would increase code size). Code generation could do it in some cases for speed but it would also have a cost in code size. For eligible functions we'd then be in a situation where caller-cleanup needs cleanup code at every call site, while calle-cleanup needs cleanup code at every return statement.
Last edit: Philipp Klaus Krause 2021-06-21
Last night and today, I compiled 1024 different approaches to caller vs callee-cleanup. Basically every combination of caller / callee-cleanup depending:
Benchmarks are again Whetstone, Dhrystone, Coremark, stdcbench. I looked into code size and the medium memory model only this time.
Results:
¹ This single aspect make a difference of about 2% in code size for Whetstone.
Now I think a reasonable way to proceed for the future calling convention would be:
This seems simple and efficient enough to me. Except for Dhystone, this convention is always within a few bytes of codesize of the best one for the respective benchmark. For Dhrystone, it has code size 0.3 % higher than the best one.
My next steps: Implement __sdccoldcall and __stdcnewcall calling conventions in trunk. The former will do nothing for now, the second will use the proposed new convention. The users can try it out, so I'll tell them on sdcc-user.
I would next look into experimenting on calling conventions for z80 and related. Maybe while doing those experiments I get some ideas that I want to try on stm8.
Last edit: Philipp Klaus Krause 2021-06-25
Diff:
Philipp,
Can you please fix the two "calle-cleaup" entries above? You left out the one and only discriminating character.
Done. Thanks.
Regarding the z80 experiments, I am still in an early phase (encountering and fixing bugs exposed by different calling conventions, not having tried caller- vs callee- restore much).
For now, I tried 900 different calling conventions. So far it looks like this:
ld d(sp), hl
, which allows to cheaply (2 Bytes) move a register parameter to the stack in the callee, while z80 needs twold d(ix), r
(total of 6 Bytes). Also, large stack pointer adjustments are expensive for z80, but even more so when hl is in use (e.g. due to being in the return value of a register parameter), while r3ka hasadd sp, #d
which is cheap and doesn't need a free hl.There is need to investigate why passing more than 1 parameter in registers is not better than old behaivior. It is possible compiler generates suboptimal code for callees. Because in worst case code size and speed are not changed (there are no difference where put arguments on stack):
There are no any extra ld (ix),r operations.
Pros:
1. Space economy on PUSH instructions on each call
2. Speed up, if arguments may be used without storing
3. Space economy and speed up on arguments deallocation
Cons:
1. Only inline FP adjusting (lost space, but speed up) if HL in use.
That would require some mechanism that puts spilled register parameters in stack locations suitable for the use of push. We currently don't have that, and it would take effort to implement.
Currently all spilt variables are handled the same. The stack allocator decides where to put them. The stack allocator does not consider possible use of push/pop , its optimization goal is to use as little stack space as possible.
After further experiments, the picture is a bit more complex, also wrt. multiple register arguments. I still didn't do experiments for callee-cleanup of stack, but by now tried about 1500 different conventions for return values and register arguments for the 4 benchmarks.
The next step will be to eliminate all register argument / return value combinations that are far from the best, and try different caller vs callee stack cleanup approaches.
I've eliminated some conventions that were optimal for none of the benchmarks. Now there's 20 left (before considering caller vs callee stack cleanup). Funnily, the convention that is best for stdcbench for z80, is worse than the current one for Whetstone for z80. And the one that is best for Coremark for r3ka is worse than the current one for Coremark for r3ka.
Still, I think I'll find a good compromise that gives us some improvement for all benchmarks; and I'll look into the caller / callee stuff next (though a few experiments I've already done indicate that for z80 and r3ka the choice of registers for return values and arguments makes a much bigger difference than the caller vs callee stuff).
To me it now looks as if the following would be good choices. So far I have only checked correctness and looked into code size. Performance experiments might follow next week. I also have not yet really done any experiments relevant to gbz80, so I don't have any opinion on that one yet.
For all:
For z80 (and z180, z80n):
For r3ka (and r2k, r2ka, tlcs90, ez80_z80):
For those doing some xperiments with these conventions, the attahced patches can be applied to the callingconvention branch:
Last edit: Philipp Klaus Krause 2021-07-28
For gbz80, from the first results it is very clear that hl shouldn't be used int he calling convention. I guess with hl being the only pointer register, and stack access as well as many accesses to global variables going through pointers in hl, this makes sense.
For now, it looks like a and e will be used for 8-bit parameters, de and bc for 16-bit parameters, bcde or debc for 32-bit parameters. a for 8-.bit return values, de for 16-bit return values, bcde or debc for 32-bit return values.
Some gbz80 experiments on caller- vs callee-restore of stack will be next.
After another round of experiments, this one looks best for gbz80:
Is 1 byte arguments aligned by 2 bytes?
No. Only the pdk ports and __smallc do that.
Is it better to do it now? Because byte alignment took extra code and saves very little number of stack space.
This is how call prepared:
While the question is still open in general, I think that at least for stm8 and gbz80 we don't want to pass 8-bit arguments as 16 bit:
For z80, the situation is more complicated. With my background (the ColecoVision has only 1 KiB of data memory vs. 32 KiB of code memory) I tend to assume that saving those bytes on the stack is worth it. But there are also platforms where both data and code reside in the same RAM.
One extra byte does not make any sense to memory use, because most of parameters will be passed in registers. Functions with many parameters (more than 3) is always not good. Byte arguments are not used very often too. So I expect raise of used stack memory not above 2-5%.
As with other aspects of the calling convention, we can construct examples where one or the other is better. Clearly, not having that extra inc sp in the caller helps save a bit at each call site (unless the peephole optimizer can use a single 16-bit push to pass two parameters). On the other hand, the stack cleanup can get longer (especially when it is done by a sequence of pop af or similar). And in the callee the parameter is further from the stack pointer, which means that in some situations we can no longer use push / pop to get at stack parameters quickly. And then there are those saved bytes on the stack.
To get a better idea, I did some simple experiments, using the __sdccall(1) vs. a variant of __sdcccall(1) that passes all 8-bit parameters as 16 bit (unpecified value in upper byte). For this, I used the callingconvention branch, and disabled 5 tests (apparently my implementation for the 8->16 bit transformation caused a bug in argument promotion of vararg functions that affected these 5 tests).
With 8-bit stack params passed as 8 bit:
With 8-bit stack params passed as 16 bit:
The difference is not much, but apparently, when looking at the big picture, the current behavior (passing 8-bit values as 8 bits) is a little bit better.
While I only looked at z80, I don't think we need further experiments for the other targets here. IMO, the data looks sufficient to keep passing 8-bit stack parameters using just 8 bits in __sdcccall(1).
Can you test for case where third parameter is passed in c/bc in case where only three parameters are passed?
I had done some such experiments early, though I don't remember the impact on code size or speed.
The problem was that the code for calls through function pointers gets really complicated when there is no free register pair (for any architecture; I first noticed it on stm8, but AFAIR the z80 situation is similar). In current z80 codegen, we just run into an assertion if a function is called via a function pointer but all of bc, de and hl are in use for function arguments.