Menu

#732 Break the world: Calling Convention

None
open
None
5
2023-10-09
2021-02-03
No

I'm currently looking into calling conventions. The basic idea is this:

See what calling conventions are the most efficient (can depend on function type). Choose an efficient one.

The basis is the noasm2 branch, which replaces all asm functions in the standard library by C code, so no asm code needs to be rewritten when experimenting with different calling conventions.

Also, the code generators need to be fixed so they generate correct code for different calling conventions. Basically: Try a new calling convention in a local copy of sdcc based on the noasm2 branch, if it breaks, fix the bugs in trunk.

So far I want to consider the stm8, hc08-related and z80-related ports. Aspects of the calling convention looked into will include registers used for return value and the use of register parameters.

This is long-term work; I hope to have results sometime in between the SDCC 4.1.0 and 4.2.0 releases.

Discussion

<< < 1 2 3 > >> (Page 2 of 3)
  • Maarten Brock

    Maarten Brock - 2021-06-11

    SDCC requires y as frame pointer for functions where stack offsets greater than 255 are needed.

    Is there any particular reason why x would not be eligible as frame pointer?

     
    • Philipp Klaus Krause

      Technically, their roles could be swapped.

      However, x is just so useful to have available to the register allocator, that we don't want to use it for something else.
      Many instructions are 1 byte shorter when using x. On the other hand, the frame pointer typically is not used that much. After all we can reach 255 B of the stack using cheap sp-relative adressing, and only need y for the rest.

       
  • Philipp Klaus Krause

    I took a bit more time than expected, since I was busy with other stuff this week, and also ran into some bugs and inefficiencies in SDCC that I had to fix first.

    But now there are some more results for stm8. I have tried various registers for return values and arguments, and callee vs caller stack cleanup.

    Here is the best I saw in score and code size:

    Benchmark Score Size
    Whetstone +0.2 % -3.6 %
    Dhrystone +2.6 % -2.9 %
    Coremark +1.4 % -1.9 %
    stdcbench +5.4 % -3.8 %

    Which gives us a good idea of what can and what can't be achieved by making changes to the calling convention. However, which convention was best varied for the benchmarks on for score vs size.

    However, there are some insights I got:

    • Using a / x for the first argument if it is 8 / 16 bits and then using the other one for the second argument was always among the best results.
    • With the exception of Dhrystone scores, using a for 8-bit return values was always among the best results.
    • Using y for arguments rarely gives an advantage over using x, and makes code generation really complicated for calls through function pointers and calls from functions with many local variables (i.e. those that need a frame pointer).
    • For return values, using a / x / xy for 8 / 16 / 32 bit is good. It is often the best. In some cases yl for 8 bit, y for 16 bit or yx for 32 bit is a bit better, but the difference is small.
    • The results on caller vs callee stack cleanup don't give me a good idea on how to proceed. Even when making this depend on the size of the returned value, I don't see a clear picture. Roughly, Coremark benefits (score and size) from having callee-cleanup for functions that return 8 or 16 bit, but not for 32 bit. stdcbench benefits (score and size) from having callee-cleanup for functions that return 16 bit, for other functions it doesn't matter. Dhrystone size suffers from callee-cleanup for functions that return 32 bit, for other functions it doesn't matter. Dhrystone score benefits from callee-cleanup for functions that return 16 bit, for other functions it doesn't matter. Whetstone size benefits from callee-cleanup for functions that return 32 bits, for other functions it doesn't matter. Whetstone score suffers from callee-cleanup for functions that return 32 bits, for other functions it doesn't matter.

    My idea is that for now, I won't do further experiments on parameters and return value. For the return value, I'll use the a/x/xy approach SDCC already uses, and for parmeters the first in a/x, second in a/x approach that turned out to work well (and is essentially the same as the one Raisonance uses, and in many cases also the same as IAR uses).
    But I need to do some further experiments, where caller vs callee stack cleanup depends on the type of the return value and on the parameters. I also need to look into caller vs callee-cleanup for functions that return void.

    P.S.: For the experiments so far, I used attached makepatches.c, which writes a large number of patches, each of which can be applied to the callingconvention branch of sdcc to try a calling convention, e.g.

    sdcc_args_a_x_ret_xh_y_yx_callee_1.patch
    

    Will use a/x for the first 8/16 parameter, second parameter always on stack. 8-bit return values in xh, 16-bit return values in y, 32-bit return values in yx (just _ would indicate all parameters on the stack, 2a_2x would try to also pass the second parameter in a/x). The digit is a bit mask: 1 indicates that functions that return 8-bit values use callee-cleanup, but other functions use caller-cleanup (2 for 16-bit 4 for 32-bit, e.g. 6 would mean callee-cleanup for 16- and 32-bit but not other functions).

    P.P.S.: After the caller/callee experiments I intend to do two more rounds of experiments:

    • Find out if it makes sense to use a different convention for some common function types.
    • Find out if optimization for code size or speed or stronger optimization affects which calling convention works best.
     

    Last edit: Philipp Klaus Krause 2021-06-20
    • Philipp Klaus Krause

      While I will do some more experiments, callee vs caller stack cleanup is likely to not be that easy to decide:

      • Functions are typically called from multiple places, so callee-cleanup tends to result in smaller code size (only one place needs code for cleanup instead of all call sites).
      • Stack cleanup is harder to do in the callee than in the caller, so callee-cleanup tends to result in slower code. The extra effort depends on the return value:
      • If x is free (i.e. not used for the return value, overhead for callee cleanup is +1 byte and -1 cycle vs caller cleanup.
      • If x is not free, but y is, callee cleanup is +3 bytes and -1 cycle vs caller cleanup.
      • If neither x nor y is free, but a is free (typical when returning a 32-bit value), callee cleanup is +6 bytes and +4 cycles vs caller cleanup.
      • The above numbers are for the medium memory model. For the large model, the following hold instead (though there is a bit of potential for improvement, I guess I could bring the numbers down to about +12 bytes and +10 cycles for both a free and not):
      • If a is free, caller cleanup is +12 bytes and +12 cycles vs. caller cleanup.
      • If a is not free, caller cleanup is +14 bytes and +14 cycles vs caller cleanup.

      So far I have only done experiments for the medium memory model. Consistent with the above reasoning, the results from Whetstone experiments clearly indicate that callee-cleanup can substantially improve code size, but comes at a noticeable cost in speed.

      Caller cleanup cost is typically 2 bytes, 2 cycles. So ignoring potential tail call optimization, for free x, we benefit from callee-cleanup for every function called at least twice, for free y we benefit for every function called at least thrice. With neither x nor y free, code size benefits for each function called at least thrice, but code speed suffers.

      However, callee-cleanup hinders tail call optimization; the effect is stronger for the large memory model, but happens for both medium and large memory model.

      P.S.:
      A small part of the above is however more an artifact of current codegen: With caller-cleanup, we use ret to return, for callee-cleanup with free x we use popw x; addw sp, #d, jp (x). For callee-cleanup we are forced to use popw x and jp(x), which are 1 byte more and 1 cycle faster than ret. For callee-cleanup we have a choice. That choice is currently always ret. If we wouldn't use ret, code size and speed would be the same for a callee vs caller cleanup with free x or y (again ignoring any effects on tail call optimization).

      P.P.S.:
      I still want to do more experiments, but at the moment I think it would be a good choice to use callee-cleanup for the medium memory model and keep caller-cleanup for the large memory model.

      P.P.P.S.:
      Another aspect: callee-cleanup prevents optimizations of early returns. a jp to a label that is just a ret can be optimized into a ret by the peephole optimizer (and code generation could also be made to be able to just emit the ret in some cases). A jp to a popw x; add sp, #d, jp (x) can't be optimized that way by the peephole optimizer (as it would increase code size). Code generation could do it in some cases for speed but it would also have a cost in code size. For eligible functions we'd then be in a situation where caller-cleanup needs cleanup code at every call site, while calle-cleanup needs cleanup code at every return statement.

       

      Last edit: Philipp Klaus Krause 2021-06-21
      • Philipp Klaus Krause

        Last night and today, I compiled 1024 different approaches to caller vs callee-cleanup. Basically every combination of caller / callee-cleanup depending:

        • return type: void vs. 8-bit vs 16-bit vs. 32-bit non-float vs 32-bit float
        • Type of first argument float vs. non-float.

        Benchmarks are again Whetstone, Dhrystone, Coremark, stdcbench. I looked into code size and the medium memory model only this time.

        Results:

        • With one exception, all benefit from callee-cleanup for function returning void or an 8-bit value. The exception is functions returning void in stdcbench, but there the difference is small.
        • For functions returning a 16-bit value, Coremark and stdcbench benefit from from callee-cleanup, while Whetstone and Dhrystone suffer.
        • All benchmarks suffer from callee-cleanup for functions that return a non-float 32-bit value.
        • Whetstone benefits from callee-cleanup for functions returning float that have a first parameter of type float. A lot¹. It suffers from callee-cleanup for functions that return float, but have a non-float first parameter.
        • Dhrystone suffers from callee-cleanup for functions that return float, no matter the parameters.

        ¹ This single aspect make a difference of about 2% in code size for Whetstone.

        Now I think a reasonable way to proceed for the future calling convention would be:

        • Pass 8-bit result in a, 16-bit in x, 24-bit in yh:x, 32 bit in xy.
        • Pass first argument in a/x if 8/16-bit and there are no varargs. Pass second in a/x in 8/16-bit and first is in reg.
        • Use caller-cleanup for large memory model. For medium memory model, use callee-cleanup for functions not having variable arguments as follows: for functions returning void or 8 bits or 16 bits and for functions that return float and also have float for first parameter.

        This seems simple and efficient enough to me. Except for Dhystone, this convention is always within a few bytes of codesize of the best one for the respective benchmark. For Dhrystone, it has code size 0.3 % higher than the best one.

        My next steps: Implement __sdccoldcall and __stdcnewcall calling conventions in trunk. The former will do nothing for now, the second will use the proposed new convention. The users can try it out, so I'll tell them on sdcc-user.
        I would next look into experimenting on calling conventions for z80 and related. Maybe while doing those experiments I get some ideas that I want to try on stm8.

         

        Last edit: Philipp Klaus Krause 2021-06-25
  • Philipp Klaus Krause

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -2,10 +2,10 @@
    
     See what calling conventions are the most efficient (can depend on function type). Choose an efficient one.
    
    -The basis is the noasm2 branch, which replaces all asm functions int he standard library by C code, so no asm code needs to be rewritten when experimenting with different calling conventions.
    +The basis is the noasm2 branch, which replaces all asm functions in the standard library by C code, so no asm code needs to be rewritten when experimenting with different calling conventions.
    
     Also, the code generators need to be fixed so they generate correct code for different calling conventions. Basically: Try a new calling convention in a local copy of sdcc based on the noasm2 branch, if it breaks, fix the bugs in trunk.
    
    -So far I want to consider the stm8, hc08-relatzed and z80-related ports. Aspects of the calling convention looked into will include registers used for return value and the use of register parameters.
    +So far I want to consider the stm8, hc08-related and z80-related ports. Aspects of the calling convention looked into will include registers used for return value and the use of register parameters.
    
     This is long-term work; I hope to have results sometime in between the SDCC 4.1.0 and 4.2.0 releases.
    
     
    • Maarten Brock

      Maarten Brock - 2021-06-20

      Philipp,
      Can you please fix the two "calle-cleaup" entries above? You left out the one and only discriminating character.

       
      • Philipp Klaus Krause

        Done. Thanks.

         
  • Philipp Klaus Krause

    Regarding the z80 experiments, I am still in an early phase (encountering and fixing bugs exposed by different calling conventions, not having tried caller- vs callee- restore much).
    For now, I tried 900 different calling conventions. So far it looks like this:

    • Our current convention for the return value is good, but we could do a bit better by passing 8-bit return values in a instead of l, and passing 32 bit return values in hlde (as we already do for gbz80) instead of dehl.
    • For z80 is doesn't matter much which register pair is used for 16-bit return values. For r3ka, hl and de are better than bc.
    • For z80, I so far see the best results by passing only the first parameter in registers; for r3ka by passing the first two parameters in registers. I think this is due to r3ka having ld d(sp), hl, which allows to cheaply (2 Bytes) move a register parameter to the stack in the callee, while z80 needs two ld d(ix), r (total of 6 Bytes). Also, large stack pointer adjustments are expensive for z80, but even more so when hl is in use (e.g. due to being in the return value of a register parameter), while r3ka has add sp, #d which is cheap and doesn't need a free hl.
    • For 8-bit register parameters, a ist the best choice.
    • For 32-bit register parameters, there is not much difference between hlde, dehl and hlbc (haven't tried others yet).
    • For 16-bit register parameters, the difference between bc, de and hl matters, but which one is best depends on benchmark and target.
     
    • Sergey Belyashov

      There is need to investigate why passing more than 1 parameter in registers is not better than old behaivior. It is possible compiler generates suboptimal code for callees. Because in worst case code size and speed are not changed (there are no difference where put arguments on stack):

      _func: ; parameters are in DE, HL, BC; return value in DE
      ;only inline FP ajusting is possible
          push ix
          ld ix,#0
          add ix,sp
          ld sp,ix
      ;these 3 PUSHes are same as in caller before
          push bc
          push hl
          push de
      ;standard locals allocation
          ld hl, #<n>
          add hl, sp
          ld sp, hl
      ;function logic...
          ....
      ;epilogue automatically deallocates stored arguments too
          ld sp, ix
          pop ix
          ret
      

      There are no any extra ld (ix),r operations.
      Pros:
      1. Space economy on PUSH instructions on each call
      2. Speed up, if arguments may be used without storing
      3. Space economy and speed up on arguments deallocation
      Cons:
      1. Only inline FP adjusting (lost space, but speed up) if HL in use.

       
      • Philipp Klaus Krause

        That would require some mechanism that puts spilled register parameters in stack locations suitable for the use of push. We currently don't have that, and it would take effort to implement.
        Currently all spilt variables are handled the same. The stack allocator decides where to put them. The stack allocator does not consider possible use of push/pop , its optimization goal is to use as little stack space as possible.

         
    • Philipp Klaus Krause

      After further experiments, the picture is a bit more complex, also wrt. multiple register arguments. I still didn't do experiments for callee-cleanup of stack, but by now tried about 1500 different conventions for return values and register arguments for the 4 benchmarks.

      • r3ka clearly benefits from having both the first and second argument in registers, both for 8-bit and 16-bit register arguments. Whetstone and Dhrystone, but not the other two benchmarks also benefit from having the 1st 32-bit argument in registers. This needs further investigation. Whetstone has lots of functions with a single 32-bit argument, but in Coremark, the 32-bit argument typically is one of many, so maybe that's what matters here.
      • Regarding return values, the previous results hold.
      • For z80, I do see benefits from having the first two 8 or 16 bit arguments in registers in Dhrystone. For stdcbench and Whetstone this doesn't make much difference vs. having the first only. For stdcbench, it is better to only have the first such argument in registers. Again, Whetstone and Dhrystone benefit from having the first 32-bit argument in registers, while stdcbench and Coremark suffer.
      • It is useful to have a free (i.e. not used for register arguments) 16-bit register pair at the call, especially for calls through function pointers. For r3ka, this free pair should be bc (as a, de and hl make good register arguments). For z80, the choice of the free pair doesn't make a big difference: In Dhrystone, bc works best, in Whetstone de, in Coremark hl, for stdcbench, having both de and hl free seems best.

      The next step will be to eliminate all register argument / return value combinations that are far from the best, and try different caller vs callee stack cleanup approaches.

       
      • Philipp Klaus Krause

        I've eliminated some conventions that were optimal for none of the benchmarks. Now there's 20 left (before considering caller vs callee stack cleanup). Funnily, the convention that is best for stdcbench for z80, is worse than the current one for Whetstone for z80. And the one that is best for Coremark for r3ka is worse than the current one for Coremark for r3ka.
        Still, I think I'll find a good compromise that gives us some improvement for all benchmarks; and I'll look into the caller / callee stuff next (though a few experiments I've already done indicate that for z80 and r3ka the choice of registers for return values and arguments makes a much bigger difference than the caller vs callee stuff).

         
        • Philipp Klaus Krause

          To me it now looks as if the following would be good choices. So far I have only checked correctness and looked into code size. Performance experiments might follow next week. I also have not yet really done any experiments relevant to gbz80, so I don't have any opinion on that one yet.

          For all:

          • Pass first parameter in a if 8 bit, in hl if 16 bit, in hlde if 32 bit.
          • If the first parameter is in register a, and the second has 8 bits, pass in register l.
          • Stack parameters are cleaned up by caller, with two exceptions, where they are cleaned up by the callee: Functions that return void and Functions that return float where the first parameter is float.

          For z80 (and z180, z80n):

          • If the first parameter is in a or hl, and the second has 16 bits, pass in register de.
          • Pass return value in a, de or hlde.

          For r3ka (and r2k, r2ka, tlcs90, ez80_z80):

          • If the first parameter is in register a, and the second has 16 bits, pass in register hl.
          • If the first parameter is in hl or hlde, and the second has 8 bits, pass in register a.
          • Pass return value in a, hl or hlde.

          For those doing some xperiments with these conventions, the attahced patches can be applied to the callingconvention branch:

          • sdcc_args_al_hlde_hlde_ret_a_de_hlde_callee_41.patch for the approach for z80
          • sdcc_args_2al_2hl_hlde_ret_a_hl_hlde_callee_41.patch for the approach for r3ka
           

          Last edit: Philipp Klaus Krause 2021-07-28
          • Philipp Klaus Krause

            For gbz80, from the first results it is very clear that hl shouldn't be used int he calling convention. I guess with hl being the only pointer register, and stack access as well as many accesses to global variables going through pointers in hl, this makes sense.

            For now, it looks like a and e will be used for 8-bit parameters, de and bc for 16-bit parameters, bcde or debc for 32-bit parameters. a for 8-.bit return values, de for 16-bit return values, bcde or debc for 32-bit return values.

            Some gbz80 experiments on caller- vs callee-restore of stack will be next.

             
            • Philipp Klaus Krause

              After another round of experiments, this one looks best for gbz80:

              • First parameter is passed in a if 8-bit, de if 16 bits, debc if 32 bit.
              • If the first parameter is in a, the second is in e if 8 bit, in bc if 16 bit, debc if 32 bit.
              • If the first parameter is in bc is passed in e if 8 bit, de if 16 bit.
              • Return values are in a if 8 bit, bc if 16 bit, debc if 32 bit.
              • Stack parameters are cleaned by the caller if there is a return value of at least 16 bits.
               
          • Sergey Belyashov

            Is 1 byte arguments aligned by 2 bytes?

             
            • Philipp Klaus Krause

              No. Only the pdk ports and __smallc do that.

               
              • Sergey Belyashov

                Is it better to do it now? Because byte alignment took extra code and saves very little number of stack space.

                 
                • Sergey Belyashov

                  This is how call prepared:

                          ld      b, 4 (ix)
                          lea     hl, ix, #-34
                          ld      c, l
                          ld      a, h
                          push    hl
                          push    bc
                          inc     sp
                          ld      b, a
                          push    bc
                          ex      de, hl
                          call    ___itoa
                  
                   
                • Philipp Klaus Krause

                  While the question is still open in general, I think that at least for stm8 and gbz80 we don't want to pass 8-bit arguments as 16 bit:

                  • stm8 has efficient push a, passing as 16 bit would mean ld xl, a followed by pushw x, which takes one more register and is slower.
                  • On gbz80, we typically have much more code memeory than data memory. The platform supports up to 8 MiB of code memory vs. 8 KiB of data memory. So I think we really want to save that one byte on the stack even if it costs, a bit of extra code size.

                  For z80, the situation is more complicated. With my background (the ColecoVision has only 1 KiB of data memory vs. 32 KiB of code memory) I tend to assume that saving those bytes on the stack is worth it. But there are also platforms where both data and code reside in the same RAM.

                   
                  • Sergey Belyashov

                    One extra byte does not make any sense to memory use, because most of parameters will be passed in registers. Functions with many parameters (more than 3) is always not good. Byte arguments are not used very often too. So I expect raise of used stack memory not above 2-5%.

                     
                • Philipp Klaus Krause

                  As with other aspects of the calling convention, we can construct examples where one or the other is better. Clearly, not having that extra inc sp in the caller helps save a bit at each call site (unless the peephole optimizer can use a single 16-bit push to pass two parameters). On the other hand, the stack cleanup can get longer (especially when it is done by a sequence of pop af or similar). And in the callee the parameter is further from the stack pointer, which means that in some situations we can no longer use push / pop to get at stack parameters quickly. And then there are those saved bytes on the stack.

                  To get a better idea, I did some simple experiments, using the __sdccall(1) vs. a variant of __sdcccall(1) that passes all 8-bit parameters as 16 bit (unpecified value in upper byte). For this, I used the callingconvention branch, and disabled 5 tests (apparently my implementation for the 8->16 bit transformation caused a bug in argument promotion of vararg functions that affected these 5 tests).

                  With 8-bit stack params passed as 8 bit:

                  Summary for 'ucz80': 0 failures, 23289 tests, 2821 test cases, 4625012 bytes, 911170398 ticks
                  

                  With 8-bit stack params passed as 16 bit:

                  Summary for 'ucz80': 0 failures, 23289 tests, 2821 test cases, 4626768 bytes, 911205058 ticks
                  

                  The difference is not much, but apparently, when looking at the big picture, the current behavior (passing 8-bit values as 8 bits) is a little bit better.

                  While I only looked at z80, I don't think we need further experiments for the other targets here. IMO, the data looks sufficient to keep passing 8-bit stack parameters using just 8 bits in __sdcccall(1).

                   
          • Sergey Belyashov

            Can you test for case where third parameter is passed in c/bc in case where only three parameters are passed?

             
            • Philipp Klaus Krause

              I had done some such experiments early, though I don't remember the impact on code size or speed.

              The problem was that the code for calls through function pointers gets really complicated when there is no free register pair (for any architecture; I first noticed it on stm8, but AFAIR the z80 situation is similar). In current z80 codegen, we just run into an assertion if a function is called via a function pointer but all of bc, de and hl are in use for function arguments.

               
<< < 1 2 3 > >> (Page 2 of 3)

Log in to post a comment.