Menu

#732 Break the world: Calling Convention

None
open
None
5
2023-10-09
2021-02-03
No

I'm currently looking into calling conventions. The basic idea is this:

See what calling conventions are the most efficient (can depend on function type). Choose an efficient one.

The basis is the noasm2 branch, which replaces all asm functions in the standard library by C code, so no asm code needs to be rewritten when experimenting with different calling conventions.

Also, the code generators need to be fixed so they generate correct code for different calling conventions. Basically: Try a new calling convention in a local copy of sdcc based on the noasm2 branch, if it breaks, fix the bugs in trunk.

So far I want to consider the stm8, hc08-related and z80-related ports. Aspects of the calling convention looked into will include registers used for return value and the use of register parameters.

This is long-term work; I hope to have results sometime in between the SDCC 4.1.0 and 4.2.0 releases.

Discussion

1 2 3 > >> (Page 1 of 3)
  • Philipp Klaus Krause

    For now, I start by experimenting with different return registers for z80, essentially by trying the use of a different RETURN_ASMOP.

     
  • Philipp Klaus Krause

    First achieved subgoal: In the noasm2 branch, now only two places (one in z80/peep.c, one in z80/gen.c) need to be changed to change the registers for return values.
    When changing that to hlde instead of the currently used dehl, nearly regression tests still pass for z80. The only failures are in the tests for atomc_flag, setjmp and the test for __z88dk_fastcall. The former two need asm code, and thus cannot be fixed easily. The __z88dk_fastcall issue might be peephole-related.

    The next steps will be checking if the reverse can be done for gbz80, and then tryingother register combinations for return values.

     

    Last edit: Philipp Klaus Krause 2021-02-03
    • Philipp Klaus Krause

      As of [r12055], for z80, even weirder return value registers, such as bcea work. Exceptions: tests for banked functions fail (they use register a to pass extra info), tests for critical functions fail (they use pop af at the end of the function). Also, the --profile option had to be removed from regression tests, since it uses register a to pass information.

       
  • Sergey Belyashov

    Good job.
    Benchmarks are required.

    Also, please check conditions:
    ASMOP_RETURN->regs[C_IDX] < 0 || ASMOP_RETURN->regs[C_IDX] > retsize at z80/gen.c:5349
    and
    ASMOP_RETURN->regs[A_IDX] > 0 && ASMOP_RETURN->regs[A_IDX] < retsize at z80/gen.c:5400
    Looks like broken invertion.

     
    • Philipp Klaus Krause

      Thanks. Good catch. Fixed in [r12118] (along with other improvements). Saves about 10 Bytes in the gbz80 regression tests.

       
      • Sergey Belyashov

        See, on line 5400 is strict "above zero" condition. Is it correct?

        On line 5060 same conditions as was on line 5349. Also, I suggest to move ASMOP_RETURN->regs[regIdx] >= 0 && ASMOP_RETURN->regs[regIdx] < retsize as separate function to improve readability.

         
        • Philipp Klaus Krause

          Doesn't look correct, and I changed it now. But I suspect that there are more problems in that code . I don't think critical functions that return values are tested or used much, so there might be more bugs lurking.

           
  • Philipp Klaus Krause

    Scope:
    Goal: Find a good default calling convention.
    Ports: Ports that currently pass all arguments on the stack by default: stm8, z80 and related.
    Aspects of calling convention to consider:
    - Passing of arguments (registers vs. stack, which registers)
    - Who cleans up stack arguments (caller vs. callee)
    - How results are returned (which registers to use for return value)

    So we need infrastructure that allows to easily change these to run experiments. For the experiments, the callingconvention branch is used (which is a branch of the noasm2 branch).

    Recent progress:
    - Implemented support for cleanup of stack arguments by callee (via __z88dk_callee)
    - Implemented support for easily changing registers for return value (everything should just rely on aopRet now, so there is only one place that needs changing to try different registers).
    - Regression tests pass for a few first experiments (8-Bit return value in a for z80, in xl for stm8).

     
    • Sergey Belyashov

      It is good work!
      Which benchmarks results for A as return value on Z80?

       
      • Philipp Klaus Krause

        No benchmark results yet. For today, I'm just trying different registers for return values, and fixing the bugs I encounter.

         
    • Philipp Klaus Krause

      I've tested the following registers for return values in all z80-related ports:
      8-Bit: l, a, c, e, b, h.
      16-Bit: hl, de, bc.
      32-Bit: hlde, dehl, hlbc.
      Some bugs were found, and fixed, regression tests now pass. The next step will be testing the callee cleanup of stack parameters. Register parameters will take more time.

       
    • Philipp Klaus Krause

      Now regression tests also pass with __z88dk_callee as the default for non-vararg functions. I've also improved code generation for stack adjustment, in particular __z88dk_callee functions a bit
      Also, there is progress on register arguments. I've refactored the code to remove hardcoded __z88dk\fastcall-isms.

      As a very first test, by adding the line

      return (i == 1 ? ASMOP_HL : ASMOP_DE);
      

      at the end of aopArg (which obviously breaks code generation for most functions), the following

      int f (int i, int j)
      {
          return i + j;
      }
      
      int g(void)
      {
          return f(23, 42);
      }
      

      compiles into

      _f::
          add hl, de
          ret
      
      _g::
          ld  de, #0x002a
          ld  hl, #0x0017
          jp  _f
      

      I'm sure there is still a lot of testing and bugfixing to do.
      And I don't really know how to do the final benchmarking yet either: I have boards for stm8 and r3ka to run some automated tests. But I expect the results to be quite different for z80 vs. r3ka. I guess I'll have to see if I can get something to work using a Z80-MBC2. But even then I don't know how to test for gbz80.
      Maybe I'll have to test on uCsim instead of hardware?

       
      • Sergey Belyashov

        Looks greate! It is wondeful!

        TODO: such small functions can be silently inlined :-)

         
        • Philipp Klaus Krause

          I don't want to open the inlining can of worms now. It is surely something we will want in the future, but to support it well, we'll need substantial work both in the front-end and the linker.

           
      • Philipp Klaus Krause

        As of now, the following work for me in the callingconvention branch:
        z80 first param in a, d or e if 8 bit.
        z80 first param in bc or de if 16 bit.
        For the z80-related other than z80 itself, there are still a few regression test failures. There are lots of failures for all z80-related when using hl.
        I'll look into making hl work for z80 and into register parameters for stm8 next.

         
        • Philipp Klaus Krause

          As of now, the hl situation has improved. Even the following now mostly works (I still see some regtest failures for gbz80, but not for the other z80-related ports) in the branch:

          First parameter in l or hl, second parameter in e or de.
          I.e. adding the following in aopArg.

            if (i == 1 && getSize (args->type) == 1)
              return ASMOP_L;
          
            if (i == 1 && getSize (args->type) == 2)
              return ASMOP_HL;
          
            if (i == 2 && aopArg (ftype, 1) && getSize (args->next->type) == 1)
              return ASMOP_E;
          
            if (i == 2 && aopArg (ftype, 1) && getSize (args->next->type) == 2)
              return ASMOP_DE;
          
           
        • Philipp Klaus Krause

          Register arguments for z80 and related seem to mostly just work now in various configurations (at least for up to 2 register arguments per function). There are still a few corner cases where a regression test fails, and it'll take a few more days to track down and fix some more of them.
          For stm8, I now have a working (regression tests pass pass for medium memory model, still a few failures for large model) case for the first time: First argument in x, if it is 16 bits.
          Besides experimenting more with register arguments for stm8, I also want to look into the interaction of register arguments, __z88dk_callee and tail call optimization. Compared to current code generation, I think there is some potential there to be able to optimize a few more tail calls.

          I guess that in a few days, everything works well enough to start thinking about benchmarking.

           
  • Philipp Klaus Krause

    • Group: -->
     
  • Philipp Klaus Krause

    By now, SDCC in the calling convention branch basically generates correct and reasonable code for any calling convention I throw at it (registers for return value, register and stack parameters, caller vs callee stack cleanup).
    Since the underlying infrastructure got backported to trunk, it is very easy to add support for additional calling conventions. So I added support for the STM8 calling conventions of Raisonance, IAR and Cosmic in trunk.
    If there is user demand, I could similarly add support for Z80 for HITECH-C, Aztec C, BDS C calling conventions (cant find documentation on IAR Z80/Z180 calling convention).

     
  • Philipp Klaus Krause

    The attached file summarizes¹ function types and calls to them in a small sample of code consisitng of the standard library, and the Whetstone, Dhrystone, stdcbench and Coremark benchmarks. The data has been obtained at register allocation time, so inlined function calls are not in it, but calls to support library functions are. However, it also means we cannot distinguish between char and unsigned char.

    While surely not representative of general use of SDCC, there are still a few interesting aspects in the data:

    • There are many functions that are called just once or twice. Such functions tend to not benefit from having the callee clean up the stack.
    • There are lots (233) of calls to float function (float, float) (actually the most common call) and _Bool function (float, float) (4th common at 48 calls). These are mostly support routine calls: SDCC targets don't have hardware floating-point support, so every addition, multiplication, comparison of float results in a support function call.
    • Second (82) is to int function ( unsigned char *, ...), such as printf. Here we need to pass all parameters on the stack, and have the caller clean up the stack so our only choice is the register for the return value. And that one doesn't matter much, since users rarely use the return value of printf.
    • Third (49) is calls to int function ( int, int). Here we have a lot of freedom, so experimenting with different calling conventions could give interesting insight.
    • Fifth (46) common are calls to int function ( unsigned char *, unsigned char *, unsigned int). This is a common function signature for string functions from the standard library.

    While we have many calls to functions of the 5 types listed above, when we look at the types of functions, we see a different picture (i.e. the above data is about few functions that get called often, but there are also many functions that only get called from few locations each):

    • The most common function signature (21 functions) is void function ( void), where we can't do anything about the calling convention.
    • Second (20 functions) is int function ( int)
    • Third (17 functions) is float function ( float). I guess that's mostly floating-point functions in the standard library.
    • Fourth (7 each) are void function ( struct *) and float function ( float, float).

    Another interesting aspect is looking at calls through function pointers. Here void function ( unsigned-char, void *) is the most common (17) followed by unsigned-char function ( void) (5).

    ¹ The data was summarized by removing qualifiers and considering all struct types to be the same.

    P.S.: The data was obtained using -mstm8. But the difference to -mz80 should be small. Apart from z80 having a few fewer calls to standard string library functions (z80 has builtins for some of them).

     

    Last edit: Philipp Klaus Krause 2021-05-21
    • Sergey Belyashov

      How many register arguments are supported for now?
      printf can be optimized too, if static arguments are passed in registers and only last one on stack too.

       
      • Philipp Klaus Krause

        There is no hard limit on the number of register arguments. But I haven't done any testing with more than 2 yet.
        We can't use register arguments for printf: For variable arguments, the last argument before the variable ones needs to be on the stack (it is the last argument where we know the type at compile time and is passed to the va_start macro, where its address on the stack is used to compute the location of the variable arguments). Since printf has only one argument before the variable ones, that means no argument can be on the stack.

         
        • Sergey Belyashov

          You are missunderstood me. I mean pass arg in both places: register and stack.

           
          • Philipp Klaus Krause

            That could be possible, but:
            1) Is not supported by the current infrastructure in SDCC, so we'd need more work there.
            2) Is most likely less efficient: Putting the parameter both in a register and on the stack is extra work for the caller. On the callee side, printf being a large function, the callee would most likely prefer to receive the argument on the stack (as large functions tend to have lots of variables that they want in registers, the typically less-often accessed parameters will go onto the stack to have the temporaries in registers - i.e. when a large functions receives a register parameter the first thing it does it usually put it onto the stack, which tends to be less efficient in the callee as opposed to having it done by the caller).

             
  • Philipp Klaus Krause

    The very first experimental results are in.

    I used the stm8 target, as it is simpler than z80 (fewer registers). For those unfamiliar with it, there is an 8-bit accumulator, and two 16-bit registers x and y. Most instructions available for x are also available for y, but many of them are 1 byte more in code size. SDCC requires y as frame pointer for functions where stack offsets greater than 255 are needed. Code generation for function calls through pointers requires at least one of x or y to not be used for register arguments. A few instructions can access 8-bit parts of x or y as xl, xg, yl, yh.

    So far, I only considered a few variations: Callee-cleanup of stack vs. caller-cleanup (same choice for all non-vararg functions). Registers for return values (a, xl, xh, yl, yh for 8-bit, x, y for 16 bit, xy, yx for 32 bit). First argument in x, y, a, stack. Second argument in x, y, a, stack.
    To decide where an argument / return value goes I only used its size. Later I will want to look into more fine-grained experiments (depend on types of arguments and return values - e.g. consider using a different register for void * vs. int).

    The results so far (looking at code size only):

    • I can compile sdcc and the library for 320 different calling conventions, and build Dhrystone 2.1 with them all overnight, so I have the results by morning. I think with a few more benchmarks, and even even executing them it should still run in about 8 hours. On a Ryzen 4800H.
    • The results for everything that uses register y to pass arguments look wrong. Probably a code generation bug I need to look into.
    • Using y or part of it to pass 8- or 16-bit return values tends to have the highest code size.
    • For 32-bit return values, the difference between using xy and yx is negligible.
    • The current choice of return values (a for 8-bit, x for 16-bit, xy for 32-bit) is best. Though passing 32-bit in yx or passing 8-bit in xl or xh has similar code size.
    • Having the callee clean up the stack reduces code size vs. the current default of having the caller do it. But when some arguments are passed in registers, we see the opposite.
    • So far the best found is: Pass first argument in a if 8 bits, in x if 16 bits. If the first is in a, and the second has 16 bits, pass in x. If the first is in x and second has 8 bits, pass in a. It gives a 3% reduction in code size compared to the current default. Interestingly, this is the calling convention of the non-free Raisonance compiler, known for generating the smallest code among all STM8 C compilers (but also the slowest).

    This week, I want to fix the issue that affected calling conventions using y as argument, extend the experiments to further benchmarks (probably stdcbench, Whetstone, Coremark), and also get benchmark scores in addition to code size.
    Next week, I probably won't have much time for this., but I hope to find some the following one.

     

    Last edit: Philipp Klaus Krause 2021-06-11
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.