Menu

#237 Advanced tail call optimization

open
nobody
None
5
2019-11-05
2008-02-21
No

When the last instruction before a return is call to another function (tail call) ther is potential for optimization.

Feature request #1893626 did this on Z80 for functions without parameters as in the following example (making it as efficient as if a loop had been used):

extern volatile unsigned char ready;
extern volatile unsigned char data;

unsigned char poll_rec(void)
{
if(ready)
return(data);
return(poll_rec());
}

However for functions with parameters this seems too complicated to do in the peepholes.

When the callee's parameters do not use more stack space than the caller's, the parameters could be passed in the stack space used by the caller's parameters (the caller's parameters are no longer needed, since there are no instruction after the call). This would make it possible to optimize tail calls as in the following example (currently this would run out of stack space):

extern volatile unsigned char *readies;
extern volatile unsigned char *datas;

unsigned char poll_rec_i(unsigned char i)
{
if(readies[i])
return(datas[i]);
return(poll_rec_i(i));
}

Philipp

Discussion

  • Maarten Brock

    Maarten Brock - 2008-02-21

    Logged In: YES
    user_id=888171
    Originator: NO

    I think this is more difficult than you think. What if you need to pass parameters on the stack to the function that are already on the stack but maybe in a different place?

    And I also see no need to let the compiler solve a memory leak created by the user with this recursive calling.

     
  • Sarun Rattanasiri

    Hey Maarten,

    I am here because Philipp pointed me from there
    https://sourceforge.net/p/sdcc/discussion/1864/thread/ff9323e58b/

    I have practically zero knowledge about compilers.
    So, for that matter, I will leave it to you, Philipp, and others.

    However, I am a software architect, and I wanted to argue as well as persuade you.
    That this feature is crucial and has value far beyond you imagine.
    It is not about fixing some memory leaks nor enhancing recursive algorithms.

    I am developing a powerful technique for embedded systems.
    Based on the idea around the composability
    https://en.wikipedia.org/wiki/Composability

    To make this quick, please look at my demo
    https://github.com/midnight-wonderer/composed-1hz/blob/master/src/main.c
    The program is just your plain old LED blinking.
    The LED blink at 1Hz with timer interrupts.

    What matters is the program is a composition of 2 solid and carefully designed modules, independent of each other.
    https://github.com/midnight-wonderer/stm8s-tim4-periodic-timer
    https://github.com/the-cave/time-prescaler
    STM8S TIM4 Periodic Timer is a firmware (which depends on hardware)
    while
    Time Prescaler is a software (which independents from hardware)

    To achieve composability in c language, I exploited the function pointers.
    Like jigsaws, functions can be composed perfectly with function pointers.
    There are principles involved to design such modules, but to keep this brief, I won't go into detail here.

    For a full-blown system (everything is composable), the stack could get really deep.
    It will be one hell of functions and their pointers.
    And this particular optimization happens to prevent the shortcoming.

    But what does this mean to you and the world?
    (let's assume every people are using the technique)

    • No one has to write a time prescaler again, no matter what microcontroller they are using, because I already did it, and I made it composable.
    • Debouncing a button too, no one has to do it again, because I did it too.
      https://github.com/the-cave/button-debounce, one of the best IMO.
    • There are tons of other things; other people will do, they would raise the abstraction levels of embedded software and made it even more fun.
    • STM8S TIM4 Periodic Timer is an example of firmware development, something that works out of the box. Got another microcontroller, maybe pdk14/pdk15? Just throw away the firmware and plug another in, keep the rest of the software the same. Push the boundary of "portability". The firmware should be composable too.
    • Debugging c code with an in-circuit debugger? No way, unless you are debugging a firmware. You can take a composable module and compile it with clang on x86. (As long as the bug is not related to compilers and specific instruction sets, i.e., logic bugs of the software itself.)
    • Co-operative schedulers? Does not matter anymore; the composable modules never block.
    • Pre-emptive schedulers? Not for normal use-cases, only for CPU intensive tasks.
    • Co-routines? would be outdated and duplicate the effort.
    • Fewer bugs, more reliability; Just a glance at a small specific module would decrease a chance of bug without writing an automated test. The technique will enable more automated tests, too, and the test can be executed on x86 machines. By having more people using the same piece of software would make it more reliable and trustworthy.
    • Operating systems? Fine, but we don't really need one, at least for controlling jobs.
    • We can even have c package managers for embedded systems too. The one likes https://www.npmjs.com/. And every different hardware from different vendors can shares c code. The industry can profit as a whole from hobbyists too.
    • The web and enterprise software developments have some advanced techniques here and there, it is bugging me so long that the embedded community is moving slowly, and never unite, but can't it be?

    And you know what?
    I can't do any of what I said. I can't really build a community. I can't even alter the functionalities of the compiler I am using.
    But, I try to enable things on my parts. Showing people what is possible, inventing a better way to do things.
    So, we are a step closer to something better.

    And this optimization would really help.

    Plug section:
    If you don't really interest in my composability technique. Fine.
    However, if by chance, you want to read more about it.
    I plan to publish it one day on my blog (currently nothing there), once I have more time to work on it.
    https://slime.systems/

    TL;DR
    The things that you claims to be bugs are actually features. And the use case of the compiler is valid.

     
  • Philipp Klaus Krause

    I agree that a first step could be to do the optimization in cases where there are no parameters, where we would just move the stack frame adjustment in front of the call.
    But even there we need to be careful, and analyze the code before we do so:

    void test3(void)
    {
        char c[16];
        a(c);
        if(i)
            f();
        else
            g();
    }
    

    a could store c in a global pointer variable that f or g could access, so in this case we could not do such an optimization:
    It matters if the caller (test3 here) has any variables that had their address taken.

     
  • Philipp Klaus Krause

    Today, I made some improvements to tail call optimization in the stm8, z80, z180, gbz80, r2k, r3ka, tlcs90, ez80-z80, pdk13, pdk14, pdk15 backends. In particular, parts of tail call optimization that used to be done in the peephole optimizer are now done in code generation instead.

    As of [r11425], for stm8, tail calls should always be optimized when the following conditions hold:

    • The callee has no arguments
    • The calles return value is not used
    • The caller has no local aggregates or unions
    • The caller has no local variables that have their address taken
    • There are at most 250 bytes of local variables in the caller

    Sarun, is this sufficient for you?

    For some simple cases, tail calls will be optimized even if some of these conditions don't hold.

     

    Last edit: Philipp Klaus Krause 2019-10-13
  • Sarun Rattanasiri

    Whoa,
    The response is quite fast.
    But, I just realized something, the problem is quite tough indeed.

    The main blocker for me is: "The callee has no arguments".
    I just realize the arguments in the stack is cleared by the caller.

    _two_millisecond_handler:
    ;   src/main.c: 20: time_prescaler8__tick(&prescaler_config, &prescaler_state);
        push    #<(_prescaler_state + 0)
        push    #((_prescaler_state + 0) >> 8)
        push    #<(_prescaler_config + 0)
        push    #((_prescaler_config + 0) >> 8)
        call    _time_prescaler8__tick
        addw    sp, #4
    ;   src/main.c: 21: }
        ret
    

    If the caller is the one who clear the stack, that would make the call not a tail call anymore.
    So "a function that take parameter can't be optimized by tail call eliminations".

    I wonder why the calee is not the one who clean up after their usages.
    However, I suspected that the reason, whatever it is, could sound quite valid as well.

    For me the problem is still there, but with your optimization the call stack usage efficientcy would double. (half of tail calls have been eliminated) thanks.

    This is not a show stopper for me.
    I have plan B.
    https://github.com/the-cave/deferred-execution
    which under the same condition as the optimization (the callee takes no argument) can flatten the stack frame, albeit some downside.

    I don't know much about compiler but may I ask you why it is design this way? (the caller clear the stack as opposed to the calee do it.)

     
  • Sarun Rattanasiri

    @spth the last change introduces a bug in ISR.
    In case of ISR, the optimization will leads to ret instead of iret.

     
    • Philipp Klaus Krause

      Thanks for reporting. it is fixed now [r11446].

       

      Last edit: Philipp Klaus Krause 2019-11-05

Log in to post a comment.