Small Device C Compiler (SDCC) / Feature Requests / #237 Advanced tail call optimization

Maarten Brock - 2008-02-21

Logged In: YES
user_id=888171
Originator: NO

I think this is more difficult than you think. What if you need to pass parameters on the stack to the function that are already on the stack but maybe in a different place?

And I also see no need to let the compiler solve a memory leak created by the user with this recursive calling.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sarun Rattanasiri - 2019-10-11

Hey Maarten,

I am here because Philipp pointed me from there
https://sourceforge.net/p/sdcc/discussion/1864/thread/ff9323e58b/

I have practically zero knowledge about compilers.
So, for that matter, I will leave it to you, Philipp, and others.

However, I am a software architect, and I wanted to argue as well as persuade you.
That this feature is crucial and has value far beyond you imagine.
It is not about fixing some memory leaks nor enhancing recursive algorithms.

I am developing a powerful technique for embedded systems.
Based on the idea around the composability
https://en.wikipedia.org/wiki/Composability

To make this quick, please look at my demo
https://github.com/midnight-wonderer/composed-1hz/blob/master/src/main.c
The program is just your plain old LED blinking.
The LED blink at 1Hz with timer interrupts.

What matters is the program is a composition of 2 solid and carefully designed modules, independent of each other.
https://github.com/midnight-wonderer/stm8s-tim4-periodic-timer
https://github.com/the-cave/time-prescaler
STM8S TIM4 Periodic Timer is a firmware (which depends on hardware)
while
Time Prescaler is a software (which independents from hardware)

To achieve composability in c language, I exploited the function pointers.
Like jigsaws, functions can be composed perfectly with function pointers.
There are principles involved to design such modules, but to keep this brief, I won't go into detail here.

For a full-blown system (everything is composable), the stack could get really deep.
It will be one hell of functions and their pointers.
And this particular optimization happens to prevent the shortcoming.

But what does this mean to you and the world?
(let's assume every people are using the technique)

No one has to write a time prescaler again, no matter what microcontroller they are using, because I already did it, and I made it composable.

Debouncing a button too, no one has to do it again, because I did it too.
https://github.com/the-cave/button-debounce, one of the best IMO.

There are tons of other things; other people will do, they would raise the abstraction levels of embedded software and made it even more fun.

STM8S TIM4 Periodic Timer is an example of firmware development, something that works out of the box. Got another microcontroller, maybe pdk14/pdk15? Just throw away the firmware and plug another in, keep the rest of the software the same. Push the boundary of "portability". The firmware should be composable too.

Debugging c code with an in-circuit debugger? No way, unless you are debugging a firmware. You can take a composable module and compile it with clang on x86. (As long as the bug is not related to compilers and specific instruction sets, i.e., logic bugs of the software itself.)

Co-operative schedulers? Does not matter anymore; the composable modules never block.

Pre-emptive schedulers? Not for normal use-cases, only for CPU intensive tasks.

Co-routines? would be outdated and duplicate the effort.

Fewer bugs, more reliability; Just a glance at a small specific module would decrease a chance of bug without writing an automated test. The technique will enable more automated tests, too, and the test can be executed on x86 machines. By having more people using the same piece of software would make it more reliable and trustworthy.

Operating systems? Fine, but we don't really need one, at least for controlling jobs.

We can even have c package managers for embedded systems too. The one likes https://www.npmjs.com/. And every different hardware from different vendors can shares c code. The industry can profit as a whole from hobbyists too.

The web and enterprise software developments have some advanced techniques here and there, it is bugging me so long that the embedded community is moving slowly, and never unite, but can't it be?

And you know what?
I can't do any of what I said. I can't really build a community. I can't even alter the functionalities of the compiler I am using.
But, I try to enable things on my parts. Showing people what is possible, inventing a better way to do things.
So, we are a step closer to something better.

And this optimization would really help.

Plug section:
If you don't really interest in my composability technique. Fine.
However, if by chance, you want to read more about it.
I plan to publish it one day on my blog (currently nothing there), once I have more time to work on it.
https://slime.systems/

TL;DR
The things that you claims to be bugs are actually features. And the use case of the compiler is valid.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2019-10-13

I agree that a first step could be to do the optimization in cases where there are no parameters, where we would just move the stack frame adjustment in front of the call.
But even there we need to be careful, and analyze the code before we do so:

void test3(void) { char c[16]; a(c); if(i) f(); else g(); }

a could store c in a global pointer variable that f or g could access, so in this case we could not do such an optimization:
It matters if the caller (test3 here) has any variables that had their address taken.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2019-10-13

Today, I made some improvements to tail call optimization in the stm8, z80, z180, gbz80, r2k, r3ka, tlcs90, ez80-z80, pdk13, pdk14, pdk15 backends. In particular, parts of tail call optimization that used to be done in the peephole optimizer are now done in code generation instead.

As of [r11425], for stm8, tail calls should always be optimized when the following conditions hold:

The callee has no arguments

The calles return value is not used

The caller has no local aggregates or unions

The caller has no local variables that have their address taken

There are at most 250 bytes of local variables in the caller

Sarun, is this sufficient for you?

For some simple cases, tail calls will be optimized even if some of these conditions don't hold.

Last edit: Philipp Klaus Krause 2019-10-13
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sarun Rattanasiri - 2019-10-14

Whoa,
The response is quite fast.
But, I just realized something, the problem is quite tough indeed.

The main blocker for me is: "The callee has no arguments".
I just realize the arguments in the stack is cleared by the caller.

_two_millisecond_handler: ; src/main.c: 20: time_prescaler8__tick(&prescaler_config, &prescaler_state); push #<(_prescaler_state + 0) push #((_prescaler_state + 0) >> 8) push #<(_prescaler_config + 0) push #((_prescaler_config + 0) >> 8) call _time_prescaler8__tick addw sp, #4 ; src/main.c: 21: } ret

If the caller is the one who clear the stack, that would make the call not a tail call anymore.
So "a function that take parameter can't be optimized by tail call eliminations".

I wonder why the calee is not the one who clean up after their usages.
However, I suspected that the reason, whatever it is, could sound quite valid as well.

For me the problem is still there, but with your optimization the call stack usage efficientcy would double. (half of tail calls have been eliminated) thanks.

This is not a show stopper for me.
I have plan B.
https://github.com/the-cave/deferred-execution
which under the same condition as the optimization (the callee takes no argument) can flatten the stack frame, albeit some downside.

I don't know much about compiler but may I ask you why it is design this way? (the caller clear the stack as opposed to the calee do it.)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sarun Rattanasiri - 2019-11-04

@spth the last change introduces a bug in ISR.
In case of ISR, the optimization will leads to ret instead of iret.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2019-11-05
  
  Thanks for reporting. it is fixed now [r11446].
  
  Last edit: Philipp Klaus Krause 2019-11-05
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Advanced tail call optimization

The Small Device C Compiler (SDCC), targeting 8-bit architectures

Group

Searches

Help

#237 Advanced tail call optimization

Discussion