I've been using this patch in my local version for a while now. As the title says, it implements stack probing for mcs51,but not for xstack, only the regular stack.
It does so by injecting the stack pointer check in the prologue of each function and assumes that the conditional branch is normally not taken if the stack check passes.
This is not perfect of course. It doesn't catch wraparound of the stack pointer and also doesn't catch a stack overflow after allocating the stack frame inside of the function (because the check is done before that). But it's better than nothing at all and has proven useful for me.
One issue I don't know how to deal with is how to extend the device lib set with the permutation for the option '--stack-probe'. The idea is to have stack probing also enabled for rt / c std device libs (printf can overflow the stack easily for example) when it's specified in the user application.
@spth do you have any suggestion?
The only option is to rebuild the library with that option.
Is the user supposed to bring his own stack_probe_err() function?
Why not implement a default one in the library?
Why are you adding 8 to the SP? Why not the size of the stack frame?
Thanks for looking at it @maartenbrock
That's what I've been doing so far -- hack it in the device lib Makefile.in as compiler option. Doesn't "scale well" though.
Real world example: I've got two different 8051 devices. The first one has a larger flash and is prone to stack overflows. So the checks help a lot during development and increased code size is not an issue. The second device has only a small code flash and has usually no issue with the stack, so would like to omit the stack frame checks to not waste code flash.
Recompiling & reinstalling the device libs in SDCC is really cumbersome.
So what I'd like to add is something like GCC's multilib for --stack-probe option. It seems something similar is already done for stack-auto?
Yes, that's what I've been doing for myself. The stack_probe_err is a special function that is implemented by the application / "board support library". It uses board specific ways to print an abort message somewhere -- usually the uart.
A default implementation could be added, which could use printf / puts. But it should be overrideable by the user application. Not sure how to implement that within SDCC.
Good point! Didn't think about it and/or didn't bother to check how to obtain the stack frame size at that time, as I needed "something working quickly".
The default could be as simple as looping forever. If you have in-circuit-debugging you can break and inspect things.
Any function in the library can be overridden by supplying your own. The linker uses all objects you give to it and then looks for the missing pieces in the libraries.
Something like the default impl. of
device/lib/__assert.clooks useful. It printfs something and infinite-loops after.Alright. I'll give a another shot. Thanks for the pointer.
What about the multi-lib issue? Do the same as for stack-auto? Looks like it's going to be a bigger changeset ...
I've integrated the stack probing into the stack adjustment code. Default impl. of library function 'stack_probe_err' is still not there, will add it later.
@maartenbrock could you please have a look whether this is reasonable like that? In particualr, I'm not sure how reliable 'pop sp' is across all the different 8051 core implementations?
There are some issues with this patch.
* There is no 'push a' and 'push acc' is 2 bytes
* stack_probe_restore_acc_code_line is still unused
* Internal stack can also be probed when xstack is specified as it is still used
* 'i + 1' vs (i - 1)
I've tested POP SP on a SiLabs EFM8 Universal Bee and it does what you expect: SP contains the popped value afterwards. This case is also explicitly documented in the MCS-51 Family of Single Chip Microcomputers User's Manual.
Although contradictory to the following definition of operation:
On the EFM8 PUSH SP stores SP as it was before the push. This seems contradictory to the original User's Manual which has this definition of operation:
It has no special example for PUSH SP.
The ucsim s51 simulator stores the incremented value.
But further down in the User's Manual in the 8051 Instruction Set Simulator code it says:
The stack pointer is incremented inside the PUSH$STACK procedure.
This means that PUSH SP should store the non-incremented stack pointer value.
Last edit: Maarten Brock 2023-07-19
@maartenbrock thanks for checking it!
It assembles and links just fine here it seems. The assembler seems to take care of it. But I in other places SDCC emits 'acc', so perhaps it's good to be consistent. I know that that insn is 2 bytes in size.
Oops. I forgot to remove this leftover piece.
Oh, OK. I've never used xstack myself, so didn't pay attention to that.
That was basically my main uncertainty regarding using 'pop sp'. 'push sp' implementation is irrelevant as it's not used. If EFM8's CIP-51 core implements the special case of 'pop sp' as per original manual, then perhaps that's what can be expected there? I will try it out on the DW8051 (synopsys) core to double check.
Your change of original patch:
to
I was counting from the first insn after the 'jc' insn onwards. Not the whole code emitted in the if block, but only the code to do the actual adjustment:
... because the 'add' and 'jc' will be always there, in all cases.
So the 14 bytes are worth 7 'inc sp' insns. However, that case includes also a 'pop acc', so it's worth only effectively 6 'insn sp' insns. Hence I put the threshold to 'i > 5'.
If I count all the code that is emitted in that else block, then I arrive at 20 for i = 7, which is the max. i that this else case can hit as per your change to 'if (i > 7)' above. Or not?
Last edit: Oleg Endo 2023-07-21
What I meant was that the comment states you need i+1 and the code actually uses i-1 !
And now we know that you don't need either.
Let's count all bytes. And my second piece was wrong in that regard. Your threshold of 5 is correct.
The non-probing case for i>9 is not right. Not only is there no post-decrement on the popped SP value, but you have already pushed twice when SP is read into A. Also SP can be read before saving R0 which saves the need for DEC R0.
And similarly in the stack probing case, when A is pushed because there is no free reg, the offset changes.
Come to think of it, all R0-R7 registers should always be free at function entry, unless it's switching register bank.But why not make the stack probing a support routine that takes the stack offset as parameter in A and returns the new stack value in A? E.g.
The stack probing routine can check underflow and overflow, has the caller automatically on stack and needs no lcall before the function.
Last edit: Maarten Brock 2023-07-21
OK, perhaps easier to follow.
Good point. I've refined it. Code can be shortened a bit.
Should be fixed.
All my software never hits those cases when
accIsFree == false || freereg == NULL. I also thought it's a bit odd thatfreereg == NULLis possible at all. But then, this path was already there so I didn't dig further...Yeah, I've seen DS390 uses this approach. However, one of my main devices here has über-slow jumps -- ~48 cycles or so. So having the
lcall ___sdcc_stack_probein there costs each function call an additional 100 cycles at least ('lcall' + 'ret' from the probe function). Hence I implemented it originally inlined in the epilogue.Devices like EFM8 (which I also use) don't suffer from that so much, but still an 'lcall' is up to 7 cycles and a 'ret' is up to '8' cycles, while the not taken 'jc' conditional branches cost only 2 cycles. So I'd prefer to have inlined stack probing.
Attached updated patch now also includes the default implementation for the
__stack_probe_errfunction and documentation update.Last edit: Oleg Endo 2023-07-21
The epilogue is what happens at the end (of the function). The prologue is what happens at the beginning. Your code is in the prologue.
The _stack_probe_err can be marked _Noreturn or [[ noreturn ]] to make sure it will not return.
I would definitely not use printf() in this error function. It is a very complex function that is very likely to fail when the stack is corrupted.
Instead of
__start__stackyou can also uses_SSEG(they are equal). It is accompanied byl_SSEGwhich holds the length of the stack segment. And since the linker doesn't support adding two relocatable symbols either, I consider addinge_SSEG.The underflow check does not yet take the PUSH ACC into account and neither does the i<=5 case. With the underflow check fixed the overflow no longer needs i-1.
There are very few 8051 derivatives where jumps are so relatively expensive. How many cycles do your checks cost in the accIsFree case? Does the device have a prefetch buffer and/or cache? I would prefer to use a support routine and save bytes.
To me, it looks like there is a trade-off between code size and speed where different users have different preferences.
Usually, in such cases, we go with the solution that yields lower code size by default, but use the faster one if
--opt-code-sizeis specified by the user.-opt-code-speedvs.--opt-code-sizetypo?Like I wrote in my other reply, I would suggest to add an alternative space optimizing implementation of stack probing afterwards.
I'm not going to propose to enable stack probing by default, so there will be no code size regression/increase at all, unless stack probing is enabled by the user specifically.
Yes, I meant
--opt-code-speed.I know. Just a typo. Fixed.
Yes, that's what I have in my app code. I wasn't sure if that's OK for the device libs. I've copied the code from assert.c which doesn't specify noreturn.
Since the call site always uses 'lcall' I think specifying noreturn will do effectively nothing in this case. But I've added it anyway.
Ttrue. I didn't consider that (In my app implementation it resets the stack before doing any printing). Removed.
I'm not sure how to use the area/section/segment names? What would that improve? Can you give an example?
By underflow you mean the 'subb' check? If so ...
Why does it have to? At that point, all it cares about is whether the stack pointer has rolled over and is somewhere in the range 0x00 - start_stack. Note that earlier in the prologue, there might be other stack pushes, e.g. for the frame pointer. But I don't see how that would be relevant at that point.
Sorry, I don't follow. Please point out in the code.
IMO, even on the EFM8UB I wouldn't want to have those extra 15 cycles for each function call. The thing is already slow as-is.
The EFM8UB does have a prefetch buffer, and it makes taken jumps more expensive than if the prefetch buffer is disabled. The DW8051 device doesn't have any buffer/cache.
For the cases like freereg == null, it might be good to outsource the stack adjustment + probing code into a support routine, when optimizing for size. For the other cases, I can't agree with it. Can we leave that as a potential future improvement? (If anybody cares at all, since I won't enable stack probing by default for everything).
OK, I've tried it out, I can replace
__start__stackwiths_SSEG. But as such, it doesn't change much for the current implementation.Support for stack end != 0xFF could be added of course, but is there any practical use case for that? I'm not sure ....
I am also not sure about checking stack end != 0xFF. It can happen, but is unlikely.
But at least by using
l_SSEGit is possible. There is no__end__stackor__len__stackor similar and there probably never will be.However, if the stack probing was done by an external assembly routine, it would be easier for any user to modify it to add this length check. It could even be implemented behind an .if clause.
Example:
__sdcc_stack_probe_1is for the case that A needs to pushed before the call.Specifying noreturn should prevent the user from creating a function that does return since it has no proper return address. It should probably also be added to the intrinsic functions prototypes in SDCC so it will always protest when noreturn is left out.
You can at least take into account that you pushed ACC here. If more pushes were done in between, it would be best if they are taken into account as well. Who knows, it might have overflown back into valid stack space!
After this, you don't need to use 'i - 1' any more in the (i > 5) case and neither in the else where you missed it (i<=5).
The DW8051 has no buffer/cache but making a jump still takes approx. 50 cycles where other instructions take approx. 2. I'm very surprised.
@maartenbrock I've tried out the following on an DW8051 core:
void test (void)
{
it prints '32'.
Looks like it also follows the 'pop sp' special case of the original mcs-51 like the EFM8. So can we assume that this works? I've tried to look at some open source 8051 soft core implementations but their operation in this regard is not straight forward to see through for me...
@maartenbrock replying in new message (it's getting too cluttered here)
OK, makes sense to me.
If we set iram size = 128, then stack end is implicitly 0x7F. That's what I was primarily thinking of. But I'm not sure how widely devices with iram size = 128 are still used. Most have 256 these days, I think. There is a
--stack-sizeoption. Theoretically it can be used to place user variables after the stack space (up to end of iram). But also not sure what's the advantage of that over placing those variables before the stack ...Perhaps easiest way for now is to assume that iram size = 256 and stack space sits at the end of iram. If that's not good enough for somebody, they are free to report :)
Yes, sure, no doubt about it. The external function would allow for a more flexible stack probing implementation and it's easy to add complex sophisticated detailed checks, whereas my goal for this one was to have fast-best-effort checks so that stack probing can be left in the production build of the software.
We could make this available via options. E.g.
--stack-probewould do the function call approach while--stack-probe-inlinewould do what I'm proposing here.Ah! OK, I get it now. Yes, it's more obvious that way. I'll change it. Thanks!
Yes, it's not perfect. Every push onto the stack in the prologue code could overflow it. In fact, the call-insn itself might overflow it, too. Of course we can't check after every single push insn. The 'subb' check does basically cumulative checking of all those, in the hope that the number of bytes pushed onto the stack so far has not exceeded 'stack_start - 0x00'. If that were the case, it would wrongly judge that the 'sp' is OK.
We can figure out how many bytes would need to be pushed by the prologue. I'd just count all the emitted push insns and then insert the following snippet at the very top of the function:
It would take this whole thing back to almost the original version of my patch, where I was doing the 'add' check first, followed by 'subb'.
The current version of the patch produces overall smaller code size for me, as the 'add' check is integrated into the actual stack adjustment code and the number of pushed bytes in the prologue is usually zero (I'm always using --fomit-frame-pointer).
So if there are any number of pushed bytes in the prologue, it should emit the 'add' check at the top of the function and just add the 'stackAdjust' number to the number of prologue bytes. That would be more robust. I could add this if you insist. Although I don't have any need for that myself at the moment.
@maartenbrock Sorry, got worked up with another thing.
I've revised the patch, to address your point:
The patch is a bit shorter now.
@maartenbrock While rebasing the patch onto the current SVN trunk, I've noticed that there is a problem with the
--all-callee-savesoption.The following test code:
Compiled with
sdcc -c --model-large --std-sdcc2x -mmcs51 --stack-auto --nogcse --fverbose-asm --stack-probe --all-callee-savesThe code at the end of
genEndFunctionwrongly eliminates the pop insns. This happens when any one of the stack adjustment or stack probe insns is inserted.This happens because those stack modification insns will have the iCode set to
FUNCTION. And the code ingenEndFunctionuses that as a terminating condition for scanning for push insns when it checks if a pop insn can be eliminated. And thus it wrongly eliminates the pop insns.This looks like a latent bug, because it would also happen if any of the previous stack / xstack adjustment insns are emitted. It seems to work OK if the iCode of the emitted insns is set to NULL.
Attached is the updated patch which should now apply cleanly onto trunk.
@maartenbrock any updates on this one?