Attached is a patch with a new setjmp.s code for all CPU-s which have Z80-like implementations.
As far as I understand, in all cases the new implementation could result in less CPU cycles executed and in less bytes used by the machine code (but I haven't explicitly verified that "less cycles" claim with the simulator, except for z80 and z80n). This should be possible because HL can be directly used instead of using IY, which is on Z80 slower and also needs bigger instructions.
An additional benefit is that, as they don't use IY, they could be directly linked and used with the code compiled with --reserve-regs-iy
, i.e. they are more universal than up to now.
There are actually 3 variants.
pop af
at the start of longjmp (I can't deduce the reason for this from the calling conventions section of the manual, maybe I'm missing something, but I see that there in the existing code, so I do the same here too).ex de, hl
and instead of jp (hl)
they use pop bc
ret
to jump to the destination. Conveniently, the code already has the destination in BC at that moment.pop af
there's one pop de
too, in longjmp (again, I saw that in the existing code and I miss the reason for this in the manual, but it can be that I haven't read it carefully enough).In the case of Z80 assembled setjmp.s uses 12 bytes less, and both routines use 78 T states less than before. Measuring only longjmp it uses 4 bytes less and uses 19 T states less than before. That should mean that it's worth replacing both setjmp and longjmp routines.
I there's anything I should correct before this could be accepted I can do that too.
Did you run the regression tests ("make test-z80" in support/regression) with your setjmp/longjmp in the libraries?
No, I'm just learning what one has to do here. In some projects regtests take days or weeks on many cpus, so I haven't thought that it's realistic to try this on a notebook. How long does that take on a few cores? Thanks!
A Raspi 4 running FreeBSD from an SD card, building SDCC + full regression test (i.e. all ports, not just the z80-related ones) takes 20 hours.
I'll quickly run a time nice make -j 18 test-z80 on my laptop now, and post how long it took.
P.S.: It took 48 min.
Last edit: Philipp Klaus Krause 2024-06-12
Is this a typo "-j 18"? Or "laptop"? Or do I live in the past? I can't understand both in the same sentence :)
It's an old laptop with a Ryzen 4800H CPU from more than 4 years ago, the Ryzen 4800H has 8 cores with SMT2, so 16 hardware threads. For -j, I add two so the CPU has something to do when a software threads is waiting for disk I/O, and end up with -j 18.
Immensely different from my "computing" resources, this will take a while: "Ryzen 7 4800H outperforms Celeron N3150 by 1459% in Passmark." My notebook is faster but unsuitable for longer "burn" periods, already needed service after that. (That's why I target Z80 as a hobby :) )
I agree that "make test-z80" can still take substantial time. It runs the regression tests for 13 z80/z80-related configurations.
Often, to get a rough idea quickly after having made a change locally, I just run one or two of these (and later do a more extensive test before committing changes). E.g. the "make test-ucz80" target runs the tests for the z80 port with default options. "make test-ucz80-resiy" does so for the z80 port with --reserve-regs-iy, "make test-ucr3ka" runs the tests for the r3ka port with default options, etc.
For the Ryzen 4800H, "nice make -j 18 test-ucz80" runs in less than 4 min.
I've had a quick look at the patch, and the current implementations, and noticed the missing
pop de
for the Rabbits in the current implementations, which could result in wrong return values when the second argument to longjmp was neither 0 nor 1. Fixed in [r14883].Related
Commit: [r14883]
I think when that when the default calling convention for the z80-related ports changed, the setjmp/longjmp asm code was adapted to it, but no one bothered to ensure that the code was optimal for the new convention (after all it was already more efficient than with the old one, so there was an improvement).
Thanks for looking into this.
Regarding that
pop de
: we currently use two different calling conventions for z80-related ports (see section 4.3, "The Z80, Z180, Rabbit 2000, Rabbit 2000A, Rabbit 3000A, SM83(GameBoy), eZ80, TLCS-90 and R800 ports" of the manual). For some ports, a second 16-bit argument is on the stack, and can thus be fetched into de by
pop af
,pop de
.That is exactly what I have thought: the new calling convention rocks.
As I assume you are a designer of
__sdcccall(1)
, thanks for that change! I've got my first retro computer only start of this year, and was really glad once I saw that possibility for a better code than what existed longer, and now I'm making a new "development kit" just for the "dot commands" on Next that will use SDCC with__sdcccall(1)
, which is still a different approach from the existing solutions which keep depending on the__sdcccall(0)
.I started with a single application which would use
__sdcccall(1)
and now I'm repacking the functions to a few lib files to make them more generally useful. For what I do, I want to keep the interrupts enabled, so I have to use "reserve IY" consistently. If setjmp is changed, then, for what I do, only two .s functions (not counting that for jump) left in SDCC which use IY are memcpy and memmove (not counting jump iy function which won't be used).Maybe you know, just curios: is there a platform where the sequence
pop af``pop bc``push af
(to fetch the parameter which is on the stack) would destroy the return address which is to be used later? As far as I know, at least on Next that's safe, e.g. I saw that in__ultobcd.s
which I have also called with o problem on Next from my code, and I've also seen that there's a custom memcpy variant which uses AF as 16-bit register through which copying via pop and pushes is done.Regarding Rabbit & c conventions: after your answer (and [r14883]) I see that I wasn't careful enough when reading the manual about the calling conventions of these other ports, many thanks!
Related
Commit: [r14883]
The
pop af pop bc push af
sequence will destroy the return address for sm83 (AFAIK the unused flags on the Gameboy are hardwired to 0). For tlcs90, it will not destroy the return value, but mess up interrupt handling (the TLCS-90 has the interrupt flags in the flag register - this also affected the tlcs90 longjmp, which I just fixed in [r14885] - unfortunatly the tlcs90 port has so few users that it is easy to break something unnoticed).Related
Commit: [r14885]
Now, if I see correctly, that means that the patches can have only two versions of the file, as there are only two variants if
pop de pop de
is used everywhere where that DE has to be filled with the parameters.I've run the regtests with your changes, now it's in progress with my updated patch, if it works, I think I'll post here the new version tomorrow.
Last edit: Janko Stamenović 2024-06-16
Is there somewhere a reference output of the reg tests? If not, can you, please, for my reference, when you're in the directory where "regression" is a subdirectory, post your current output of:
sometime before you delete the individual log files? (I've deleted these from my "before" run by starting the new run, before I knew that these are the most convenient for comparison as the matching paths could not be reconstructed in the redirected output of the run).
What I'm getting up to now with my changes and that command:
Last edit: Janko Stamenović 2024-06-17
The daily regression test log can be found on the snapshot page behind the red or green diamond in the right column under RT.
https://sdcc.sourceforge.net/snap.php
The new version of the patch (edit: current (>=r14885) --> my changes) attached.
The regression tests results for
tst_*jmp
:Last edit: Janko Stamenović 2024-06-18
Thanks. I'll have a closer look next week. For now just two small comments:
* I prefer the more verbose comment style from the current implementations.
* You mentioned above "the patches can have only two versions of the file". It might still be possible that they can be made more efficient using instructions only available on some of the z80-related architectures or by considering timing differences. So if you do need top performance on these functions for a particular architecture, it might make sense to look into that. For details on the instruction sets one would have to look at manuals, but we have an (incomplete) summary at https://sourceforge.net/p/sdcc/wiki/Z80%20instruction%20timing/
Regarding 1 (comments) I have no problem in rewording the comments however you want, I've already commented every necessary point, but I was myself unsure of how to name what. Correct me if I'm wrong in any of the statements that follow:
Maybe "IX" is more correct than the "frame pointer"? If I understand, even if SDCC uses IX at the moment only for the frame pointer, it doesn't have to: from the point of view of setmp/longjmp IX is only the register that has to be preserved, but SDCC could be well enhanced to use IX for some other purposes during the life of the function which can be jumped to, while still depending on the convention that it will be preserved. It can already be used so in the hand written assembly, correct?
Re: "fetch spval" -- there writing "stack pointer" would suggest "the content of SP" when what is read is "the content of the second word of the three word structure used by setjmp and longjmp." It would be good if the structure has the description with their names, and then to refer across the setjmp / longjmp sources using these names?
E.g.
Once the names of the elements of the structure exist, it's easier to follow what is being done when. It wasn't necessary as the code was more direct with
but now "spval" (from that instance of that structure, I mean) is actually read to DE pushed as DE popped as HL and then loaded to SP. I think documenting where the value at which moment is helps following what's going on -- the intention behind some push of one register pair and pop to another I think should be written in not immediately obvious hand written assembly?
That was an intention behind my comments using "spval" here:
But as you are maintainer you should decide what would be for you ideal content of the comments and I can't and won't object to that at all!
Regarding 2 (instruction selection):
I think that code is already faster than before on every "port", and if I understand the timings for those for which I've changed the routines, none had made "worse" tradeoffs for the existing instructions, just improved the slower ones or added some additional ones, which I don't think could be used in these specific examples, but are certainly interesting in different scenarios.
Tangentially to that instruction selection, my goal was to have these two routines in a form usable for "reserve IY". Before that they could not have been used, but I see that there is
ucz80-resiy
target for regtests, are these tracked somewhere? Grepping in the z80 I find 4 routines still using IY, doesucz80-resiy
avoid them?Even more importantly, the compiler allows for
--reserve-regs-iy
but I think the libraries packed in the distribution aren't "reserve iy" friendly, that is, they are compiled to use IY. I'm compiling my own library from the sources, I guess that is an expected way to use such option. But maybe, alternatively, automatic library selection code could be a part of SDCC code, just like now the "port" library is selected (e.g. z80.lib or z80n.lib) shouldn't it be "built in" in SDCC to also search for "z80-resiy.lib" or "z80n-resiy.lib" when the "resiy" option is active in the linking phase? Maybe it already exists and I just don't know about that?(I would also like to use "resiy" instead of "--reserve-regs-iy" as I write btw, and I also note that I more often have to refer to the "code compiled when the --reserve-regs-iy option is not used" for which I'd also like some short and obvious way to refer to, maybe "no-resiy"?)
Once that selection of library is automatic, it would be good to have different implementations for resiy and no-resiy of these functions where using IY is actually advantageous for no-resiy target (I think in setjmp / longjmp it wasn't, but it seems to me (I haven't thoroughly checked, just an impression) that these memcpy.s and memmove.s are still a few cycles better with IY, so it would not be good to have only resiy version of them).
What would be a good approach for improvements regarding resiy libraries?
Yes, we do not by default build libraries for all possible configurations, only for the most common ones.
So the first question about a --reserve-regs-iy library is: for which ports is there sufficient user demand?
AFAIK, historically by far most users of --reserve-regs-iy did not use sdcc directly, but use the sdcc fork in z88dk. AFAIK z88dk only uses the compiler from SDCC, combined with a different preprocessor, assembler, linker and library.
My question was about the "automatic" selection of the library when resiy is a command line option, not about "building by default". I saw that there are tests in sdcc. My question was: if it is reasonable for a "standard" library selection to be somehow made with the awareness of that '--reserve-regs-iy' option in any way?
I'm producing my own "dk" to compile for some z80n targets which avoids that "big" dk mentioned by you (And that "big", if I understand correctly, has a license under which I can't use any sources from it in my own projects unless I use that whole dk? So I'm doing it completely independently anyway, using pure sdcc -- another decision I don't like in that big dk is them keeping
__stdccall(0
) a default).For sdcc: I know I can say "nostdlib" and explicitly specify my alternative build of z80n built with
--reserve-regs-iy
, that's what I'm doing locally already. And I can distribute that alternative build if I'd like, I'm aware of that too, with all GPL licenses included. But I wanted to know if there is a way or if it would be reasonable for that resiy information to be in any way present outside of the compilation phase, for a linker to be able to select the "right" library somehow similarly like there is already that -z80n-resiy target during that make.I fully understand if the best solution right now is to not change sdcc in any way.
The automatic selection should be doable: e.g. for mcs51 and pdk15 we doe have a separate --stack-auto library, which is used if that option is specified at link time.
While --reserve-regs-iy is a less serious difference (--stack-auto changes the ABI, --reserve-regs-iy does not), we could still do the same here.
Against adding that library selection mechanism is: if it's there, but if SDCC doesn't distribute e.g. "z80n-reserve-regs-iy/z80n.lib" then the linking in this case would fail. But at the moment even if the linking doesn't fail, the resulting binary would crash on the real systems which expect the IY to remain unchanged, so from the users' perspective, if the binary package didn't work immediately after unpacking before, due to the lack of the pre-built "reserve" lib, after adding the feature it can remain the same: they'd need to build such a library themselves. From that perspective, it frees SDCC from an obligation to distribute new binary precompiled libraries.
The official releases do have lib/src directory and the possibility for the user to build the libraries locally. Having the separate names would just allow the user to have both versions on his system and for the "right one" to be selected automatically, which is maybe better than now? So that could be good enough: not distributing the binary libs but having the better selection infrastructure there?
But, one more argument against the change: adding that change could mean that the existing SDCC build scripts and make-related infrastructure would have to be changed to be aware of such new naming conventions. Maybe is that trickier than adding the
get_model
functions for z80 ports for which, it seems, everything that should use them in the rest of SDCC*.c files is "already there" (as far as I see) because it is already used for other ports?I also don't know which other ports, except for z80 and z80n, would ever need
--reserve-regs-IY
. Specifically: ZX Spectrum, at least some of its clones and ZX Spectrum Next do need to not let the user's code change IY if it should run when the default interrupts are enabled and/or the ROM routines are being called. Maybe there would be never a practical need for addingget_model
and the -reserve-regs-iy lib search for more than these two ports? I don't know if other platforms need the same convention.Those writing "pure" games also often decide in advance that they don't want to use anything from the system and will block the interrupts, so they can use IY on that even on these platforms, but I'm targeting for the programs that work with the system. So that is maybe a specific use-case where a single user could use both versions: it's not just a processor type but the target environment of the user's code.
Personally, I don't make any games, and at the moment I do nostdlib and rebuild the whole library myself, but as I've started my development, using and linking of strlen and some other "default" function simple enough "worked" just because exactly these don't use IY. :) Only when I started to use more I saw that I do have to build the lib myself and enforce nostdlib.
Regarding the number of users: just checked: that "no crash" version of sdasz80.exe for W64 was downloaded "whole" two times since I've made it over the weekend. :)
Last edit: Janko Stamenović 2024-06-24
If IY is used so little in the libs, we might as well choose to always build them with --reserve-regs-IY.
We can also choose to place the text "--reserve-regs-IY" or something similar in the .optsdcc to enforce compatibility. This will prevent linking objects with mixed reserve-regs settings.
In all currently support calling conventions, IY is never used to pass arguments or return values. So code built with --reserve-regs-iy is compatible with code built without it.
IMO, choosing the lib depending on --reserve-regs-iy being specified at link time would be fine. IMO, the question is if it is worth having an extra library, and if yes, for which ports. I guess z80 is the most obvious candidate.
For which ports would you use a library built with --reserve-regs-iy, if it was available now? z80? z80n? Any other?
This week I've published my "--reserve-regs-iy," rebuild of the SDCC's z80n.lib inside of my "development kit" for NextZXOS dot files (that is z80n target):
https://gitlab.com/jankos-projects/
and there: Jankos_DNCDK and Jankos_DNCDK_sources
"Dot commands" are executables which would be equivalent to .COM programs on the CP/M or DOS, but they have more limitations: they get only one 8KB page initially at 0x2000, the command line and stack could be anywhere, but then there are additional preconditions for using the API calls, as these cause different banks to be switched etc.
I don't know what's the state of other systems having Z80-like CPUs, regarding the "reserve" registers: I know only about classic ZX Spectrum and all the clones (which would be the z80 target) and ZX Spectrum Next (z80n target). These would use
--reserve-regs-iy
for the programs compiled with the expectation to use the system calls and to keep the default interrupt routine. But when the program is written with an intention to take over the machine completely and never return to the system, like game writers typically did, then IY would be treated as "free to be used".In this rebuild of mine I also resurfaced the
printf_small
with some small changes (it didn't compile for z80)https://gitlab.com/jankos-projects/Jankos_DNCDK_sources/-/blob/main/jj_z80n/printf_small_mod.c?ref_type=heads
https://gitlab.com/jankos-projects/Jankos_DNCDK_sources/-/blob/main/jj_z80n/printf_small_compat.c?ref_type=heads
and I've also changed the
printf_large
https://gitlab.com/jankos-projects/Jankos_DNCDK_sources/-/blob/main/jj_z80n/printf_large_mod.c?ref_type=heads
to use byte sized widths instead of the integers, reducing the size of it for around 1KB, which is a lot, as the goal of that development kit are "dot" files for NextZXOS, where, even if the system can page a lot of newly allocated pages in, a "page" is 8 K and the dot files get only one page initially, so efficient use of the initial 8K page makes a difference. The kit is currently designed for exactly one code and one data page (each 8K), everything else is the responsibility of the user.
I've also let
printf_large
compile with much higher--max-allocs-per-node
than what is I think current default. I've posted about my experiments withprintf_large
some graph in some other message, 2 - 3 millions gets better result but compilation takes really a lot of time, so at the moment I also keep the .s file for faster library rebuilds:https://gitlab.com/jankos-projects/Jankos_DNCDK_sources/-/blob/main/jj_z80n/printf_large_mod-2048000.s?ref_type=heads
Last edit: Janko Stamenović 2024-07-10