Menu

#479 new setjmp.s for all Z80-similar CPUs

None
closed-accepted
None
5
2024-07-10
2024-06-12
No

Attached is a patch with a new setjmp.s code for all CPU-s which have Z80-like implementations.

As far as I understand, in all cases the new implementation could result in less CPU cycles executed and in less bytes used by the machine code (but I haven't explicitly verified that "less cycles" claim with the simulator, except for z80 and z80n). This should be possible because HL can be directly used instead of using IY, which is on Z80 slower and also needs bigger instructions.

An additional benefit is that, as they don't use IY, they could be directly linked and used with the code compiled with --reserve-regs-iy, i.e. they are more universal than up to now.

There are actually 3 variants.

  • z80 z80n z180 r800
    (The main variant)
  • r2k r2ka r3ka
    (Variant B) these
    • don't return the value in DE but in HL
    • have one pop af at the start of longjmp (I can't deduce the reason for this from the calling conventions section of the manual, maybe I'm missing something, but I see that there in the existing code, so I do the same here too).
    • because they return value in HL then longjmp before returning, compared to the main variant have to do one ex de, hl and instead of jp (hl) they use pop bc ret to jump to the destination. Conveniently, the code already has the destination in BC at that moment.
  • ez80_z80 tlcs90
    (Variant C) These have everything Variant B has, just after that pop af there's one pop de too, in longjmp (again, I saw that in the existing code and I miss the reason for this in the manual, but it can be that I haven't read it carefully enough).

In the case of Z80 assembled setjmp.s uses 12 bytes less, and both routines use 78 T states less than before. Measuring only longjmp it uses 4 bytes less and uses 19 T states less than before. That should mean that it's worth replacing both setjmp and longjmp routines.

I there's anything I should correct before this could be accepted I can do that too.

1 Attachments

Discussion

1 2 > >> (Page 1 of 2)
  • Philipp Klaus Krause

    Did you run the regression tests ("make test-z80" in support/regression) with your setjmp/longjmp in the libraries?

     
    • Janko Stamenović

      No, I'm just learning what one has to do here. In some projects regtests take days or weeks on many cpus, so I haven't thought that it's realistic to try this on a notebook. How long does that take on a few cores? Thanks!

       
      • Philipp Klaus Krause

        A Raspi 4 running FreeBSD from an SD card, building SDCC + full regression test (i.e. all ports, not just the z80-related ones) takes 20 hours.

        I'll quickly run a time nice make -j 18 test-z80 on my laptop now, and post how long it took.

        P.S.: It took 48 min.

         
        👍
        1

        Last edit: Philipp Klaus Krause 2024-06-12
        • Janko Stamenović

          Is this a typo "-j 18"? Or "laptop"? Or do I live in the past? I can't understand both in the same sentence :)

           
          • Philipp Klaus Krause

            It's an old laptop with a Ryzen 4800H CPU from more than 4 years ago, the Ryzen 4800H has 8 cores with SMT2, so 16 hardware threads. For -j, I add two so the CPU has something to do when a software threads is waiting for disk I/O, and end up with -j 18.

             
            👍
            1
            • Janko Stamenović

              Immensely different from my "computing" resources, this will take a while: "Ryzen 7 4800H outperforms Celeron N3150 by 1459% in Passmark." My notebook is faster but unsuitable for longer "burn" periods, already needed service after that. (That's why I target Z80 as a hobby :) )

               
              • Philipp Klaus Krause

                I agree that "make test-z80" can still take substantial time. It runs the regression tests for 13 z80/z80-related configurations.

                Often, to get a rough idea quickly after having made a change locally, I just run one or two of these (and later do a more extensive test before committing changes). E.g. the "make test-ucz80" target runs the tests for the z80 port with default options. "make test-ucz80-resiy" does so for the z80 port with --reserve-regs-iy, "make test-ucr3ka" runs the tests for the r3ka port with default options, etc.

                For the Ryzen 4800H, "nice make -j 18 test-ucz80" runs in less than 4 min.

                 
                👍
                1
  • Philipp Klaus Krause

    I've had a quick look at the patch, and the current implementations, and noticed the missing pop de for the Rabbits in the current implementations, which could result in wrong return values when the second argument to longjmp was neither 0 nor 1. Fixed in [r14883].

     
    👍
    1

    Related

    Commit: [r14883]

  • Philipp Klaus Krause

    I think when that when the default calling convention for the z80-related ports changed, the setjmp/longjmp asm code was adapted to it, but no one bothered to ensure that the code was optimal for the new convention (after all it was already more efficient than with the old one, so there was an improvement).
    Thanks for looking into this.
    Regarding that pop de: we currently use two different calling conventions for z80-related ports (see section 4.3, "The Z80, Z180, Rabbit 2000, Rabbit 2000A, Rabbit 3000A, SM83
    (GameBoy), eZ80, TLCS-90 and R800 ports" of the manual). For some ports, a second 16-bit argument is on the stack, and can thus be fetched into de by pop af, pop de.

     
    👍
    1
    • Janko Stamenović

      That is exactly what I have thought: the new calling convention rocks.
      As I assume you are a designer of __sdcccall(1), thanks for that change! I've got my first retro computer only start of this year, and was really glad once I saw that possibility for a better code than what existed longer, and now I'm making a new "development kit" just for the "dot commands" on Next that will use SDCC with __sdcccall(1), which is still a different approach from the existing solutions which keep depending on the __sdcccall(0).
      I started with a single application which would use __sdcccall(1) and now I'm repacking the functions to a few lib files to make them more generally useful. For what I do, I want to keep the interrupts enabled, so I have to use "reserve IY" consistently. If setjmp is changed, then, for what I do, only two .s functions (not counting that for jump) left in SDCC which use IY are memcpy and memmove (not counting jump iy function which won't be used).
      Maybe you know, just curios: is there a platform where the sequence pop af``pop bc``push af(to fetch the parameter which is on the stack) would destroy the return address which is to be used later? As far as I know, at least on Next that's safe, e.g. I saw that in __ultobcd.s which I have also called with o problem on Next from my code, and I've also seen that there's a custom memcpy variant which uses AF as 16-bit register through which copying via pop and pushes is done.
      Regarding Rabbit & c conventions: after your answer (and [r14883]) I see that I wasn't careful enough when reading the manual about the calling conventions of these other ports, many thanks!

       

      Related

      Commit: [r14883]

      • Philipp Klaus Krause

        The pop af pop bc push af sequence will destroy the return address for sm83 (AFAIK the unused flags on the Gameboy are hardwired to 0). For tlcs90, it will not destroy the return value, but mess up interrupt handling (the TLCS-90 has the interrupt flags in the flag register - this also affected the tlcs90 longjmp, which I just fixed in [r14885] - unfortunatly the tlcs90 port has so few users that it is easy to break something unnoticed).

         
        👍
        1

        Related

        Commit: [r14885]

        • Janko Stamenović

          Now, if I see correctly, that means that the patches can have only two versions of the file, as there are only two variants if pop de pop deis used everywhere where that DE has to be filled with the parameters.
          I've run the regtests with your changes, now it's in progress with my updated patch, if it works, I think I'll post here the new version tomorrow.

           

          Last edit: Janko Stamenović 2024-06-16
          • Janko Stamenović

            Is there somewhere a reference output of the reg tests? If not, can you, please, for my reference, when you're in the directory where "regression" is a subdirectory, post your current output of:

            find . -name '*.log' | grep jmp | xargs -I {} grep -Hn . {} | sort | sed "s|./regression/results/||"
            

            sometime before you delete the individual log files? (I've deleted these from my "before" run by starting the new run, before I knew that these are the most convenient for comparison as the matching paths could not be reconstructed in the redirected output of the run).

            What I'm getting up to now with my changes and that command:

            ucz180-resiy/tst_longjmp.log:1:tst_longjmp                         (f:  0, t:   4, c:  1, b:   1060, T:    24958)
            ucz180-resiy/tst_setjmp.log:1:tst_setjmp                          (f:  0, t:   3, c:  1, b:   1075, T:    24227)
            ucz180/tst_longjmp.log:1:tst_longjmp                         (f:  0, t:   4, c:  1, b:   1096, T:    24926)
            ucz180/tst_setjmp.log:1:tst_setjmp                          (f:  0, t:   3, c:  1, b:   1110, T:    24187)
            ucz80-resiy/tst_longjmp.log:1:tst_longjmp                         (f:  0, t:   4, c:  1, b:   1058, T:    24958)
            ucz80-resiy/tst_setjmp.log:1:tst_setjmp                          (f:  0, t:   3, c:  1, b:   1073, T:    24227)
            ucz80-undoc/tst_longjmp.log:1:tst_longjmp                         (f:  0, t:   4, c:  1, b:   1092, T:    24926)
            ucz80-undoc/tst_setjmp.log:1:tst_setjmp                          (f:  0, t:   3, c:  1, b:   1106, T:    24187)
            ucz80/tst_longjmp.log:1:tst_longjmp                         (f:  0, t:   4, c:  1, b:   1092, T:    24926)
            ucz80/tst_setjmp.log:1:tst_setjmp                          (f:  0, t:   3, c:  1, b:   1106, T:    24187)
            
             

            Last edit: Janko Stamenović 2024-06-17
            • Maarten Brock

              Maarten Brock - 2024-06-17

              The daily regression test log can be found on the snapshot page behind the red or green diamond in the right column under RT.
              https://sdcc.sourceforge.net/snap.php

               
              👍
              1
              • Janko Stamenović

                The new version of the patch (edit: current (>=r14885) --> my changes) attached.

                The regression tests results for tst_*jmp:

                ez80-z80/tst_longjmp                         (f:  0, t:   4, c:  1, b:   1090, T:    25558)
                ez80-z80/tst_setjmp                          (f:  0, t:   3, c:  1, b:   1110, T:    24699)
                tlcs90/tst_longjmp                         (f:  0, t:   4, c:  1, b:   1293, T:    19896)
                tlcs90/tst_setjmp                          (f:  0, t:   3, c:  1, b:   1306, T:    19096)
                ucr2k/tst_longjmp                         (f:  0, t:   4, c:  1, b:   1123, T:    17122)
                ucr2k/tst_setjmp                          (f:  0, t:   3, c:  1, b:   1138, T:    16568)
                ucr2ka/tst_longjmp                         (f:  0, t:   4, c:  1, b:   1125, T:    17122)
                ucr2ka/tst_setjmp                          (f:  0, t:   3, c:  1, b:   1140, T:    16568)
                ucr3ka/tst_longjmp                         (f:  0, t:   4, c:  1, b:   1119, T:    16904)
                ucr3ka/tst_setjmp                          (f:  0, t:   3, c:  1, b:   1134, T:    16290)
                ucr800/tst_longjmp                         (f:  0, t:   4, c:  1, b:   1096, T:     6356)
                ucr800/tst_setjmp                          (f:  0, t:   3, c:  1, b:   1110, T:     6111)
                ucz180-resiy/tst_longjmp                         (f:  0, t:   4, c:  1, b:   1060, T:    24958)
                ucz180-resiy/tst_setjmp                          (f:  0, t:   3, c:  1, b:   1075, T:    24227)
                ucz180/tst_longjmp                         (f:  0, t:   4, c:  1, b:   1096, T:    24926)
                ucz180/tst_setjmp                          (f:  0, t:   3, c:  1, b:   1110, T:    24187)
                ucz80-resiy/tst_longjmp                         (f:  0, t:   4, c:  1, b:   1058, T:    24958)
                ucz80-resiy/tst_setjmp                          (f:  0, t:   3, c:  1, b:   1073, T:    24227)
                ucz80-undoc/tst_longjmp                         (f:  0, t:   4, c:  1, b:   1092, T:    24926)
                ucz80-undoc/tst_setjmp                          (f:  0, t:   3, c:  1, b:   1106, T:    24187)
                ucz80/tst_longjmp                         (f:  0, t:   4, c:  1, b:   1092, T:    24926)
                ucz80/tst_setjmp                          (f:  0, t:   3, c:  1, b:   1106, T:    24187)
                ucz80n/tst_longjmp                         (f:  0, t:   4, c:  1, b:   1092, T:    24926)
                ucz80n/tst_setjmp                          (f:  0, t:   3, c:  1, b:   1106, T:    24187)
                
                 

                Last edit: Janko Stamenović 2024-06-18
                • Philipp Klaus Krause

                  Thanks. I'll have a closer look next week. For now just two small comments:
                  * I prefer the more verbose comment style from the current implementations.
                  * You mentioned above "the patches can have only two versions of the file". It might still be possible that they can be made more efficient using instructions only available on some of the z80-related architectures or by considering timing differences. So if you do need top performance on these functions for a particular architecture, it might make sense to look into that. For details on the instruction sets one would have to look at manuals, but we have an (incomplete) summary at https://sourceforge.net/p/sdcc/wiki/Z80%20instruction%20timing/

                   
                  👍
                  1
                  • Janko Stamenović

                    Regarding 1 (comments) I have no problem in rewording the comments however you want, I've already commented every necessary point, but I was myself unsure of how to name what. Correct me if I'm wrong in any of the statements that follow:

                    • Maybe "IX" is more correct than the "frame pointer"? If I understand, even if SDCC uses IX at the moment only for the frame pointer, it doesn't have to: from the point of view of setmp/longjmp IX is only the register that has to be preserved, but SDCC could be well enhanced to use IX for some other purposes during the life of the function which can be jumped to, while still depending on the convention that it will be preserved. It can already be used so in the hand written assembly, correct?

                    • Re: "fetch spval" -- there writing "stack pointer" would suggest "the content of SP" when what is read is "the content of the second word of the three word structure used by setjmp and longjmp." It would be good if the structure has the description with their names, and then to refer across the setjmp / longjmp sources using these names?

                    E.g.

                    struct {
                         uint16_t retaddr;
                         uint16_t ix_val;
                         uint16_t spval;
                    }
                    

                    Once the names of the elements of the structure exist, it's easier to follow what is being done when. It wasn't necessary as the code was more direct with

                        ld  2(iy), l
                        ld  3(iy), h
                    

                    but now "spval" (from that instance of that structure, I mean) is actually read to DE pushed as DE popped as HL and then loaded to SP. I think documenting where the value at which moment is helps following what's going on -- the intention behind some push of one register pair and pop to another I think should be written in not immediately obvious hand written assembly?

                    That was an intention behind my comments using "spval" here:

                    ; fetch  spval
                    ld  e, (hl)
                    inc hl
                    ld  d, (hl)
                    inc hl
                    ; save  spval
                    push    de
                    
                    ...
                    
                    ; restore spval
                    pop hl
                    ; restore retval
                    pop de
                    ; adjust the stack
                    ld  sp, hl
                    pop hl
                    

                    But as you are maintainer you should decide what would be for you ideal content of the comments and I can't and won't object to that at all!

                    Regarding 2 (instruction selection):

                    I think that code is already faster than before on every "port", and if I understand the timings for those for which I've changed the routines, none had made "worse" tradeoffs for the existing instructions, just improved the slower ones or added some additional ones, which I don't think could be used in these specific examples, but are certainly interesting in different scenarios.

                    Tangentially to that instruction selection, my goal was to have these two routines in a form usable for "reserve IY". Before that they could not have been used, but I see that there is ucz80-resiy target for regtests, are these tracked somewhere? Grepping in the z80 I find 4 routines still using IY, does ucz80-resiyavoid them?

                    __mulsint2slong.s
                    __sdcc_call_iy.s
                    memcpy.s:39
                    memmove.s:36
                    

                    Even more importantly, the compiler allows for --reserve-regs-iy but I think the libraries packed in the distribution aren't "reserve iy" friendly, that is, they are compiled to use IY. I'm compiling my own library from the sources, I guess that is an expected way to use such option. But maybe, alternatively, automatic library selection code could be a part of SDCC code, just like now the "port" library is selected (e.g. z80.lib or z80n.lib) shouldn't it be "built in" in SDCC to also search for "z80-resiy.lib" or "z80n-resiy.lib" when the "resiy" option is active in the linking phase? Maybe it already exists and I just don't know about that?

                    (I would also like to use "resiy" instead of "--reserve-regs-iy" as I write btw, and I also note that I more often have to refer to the "code compiled when the --reserve-regs-iy option is not used" for which I'd also like some short and obvious way to refer to, maybe "no-resiy"?)

                    Once that selection of library is automatic, it would be good to have different implementations for resiy and no-resiy of these functions where using IY is actually advantageous for no-resiy target (I think in setjmp / longjmp it wasn't, but it seems to me (I haven't thoroughly checked, just an impression) that these memcpy.s and memmove.s are still a few cycles better with IY, so it would not be good to have only resiy version of them).

                    What would be a good approach for improvements regarding resiy libraries?

                     
                    • Philipp Klaus Krause

                      Yes, we do not by default build libraries for all possible configurations, only for the most common ones.
                      So the first question about a --reserve-regs-iy library is: for which ports is there sufficient user demand?
                      AFAIK, historically by far most users of --reserve-regs-iy did not use sdcc directly, but use the sdcc fork in z88dk. AFAIK z88dk only uses the compiler from SDCC, combined with a different preprocessor, assembler, linker and library.

                       
                      • Janko Stamenović

                        My question was about the "automatic" selection of the library when resiy is a command line option, not about "building by default". I saw that there are tests in sdcc. My question was: if it is reasonable for a "standard" library selection to be somehow made with the awareness of that '--reserve-regs-iy' option in any way?

                        I'm producing my own "dk" to compile for some z80n targets which avoids that "big" dk mentioned by you (And that "big", if I understand correctly, has a license under which I can't use any sources from it in my own projects unless I use that whole dk? So I'm doing it completely independently anyway, using pure sdcc -- another decision I don't like in that big dk is them keeping __stdccall(0) a default).

                        For sdcc: I know I can say "nostdlib" and explicitly specify my alternative build of z80n built with --reserve-regs-iy, that's what I'm doing locally already. And I can distribute that alternative build if I'd like, I'm aware of that too, with all GPL licenses included. But I wanted to know if there is a way or if it would be reasonable for that resiy information to be in any way present outside of the compilation phase, for a linker to be able to select the "right" library somehow similarly like there is already that -z80n-resiy target during that make.

                        I fully understand if the best solution right now is to not change sdcc in any way.

                         
                        • Philipp Klaus Krause

                          The automatic selection should be doable: e.g. for mcs51 and pdk15 we doe have a separate --stack-auto library, which is used if that option is specified at link time.
                          While --reserve-regs-iy is a less serious difference (--stack-auto changes the ABI, --reserve-regs-iy does not), we could still do the same here.

                           
                          👍
                          1
                          • Janko Stamenović

                            Against adding that library selection mechanism is: if it's there, but if SDCC doesn't distribute e.g. "z80n-reserve-regs-iy/z80n.lib" then the linking in this case would fail. But at the moment even if the linking doesn't fail, the resulting binary would crash on the real systems which expect the IY to remain unchanged, so from the users' perspective, if the binary package didn't work immediately after unpacking before, due to the lack of the pre-built "reserve" lib, after adding the feature it can remain the same: they'd need to build such a library themselves. From that perspective, it frees SDCC from an obligation to distribute new binary precompiled libraries.
                            The official releases do have lib/src directory and the possibility for the user to build the libraries locally. Having the separate names would just allow the user to have both versions on his system and for the "right one" to be selected automatically, which is maybe better than now? So that could be good enough: not distributing the binary libs but having the better selection infrastructure there?
                            But, one more argument against the change: adding that change could mean that the existing SDCC build scripts and make-related infrastructure would have to be changed to be aware of such new naming conventions. Maybe is that trickier than adding the get_model functions for z80 ports for which, it seems, everything that should use them in the rest of SDCC*.c files is "already there" (as far as I see) because it is already used for other ports?
                            I also don't know which other ports, except for z80 and z80n, would ever need --reserve-regs-IY. Specifically: ZX Spectrum, at least some of its clones and ZX Spectrum Next do need to not let the user's code change IY if it should run when the default interrupts are enabled and/or the ROM routines are being called. Maybe there would be never a practical need for adding get_model and the -reserve-regs-iy lib search for more than these two ports? I don't know if other platforms need the same convention.
                            Those writing "pure" games also often decide in advance that they don't want to use anything from the system and will block the interrupts, so they can use IY on that even on these platforms, but I'm targeting for the programs that work with the system. So that is maybe a specific use-case where a single user could use both versions: it's not just a processor type but the target environment of the user's code.
                            Personally, I don't make any games, and at the moment I do nostdlib and rebuild the whole library myself, but as I've started my development, using and linking of strlen and some other "default" function simple enough "worked" just because exactly these don't use IY. :) Only when I started to use more I saw that I do have to build the lib myself and enforce nostdlib.
                            Regarding the number of users: just checked: that "no crash" version of sdasz80.exe for W64 was downloaded "whole" two times since I've made it over the weekend. :)

                             

                            Last edit: Janko Stamenović 2024-06-24
                            • Maarten Brock

                              Maarten Brock - 2024-06-25

                              If IY is used so little in the libs, we might as well choose to always build them with --reserve-regs-IY.

                              We can also choose to place the text "--reserve-regs-IY" or something similar in the .optsdcc to enforce compatibility. This will prevent linking objects with mixed reserve-regs settings.

                               
                              • Philipp Klaus Krause

                                In all currently support calling conventions, IY is never used to pass arguments or return values. So code built with --reserve-regs-iy is compatible with code built without it.
                                IMO, choosing the lib depending on --reserve-regs-iy being specified at link time would be fine. IMO, the question is if it is worth having an extra library, and if yes, for which ports. I guess z80 is the most obvious candidate.

                                 
                    • Philipp Klaus Krause

                      For which ports would you use a library built with --reserve-regs-iy, if it was available now? z80? z80n? Any other?

                       
                      • Janko Stamenović

                        This week I've published my "--reserve-regs-iy," rebuild of the SDCC's z80n.lib inside of my "development kit" for NextZXOS dot files (that is z80n target):

                        https://gitlab.com/jankos-projects/

                        and there: Jankos_DNCDK and Jankos_DNCDK_sources

                        "Dot commands" are executables which would be equivalent to .COM programs on the CP/M or DOS, but they have more limitations: they get only one 8KB page initially at 0x2000, the command line and stack could be anywhere, but then there are additional preconditions for using the API calls, as these cause different banks to be switched etc.

                        I don't know what's the state of other systems having Z80-like CPUs, regarding the "reserve" registers: I know only about classic ZX Spectrum and all the clones (which would be the z80 target) and ZX Spectrum Next (z80n target). These would use --reserve-regs-iy for the programs compiled with the expectation to use the system calls and to keep the default interrupt routine. But when the program is written with an intention to take over the machine completely and never return to the system, like game writers typically did, then IY would be treated as "free to be used".

                        In this rebuild of mine I also resurfaced the printf_small with some small changes (it didn't compile for z80)

                        https://gitlab.com/jankos-projects/Jankos_DNCDK_sources/-/blob/main/jj_z80n/printf_small_mod.c?ref_type=heads

                        https://gitlab.com/jankos-projects/Jankos_DNCDK_sources/-/blob/main/jj_z80n/printf_small_compat.c?ref_type=heads

                        and I've also changed the printf_large

                        https://gitlab.com/jankos-projects/Jankos_DNCDK_sources/-/blob/main/jj_z80n/printf_large_mod.c?ref_type=heads

                        to use byte sized widths instead of the integers, reducing the size of it for around 1KB, which is a lot, as the goal of that development kit are "dot" files for NextZXOS, where, even if the system can page a lot of newly allocated pages in, a "page" is 8 K and the dot files get only one page initially, so efficient use of the initial 8K page makes a difference. The kit is currently designed for exactly one code and one data page (each 8K), everything else is the responsibility of the user.

                        I've also let printf_large compile with much higher --max-allocs-per-node than what is I think current default. I've posted about my experiments with printf_large some graph in some other message, 2 - 3 millions gets better result but compilation takes really a lot of time, so at the moment I also keep the .s file for faster library rebuilds:

                        https://gitlab.com/jankos-projects/Jankos_DNCDK_sources/-/blob/main/jj_z80n/printf_large_mod-2048000.s?ref_type=heads

                         

                        Last edit: Janko Stamenović 2024-07-10
1 2 > >> (Page 1 of 2)

Log in to post a comment.