Menu

#2498 wrong code for inline function with __z88dk_fastcall

closed-fixed
Ben Shi
None
Front-end
5
2016-04-29
2016-04-27
Kio
No

passing arguments to and storing the return value of a fastcall inline function is completely not handled.
see attached file.

1 Attachments

Discussion

  • alvin

    alvin - 2016-04-27

    The generated code looks correct to me?

    Maybe point out version, compile line and where it goes wrong.

        call    _fii
        ld  hl,6
        add hl,sp
        ld  sp,hl
    ;; call to fii cleaned up above
    
    ;; inline swap_words() here
        ;; (void)n;  ignored
        ex  hl,de   ;; incorrect asm inlined by swap_words()
    
    ;; return dehl=n
        ld  l,(ix-4)
        ld  h,(ix-3)
        ld  e,(ix-2)
        ld  d,(ix-1)
        ld  sp, ix
        pop ix
        ret
    

    Edit: Ah, I think you want to swap the endianness of the long. You can't reliably do that with inlined asm because the compiler generates code around inlined asm. In this example it's decided it doesn't need to load "n" into registers prior to your inlined ex. Also inlined asm interferes with sdcc's peephole optimizer (it can't look into the asm block to decide what registers are live) unless you tell sdcc to look at inlined asm. Even when you do that sdcc expects registers holding values prior to the asm block still retain their values after the inlined asm block. Ie, you may not know what registers sdcc has in play when your inlined asm runs and you're not allowed to change them. Obviously for short bits of code like "ex de,hl" or inlined asm at the beginning of the function this doesn't matter too much.

    We solved this by introducing intrinsics for sdcc in z88dk. Instrinsics are C-like functions that inline individual z80 instructions. We don't have one for "ex de,hl" although I may add it now since it does seem useful.

    A couple of examples:

    In "instrinsic.h" function intrinsic_di() is prototyped:

    extern void intrinsic_di(void) __preserves_regs(a,b,c,d,e,h,l);
    

    This tells the C compiler this function returns nothing, takes no arguments and changes no register values. This almost makes a call to intrinsic_di() invisible to the code generator.

    When used the compiler will output a "call _ intrinsic_di" (extra space to foil formatting) and in the peephole rules we add this rule:

    replace restart {
        call    _intrinsic_di
    } by {
        di
        ; peephole z88dk-intrinsic-di
    }
    

    which replaces the call with a di first thing. This means the peephole optimizer can operate per normal, seeing the di instruction as it is.

    For swapping endianness of a long, I would probably introduce a function that takes an argument and returns a value.

    extern unsigned long intrinsic_swap_long(unsigned long n) __z88dk_fastcall __preserves_regs(a,b,c,d,e,h,l);
    

    The compiler will produce a call "call _ intrinsic_swap_long" with dehl = n. In the peephole rules this should be added at the front:

    replace restart {
        call    _intrinsic_swap_long
    } by {
        ex de,hl
        ; peephole z88dk-intrinsic-swap_long
    }
    

    Then in your example code, instead of this:

    unsigned long foo(void* f)
    {
        unsigned long n;
        fii(f,&n,4);
        return swap_words(n);
    }
    

    You could do this:

    unsigned long foo(void* f)
    {
        unsigned long n;
        fii(f,&n,4);
        return intrinsic_swap_long(n);
    }
    

    Actually this (intrinsics) is probably something that should be adopted by sdcc but I'm not sure how far sdcc wants to go in making one port's library different from the others.

     

    Last edit: alvin 2016-04-27
    • Kio

      Kio - 2016-04-28

      Sorry that i left out some of the standard information, i just thought it's so obvious…

      I tested it with a recently downloaded beta of 3.6:
      SDCC : mcs51/z80/z180/r2k/r3ka/gbz80/tlcs90/ds390/pic16/pic14/TININative/ds400/hc08/s08/stm8 3.5.5 #9574 (Mac OS X i386)

      command invoked:
      sdcc -mz80 foo.c

      the called inlined function:

      inline 
      unsigned long swap_words(unsigned long n) __z88dk_fastcall
      {
          (void)n;
          __asm__(" ex hl,de\n"); 
      }
      

      the calling function:

      unsigned long foo(void* f)
      {
          unsigned long n;
          fii(f,&n,4);
          return swap_words(n);
      }
      

      when i compile this without inline, the relevant calling code is:

      ;foo.c:20: return swap_words(n);
          ld  iy,#0
          add iy,sp
          ld  l,0 (iy)
          ld  h,1 (iy)
          ld  e,2 (iy)
          ld  d,3 (iy)
          call _swap_words
          pop af
          pop af
          ret
      

      This loads the value into dehl, calls swap_words which accepts it's argument in dehl and returns it's return value in dehl and directly returns this register contents itself. (omitting a big chance for optimization to pop hl : pop de : jp _ swap_words instead of this all.)

      With inline i expect it to behave mostly identical: provide argument in dehl and accept return value in dehl. Just replace the call with the inlined code. The code should be similar to this:

      ;foo.c:20: return swap_words(n);
          ld  iy,#0
          add iy,sp
          ld  l,0 (iy)
          ld  h,1 (iy)
          ld  e,2 (iy)
          ld  d,3 (iy)
          ex  de,hl
          pop af
          pop af
          ret
      

      The compiler instead actually generates:

      ;foo.c:6: __asm__(" ex hl,de\n"); 
          ex  hl,de
      ;foo.c:20: return swap_words(n);
          ld  iy,#4
          add iy,sp
          ld  l,0 (iy)
          ld  h,1 (iy)
          ld  e,2 (iy)
          ld  d,3 (iy)
          pop af
          pop af
          pop af
          pop af
          ret
      

      It does not put the argument in dehl and it does not use the returned dehl. Instead it returns the uninitialized contents of the temp cell. (i believe it's the temp cell.)

      Ah, I think you want to swap the endianness of the long. You can't reliably do that with inlined asm because the compiler generates code around inlined asm. In this example it's decided it doesn't need to load "n" into registers prior to your inlined ex. Also inlined asm interferes with sdcc's peephole optimizer (it can't look into the asm block to decide what registers are live) unless you tell sdcc to look at inlined asm. Even when you do that sdcc expects registers holding values prior to the asm block still retain their values after the inlined asm block. Ie, you may not know what registers sdcc has in play when your inlined asm runs and you're not allowed to change them. Obviously for short bits of code like "ex de,hl" or inlined asm at the beginning of the function this doesn't matter too much.

      I could not find an explanation of restrictions and requirements for assembler in sdcc.

      I think you cannot require that inline assembly doesn't touch any register, "because you don't know what the compiler currently uses them for." The same could hold true to access to local variables: You don't know for sure how to access them: might be in register, might be on stack. might be additional data pushed between. After that there is not much left what you can do in assembly.

      So instead you look what the compiler does and try not to interfere.

      On the other hand i'd actually expect that inline assembly breaks some optimization by the compiler. I don't expect the compiler to understand what arbitrary assembly does. Better it doesn't try to be too smart. Primary goal is correct code.

      ... Kio !
      
       
      • alvin

        alvin - 2016-04-28

        (when not inlined)

        This loads the value into dehl, calls swap_words which accepts it's argument in dehl and returns it's return value in dehl and directly returns this register contents itself.

        (when inlined)

        With inline i expect it to behave mostly identical: provide argument in dehl and accept return value in dehl. Just replace the call with the inlined code. The code should be similar to this:

        ;foo.c:20: return swap_words(n);
            ld  iy,#0
            add iy,sp
            ld  l,0 (iy)
            ld  h,1 (iy)
            ld  e,2 (iy)
            ld  d,3 (iy)
            ex  de,hl
            pop af
            pop af
            ret
        

        The compiler instead actually generates:

        ;foo.c:6: __asm__(" ex hl,de\n"); 
            ex  hl,de
        ;foo.c:20: return swap_words(n);
            ld  iy,#4
            add iy,sp
            ld  l,0 (iy)
            ld  h,1 (iy)
            ld  e,2 (iy)
            ld  d,3 (iy)
            pop af
            pop af
            pop af
            pop af
            ret
        

        It does not put the argument in dehl and it does not use the returned dehl. Instead it returns the uninitialized contents of the temp cell. (i believe it's the temp cell.)

        Inlined code is effectively copied and pasted to the location of its invocation. This allows the compiler to optimize with the inlined code in place and that's what happens here. The compiler decides no code needs to be generated before your asm "ex hl,de" so your inlined asm does not get an opportunity to swap words. I think this is desired behaviour.

        I could not find an explanation of restrictions and requirements for assembler in sdcc.

        This comes from experience after staring at many thousands of lines of compiler output on the z80 :) When I have some time I will add some examples here that demonstrate these behaviours.

        It's understandable why these restrictions occur.

        The code generator knows nothing about the inlined asm (or any assembly language since it's operating on intermediate code) so it must regard it as a black box and code as per normal around it.

        The peepholer is not normally allowed to look into inlined asm so it must assume all registers are live and this prevents it from doing some optimizations. If you allow the peepholer to look into the inlined asm (--peep-asm) the inlined asm must be formatted correctly (correct whitespace included) and the peepholer will be allowed to transform the inlined asm, which is quite often not what is wanted.

        I think you cannot require that inline assembly doesn't touch any register, "because you don't know what the compiler currently uses them for." The same could hold true to access to local variables: You don't know for sure how to access them: might be in register, might be on stack. might be additional data pushed between. After that there is not much left what you can do in assembly.

        Yes that's right you can't know these things. Inlined asm should only be used to insert a small amount of code in places where you know these problems won't occur. If you're doing something non-trivial in asm, the function should be written in asm not in C.

        You can insert code that has no side effects to registers anywhere ("di" and "ei" would be very common examples), you can insert code at specific locations in the C that you know won't suffer these problems or you can be careful with your C such as using volatile where appropriate to make sure variables are written or read from memory before the inlined asm is encountered.

        Having said all that, the intrinsics I mentioned allows you to do exactly what you want and I think this is something that sdcc should probably adopt.

        Here's how I solved the swap words problem:

        In your C code:

        // prototype below hidden in "intrinsic.h"
        
        extern unsigned long intrinsic_swap_word_32(unsigned long n) __preserves_regs(a,b,c,d,e,h,l) __z88dk_fastcall;
        
        ....
        
        unsigned long foo(void* f)
        {
            unsigned long n;
            fii(f,&n,4);
            return intrinsic_swap_word_32(n);
        }
        

        In the peephole file a rule to deal with intrinsic swap:

        replace restart {
            call    _intrinsic_swap_word_32
        } by {
            ex  de,hl
            ; peephole z88dk-intrinsic-swap-word-32
        }
        

        And the generated output:

        _foo:
            push    ix
            ld  ix,0
            add ix,sp
            push    af
            push    af
            ld  hl,0x0000
            add hl,sp
            ld  bc,0x0004
            push    bc
            push    hl
            ld  l,(ix+4)
            ld  h,(ix+5)
            push    hl
            call    _fii
            ld  hl,6
            add hl,sp
            ld  sp,hl
        ;; return intrinsic_swap_word_32(n);
            ld  e,(ix-4)
            ld  d,(ix-3)
            ld  l,(ix-2)
            ld  h,(ix-1)
        ;; "ex de,hl" was here but was propagated upward by the
        ;; peepholer changing the indexed register loads.
        ;; Eg, "ld l,(ix-4)" was changed to "ld e,(ix-4)" above
            ld  sp, ix
            pop ix
            ret
        

        The generated code before peephole optimization so you can see the "ex de,hl" was placed correctly as advertised:

        _foo:
            push    ix
            ld  ix,0
            add ix,sp
            push    af
            push    af
            ld  hl,0x0000
            add hl,sp
            ld  c,l
            ld  b,h
            ld  hl,0x0004
            push    hl
            push    bc
            ld  l,(ix+4)
            ld  h,(ix+5)
            push    hl
            call    _fii
            ld  hl,6
            add hl,sp
            ld  sp,hl
            ld  l,(ix-4)
            ld  h,(ix-3)
            ld  e,(ix-2)
            ld  d,(ix-1)
            ex  de,hl
            ld  sp, ix
            pop ix
            ret
        
         
  • Ben Shi

    Ben Shi - 2016-04-28

    Yes. Please explain your expected result, or check if alvin's explaination help.

     
  • Ben Shi

    Ben Shi - 2016-04-28

    I thought you might be confusing with the __z88dk_fastcall.

    __z88dk_fastcall is a kind of calling convention which means it only makes sense while the calling really happen.

    If a function is inlined, then the calling does not happen, and talking about calling convention make no sense.

    Maybe sdcc should rise a warning about inlined __z88dk_fastcall functions and then omit the __z88dk_fastcall suffix.

     

    Last edit: Ben Shi 2016-04-28
    • alvin

      alvin - 2016-04-28

      I think it actually works Ben. We don't see an example of actually using the parameter passed via inlined fastcall here but I will try it in a bit and see if sdcc does things properly.

       
  • Ben Shi

    Ben Shi - 2016-04-28

    sdcc processes the __z88dk_fastcall in the back end while does the inlining work in the arch-independant front end, where it omit the __z88dk_fastcall attribute.

    So a warning should rise.

     
  • Ben Shi

    Ben Shi - 2016-04-28

    I have added a warning for that in revision #9581.

     
  • Kio

    Kio - 2016-04-28

    So it doesn't work because it doesn't work. It's an unsupported combination of features and sdcc could please raise an error in such a case.

    I think this is desired behaviour.

    not really... B-)

    You can insert code that has no side effects to registers anywhere ("di" and "ei" would be very common examples),

    the sdcc manual shows how to replace the entire body of a function with inline assembly.
    ei and di are supported by _ _ critical in c (though this is broken sometimes).

    If you're doing something non-trivial in asm, the function should be written in asm not in C.

    Unluckily assembler functions can't be inlined. But the options with z88dk_fastcall and z88dk_callee are already quite usable for assembler.

    So then i'll have to call swap_words().

    ... Kio !
    
     

    Last edit: Kio 2016-04-28
    • Philipp Klaus Krause

      z88dk_fastcall is a calling convention. inlined functions do not have a calling convention. Using z88dk_fastcall for a function declared inline only makes sense in one case: inline without static. The one where the programmer is required to provide a separate non-inline definition of the function for cases where the compiler cannot inline the function. Then when a call is not inlined, z88dk_fastcall woulöd apply.

      Philipp

       
    • alvin

      alvin - 2016-04-28
      You can insert code that has no side effects to registers anywhere ("di" and "ei" would be very common examples),
      

      the sdcc manual shows how to replace the entire body of a function with inline assembly.
      ei and di are supported by _ _ critical in c (though this is broken sometimes).

      Yes the code is incorrect for nmos z80s so it's not really usable. This can be fixed but sdcc does not currently have a way to distinguish whether it is generating code for a cmos z80 or an nmos z80. The nmos fix is bulkier and I'm not sure if it's a good idea to inline that everywhere. Our solution is to look for the critical code pattern in the peepholer and replace it with a call to a library function; the library knows whether it was built for a cmos z80 or an nmos z80. The cost is call overhead but the code is correct.

      The semantics of _ _ critical is not always what is wanted either. critical statements open with an effective "di" and always terminate with an effective "ei". Sometimes you don't want both. In real code, users quite often don't care what the current interrupt enable state is so they only want a bare "di" or "ei" and not the extra state code critical supplies.

      Anyway if you're going to have a construct like _ _ critical, what sdcc does is probably the most appropriate choice.

      The problem with inlining "di" and "ei" is that inlined asm foils the peephole optimizer so you end up with less optimal code around them. That's what led to the idea of intrinsics and the first one tried was "intrinsic_di()" which did the same inlining job but without any drawbacks.

      There is also a downside for sdcc's recommended method for defining asm functions. You lose control of placement of asm code and data in specific areas. The compiler is opening and closing areas as it generates code so you can't arbitrarily assign code and data to areas within these C-wrapper asm functions. This doesn't affect most people but it does have an effect on programmers who want to place code and data in specific memory banks or at specific memory addresses. The best way to define an asm function is to write it in a separate asm file and supply the required prototype in a header file included by the C program. This is less convenient though because sdcc's tools can only compile / assemble one file at a time to a rel and then you have to link it altogether into a single executable. It's not a huge deal but it's a lot easier to tell people to inline your asm with a C wrapper in the manual :)

      Unluckily assembler functions can't be inlined. But the options with z88dk_fastcall and z88dk_callee are already quite usable for assembler.

      The mechanism was shown so it's at least possible.

       
  • Ben Shi

    Ben Shi - 2016-04-29
    • status: open --> closed-fixed
    • assigned_to: Ben Shi
    • Category: Z80 --> Front-end
     
  • Ben Shi

    Ben Shi - 2016-04-29

    The conclusion is a function can be either inlined or called with a particular convention. And a warning for a combination of them is added in revision #9581.

     
  • Ben Shi

    Ben Shi - 2016-04-29

    The new warning become more generic in revision #9582.

     

Log in to post a comment.