Menu

#624 Atomics

None
open
nobody
None
5
2024-07-17
2019-04-11
No

SDCC currently does not have any official support for atomics. Some operations on some operands have traditionally been atomic, and we see bugs reports when that changes.

It might make sense to (partially) support atomics as in the C standard:
1) For communication between the main program and interrupt handlers
2) For inter-thread communication on multicore devices (SDCC does not yet support such devices).
3) For inter-thread communication when using some timesharing OS.

The two most basic types are sig_atomic_t and atomic_flag.

To me it seems, we could (as I have done in the post-3.9 branhc) just make sig_atomic_t a typedef to unsigned char.
atomic_flag, on the other hand will require a test_and_set instruction. On many targets, we could implement using an inverted 0/1 value using a right shift with indirect addressing mode. Or we could just use plain exchange, where available.

This would give us atomic_flag on stm8, z80 and related, hc08 and s08. For mcs51, we could have atomic_flag in iram (which seems okay to me, we'd just place every atomic_flag in iram, even if the memory model is different, and emit an error if someone e.g. tries to put it explicitly into xram).
Unfortunately, pdk does not have indirect adressing mode on their right shift and exchange instructions.

The C standard requires atomic_flag to be lock-free. I don't see a way to do this on pdk. On the other hand, being lock-free only has real menaing in interaction with signal handlers. And we could just declare interrupt handlers on pdk to be something other than signal handlers. Though this will greatly reduce the usefulness of atomic_flag on pdk13, pdk14, pdk15.

atomic_flag, in turn, can be used to implement spinlocks, which give us all atomic types (though not in a lock-free version). Atomics that are not lock-free seem only really useful for multithreading, though.

Discussion

1 2 > >> (Page 1 of 2)
  • Philipp Klaus Krause

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,13 +1,13 @@
     SDCC currently does not have any official support for atomics. Some operations on some operands have traditionally been atomic, and we see bugs reports when that changes.
    
    -It might make snese to (partially) support atomics as in the C standard:
    +It might make sense to (partially) support atomics as in the C standard:
     1) For communication between the main program and interrupt handlers
     2) For inter-thread communication on multicore devices (SDCC does not yet support such devices).
     3) For inter-thread communication when using some timesharing OS.
    
     The two most basic types are sig_atomic_t and atomic_flag.
    
    -To me it seems, we coould (as I have done in the post-3.9 branhc) just make sig_atomic_t a typedef to unsigned char.
    +To me it seems, we could (as I have done in the post-3.9 branhc) just make sig_atomic_t a typedef to unsigned char.
     atomic_flag, on the other hand will require a test_and_set instruction. On many targets, we could implement using an inverted 0/1 value using a right shift with indirect addressing mode. Or we could just use plain exchange, where available.
    
     This would give us atomic_flag on stm8, z80 and related, hc08 and s08. For mcs51, we could have atomic_flag in iram (which seems okay to me, we'd just place every atomic_flag in iram, even if the memory model is different, and emit an error if someone e.g. tries to put it explicitly into xram).
    
    • Group: -->
     
  • Philipp Klaus Krause

    We currently support volatile sig_atomic_t for all our ports. It is useful, and should be sufficient for typical use when there is only one hardware thread, and the normal program is being interrupted by an interrupt.

    However, once it gets more complex (i.e. multiple hardware threads or interrupt handlers that can interrupt each other), more atomics are needed for some uses. And even for the simple case of just one interrupt handler, they can be useful to more naturally implement some algorithms.

    In C, an atomic type if "lock-free", if it can be used in signal handlers. For SDCC, I usually consider interrupt handlers to be the closest match to C's signal handlers, and try to make interrupt handlers work in the way C mandates for signal handlers.

    SDCC does not currently support multiple hardware threads, though we might want to do so in the future (in particular for the Padauk ports). But many of our targets have interrupt priorities where interrupt handlers can themselves be interrupted (e.g. mcs51 and z80).

    C mandates that atomic_flag is lock-free. We currently support it well for most ports, and for some have limited support (for mcs51, atomic_flag needs to be in __idata, which is a slight non-compliance if the heap is in __xdata, as then memory from the heap cannot be used as atomic_flag; there are C2Y proposals to allow other character arrays to be used as atomic_flag, which would add to that problem ).

    It would be possible to improve this situation: I reply to my earlier concerns about the implementability of atomic_flag on mcs51 and Padauk for future C version, Hans Boehm proposed a restartable sequence implementation for atomic_flag for the case of a single hardware thread. I've since further refined the approach to eliminate its interrupt latency burden, and generalized it to support more lock-free atomics:

    Still, this works only for a single hardware thread. Let's look at mcs51 for now, where we would need these 3 asm functions

    atomic_exchange_xdata_impl:
    0:  movx a, @dptr
    1a: xch  a, r0
    2a: movx @dptr, a
    3:  ret
    
    atomic_compare_exchange_xdata_impl:
    0:  movx a, @dptr
    1a: cjne a, ar1, 4
    2a: xch  a, r0
    3a: movx @dptr, a
    4:  ret
    
    atomic_compare_exchange_idata_impl:
    0:  mov  a, @r0
    1a: cjne a, ar1, 4
    2a: xch  a, r2
    3a: movx @r0, a
    4:  ret
    

    These would be used to implement atomic_exchange and atomic_compare_and_exchange primitives that in turn would be used to implement atomic_flag, and lock-free atomics of size up to 8, i.e. we'd get lock-free atomic types for char, _BitInt(7), bool, etc. Once we have atomic_exchange and atomic_compare_and_exchange, we can implement all other functionality for atomics.

    At the end of an interrupt handler, we take the return address from the stack, and compare it against all labels marked with labels that end in a above. If we are at one of these, we need to get registers values from the stack, undo the effect of the xch there, and set the stored pc back to corresponding label 0.

    This is clearly an extra overhead, but:
    1) It happens at the end of the interrupt handler, which is less critical than it would be at the beginning.
    2) static analysis could prove that a given interrupt handler does not use some of these, in which case the extra code at the end of the interrupt handler could be left out.

    P.S.: The asm functions only require an atomic store instruction of suitable size. For most ports we only have these up to 8 bits, but on some ports we have 16-bit ones, and use 16-bit pointers. I.e. we could have lock-free atomic pointer types, which would allow the use of a lot of non-blocking data structures. On the other hand, we'd then have to potentially check for even more pc values at the end of signal handlers.

    P.P.S.: It would be good, if we could guarantee that those asm functions are all within a single 256B-aligned block of 256B, so the pc comparison can be done cheaper (i.e compare the upper byte once, then if it matches compare the lower byte against multiple values).

     
    • Philipp Klaus Krause

      Looks like my idea is essentially what NetBSD already did 20 years ago:
      https://www.usenix.org/legacy/events/usenix2003/tech/freenix03/full_papers/mcgarry/mcgarry_html/

      P.S.: The first proposal to use restartable sequences to implement atomics is AFAIK "Fast mutual exclusion for uniprocessors" by Bershad et alii from 1992: https://dl.acm.org/doi/10.1145/143365.143523

       

      Last edit: Philipp Klaus Krause 2024-06-10
    • Maarten Brock

      Maarten Brock - 2024-06-10

      Maybe we should try to keep those restartable functions in a single area and at power-of-two offsets (8?). The return address check can then be against the lower and upper bound and a simple AND operation to restart. Also what to roll back can be easier if the same registers are used at the same offsets.

      For mcs51 things can be more difficult. The pointer can be a generic pointer which first requires finding out the memory space. This can be done in a wrapper that calls the appropriate restartable functions. And the pointer can also point to __pdata which is the default for the medium memory model.

      atomic_exchange_gptr_impl:
          jnb B.6, atomic_exchange_xdata_impl
          mov r0, dpl
          jb  B.5, atomic_exchange_pdata_impl
      ;    sjmp atomic_exchange_idata_impl
      atomic_exchange_idata_impl:
          mov a, r2
          xch a, @r0
          ret
      
      atomic_compare_exchange_gptr_impl:
          jnb B.6, atomic_compare_exchange_xdata_impl
          mov r0, dpl
          jb  B.5, atomic_compare_exchange_pdata_impl
      ;    sjmp atomic_compare_exchange_idata_impl
      
      .bndry 8                                      ; does not work yet!
      atomic_compare_exchange_idata_impl:
      0:  mov  a, @r0                                   ; 1 B
      1a: cjne a, ar1, 4                                ; 3 B
      2a: xch  a, r2                                    ; 1 B
      3a: mov @r0, a                                    ; 1 B
      4:  ret                                           ; 1 B
          nop                                           ; 1 B
                                                        ;=====
                                                        ; 8 B
      atomic_compare_exchange_pdata_impl:
      0:  movx a, @r0                                   ; 1 B
      1a: cjne a, ar1, 4                                ; 3 B
      2a: xch  a, r2                                    ; 1 B
      3a: movx @r0, a                                   ; 1 B
      4:  ret                                           ; 1 B
          nop                                           ; 1 B
                                                        ;=====
                                                        ; 8 B
      atomic_compare_exchange_xdata_impl:
      0:  movx a, @dptr                                 ; 1 B
      1a: cjne a, ar1, 4                                ; 3 B
      2a: xch  a, r2                                    ; 1 B
      3a: movx @dptr, a                                 ; 1 B
      4:  ret                                           ; 1 B
          nop                                           ; 1 B
                                                        ;=====
                                                        ; 8 B
      atomic_exchange_pdata_impl:
      0:  movx a, @r0                                   ; 1 B
      1a: nop nop nop                                   ; 3 B
      2a: xch  a, r2                                    ; 1 B
      3a: movx @r0, a                                   ; 1 B
      4:  ret                                           ; 1 B
                                                        ;=====
                                                        ; 8 B
      atomic_exchange_xdata_impl:
      0:  movx a, @dptr                                 ; 1 B
      1a: nop nop nop                                   ; 3 B
      2a: xch  a, r2                                    ; 1 B
      3a: movx @dptr, a                                 ; 1 B
      4:  ret                                           ; 1 B
                                                        ;=====
                                                        ; 8 B
      

      The mcs51 also has __bit variables which can be accessed with JBC to both test and clear the flag. Unfortunately they cannot be accessed indirectly.

       

      Last edit: Maarten Brock 2024-07-10
      • Philipp Klaus Krause

        Isn't __pdata just a subspace of __xdata, so we don't need separate functions for it?

         
        • Maarten Brock

          Maarten Brock - 2024-06-10

          Well, yes and no. __pdata does lie somewhere in __xdata space. But a (single byte) pdata pointer does not hold the page. And a (3 bytes) generic pointer does not either.

          It is expected instead that the page selection is active with the high byte of s_PSEG in __XPAGE (see crtxclear.asm) for MOVX @Ri to work.

           

          Last edit: Maarten Brock 2024-07-10
      • Philipp Klaus Krause

        Actually, having thought about it a little bit more, it gets much easier, if we never have to roll back, and can just set the pc back. We do that by having a spare copy that we can use to restore register values, e.g.:

        atomic_exchange_xdata_impl:
        0:  mov  r0, ar2
        1:  movx a, @dptr
        2a: xch  a, r0
        3a: movx @dptr, a
        4:  ret
        

        If the pc is at 2a or 3a, we set it back to 0, and the instruction at 0 will do all the work.

         
        • Maarten Brock

          Maarten Brock - 2024-06-10

          R0 should be R3 or higher IMHO. See the pdata and _compare_ variants.

          This will require 16 bytes aligned routines and maybe that would finally be the first opportunity for SDCC to actually emit the XCHD A,@Ri instruction.

              ; roll back PC
              push ar1
              push acc
              mov r1, SP
              dec r1
              dec r1
              dec r1
          ;   mov a,#0xF0
          ;   anl a,@r1
          ;   mov @r1,a
              clr a
              xchd a, @r1    ; clear low nibble of return address
              pop acc
              pop ar1
          
           
          • Philipp Klaus Krause

            To enforce the alignment, I guess we could just emit this at the end of the IVT (prepending nops as necessary to avoid the routines crossing a 256B-boundary)?

             
  • Philipp Klaus Krause

    Based on Maarten's code sample above, I now have this proposal, which should work and be reasonably efficient:

    ; Align the start of these 5 functions to an 8B boundary. Possibly by
    ; putting the code at the end of the IVT. Ensure that no 256B boundary
    ; is crossed.
    
    atomic_exchange_rollback_impl:
    
    atomic_exchange_pdata_impl:
    0:  movx a, @r0                                   ; 1 B
    1a: mov  r3, a                                    ; 1 B
    2a: nop nop                                       ; 2 B
    3a: mov  a, r2                                    ; 1 B
    4a: movx @r0, a                                   ; 1 B
    5:  mov  a, r3                                    ; 1 B
    6:  ret                                           ; 1 B
                                                      ;=====
                                                      ; 8 B
    
    atomic_exchange_xdata_impl:
    
    0:  movx a, @dptr                                 ; 1 B
    1a: mov  r3, a                                    ; 1 B
    2a: nop nop                                       ; 2 B
    3a: mov  a, r2                                    ; 1 B
    4a: movx @dptr, a                                 ; 1 B
    5:  mov  a, r3                                    ; 1 B
    6:  ret                                           ; 1 B
                                                      ;=====
                                                      ; 8 B                                                
    
    atomic_compare_exchange_idata_impl:
    0:  mov  a, @r0                                   ; 1 B
    1a: cjne a, ar2, 4                                ; 3 B
    2a: mov  a, r3                                    ; 1 B
    3a: mov @r0, a                                    ; 1 B
    4:  ret                                           ; 1 B
    5:  nop                                           ; 1 B
                                                      ;=====
                                                      ; 8 B
    
    atomic_compare_exchange_pdata_impl:
    0:  movx a, @r0                                   ; 1 B
    1a: cjne a, ar2, 4                                ; 3 B
    2a: mov  a, r3                                    ; 1 B
    3a: movx @r0, a                                   ; 1 B
    4:  ret                                           ; 1 B
    5:  nop                                           ; 1 B
                                                      ;=====
                                                      ; 8 B
    
    atomic_compare_exchange_xdata_impl:
    0:  movx a, @dptr                                 ; 1 B
    1a: cjne a, ar2, 4                                ; 3 B
    2a: mov  a, r3                                    ; 1 B
    3a: movx @dptr, a                                 ; 1 B
    4:  ret                                           ; 1 B
                                                      ;=====
                                                      ; 7 B
    
    ; These are normal library functions, to be placed by the linker:
    
    ; Store value in r2 into byte at b:dptr, return previous byte at b:dptr in a.
    ; Overwrites r0, r2, r3.
    atomic_exchange_gptr_impl:
        jnb B.6, atomic_exchange_xdata_impl
        mov r0, dpl
        jb  B.5, atomic_exchange_pdata_impl
    atomic_exchange_idata_impl:
        mov a, r2
        xch a, @r0
        ret
    
    ; If the value of the byte at b:dptr is the value of r2, store the value
    ; of r3 into that byte. Return the new value of that byte in a.
    ; Overwrites r0, r2, r3.
    atomic_compare_exchange_gptr_impl:
        jnb B.6, atomic_compare_exchange_xdata_impl
        mov r0, dpl
        jb  B.5, atomic_compare_exchange_pdata_impl
        sjmp atomic_compare_exchange_idata_impl
    
    ; Rollback restartable sequence at end of interrupt handler (version for
    ; non-critical handler and stack in __idata).
    maybe_rollback:
        push ar0
        mov  r0, SP
        dec  r0
        cjne @r0, #>atomic_exchange_rollback_impl, outer_skip
        dec  r0
        cjne @r0, #40, here0
    here0:
        jc   outer_skip
        ; we now know we are somewhere among the restartable implementations of
        ; atomic functions.
        push acc
        mov  a, @r0
        and  a, #0x07
        dec  a
        cjne @r0, #6, here1
    here1:
        jc   inner_skip
        ; we actually need to restart.
        mov  a, @r0
        and  a, #0xf8
        mov  @r0, a
    inner_skip:
        pop acc
    outer_skip:
        pop ar0
        reti
    

    We'll still need a version of the rollback for xstack, and see how we'll actually use those atomic_exchange_gptr_impl and atomic_compare_exchange_gptr_impl from libary functions and codegen. We also need actual compiler support for atomic types.

    IMO, we can put the above into SDCC soon, and start using it for atomic_flag before we introduce support for more atomic types.

     

    Last edit: Maarten Brock 2024-06-12
  • Maarten Brock

    Maarten Brock - 2024-06-12

    For mcs51 the call stack is always in __idata. Only parameters and local variables can go to __pdata for xstack.

    Although atomic_exchange_gptr_impl and atomic_exchange_idata_impl need not be placed at 8 byte offsets, they do need to be close to atomic_exchange_pdata_impl and atomic_exchange_xdata_impl since the jump on bit instructions use relative addressing. I suggest to keep them in the same asm file. The same goes for the _compare_ variants.

    The compare against #40 assumes that this code is 256 bytes aligned, not just 8 bytes aligned.

        cjne @r0, #6, here1
    here1:
        jc   inner_skip
    

    Needs to be:

        cjne a, #6, here1
    here1:
        jnc   inner_skip
    
     
    • Philipp Klaus Krause

      I see. Let's make that #40 into #<atomic_exchange_rollback_impl+40, and put those atomic_exchange_gptr_impl and atomic_exchange_idata_impl before or after the rest (whichever makes sense to keep code mem usage low - depends on the number of interrupt vectors).

       
  • Philipp Klaus Krause

    This first implementation passes mcs51 regression tests.

    Further steps:
    0) Someone else having a look.
    1) Port it to ds390.
    2) Implement basic check for interrupt handlers not using any atomics (on those handlers we then omit the rollback check).
    3) Port it to the pdk ports.
    4) Implement support for the _Atomicqualifier in the front-end, and support more atomic types throughout the compiler (to be done much later).

    P.S.: I think this passed regressions tests for the huge memory model of mcs51 by chance only.

     

    Last edit: Philipp Klaus Krause 2024-07-10
    • Philipp Klaus Krause

      Updated patch. Also contains an attempted port to ds390. Still fails for ds390 and for the huge memory model of mcs51.

       

      Last edit: Philipp Klaus Krause 2024-07-10
      • Philipp Klaus Krause

        ds390 should be fixed now, and the check for handlers that don't use atomics is implemented, too. Still fails regression tests for mcs51 huge memory model.

         
  • Maarten Brock

    Maarten Brock - 2024-07-10

    First review:
    Comment atomic_flag_test_and_set.c - C run-time: C11 atomic flag is wrong for the rollback.
    Leaving out the default areas is dangerous though less so for libraries.
    here0, etc are not proper local labels.
    There is no check for the return address lsb to be >= __sdcc_atomic_exchange_rollback_impl
    jc outer_skip should be jnc I think.
    cjne @r0, #6, here1 should be cjne a, #5, here1
    jc inner_skip should be jnc I think.
    beign is not a word.
    There seems no need for mov a,r3 at the end of each asm block.

    Maarten

     
    • Philipp Klaus Krause

      Thanks for the feedback. Here's a new version of the patch.
      It fixes the mcs51 huge memory model failures, and the issues in your comment, except for these two:

      • I did not change the #6 to #5. My understanding: if we are at start-of-routine+#5, the "pc" points to the instruction that writes the new value, but that instruction has not yet been executed, so no value was written, and we still need to roll back. So we want to check for < 6.
      • I did not remove the mov a, r3. The sdcc_atomic_exchange routines need to return the original value of the atomic object, and the comments say that they do so in a (and atomic_flag_test_and_set currently assumes this). I have no idea if it would be better to specify that the original value is returned in r3 instead.
       

      Last edit: Philipp Klaus Krause 2024-07-11
      • Maarten Brock

        Maarten Brock - 2024-07-11
        • What is the dec a for then? Is it not for skipping start-of-routine+#0 ?
        • If you specify the return to be in r3, you can load dpl in the caller directly from r3. This only applies to the non-comparing case.
         
        • Philipp Klaus Krause

          • I see. I suggest we remove that dec a instead of changing the constant. The dec a allows us to skip a restart in rare cases, but adds one extra instruction, so its probably not worth it.
          • Via a is faster: we currently do mov a, r3 + mov dpl, a, which is 24 cycles. the alternative would be nop + mov dpl, r3, which is 36 cycles.
           
          • Maarten Brock

            Maarten Brock - 2024-07-12
            • Removing dec a is fine with me. It's not really restarting there either because it hasn't done anything yet. But this gives room for more optimizations as the 2 nops can be moved out of the routine.
            • Further, we do not need to execute an extra nop when r3 contains the return value. Instead we can return earlier, placing the nop after the ret. Or even better, don´t use a call at all saving precious stack space.
            • And we can drop another call if we rename the gptr entry point to _atomic_flag_test_and_set
            0:  nop                                           ; 1 B
            1:  nop                                           ; 1 B
            sdcc_atomic_exchange_pdata_impl:
            2:  movx a, @r0                                   ; 1 B
            3a: mov  r3, a                                    ; 1 B
            4a: mov  a, r2                                    ; 1 B
            5a: movx @r0, a                                   ; 1 B
            6:  sjmp sdcc_atomic_exchange_exit                ; 2 B
                                                              ;=====
                                                              ; 8 B
            
            0:  nop                                           ; 1 B
            1:  nop                                           ; 1 B
            sdcc_atomic_exchange_xdata_impl:
            2:  movx a, @dptr                                 ; 1 B
            3a: mov  r3, a                                    ; 1 B
            4a: mov  a, r2                                    ; 1 B
            5a: movx @dptr, a                                 ; 1 B
            6:  sjmp sdcc_atomic_exchange_exit                ; 2 B
                                                              ;=====
                                                              ; 8 B
            
            sdcc_atomic_exchange_exit:
                mov  dpl, r3
                ret
            _atomic_flag_test_and_set::
                mov  r2, #1
                jnb B.6, sdcc_atomic_exchange_xdata_impl
                mov r0, dpl
                jb  B.5, sdcc_atomic_exchange_pdata_impl
            ;    sjmp sdcc_atomic_exchange_idata_impl
            sdcc_atomic_exchange_idata_impl:
                mov a, r2
                xch a, @r0
                mov  dpl, a
                ret
            
             
            • Maarten Brock

              Maarten Brock - 2024-07-12

              Here is the atomic3.patch with many comments marked with >>>. It is not complete.

               

              Last edit: Maarten Brock 2024-07-12
            • Philipp Klaus Krause

              Thanks. I moved those nop in [r14918]. The dec a never made it to trunk anyway.
              Regarding the other two changes I'm not sure yet: later, these support functions will also be used to implement atomic_exchangefor 8-bit types, so we don't want to overspecialize them towards atomic_flag_test_and_set now.

              That would be the function family:

              C atomic_exchange(volatile _Atomic C *object, C desired);
              

              For every unqualified 8-bit type C (see section 7.17.7.3 of the ISO C23 standard).

               

              Related

              Commit: [r14918]

              • Maarten Brock

                Maarten Brock - 2024-07-16
                • The dec r0 bug is still present. I will fix this.
                • I will also rename __sdcc_atomic_exchange_rollback_impl to sdcc_atomic_exchange_rollback_start. I see absolutely no reason why this implementation must be visible in the C namespace.
                • I will add sdcc_atomic_exchange_rollback_end.
                • I will add sdcc_atomic_exchange_exit for returning in DPL.
                sdcc_atomic_exchange_exit:
                    mov  dpl, r3
                    ret
                
                • I will modify _atomic_flag_test_and_set to
                    mov  r2, #0x01
                    ljmp ___sdcc_atomic_exchange_gptr_impl
                
                 
                • Maarten Brock

                  Maarten Brock - 2024-07-17

                  The additional dec r0 is not needed if push psw is moved down.
                  Along with the above implemented in [r14923].

                   

                  Related

                  Commit: [r14923]

                  • Philipp Klaus Krause

                    Thanks. I don't understand the point of

                        .area HOME    (CODE)
                    +   .area GSINIT  (CODE)
                    +   .area GSFINAL (CODE)
                    +   .area HOME    (CODE)
                    

                    in __sdcc_atomic_maybe_rollback.c, though. After all this is a C source file, not an assembler source file, so the compiler already places the ordering at the beginning of the file, before that inline assembler code.

                     
1 2 > >> (Page 1 of 2)

Log in to post a comment.