Menu

#799 For loops

open
nobody
5
2022-05-06
2022-03-22
No

Inspecting the asm code generated it seems that the for loops starts always with a test on the exit conditions even if its result on the first iteration is known.
In those cases it is possible to move the test on the exit condition to the bottom of the loop and merge it with the operations needed for increasing/decreasing the counter
The resulting code is more efficient and shorter

Discussion

  • Ragozini Arturo

    Ragozini Arturo - 2022-03-22

    As an example, look at the for() at lines 84-86 in intro.c

        for (char j=0;j<16*3;j++) {
            VramWrite((unsigned int)gradient_tiles,8*16);
        }
    

    and to its asm equivalent

    ;src\intro.c:84: for (char j=0;j<16*3;j++) {
        ld  c, #0x00
    00121$:
        ld  a, c
        sub a, #0x30
        jr  NC, 00103$
    ;src\intro.c:85: VramWrite((unsigned int)gradient_tiles,8*16);
        push    bc
        ld  de, #0x0080
        ld  hl, #_gradient_tiles
        call    _VramWrite
        pop bc
    ;src\intro.c:84: for (char j=0;j<16*3;j++) {
        inc c
        jp  00121$
    00103$:
    

    As we know that at the first iteration the condition i>0 is true, the loop can be arranged moving the test at the bottom saving a jump

    e.g.

    ld  c, #0x00
    00121$:
    push    bc
    ld  de, #0x0080
    ld  hl, #_gradient_tiles
    call    _VramWrite
    pop bc
    inc c
    ld a,#0x2F
    cp a,c
    jp  nc,00121$
    
     

    Last edit: Ragozini Arturo 2022-03-22
  • Philipp Klaus Krause

    Shouldn't the loop reverse optimization take care of this? Someone should check why it doesn't apply to your example.

     
    • Sebastian Riedel

      1. No function calls within the loop.

        No, it shouldn’t.

      2. To my knowledge it only works for for (j=0;j<16*3;j++), see [bugs:#3162]

       

      Related

      Bugs: #3162

  • Ragozini Arturo

    Ragozini Arturo - 2022-03-22

    All the loops in the intro.c code where the counter variable is defined within the for() do not have the loop reverse optimization. It has to be related to the bug you mention.

     
  • Sebastian Riedel

    I don’t know where/how to fix the loop optimization and generation, but I also didn’t look into it yet.
    Here is some demonstration of how it works currently in [r13284]

    func is not optimized, because the optimization rule does not support this new loop style.
    func2 is not optimized because it has a function call, which is not supported and documented. But at least it did the comparison at the end of the loop.
    func3 does the optimization. It’s fine for sm83, but not for z80. Because it decrements a, it can’t use djnz, which might only be applied through peep hole rules.
    func4 does the optimization and is not fine.
    On sm83 you would completely change the whole loop. Don’t know if z80 also has a special instruction for that or if it’s not worth to write it that way due to cycles:

    _func4::
        ld  c, a
        ld  de, #0x0030+0x100
        ; inc d ; if it's not a constant
    00104$:
        inc c
        dec e
        jr  NZ, 00104$
        dec d
        jr  NZ, 00104$
        ld  a, c
        ret
    
    unsigned char func(unsigned char a){
        for(char j=0;j<16*3;j++)++a;
        return a;
    }
    unsigned char func2(unsigned char a){
        char j;
        for(j=0;j<16*3;j++)a=func(a);
        return a;
    }
    unsigned char func3(unsigned char a){
        char j;
        for(j=0;j<16*3;j++)++a;
        return a;
    }
    unsigned char func4(unsigned char a){
        int j;
        for(j=0;j<16*3;j++)++a;
        return a;
    }
    void main(void){
        func(1);
        func2(1);
        func3(1);
        func4(1);
    }
    

    $ sdcc -mz80 -S test.c

    _func::
        ld  c, a
        xor a, a
    00103$:
        cp  a, #0x30
        jr  NC, 00101$
        inc c
        inc a
        jr  00103$
    00101$:
        ld  a, c
        ret
    _func2::
        ld  c, a
        ld  b, #0x00
    00102$:
        push    bc
        ld  a, c
        call    _func
        pop bc
        ld  c, a
        inc b
        ld  a, b
        sub a, #0x30
        jr  C, 00102$
        ld  a, c
        ret
    _func3::
        ld  c, a
        ld  a, #0x30
    00104$:
        inc c
        dec a
        jr  NZ, 00104$
        ld  a, c
        ret
    _func4::
        ld  c, a
        ld  de, #0x0030
    00104$:
        inc c
        dec de
        ld  a, d
        or  a, e
        jr  NZ, 00104$
        ld  a, c
        ret
    

    $ sdcc -msm83 -S test.c

    _func::
        ld  c, a
        xor a, a
    00103$:
        cp  a, #0x30
        jr  NC, 00101$
        inc c
        inc a
        jr  00103$
    00101$:
        ld  a, c
        ret
    _func2::
        ld  c, a
        ld  b, #0x00
    00102$:
        push    bc
        ld  a, c
        call    _func
        pop bc
        ld  c, a
        inc b
        ld  a, b
        sub a, #0x30
        jr  C, 00102$
        ld  a, c
        ret
    _func3::
        ld  c, a
        ld  a, #0x30
    00104$:
        inc c
        dec a
        jr  NZ, 00104$
        ld  a, c
        ret
    _func4::
        ld  c, a
        ld  de, #0x0030
    00104$:
        inc c
        dec de
        ld  a, d
        or  a, e
        jr  NZ, 00104$
        ld  a, c
        ret
    

    (disclaimer: might contain off by one errors, I’m not fully awake yet)
    (edit: fix snippet)

     

    Related

    Commit: [r13284]


    Last edit: Sebastian Riedel 2022-03-23
  • Ragozini Arturo

    Ragozini Arturo - 2022-03-22

    Suggestion about loops. On z80 the cpi instruction can be used to control 16 bit loops on BC.
    This loop:

    label:
        dec bc
        ld  a, b
        or  a, c
        jp  NZ, label
    

    is equivalent to this other loop

    label:    
        dec hl
        cpi
        jp po,label
    

    where the dec hl compensates the side effects of CPI and can be omitted if HL is not used or holds useless values.

    The trick can work only on BC so it has to go in the code generation during the register assignment phase

     

    Last edit: Maarten Brock 2022-12-15
    • Sebastian Riedel

      I don’t see the benefit if hl can’t be trashed. In my table it’s just slower.
      cpi 2B 16C
      ld a,b 1B 4C
      or a,c 1B 4C

      Even when transhing hl it’s 2B 16C vs 3B 14C, so it would need to depend on the code optimization settings (speed/size).

      It might be a problem to have memory at random places be accessed? Since it compares a with (hl)

       

      Last edit: Sebastian Riedel 2022-03-22
      • Philipp Klaus Krause

        For the read access to random memory locations: SDCC may generate them only if the option --allow-unsafe-read is used.

         
        👍
        1
  • Ragozini Arturo

    Ragozini Arturo - 2022-03-22

    You are right, if hl cannot be trashed it is just slower, otherwise it is 1 byte shorter (but 2 cycles slower). About reading a byte at hl normally it should be safe, at least on the msx and on the colecovision.

     
    • Sebastian Riedel

      About reading a byte at hl normally it should be safe

      Due to in and out there is probably no z80 who uses memory mapping.

      It looks like it could be worth it to swap b and c and use a similar strategy like we use on GameBoy. That allows to use jrnz in the inner loop, therefore speed that part up significantly. For constants it’s the same size as cpi if hl can’t be trashed. If b and c have to be swapped first and hl is free, it gets bigger in size.

      _func4::
          ld  c, a
          ld  de, #0x0030
      00104$:
          inc c
          dec de ; 1 6
          ld  a, d ; 1 4
          or  a, e ; 1 4
          jr  NZ, 00104$ ; 2 12
          ld  a, c
          ret
      
      _func4::
          ld  bc, #0x0030+0x100
          ; inc b ; 1 4
      00104$:
          inc a
          dec c ; 1 4
          jp  NZ, 00104$ ; 3 10
          dec b ; 1 4
          jp  NZ, 00104$ ; 3 10
          ret
      
      _func4::
          ld  bc, #0x3000+0x1
          ; inc c ; 1 4
      00104$:
          inc a
          jrnz 00104$ ; 2 13
          dec c ; 1 4
          jp  NZ, 00104$ ; 3 10
          ret
      
      _func4::
          ld  bc, #0x0030
      00104$:
          inc a
          dec hl ; 1 6
          cpi ; 2 16
          jp  PO, 00104$ ; 3 10
          ret
      

      That jrnz addr strategy could also be used on rabbit, which lacks cpi. TLCS even has jrnz bc, addr o.O

      (edit: fix snippet)

       

      Last edit: Sebastian Riedel 2022-03-23
      • Philipp Klaus Krause

        There are.

        Consider e.g. the ColecoVision. By itself, it doesn't use memory-mapped I/O. The designers treated the cartridge port as a way to map ROM into 32K of the address space. For that you don't need the IO signals, you don't even need read / write signals.
        So all the system designers put there were data lines, address lines, power and an (actually 4, but that doesn't matter here) active signal.
        So when game programmers wanted to use more than 32K in a game cartridge, they had do do the mappers via memory-mapped I/O. And since the cartridge port doesn't have a read / write signal, a read from an address used by the mapper will trigger a write, changing the memory mapping / banking registers.

         
        • Ragozini Arturo

          Ragozini Arturo - 2022-03-22

          I think you are right. Some home brew cartridges for colecovision have a rom mapper whose pages are activated by reading specific rom addresses. Anyway, if you allow the cpi optimisation only if --allow-unsafe-read is set and you need to optimise for size, it could work.

          BTW there are cases where the HL cannot be trashed but it is actively pointing to an array sliding one step at each iteration (up or down).
          In those cases the cpi trick (and its cpd equivalent) can be used to include the inc hl / dec hl step, resulting in shorter and faster code (because you would have has inc/dec hl in the other way).
          (edit: added hint about inc hl/dec hl and cpi/cpd)

           

          Last edit: Ragozini Arturo 2022-03-23
  • Ragozini Arturo

    Ragozini Arturo - 2022-03-22

    on z80 it is

    djnz addr
    

    it decrements B and jumps if it is not zero
    you should do

    ld bc,0x3001
    loop:
    inc a
    djnz loop
    dec c
    jp nz,loop
    
     
  • Ragozini Arturo

    Ragozini Arturo - 2022-05-01

    Any news on adding this fix to the next release ?

     
    • Philipp Klaus Krause

      No. Maybe some SDCC developer will work on it, maybe not. Maybe some user will post a patch that gets applied, maybe not.
      But if something happens, you will notice by this issue being closed.

       

Log in to post a comment.

MongoDB Logo MongoDB