Small Device C Compiler (SDCC) / Feature Requests / #799 For loops

As an example, look at the for() at lines 84-86 in intro.c

    for (char j=0;j<16*3;j++) {
        VramWrite((unsigned int)gradient_tiles,8*16);
    }

and to its asm equivalent

;src\intro.c:84: for (char j=0;j<16*3;j++) {
    ld  c, #0x00
00121$:
    ld  a, c
    sub a, #0x30
    jr  NC, 00103$
;src\intro.c:85: VramWrite((unsigned int)gradient_tiles,8*16);
    push    bc
    ld  de, #0x0080
    ld  hl, #_gradient_tiles
    call    _VramWrite
    pop bc
;src\intro.c:84: for (char j=0;j<16*3;j++) {
    inc c
    jp  00121$
00103$:

As we know that at the first iteration the condition i>0 is true, the loop can be arranged moving the test at the bottom saving a jump

e.g.

ld  c, #0x00
00121$:
push    bc
ld  de, #0x0080
ld  hl, #_gradient_tiles
call    _VramWrite
pop bc
inc c
ld a,#0x2F
cp a,c
jp  nc,00121$

Last edit: Ragozini Arturo 2022-03-22

intro.asm

intro.c

Philipp Klaus Krause - 2022-03-22

Shouldn't the loop reverse optimization take care of this? Someone should check why it doesn't apply to your example.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Sebastian Riedel - 2022-03-22
  
  No function calls within the loop.
  
  No, it shouldn’t.
  
  To my knowledge it only works for for (j=0;j<16*3;j++), see [bugs:#3162]
  
  Related
  
  Bugs: #3162
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ragozini Arturo - 2022-03-22

All the loops in the intro.c code where the counter variable is defined within the for() do not have the loop reverse optimization. It has to be related to the bug you mention.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I don’t know where/how to fix the loop optimization and generation, but I also didn’t look into it yet.
Here is some demonstration of how it works currently in [r13284]

func is not optimized, because the optimization rule does not support this new loop style.
func2 is not optimized because it has a function call, which is not supported and documented. But at least it did the comparison at the end of the loop.
func3 does the optimization. It’s fine for sm83, but not for z80. Because it decrements a, it can’t use djnz, which might only be applied through peep hole rules.
func4 does the optimization and is not fine.
On sm83 you would completely change the whole loop. Don’t know if z80 also has a special instruction for that or if it’s not worth to write it that way due to cycles:

_func4::
    ld  c, a
    ld  de, #0x0030+0x100
    ; inc d ; if it's not a constant
00104$:
    inc c
    dec e
    jr  NZ, 00104$
    dec d
    jr  NZ, 00104$
    ld  a, c
    ret

unsigned char func(unsigned char a){
    for(char j=0;j<16*3;j++)++a;
    return a;
}
unsigned char func2(unsigned char a){
    char j;
    for(j=0;j<16*3;j++)a=func(a);
    return a;
}
unsigned char func3(unsigned char a){
    char j;
    for(j=0;j<16*3;j++)++a;
    return a;
}
unsigned char func4(unsigned char a){
    int j;
    for(j=0;j<16*3;j++)++a;
    return a;
}
void main(void){
    func(1);
    func2(1);
    func3(1);
    func4(1);
}

$ sdcc -mz80 -S test.c

_func::
    ld  c, a
    xor a, a
00103$:
    cp  a, #0x30
    jr  NC, 00101$
    inc c
    inc a
    jr  00103$
00101$:
    ld  a, c
    ret
_func2::
    ld  c, a
    ld  b, #0x00
00102$:
    push    bc
    ld  a, c
    call    _func
    pop bc
    ld  c, a
    inc b
    ld  a, b
    sub a, #0x30
    jr  C, 00102$
    ld  a, c
    ret
_func3::
    ld  c, a
    ld  a, #0x30
00104$:
    inc c
    dec a
    jr  NZ, 00104$
    ld  a, c
    ret
_func4::
    ld  c, a
    ld  de, #0x0030
00104$:
    inc c
    dec de
    ld  a, d
    or  a, e
    jr  NZ, 00104$
    ld  a, c
    ret

$ sdcc -msm83 -S test.c

_func::
    ld  c, a
    xor a, a
00103$:
    cp  a, #0x30
    jr  NC, 00101$
    inc c
    inc a
    jr  00103$
00101$:
    ld  a, c
    ret
_func2::
    ld  c, a
    ld  b, #0x00
00102$:
    push    bc
    ld  a, c
    call    _func
    pop bc
    ld  c, a
    inc b
    ld  a, b
    sub a, #0x30
    jr  C, 00102$
    ld  a, c
    ret
_func3::
    ld  c, a
    ld  a, #0x30
00104$:
    inc c
    dec a
    jr  NZ, 00104$
    ld  a, c
    ret
_func4::
    ld  c, a
    ld  de, #0x0030
00104$:
    inc c
    dec de
    ld  a, d
    or  a, e
    jr  NZ, 00104$
    ld  a, c
    ret

(disclaimer: might contain off by one errors, I’m not fully awake yet)
(edit: fix snippet)

Commit: [r13284]

Last edit: Sebastian Riedel 2022-03-23

Ragozini Arturo - 2022-03-22

Suggestion about loops. On z80 the cpi instruction can be used to control 16 bit loops on BC.
This loop:

label: dec bc ld a, b or a, c jp NZ, label

is equivalent to this other loop

label: dec hl cpi jp po,label

where the dec hl compensates the side effects of CPI and can be omitted if HL is not used or holds useless values.

The trick can work only on BC so it has to go in the code generation during the register assignment phase

Last edit: Maarten Brock 2022-12-15
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Sebastian Riedel - 2022-03-22
  
  I don’t see the benefit if hl can’t be trashed. In my table it’s just slower.
  cpi 2B 16C
  ld a,b 1B 4C
  or a,c 1B 4C
  
  Even when transhing hl it’s 2B 16C vs 3B 14C, so it would need to depend on the code optimization settings (speed/size).
  
  It might be a problem to have memory at random places be accessed? Since it compares a with (hl)
  
  Last edit: Sebastian Riedel 2022-03-22
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Philipp Klaus Krause - 2022-03-22
    
    For the read access to random memory locations: SDCC may generate them only if the option --allow-unsafe-read is used.
    
    👍
    1
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ragozini Arturo - 2022-03-22

You are right, if hl cannot be trashed it is just slower, otherwise it is 1 byte shorter (but 2 cycles slower). About reading a byte at hl normally it should be safe, at least on the msx and on the colecovision.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Sebastian Riedel - 2022-03-22
  
  About reading a byte at hl normally it should be safe
  
  Due to in and out there is probably no z80 who uses memory mapping.
  
  It looks like it could be worth it to swap b and c and use a similar strategy like we use on GameBoy. That allows to use jrnz in the inner loop, therefore speed that part up significantly. For constants it’s the same size as cpi if hl can’t be trashed. If b and c have to be swapped first and hl is free, it gets bigger in size.
  
  _func4:: ld c, a ld de, #0x0030 00104$: inc c dec de ; 1 6 ld a, d ; 1 4 or a, e ; 1 4 jr NZ, 00104$ ; 2 12 ld a, c ret _func4:: ld bc, #0x0030+0x100 ; inc b ; 1 4 00104$: inc a dec c ; 1 4 jp NZ, 00104$ ; 3 10 dec b ; 1 4 jp NZ, 00104$ ; 3 10 ret _func4:: ld bc, #0x3000+0x1 ; inc c ; 1 4 00104$: inc a jrnz 00104$ ; 2 13 dec c ; 1 4 jp NZ, 00104$ ; 3 10 ret _func4:: ld bc, #0x0030 00104$: inc a dec hl ; 1 6 cpi ; 2 16 jp PO, 00104$ ; 3 10 ret
  
  That jrnz addr strategy could also be used on rabbit, which lacks cpi. TLCS even has jrnz bc, addr o.O
  
  (edit: fix snippet)
  
  Last edit: Sebastian Riedel 2022-03-23
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Philipp Klaus Krause - 2022-03-22
    
    There are.
    
    Consider e.g. the ColecoVision. By itself, it doesn't use memory-mapped I/O. The designers treated the cartridge port as a way to map ROM into 32K of the address space. For that you don't need the IO signals, you don't even need read / write signals.
    So all the system designers put there were data lines, address lines, power and an (actually 4, but that doesn't matter here) active signal.
    So when game programmers wanted to use more than 32K in a game cartridge, they had do do the mappers via memory-mapped I/O. And since the cartridge port doesn't have a read / write signal, a read from an address used by the mapper will trigger a write, changing the memory mapping / banking registers.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Ragozini Arturo - 2022-03-22
      
      I think you are right. Some home brew cartridges for colecovision have a rom mapper whose pages are activated by reading specific rom addresses. Anyway, if you allow the cpi optimisation only if --allow-unsafe-read is set and you need to optimise for size, it could work.
      
      BTW there are cases where the HL cannot be trashed but it is actively pointing to an array sliding one step at each iteration (up or down).
      In those cases the cpi trick (and its cpd equivalent) can be used to include the inc hl / dec hl step, resulting in shorter and faster code (because you would have has inc/dec hl in the other way).
      (edit: added hint about inc hl/dec hl and cpi/cpd)
      
      Last edit: Ragozini Arturo 2022-03-23
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ragozini Arturo - 2022-03-22

on z80 it is

djnz addr

it decrements B and jumps if it is not zero
you should do

ld bc,0x3001 loop: inc a djnz loop dec c jp nz,loop
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ragozini Arturo - 2022-05-01

Any news on adding this fix to the next release ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2022-05-06
  
  No. Maybe some SDCC developer will work on it, maybe not. Maybe some user will post a patch that gets applied, maybe not.
  But if something happens, you will notice by this issue being closed.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

For loops

The Small Device C Compiler (SDCC), targeting 8-bit architectures

Group

Searches

Help

#799 For loops

Discussion

Related

Related