The following function could use only HL DE BC and A throughout:
int memcmp (const void * vbuf1, const void * vbuf2, unsigned count)
{
unsigned char* buf1 = (unsigned char*)vbuf1;
unsigned char* buf2 = (unsigned char*)vbuf2;
if ( count ) {
do {
if ( *buf1 != *buf2 )
break;
buf2++; buf1++;
} while ( --count );
return( *buf1 - *buf2 );
} else
return 0;
}
annotated:
int memcmp (const void * vbuf1, const void * vbuf2, unsigned count)
{
unsigned char* buf1 = (unsigned char*)vbuf1; // HL
unsigned char* buf2 = (unsigned char*)vbuf2; // DE
if ( count ) { // BC
do {
if ( *buf1 != *buf2 ) // ld A,(DE) : cp (HL) : jr NZ, break
break;
buf2++; buf1++; // inc DE : inc HL
} while ( --count ); // dec BC : ld A, B : or C : jr NZ, loop
return( *buf1 - *buf2 );
} else
return 0;
}
but currently (#14815 (Linux)
) what I get is:
sdcc -mz80 --opt-code-speed --max-allocs-per-node20000 --Werror --peep-return -c
_memcmp::
push ix
ld ix,#0
add ix,sp
push af
ld c, l
ld b, h
;check.c:3: unsigned char* buf1 = (unsigned char*)vbuf1;
;check.c:4: unsigned char* buf2 = (unsigned char*)vbuf2;
;check.c:5: if ( count ) {
ld a, 5 (ix)
or a, 4 (ix)
jr Z, 00107$
;check.c:6: do {
ld l, 4 (ix)
ld h, 5 (ix)
00103$:
;check.c:7: if ( *buf1 != *buf2 )
ld a, (bc)
ld -2 (ix), a
ld a, (de)
ld -1 (ix), a
ld a, -2 (ix)
sub a, -1 (ix)
jr NZ, 00105$
;check.c:9: buf2++; buf1++;
;check.c:10: } while ( --count );
dec hl
inc de
inc bc
ld a, h
or a, l
jr NZ, 00103$
00105$:
;check.c:12: return( *buf1 - *buf2 );
ld a, (bc)
ld l, a
ld h, #0x00
ld a, (de)
ld c, a
ld b, #0x00
cp a, a
sbc hl, bc
ex de, hl
jp 00109$
00107$:
;check.c:14: return 0;
ld de, #0x0000
00109$:
;check.c:15: }
ld sp, ix
pop ix
pop hl
pop af
jp (hl)
The biggest problem is in the loop:
;check.c:7: if ( *buf1 != *buf2 )
ld a, (bc)
ld -2 (ix), a
ld a, (de)
ld -1 (ix), a
ld a, -2 (ix)
sub a, -1 (ix)
jr NZ, 00105$
Ideally, that could be
1A ld a,(de) [7]
BE cp (hl) [7]
20 .. jr NZ, ..
Then the registry allocator would "know" that HL is the one used in CP, and then the codegen would just use all that? I am aware that the icode has a lot of variables now, and that maybe it could be too complicated to achieve that, if the levels can't plan like that, but maybe this could be considered as a goal.
I am also aware that some modifications of how that compare is written produce different code.
E.g.
unsigned char c2 = *buf2;
if ( *buf1 != c2 )
break;
gives:
;check.c:7: unsigned char c2 = *buf2;
ld a, (de)
ld -2 (ix), a
;check.c:8: if ( *buf1 != c2 )
ld a, (bc)
ld -1 (ix), a
ld a, -2 (ix)
sub a, -1 (ix)
jr NZ, 00105$
and this almost manages to produce something better:
unsigned char c2 = *buf2;
if ( (unsigned char)(*buf1 - c2) )
break;
(note without the cast it would use 16 bits for sub there, which would also be unnecessary and is a symptom of some issue, I'd think)
;check.c:7: unsigned char c2 = *buf2;
ld a, (de)
ld -1 (ix), a
;check.c:8: if ( (unsigned char)(*buf1 - c2) )
ld a, (bc)
sub a, -1 (ix)
or a, a
jr NZ, 00105$
but it still uses stack for one of the two.
I think that ideally all the forms would generate
1A ld a,(de) [7]
BE cp (hl) [7]
20 .. jr NZ, ..
but I don't know how possible that would be.
sub a, -1 (ix) or a, a jr NZ
is also unexpected in that last case.The problem with temporary variables, demonstrated in the minimal function:
produces:
verbose:
Last edit: Janko Stamenović 2024-04-23
Should the optimal solution not use this?
i still haven't learned enough what is realistic to expect from the existing infrastructure, and I understand that compilers hve its limitations about how far some optimization can go, so I'm more expecting some relatively more general to be possible than some "absolute best". Sooner or later I will spend more time and even if these entries remain unsolved I'll try to nderstand on them where the limits are. Thanks for that example.
I am starting to think that the code generator of sdcc is too targeted to generic 8 bit processors to produce good z80 code.
I don't think there's an easy answer to this.
On one hand, the structure of SDCC makes it hard to make good use of indirect and indexed indirect addressing modes (outside of plain loads), as well as complex instructions such as ldi, ldd and cpi. In particular for z80, there is are clear weaknesses in SDCC's use of (hl), (de), (bc), d (iy): When there is a pointer or array access in the C code, SDCC typically loads the result into a temporary (in registers or on the stack), then does the operations on these temporaries. GCC's or LLVM's instruction selection can do much better here.
But SDCC has better register allocation. And I don't think combining SDCC's register allocator with an GCC-like instruction selection would be easy (though it isn't impossible either).
There are many still cases where hand-written asm can do better SDCC.
But AFAIK, SDCC is still the best compiler for the Z80 (and for STM8, Rabbit, etc, which also have complex addressing modes that SDCC won't fully use) so far.
Let's start slow here. For the comparison, I get this code (with --no-peep --fverbose-asm --icode-in-asm --max-allocs-per-node 2000000):
Here, we have the pointer read results in iTemp7 and iTemp9. bc, de and hl are all in use. So iTemp7 has to go on the stack. But it should be possible to put iTemp9 in a, i.e. get something like this after register allocation and code generation:
Stack use of the function would be down to a single byte.
P.S.: This now works in [r14835], which also made some peephole optimizer rule improvements that can optimize
two_bytes_equal
above even a bit further (since there the other temporary is in a register, not on the stack).Related
Commit: [r14835]
Last edit: Philipp Klaus Krause 2024-04-30
Shouldn't n that commit the rule 115 say "if notUsed %1 "?
Yes. Thanks. fixed in [r14836].
Related
Commit: [r14836]
Many thanks to you!
A more generic solution would be [feature-requests:#914].
Related
Feature Requests: #914