I wrote tiny fast mul 8x8->16 which not use SLOCs .
unsigned int muluchar (unsigned char x, unsigned char y) __naked {
x;
y;
#if defined(__SDCC_pdk13) || defined(__SDCC_pdk14) || defined(__SDCC_pdk15) || defined(__SDCC_pdk16)
__asm ; loop 8x of 9 instruction cycles, data inplace, no SLOCs
mov a, #0x00
clear p
#if !defined(__SDCC_pdk13) // mulint/mullong/.. may emit high probable multiplications to 0
cneqsn a,_test_muluchar_PARM_1 ;x==0 ?
ret
1$:
cneqsn a,_test_muluchar_PARM_2 ;y==0 ?
ret
2$:
#endif
inc p ; {p,a} = 0x0100
0$:
sl a
slc p
slc _test_muluchar_PARM_1 ; x <<= 1;
t0sn.io f,c
add a, _test_muluchar_PARM_2 ; result += y;
3$:
addc p
t1sn _test_muluchar_PARM_1, #0
goto 0$
4$:
ret
__endasm;
#endif
}
It was tested against following code:
unsigned int muluchar (unsigned char x, unsigned char y)
{
unsigned int result = 0;
unsigned char i = 8;
if (x|y) // mulint/mullong/.. may emit high probable multiplications to 0
do {
result <<= 1;
if (x & 0x80)
result += y;
x <<= 1;
} while (--i);
return result;
}
Thanks applied to next branch in [r14584].
Related
Commit: [r14584]
Thanks; applied to next branch in [r14584].
P.S.: I am unsure if the zero check is worth it. While I agree that 0 may be a common operand, doing the checks takes 4 extra cycles for multiplications where no operand is 0. For now, I left the check out.
Related
Commit: [r14584]
Last edit: Philipp Klaus Krause 2024-01-03
Nice point to start discussion ;)
Sorry for topic mixing. Please have a look for fast 16bit multiplication implementation also.
This routine uses a non-zero p+1. Current SDCC relies on p+1 always being zero.
I got (seems wrongly) P as 16bit scratchpad ;(
For now, it is 8 lower bits of data with fixed zero upper 8 bits. It seemed like a good start when the pdk ports were created initially.
Changing to full 16 bits would have advantages and disadvantages:
* Obvious advantage of having 8 more bits for a pseudo-register.
* Increased interrupt latency due to having to save and restore the upper bits in the interrupt handler.
* Less efficient RAM access (via pointers), since we'd have to zero p+1 before RAM accesses via idxm.
* Substantial rewrite of code generation required when changing this.
IMO, it probably isn't worth the effort to look into changing it now. But once we get a pdk16 port, we should revisit this decision.
The interrupt handler should only care about P if it is used.
Another way is to use a register frame assigned to a handler. Depends on the target trade-off between speed and size.
In the mean time we may adapt fast mulint with non-stack local variable instead of P+1
another issue is labels inside .rept macro
Do I really need to open another separate patch case for 16 bit multiplication?
Most work on SDCC is volunteer work, so it often happens that other stuff takes priority, and we SDCC developers don't find much time for SDCC. Recently, for me that meant that I decided to use the time I could spend on SDCC on the SDCC 4.4. 0 release; dealing with patch tickets got postponed.
no discussion. life is too short. I really appreciate what you do.
Today I found that apparently the new mulchar wasn't actually used, except for pdk13. I noticed when I saw pdk13 tests fail to link, due to the parameter name being wrong. I'll look into fixing it, and enabling it for all pdk ports today.
The optimized muluchar is not finally working in [r15578]. I had to change it a bit:
To avoid the t1sn which would result in link failures depending on how the linker arranges the parameters in memory (the t1sn pdk instruction is not available for the full pdk RAM address range).
Related
Commit: [r15578]