This patch splits the division/modulus computing function in two cases: divisor up to 7 bits long, divisor with more than 7 bits. It may sound strange that the threshold is 7 and not 8... but look at the code and you will see! Both cases can be computed faster than with current code.
Drawback of the patch: code size increases! The _overall_ test-ucz80 result is 7,000 more bytes used, 150,000 ticks saved.
The "atmost7bits" version can be easily adapted to compute both quotient and reminder of a 32-bit long by a positive signed char (7 bits :-), storing dividend (and quotient) in HLDE and divisor in C (A holds the reminder). This code should then be used to speed-up ltoa (radix must be < 37 with the current implementation).
IMO the speedup is worth the increase in code size. I suppose we can make up for the code size increase by further modularization of the asm routines.
Applied in revision #5447.