## Re: [Sdcc-user] Even faster multiplication on Z80?

 Re: [Sdcc-user] Even faster multiplication on Z80? From: - 2009-03-30 14:03:43 ```Dear Philipp, > thanks for your help with Z80 multiplication. Thank you for your answer! > study, but it has been shown, that small integer values are far more > common than bigger ones. AFAIR values 0,1,2 combined are about as likely > to occur as all other values combined in an 8 bit integer. I'm sure you are right, because "char" is often used to store "boolean" values or for small cycles. But the real question is: are those values used as operands in multiplications? If you use the byteXbyte product as a building block for some more complex mathematical function, you usually have "random" bytes to handle. If you use it to access elements of a bi-dimensional array A[x+ROWLEN*y], then the result must be small... and so must be the two operands. A far faster code for small operands is: ld hl, #0 ld d, h or a jp 3 1: add hl,de 2: rl e rl d 3: rra jp C, 1 jp NZ, 2 It assumes the two operands are in A and E; put the result in HL... It is very fast when A<32, but code size is too big to be inlined... >> Drawbacks: >> - overwrites the accumulator A > For builtin code generation this is bad, since A is used to backup B > (when B is in use). If both DE and B are used, wrap into push bc/pop pc, (saving ld a,b/ld b,a). If only B is, save it to D or E (same as current). If only DE is in use, do nothing (save push/pop). > You might want to have a look at the 16 bit multiplication in mul.s or There is a faster way for 16x16->16, but it would be slower for 16x8->16... I have to think about it a little bit longer... anyway I can suggest some very small corrections: ---------8<---------8<---------8<---------8<---------8<---------8<----- --- sdcc/device/lib/z80/mul.s 2009-01-05 11:20:47.000000000 +0100 +++ ./mul.s 2009-03-30 15:30:08.000000000 +0200 @@ -29,11 +29,11 @@ ;; DE = multiplier ;; ;; Exit conditions - ;; DE = less significant word of product + ;; HL = less significant word of product ;; ;; Register used: AF,BC,DE,HL __mul16:: - ld hl,#0 + ld l,#0 ld a,b ; ld c,c ld b,#16 @@ -41,7 +41,7 @@ ;; Optimise for the case when this side has 8 bits of data or ;; less. This is often the case with support address calls. or a - jr NZ,1\$ + jr NZ,3\$ ld b,#8 ld a,c @@ -49,6 +49,7 @@ ;; Taken from z88dk, which originally borrowed from the ;; Spectrum rom. add hl,hl +3\$: rl c rla ;DLE 27/11/98 jr NC,2\$ ---------8<---------8<---------8<---------8<---------8<---------8<----- > I wonder how these would compare at device/lib/_mulllong.c This is 32x32->32, isn't it? It should be nice to write an ad-hoc function for this, if one can use BC,DE,HL,BC',DE',HL',IX,IY,A,A'... there as a lot of register-space to work on... Maybe also 32x32->64 is possible. Regards, Marco -- http://bodrato.it/ ```

 [Sdcc-user] Even faster multiplication on Z80? From: - 2009-03-27 11:37:49 ```Dear users and developers, This is my first mail on this mailing list, and it is also the first time I come back to Z80-assembly, after a pause of a decade and a half... I know I'm late, because 8x8-bits multiplication on Z80 is "Now replaced by a built-in for code generation, but"... I have to start from somewhere... I'm trying to substitute the code for "__muluchar_rrx_s::" in device/lib/z80/mulchar.s with some faster code. If I'm not wrong: current code uses (342+b*6) T-states (call and ret excluded), where (b) is the number of bit set to "1" in the second operand. Mine uses (356-b), this means that the current code is faster than mine only when the second operand is zero or a power of two, and slower every time the second operand has at least three bits set to "1" (85% of the values in 0-255). Advantages of the new code: - faster (most of the times), saving 14 T-states on average - does not mess with DE - can be easily modified to give its result on DE (or BC) Drawbacks: - overwrites the accumulator A Parity: - same memory footprint (2 bytes can be saved, but costs 30 T-states) The patch follows (it applies to sdcc 2.9.0): ---------8<---------8<---------8<---------8<---------8<---------8<----- --- mulchar.s 2009-03-27 11:31:35.000000000 +0100 +++ sdcc/device/lib/z80/mulchar.s 2009-01-05 11:20:47.000000000 +0100 @@ -1,38 +1,25 @@ .area _CODE -;; Multiply two 8-bits operands, giving a 16-bits result -;; by Marco Bodrato, March 27, 2009, licensed GPLv2+ -;; Before: -;; On-stack: return address, operands -;; After: -;; B=0 -;; HL=result -;; A=H, F=[H=N=Cy=0,(P,S,Z, depend on L)] -;; -;; Timings: -;; Total cycles needed depend on z=the number of "0" bits in L; -;; returns after (348+z). -;; Notes: HL can be replaced by DE (or, a bit trickier, BC) - +; This multiplication routine is similar to the one +; from Rodnay Zaks, "Programming the Z80". + ; Now replaced by a builtin for code generation, but ; still called from some asm files in this directory. __muluchar_rrx_s:: - pop af - pop hl ; Load operands - push hl ; and recover stack - push af - ;; registers H and L now store the two operands - xor a + ld hl, #2+1 + add hl, sp + ld e, (hl) + dec hl + ld h, (hl) + ld l, #0 + ld d, l ld b, #8 muluchar_rrx_s_loop: - rr l + add hl, hl jr nc, muluchar_rrx_s_noadd - add a, h + add hl, de muluchar_rrx_s_noadd: - rra djnz muluchar_rrx_s_loop - rr l ; result is in AL, now... - ld h, a ret ; operands have different sign ---------8<---------8<---------8<---------8<---------8<---------8<----- Let me know if you find it interesting! If you do, I can try to optimise also the "code generation" (but I'll need some hints) and maybe other multiplication routines... Regards, Marco -- http://bodrato.it/software/ ```
 Re: [Sdcc-user] Even faster multiplication on Z80? From: Philipp Klaus Krause - 2009-03-29 14:10:16 ```-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dear Marco, thanks for your help with Z80 multiplication. bodrato@... schrieb: > [...] > > Advantages of the new code: > - faster (most of the times), saving 14 T-states on average I don't think so. Unfortuantely I currently don't find a link to the study, but it has been shown, that small integer values are far more common than bigger ones. AFAIR values 0,1,2 combined are about as likely to occour as all other values combined in an 8 bit integer. > - does not mess with DE This could be an important advantage for builtin code generation. Currently the code often has to be wrapped into push de / pop de. > - can be easily modified to give its result on DE (or BC) Another aspect which could be good for builtin code generation. > Drawbacks: > - overwrites the accumulator A For builtin code generation this is bad, since A is used to backup B (when B is in use). > > Let me know if you find it interesting! If you do, I can try to optimise > also the "code generation" (but I'll need some hints) and maybe other > multiplication routines... > You might want to have a look at the 16 bit multiplication in mul.s or the division routines in div.s and divsigned.s. AFAIK these have not been optimized as much as the 8x8 bit multiplication yet. The code generation is in src/z80/gen.c in the genMultOneChar function. It currently generates the following code: ld l, #0 ld b, #8 1: add hl, hl jp NC, 2 add hl, de 2: djnz 1 Code size is 11 bytes, it takes 285+11b cycles. Your multiplication would be xor a ld b, #8 1: rr l jp NC, 2 add a, h 2: rra djnz 1 rr l ld h, a Code size is 15 bytes, it takes 298+4b̄ cycles. However depending on the operands and weather de and bc are in use your code could still be better. I wonder how these would compare at device/lib/_mulllong.c Philipp -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAknPgUEACgkQbtUV+xsoLpoKygCgjnEDDl03fIhuTIX9AkyyqtFd YEgAn0KwvorqMCktqZ9WhHtT6th1/3so =gXUj -----END PGP SIGNATURE----- ```
 Re: [Sdcc-user] Even faster multiplication on Z80? From: - 2009-03-30 14:03:43 ```Dear Philipp, > thanks for your help with Z80 multiplication. Thank you for your answer! > study, but it has been shown, that small integer values are far more > common than bigger ones. AFAIR values 0,1,2 combined are about as likely > to occur as all other values combined in an 8 bit integer. I'm sure you are right, because "char" is often used to store "boolean" values or for small cycles. But the real question is: are those values used as operands in multiplications? If you use the byteXbyte product as a building block for some more complex mathematical function, you usually have "random" bytes to handle. If you use it to access elements of a bi-dimensional array A[x+ROWLEN*y], then the result must be small... and so must be the two operands. A far faster code for small operands is: ld hl, #0 ld d, h or a jp 3 1: add hl,de 2: rl e rl d 3: rra jp C, 1 jp NZ, 2 It assumes the two operands are in A and E; put the result in HL... It is very fast when A<32, but code size is too big to be inlined... >> Drawbacks: >> - overwrites the accumulator A > For builtin code generation this is bad, since A is used to backup B > (when B is in use). If both DE and B are used, wrap into push bc/pop pc, (saving ld a,b/ld b,a). If only B is, save it to D or E (same as current). If only DE is in use, do nothing (save push/pop). > You might want to have a look at the 16 bit multiplication in mul.s or There is a faster way for 16x16->16, but it would be slower for 16x8->16... I have to think about it a little bit longer... anyway I can suggest some very small corrections: ---------8<---------8<---------8<---------8<---------8<---------8<----- --- sdcc/device/lib/z80/mul.s 2009-01-05 11:20:47.000000000 +0100 +++ ./mul.s 2009-03-30 15:30:08.000000000 +0200 @@ -29,11 +29,11 @@ ;; DE = multiplier ;; ;; Exit conditions - ;; DE = less significant word of product + ;; HL = less significant word of product ;; ;; Register used: AF,BC,DE,HL __mul16:: - ld hl,#0 + ld l,#0 ld a,b ; ld c,c ld b,#16 @@ -41,7 +41,7 @@ ;; Optimise for the case when this side has 8 bits of data or ;; less. This is often the case with support address calls. or a - jr NZ,1\$ + jr NZ,3\$ ld b,#8 ld a,c @@ -49,6 +49,7 @@ ;; Taken from z88dk, which originally borrowed from the ;; Spectrum rom. add hl,hl +3\$: rl c rla ;DLE 27/11/98 jr NC,2\$ ---------8<---------8<---------8<---------8<---------8<---------8<----- > I wonder how these would compare at device/lib/_mulllong.c This is 32x32->32, isn't it? It should be nice to write an ad-hoc function for this, if one can use BC,DE,HL,BC',DE',HL',IX,IY,A,A'... there as a lot of register-space to work on... Maybe also 32x32->64 is possible. Regards, Marco -- http://bodrato.it/ ```
 Re: [Sdcc-user] Even faster multiplication on Z80? From: Philipp Klaus Krause - 2009-04-01 19:15:05 ```-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 bodrato@... schrieb: > [...] anyway I can > suggest some very small corrections: Implemented in sdcc 2.9.1 #5422. Philipp -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAknTvSoACgkQbtUV+xsoLpoEHwCfXhCV/6yYMykTwizIxSlQImPg o3kAni7rvbA0qnHYguLu6b0TWlG3C0UW =utwx -----END PGP SIGNATURE----- ```
 Re: [Sdcc-user] Even faster multiplication on Z80? From: - 2009-04-02 13:46:10 ```Dear Philipp, >> [...] anyway I can >> suggest some very small corrections: > > Implemented in sdcc 2.9.1 #5422. Great, but...I did not expect such a hurried inclusion... - the comment wrongly saying "DE = less significant word..." wasn't changed in "HL =..."; - one more byte could be saved by replacing: 36 ld l,#0 37 ld a,b 38 ld b,#16 39 42 or a 43 jr NZ,2\$ with 36 xor a 37 ld l,a 38 xor b 39 ld b,#16 42 43 jr NZ,2\$ Regards, Marco PS: I've cleaned __divu16 form div.s, it's 30% faster now... I do only have to test it a little bit more... Should I send the next patch here? or to another ML? -- http://bodrato.it/ ```
 Re: [Sdcc-user] Even faster multiplication on Z80? From: Philipp Klaus Krause - 2009-04-02 21:07:16 ```-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 bodrato@... schrieb: > Dear Philipp, > > Great, but...I did not expect such a hurried inclusion... Well, it was a very simple change, now drawbacks, I understood how it works, the regression tests passed. > > PS: I've cleaned __divu16 form div.s, it's 30% faster now... I do only > have to test it a little bit more... Should I send the next patch here? or > to another ML? No. Please either create a feature request entry in the tracker for the improvement and attach the patch to it or create a patch entry in the tracker. http://sourceforge.net/tracker/?group_id=599 Philipp -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAknVKPgACgkQbtUV+xsoLpoiPwCfRdp1UnzZlRlnMPtS+SdJdvi4 aCsAoLY+uGfe5NAlgpSY1HZgPgFPa8P1 =jCAg -----END PGP SIGNATURE----- ```
 Re: [Sdcc-user] Even faster multiplication on Z80? From: - 2009-04-03 08:15:48 ```Dear Philipp, > improvement and attach the patch to it or create a patch entry in the Done: http://sourceforge.net/support/tracker.php?aid=2727917 I did not rewrite the code, I only cleaned the existing one, please let me know. Regrads, Marco -- http://bodrato.it/ ```
 Re: [Sdcc-user] Even faster multiplication on Z80? From: Philipp Klaus Krause - 2009-04-09 12:03:59 ```-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 bodrato@... schrieb: > Dear Philipp, > >> thanks for your help with Z80 multiplication. > > Thank you for your answer! > >> study, but it has been shown, that small integer values are far more >> common than bigger ones. AFAIR values 0,1,2 combined are about as likely >> to occur as all other values combined in an 8 bit integer. > > I'm sure you are right, because "char" is often used to store "boolean" > values or for small cycles. But the real question is: are those values > used as operands in multiplications? > If you use the byteXbyte product as a building block for some more complex > mathematical function, you usually have "random" bytes to handle. If you > use it to access elements of a bi-dimensional array A[x+ROWLEN*y], then > the result must be small... and so must be the two operands. I finally found some useful statistics: In "The New C Standard - An Economical and Cultural Commentary", on page 1144, see figures 1143.2. It has info on how common char values are when used as operands to * and /. Unfortunately it's not much, but it might be insightful. Philipp -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAknd5CMACgkQbtUV+xsoLppFXgCgg4LXzEmJLixT1tJ6sBA3ySfn IAMAn3kNOtDeAGj+L0Sx9YjDq7CPdhzs =rCA4 -----END PGP SIGNATURE----- ```
 Re: [Sdcc-user] Even faster multiplication on Z80? From: - 2009-04-15 07:57:16 ```Dear Philipp, >> I'm sure you are right, because "char" is often used to store "boolean" >> values or for small cycles. But the real question is: are those values >> used as operands in multiplications? >> If you use the byteXbyte product as a building block for some more >> complex >> mathematical function, you usually have "random" bytes to handle. If you >> use it to access elements of a bi-dimensional array A[x+ROWLEN*y], then >> the result must be small... and so must be the two operands. I wrote a small patch for 32x32->32 multiplication using both 8x8->16 and 16x16->16 product, this should speed up the Z80 implementation a lot... https://sourceforge.net/support/tracker.php?aid=2762462 > I finally found some useful statistics: In "The New C Standard - An > Economical and Cultural Commentary", on page 1144, see figures 1143.2. > It has info on how common char values are when used as operands to * > and /. Unfortunately it's not much, but it might be insightful. Unfortunately this is mainly about constants used in programs, and this should influence mainly the compiler part and marginally the library. The compiler should optimise r=a*0+b*1/8+2*c/1 with r=(b>>3)+(c<<1) at comile-time. The library can not check for too many special cases at run-time... I tried to add such a special handling to div/mod code, because I think that most of the times modulus is called to convert integers to strings (base 10 or 16...) with a small divisor. The result obviously is that code size increase! https://sourceforge.net/support/tracker.php?aid=2764475 One question, is it possible to have some functions with a different calling convention? Using registers to pass parameters, should obviously be very interesting for small library functions, but also the simple "stack clean-up by the callee" can (sometime) avoid a lot of coding both on the caller and on the callee side. Regards, Marco -- http://bodrato.it/ ```