Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo
Close
From: Schmitt Michael <mschmitt@ba...>  20020320 14:50:01

okay my fault ... my brain still things sdcc is a 8051 (only compiler, = still where sdcc startet as sdc a couple of years ago ....) i mean 8051 ... the problem is ... every machine cycle counts so it has = to be a realy very fast 18/15 bit div function. so the fastest function i = have takes about 70usec (assuming 24mhz or 0,5usec per machine cycle =3D 140 machine cycles) .. anybody out there that can beat that ? isn't that a lot of fun ? spending hours and hours just to get rid of = just 1 machine cycle to get into the needed specs ? Dipl.Ing. (FH) Michael Schmitt Baumer Ident GmbH Entwicklung / Development Department Hertzstr. 10 D69469 Weinheim Deutschland / Germany Tel. +49 (0) 6201 9957  30 Fax. +49 (0) 6201 9957  99 EMail : mschmitt@... Web: http://www.baumerident.com > Urspr=FCngliche Nachricht > Von: Scott Dattalo [mailto:scott@...] > Gesendet: Mittwoch, 20. M=E4rz 2002 15:20 > Cc: SDCCUSER (EMail) > Betreff: Re: [Sdccuser] need a very fast 18bit / 15bit div >=20 >=20 > On Wed, 20 Mar 2002, Bernhard Held wrote: >=20 > > > Does anybody know where to find source of a very fast=20 > division of 18bit / > > > 15bit ? > > What processor? For the 8051you'll find assembler functions=20 > for 16/16 and > > 32/32 bit divisions in the library. It should be easy to=20 > adjust the source > > for 18/15 bit. Additionally you can eliminate the loop.=20 > AFAIK there is no > > faster algorithm for the 8051, if there's no further=20 > infomation about the > > distribution of the values. >=20 > Yeah, which processor? >=20 > There are always tricks one can play. A generic solution to=20 > this problem > would be to unroll the division loop. This saves decrementing a loop > counter and a branch. >=20 > If you know something about the form of the numbers, there=20 > are additional > tricks. For example, if your divisor is actually a constant, then = it's > often faster to multiply by it's reciprocal. Or if you know that the > upper bits are constant, then you can try this formula: >=20 > x/y =3D x / (y_up + y_lo) >=20 > =3D x / y_up * (1/(1 + y_lo/y_up)) >=20 > A series expansion of >=20 > 1 >  =3D 1  A + A^2  A^3 > 1 + A >=20 > May be then invoked. This is especially useful for the case=20 > when y_up is > way greater than y_lo because then only a few terms of the=20 > expansion are > needed. Intelligent partitioning can turn some of the=20 > division operations > into shift operations. >=20 > There are many, many tricks. But each trick depends on the specific > problem to be solved. >=20 > Scott >=20 >=20 > _______________________________________________ > Sdccuser mailing list > Sdccuser@... > https://lists.sourceforge.net/lists/listinfo/sdccuser >=20 
From: Schmitt Michael <mschmitt@ba...>  20020321 07:38:45

Good morning all, ... > latter variant. If i search a bit, i am sure i can find this algorithm > somewhere in my archives (probably in VHDL only). this is no problem at all, as my daily business is verilog ... And to Scott ... i know it's been now 2 years that i did my last update to the cvs and the OKR from Thorsten due to some privat issues ... (currently quitting my marriage and a very long time being unable to use my right thumb) but if my cvs und cygnus is up again ... hope sandeep has not quit my write rights to cvs ... Dipl.Ing. (FH) Michael Schmitt Baumer Ident GmbH Entwicklung / Development Department Hertzstr. 10 D69469 Weinheim Deutschland / Germany Tel. +49 (0) 6201 9957  30 Fax. +49 (0) 6201 9957  99 EMail : mschmitt@... Web: http://www.baumerident.com 
From: Scott Dattalo <scott@da...>  20020320 14:54:11

On Wed, 20 Mar 2002, Schmitt Michael wrote: > okay my fault ... my brain still things sdcc is a 8051 (only compiler, still > where sdcc startet as sdc a couple of years ago ....) > > i mean 8051 ... the problem is ... every machine cycle counts so it has to > be a realy very fast 18/15 bit div function. so the fastest function i have > takes about 70usec (assuming 24mhz or 0,5usec per machine cycle = 140 > machine cycles) .. anybody out there that can beat that ? > > isn't that a lot of fun ? spending hours and hours just to get rid of just 1 > machine cycle to get into the needed specs ? It's really, really sad, but some of us just thrive on this kind of fun. Now that we know the processor, what about the other part? Is there something special about the data? What's the application? And to keep things on topic, will this create something useful that we can put into the SDCC libraries? Scott 
From: <MSoegtrop@MichaelSoegtrop.de>  20020320 17:04:32

Another trick: if you unroll the loop, you can make use of the fact, that the various variables in the algorithm have different length in different parts of the algorithm. If you have a/b=c, a starts with 18 bits, but soon gets down to 16 bits and later to 8 bits. The result starts with 0 bits and increases to 18 bits on the run. Besides this, many division algorithms i have seen are not optimal: they first compare and then make a subtraction or not. It is a bit tricky, but more efficient to leave away the comparison and switch between subtracting and adding. This way there is only one multi byte operation per result bit. I don't know whether the SDCC library implementation is of the latter variant. If i search a bit, i am sure i can find this algorithm somewhere in my archives (probably in VHDL only). Michael 
From: Bernhard Held <bernhard@be...>  20020321 08:32:37

> Besides this, > many division algorithms i have seen are not optimal: they first compare > and then make a subtraction or not. It is a bit tricky, but more > efficient to leave away the comparison and switch between subtracting > and adding. This way there is only one multi byte operation per result > bit. I don't know whether the SDCC library implementation is of the > latter variant. I had a look at the "optimized" algorithm, but it has the disadvantage, that you have to store an intermediate result. There are too less registers to store this additional result in registers, so I had to use expensive moves on the stack. The result was, that the "optimized" algorithm was a little bit slower with an increased memory consumption. Bernhard 
From: Scott Dattalo <scott@da...>  20020321 13:37:29

On Thu, 21 Mar 2002, Bernhard Held wrote: > > Besides this, > > many division algorithms i have seen are not optimal: they first compare > > and then make a subtraction or not. It is a bit tricky, but more > > efficient to leave away the comparison and switch between subtracting > > and adding. This way there is only one multi byte operation per result > > bit. I don't know whether the SDCC library implementation is of the > > latter variant. > I had a look at the "optimized" algorithm, but it has the disadvantage, that > you have to store an intermediate result. There are too less registers to > store this additional result in registers, so I had to use expensive moves > on the stack. > > The result was, that the "optimized" algorithm was a little bit slower with > an increased memory consumption. Just a comment... The difference between compare/subtract versus subtract/restore falls into the category of restoring versus nonrestoring algorithms. Depending on the processor, algorithm, and the data, one implementation can favor the other. Almost all division algorithms I've seen are of the nonrestoring type (i.e. compare first and then subtract if the compare says you need to). This makes sense if the whole operation fits into the processor's registers. But for multibyte operations, a restoring algorithm can be more efficient. Nikolai Golovchenko has implemented a restoring division algorithm for the PIC that is faster than anything out there (except perhaps an unrolled loop). Here's his 24bit divide by 16bit: http://www.piclist.com/techref/microchip/math/div/24by16.htm Scott 
From: <MSoegtrop@MichaelSoegtrop.de>  20020321 20:42:38
Attachments:
=?usascii?Q?Michael_Sogtrop.vcf?=
Div.cpp

Dear Scott and Michael, there is one more option besides compare/subtract, subtract and restore by copy, subtract and restore by add, and this is subtract OR add. You make one add or subtract in every cycle and no restore or compare. This algorithm looks like this: int udivtst(int a, int b) { int c,q=0 b<<=(nBitsQ1); for(c=0; c<nBitsQ; c++) { if(a<0) a+=b; else a=b; q<<=1; q+=a>=0; b>>=1; } return q; } The compares a>=0 and a<0 can be extracted from the flags of the add or subtract in the previous cycle (a compare with 0 is not really a compare!). You need the same number and size of registers as for the compare/subtract algorithm, but you save the comparison. In the appended file i have a C test program, that demonstrates the algorithm. The implementation is intended for signal processing. If you make, say a 16Bit / 8Bit => 8 Bit division, the result will be saturated to the maximum value, if it doesn't fit in the result. A division by zero results in the maximum value as well. The modulus sign of the signed version is designed in such a way, that (a+b/2)/b gives int(round(double(a)/double(b))), regardless of the sign of a and b (i hope, it does what it is designed for). If the loop is unrolled, and the result is known to fit in q (for this reason, i don't recommend this), b and q can share 1 register, and the a>=0 and a<0 is extracted from the sign bit at a location in a, which is not messed up by the q bits in q:b. This is a different bit in every cycle, but this does not hurt on a processor with bit access. If you don't need the remainder, you can leave away the extra remainder fixing code in the file. As far as i know, this is the fastest general, integer exact "one bit at a time" division algorithm on most machines. Here an extract of a paper, in which i explained why this algorithm works: If the result is positive, the corresponding resultbit is a one, and a zero otherwise. Now, if the intermediate divisionrest once became negative, we have subtracted to much, but it turns out, that if the divisor is added rather than subtracted in the following bitposition, the effect is the same, as if the subtraction, which made the intermidiate divisionrest negative would not have been executed, and the divisor would have been subtracted in the next bit position. This is cause of the following equivalence: r  2^n d + 2^{n1} d = r  2^{n1} d This trickery does only work with binary numbers. If the 2 in the above equation is replaced by a different number, the equation no longer holds. The advantage of this trickery is, that we need not figure out, whether a subtraction is possible or not. In every step, either a subtract or an add is executed. I hope this helps Best regards, Michael Original Message From: sdccuseradmin@... [mailto:sdccuseradmin@...]On Behalf Of Scott Dattalo Sent: Thursday, March 21, 2002 2:37 PM Cc: Sdccuser@... Subject: Re: [Sdccuser] need a very fast 18bit / 15bit div On Thu, 21 Mar 2002, Bernhard Held wrote: > > Besides this, > > many division algorithms i have seen are not optimal: they first compare > > and then make a subtraction or not. It is a bit tricky, but more > > efficient to leave away the comparison and switch between subtracting > > and adding. This way there is only one multi byte operation per result > > bit. I don't know whether the SDCC library implementation is of the > > latter variant. > I had a look at the "optimized" algorithm, but it has the disadvantage, that > you have to store an intermediate result. There are too less registers to > store this additional result in registers, so I had to use expensive moves > on the stack. > > The result was, that the "optimized" algorithm was a little bit slower with > an increased memory consumption. Just a comment... The difference between compare/subtract versus subtract/restore falls into the category of restoring versus nonrestoring algorithms. Depending on the processor, algorithm, and the data, one implementation can favor the other. Almost all division algorithms I've seen are of the nonrestoring type (i.e. compare first and then subtract if the compare says you need to). This makes sense if the whole operation fits into the processor's registers. But for multibyte operations, a restoring algorithm can be more efficient. Nikolai Golovchenko has implemented a restoring division algorithm for the PIC that is faster than anything out there (except perhaps an unrolled loop). Here's his 24bit divide by 16bit: http://www.piclist.com/techref/microchip/math/div/24by16.htm Scott _______________________________________________ Sdccuser mailing list Sdccuser@... https://lists.sourceforge.net/lists/listinfo/sdccuser 
From: Scott Dattalo <scott@da...>  20020321 21:17:48

On Thu, 21 Mar 2002, Michael Sogtrop wrote: > there is one more option besides compare/subtract, subtract and restore Excellent! Scott 
From: <MSoegtrop@MichaelSoegtrop.de>  20020321 21:55:44
Attachments:
=?usascii?Q?Michael_Sogtrop.vcf?=

Thank you, Scott! It took me quite a while to figure this out, some years ago. If anybody tries this on a 8051, i would like to know, how fast it is compared to the "normal" algorithm. Best regards, Michael > Original Message > From: sdccuseradmin@... > Scott Dattalo > Sent: Thursday, March 21, 2002 10:18 PM > > > there is one more option besides compare/subtract, subtract > and restore > > Excellent! > > Scott > > > > > _______________________________________________ > Sdccuser mailing list > Sdccuser@... > https://lists.sourceforge.net/lists/listinfo/sdccuser 
From: Bernhard Held <bernhard@be...>  20020322 09:13:38

> there is one more option besides compare/subtract, subtract and restore > by copy, subtract and restore by add, and this is subtract OR add. You > make one add or subtract in every cycle and no restore or compare. This > algorithm looks like this: > > int udivtst(int a, int b) { > int c,q=0 > > b<<=(nBitsQ1); > for(c=0; c<nBitsQ; c++) { > if(a<0) a+=b; > else a=b; > q<<=1; > q+=a>=0; > b>>=1; > } > > return q; > } Sounds very interesting. But I stopped implementing a 16 / 16 bit = 16 bit division for SDCC at this line: b<<=(nBitsQ1); This shifts my 16 bit divisor by 15 bits to the left. To avoid loss of MSBs I would have to implement 31 (or 32) bit variables and operations (!) for b and a. This would be very ineffective on a 8051. Is this correct or am I wrong? Bernhard 
From: <MSoegtrop@MichaelSoegtrop.de>  20020322 10:00:33
Attachments:
=?iso88591?Q?Michael_S=F6gtrop.vcf?=

Dear Bernard, A 16/16>16 is a bit different from a 32/16>16 algorithm like the algorithm shown. I just forgot how the shifting in a 16/16>16 bit algorithm works (i did this about ten years ago). If i remember right, you start with a and b in the same position, and shift a left on the run, while successive digits of a or cancelled out. The add OR subtract is in general similar to the usual compare/subtract algorithm regarding shifting of the operands. You just replace the compare by the sign of a, and add instead of subtract, if a is negative. The algorithms are really the same, except that the subtract is just done without asking any questions, and if it was wrong to do so (a gets negative) the mess is fixed by adding instead of subtracting in the next cycle. E.g. if you make a wrong 2b in cycle 1, you fix this by making +b instead of b in cycle 2. 2b+b=b so this has the same effect as not doing the 2b in cycle 1 and doing the b in cycle 2 as you should (if this is wrong again, you fix it in the same way in cycle 3 ...). Otherwise the algorithm is identical to any kind of compare/subtract algorithm. The only problem is, that you must keep track, if a is negative or not, while shifting a to the left. It should work to shift the top bit of a into the carry bit, and then use the carry bit for the descision, if a has become negative. I tried this, but i just cannot remember all the details of the shifting in 16/16>16 division. Could you send me PC Ccode for the normal 16/16>16 algorithm. I am sure, i can convert this quickly to the sub or add algorithm Best regards, Michael > Original Message > From: Bernhard Held [mailto:bernhard@...] > Sent: Friday, March 22, 2002 10:12 AM > To: Michael Sogtrop; Sdccuser@... > Subject: Re: [Sdccuser] need a very fast 18bit / 15bit div > > > > there is one more option besides compare/subtract, subtract > and restore > > by copy, subtract and restore by add, and this is subtract > OR add. You > > make one add or subtract in every cycle and no restore or > compare. This > > algorithm looks like this: > > > > int udivtst(int a, int b) { > > int c,q=0 > > > > b<<=(nBitsQ1); > > for(c=0; c<nBitsQ; c++) { > > if(a<0) a+=b; > > else a=b; > > q<<=1; > > q+=a>=0; > > b>>=1; > > } > > > > return q; > > } > > Sounds very interesting. But I stopped implementing a 16 / 16 > bit = 16 bit > division for SDCC at this line: > b<<=(nBitsQ1); > This shifts my 16 bit divisor by 15 bits to the left. To > avoid loss of MSBs > I would have to implement 31 (or 32) bit variables and > operations (!) for b > and a. This would be very ineffective on a 8051. Is this > correct or am I > wrong? > > Bernhard > 