Thread: [Algorithms] fast pow() for limited inputs
Brought to you by:
vexxed72
From: Jeff R. <je...@8m...> - 2010-08-19 01:16:52
|
So I need to speed up the CRT pow() function a bit, but I have some restrictions on the input which hopefully should give me some room to optimize: I need to compute pow(x,y), where x is on [0,1], and y is guaranteed positive and could have an upper bound in the neighborhood of 1000. I need "reasonable" accuracy (could be a little looser than the standard pow() implementation). I've searched online and found some bit twiddling approaches that claim to be very fast, but they seem to be too inaccurate for my purposes. I've tried implementing pow() as exp( log(x), y ), with my own cheap Taylor series in place of the natural log function. It did produce good output but wasn't very fast (slightly slower than the CRT pow()). It is probably worth mentioning before anyone asks that yes I have confirmed pow() as the bottleneck with a profiling tool ;-) I would also love to just see a sample implementation of pow(), log(), and exp() somewhere, even that might be helpful. Thanks, Jeff -- Jeff Russell Engineer, 8monkey Labs www.8monkeylabs.com |
From: Simon F. <sim...@po...> - 2010-08-19 05:52:19
|
> with my own cheap Taylor series in place of the natural log function. This may be a silly question, but why did you use a natural log instead of a log base 2? Cheers Simon ________________________________ From: Jeff Russell [mailto:je...@8m...] Sent: 19 August 2010 02:16 To: Game Development Algorithms Subject: [Algorithms] fast pow() for limited inputs So I need to speed up the CRT pow() function a bit, but I have some restrictions on the input which hopefully should give me some room to optimize: I need to compute pow(x,y), where x is on [0,1], and y is guaranteed positive and could have an upper bound in the neighborhood of 1000. I need "reasonable" accuracy (could be a little looser than the standard pow() implementation). I've searched online and found some bit twiddling approaches that claim to be very fast, but they seem to be too inaccurate for my purposes. I've tried implementing pow() as exp( log(x), y ), with my own cheap Taylor series in place of the natural log function. It did produce good output but wasn't very fast (slightly slower than the CRT pow()). It is probably worth mentioning before anyone asks that yes I have confirmed pow() as the bottleneck with a profiling tool ;-) I would also love to just see a sample implementation of pow(), log(), and exp() somewhere, even that might be helpful. Thanks, Jeff -- Jeff Russell Engineer, 8monkey Labs www.8monkeylabs.com |
From: Fabian G. <ry...@gm...> - 2010-08-19 06:36:04
|
On 18.08.2010 18:16, Jeff Russell wrote: > So I need to speed up the CRT pow() function a bit, but I have some > restrictions on the input which hopefully should give me some room to > optimize: > > I need to compute pow(x,y), where x is on [0,1], and y is guaranteed > positive and could have an upper bound in the neighborhood of 1000. Unfortunately, x in [0,1] isn't a big restriction for pow; it covers roughly a quarter of all possible 32-bit floating point numbers over a wide range of exponents. pow on [0,1] is, for all practical purposes, just as expensive as a full-range one. What's a lot more interesting is the distribution of your x inside [0,1], and what type of error bound you need. Some background first: A full-precision pow tries to return a floating-point number that is as close as possible to the result you would get if you were doing the computation exactly then rounding to the nearest representable FP number. The error of numerical approximations is usually specified in ulps (units in the last place). With IEEE floating point, addition, subtraction, multiplication and division are guaranteed to return results accurate to within 0.5ulps (which is equivalent to saying that the results are the same as if they had been performed exactly then rounded). For a variety of reasons, you can't get the same error bound on pow and a lot of other transcendental functions, but library implementations are usually on the order of 1ulp or less, so they're pretty accurate. Anyway, if you don't care about relative error, but rather need an absolute error bound, you can play fast and loose with small x (as long as y > 1) and the like. > I need "reasonable" accuracy (could be a little looser than the standard > pow() implementation). I've searched online and found some bit twiddling > approaches that claim to be very fast, but they seem to be too > inaccurate for my purposes. I've tried implementing pow() as exp( > log(x), y ), with my own cheap Taylor series in place of the natural log > function. It did produce good output but wasn't very fast (slightly exp(log(x) * y) is already significantly worse than a "true" pow(x, y) in terms of relative error, and it has some extra roundoff steps that aren't directly visible: exp and log are usually implemented in terms of exp2 and log2 (since that matches better with the internal FP representation), so there's some invisible conversion factors in there. Taylor series approximations are quite useless in practice. You can get a lot more precision for the same amount of work (or, equivalently, the same precision with a lot less work) by using optimized approximation polynomials (Minimax polynomials are usually the weapon of choice, but you really need a Computer Algebra package if you want to build your own approximation polynomials). log is still a fairly tough nut to crack however, since it's hard to approximate properly with polynomials. Robin Green did a great GDC 2003 presentation on the subject, it's available here: (transcendental functions are in the 2nd part, but you should still start at the beginning) http://www.research.scea.com/gdc2003/fast-math-functions.html Big spoiler: There's no really great general solution for pow that's significantly faster than what the CRT does. > slower than the CRT pow()). It is probably worth mentioning before > anyone asks that yes I have confirmed pow() as the bottleneck with a > profiling tool ;-) A bit more context would help. What are you pow'ing, and what do you do with the results? General pow is hard, but for applications like rendering, you can usually get by with rather rough approximations. > I would also love to just see a sample implementation of pow(), log(), > and exp() somewhere, even that might be helpful. A handy reference is... the CRT source code. VC++ doesn't come with source code to the math functions though - you can either just disassemble (e.g. with objdump) or look at another CRT implementation. glibc is a good candidate. glibc math implementations are in sysdeps/ieee754 for generic IEEE-754 compliant platforms, with optimized versions for all relevant architectures in sysdeps/<arch>. If you really want to know how it's implemented :) -Fabian |
From: Jeff R. <je...@8m...> - 2010-08-19 15:32:16
|
Thanks Fabian, good info there. I'll check out minimax polynomials, and as both you and Simon pointed out base 2 log/exp would probably make more sense. Jeff On Thu, Aug 19, 2010 at 1:35 AM, Fabian Giesen <ry...@gm...> wrote: > On 18.08.2010 18:16, Jeff Russell wrote: > > So I need to speed up the CRT pow() function a bit, but I have some > > restrictions on the input which hopefully should give me some room to > > optimize: > > > > I need to compute pow(x,y), where x is on [0,1], and y is guaranteed > > positive and could have an upper bound in the neighborhood of 1000. > > Unfortunately, x in [0,1] isn't a big restriction for pow; it covers > roughly a quarter of all possible 32-bit floating point numbers over a > wide range of exponents. pow on [0,1] is, for all practical purposes, > just as expensive as a full-range one. > > What's a lot more interesting is the distribution of your x inside > [0,1], and what type of error bound you need. Some background first: A > full-precision pow tries to return a floating-point number that is as > close as possible to the result you would get if you were doing the > computation exactly then rounding to the nearest representable FP > number. The error of numerical approximations is usually specified in > ulps (units in the last place). With IEEE floating point, addition, > subtraction, multiplication and division are guaranteed to return > results accurate to within 0.5ulps (which is equivalent to saying that > the results are the same as if they had been performed exactly then > rounded). For a variety of reasons, you can't get the same error bound > on pow and a lot of other transcendental functions, but library > implementations are usually on the order of 1ulp or less, so they're > pretty accurate. > > Anyway, if you don't care about relative error, but rather need an > absolute error bound, you can play fast and loose with small x (as long > as y > 1) and the like. > > > I need "reasonable" accuracy (could be a little looser than the standard > > pow() implementation). I've searched online and found some bit twiddling > > approaches that claim to be very fast, but they seem to be too > > inaccurate for my purposes. I've tried implementing pow() as exp( > > log(x), y ), with my own cheap Taylor series in place of the natural log > > function. It did produce good output but wasn't very fast (slightly > > exp(log(x) * y) is already significantly worse than a "true" pow(x, y) > in terms of relative error, and it has some extra roundoff steps that > aren't directly visible: exp and log are usually implemented in terms of > exp2 and log2 (since that matches better with the internal FP > representation), so there's some invisible conversion factors in there. > > Taylor series approximations are quite useless in practice. You can get > a lot more precision for the same amount of work (or, equivalently, the > same precision with a lot less work) by using optimized approximation > polynomials (Minimax polynomials are usually the weapon of choice, but > you really need a Computer Algebra package if you want to build your own > approximation polynomials). log is still a fairly tough nut to crack > however, since it's hard to approximate properly with polynomials. > > Robin Green did a great GDC 2003 presentation on the subject, it's > available here: (transcendental functions are in the 2nd part, but you > should still start at the beginning) > > http://www.research.scea.com/gdc2003/fast-math-functions.html > > Big spoiler: There's no really great general solution for pow that's > significantly faster than what the CRT does. > > > slower than the CRT pow()). It is probably worth mentioning before > > anyone asks that yes I have confirmed pow() as the bottleneck with a > > profiling tool ;-) > > A bit more context would help. What are you pow'ing, and what do you do > with the results? General pow is hard, but for applications like > rendering, you can usually get by with rather rough approximations. > > > I would also love to just see a sample implementation of pow(), log(), > > and exp() somewhere, even that might be helpful. > > A handy reference is... the CRT source code. VC++ doesn't come with > source code to the math functions though - you can either just > disassemble (e.g. with objdump) or look at another CRT implementation. > glibc is a good candidate. > > glibc math implementations are in sysdeps/ieee754 for generic IEEE-754 > compliant platforms, with optimized versions for all relevant > architectures in sysdeps/<arch>. If you really want to know how it's > implemented :) > > -Fabian > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by > > Make an app they can't live without > Enter the BlackBerry Developer Challenge > http://p.sf.net/sfu/RIM-dev2dev > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > -- Jeff Russell Engineer, 8monkey Labs www.8monkeylabs.com |
From: Simon F. <sim...@po...> - 2010-08-19 15:45:10
|
Jeff, I don't know how much accuracy you are really after, but Jim Blinn had a little article called "Floating-Point Tricks" that was in IEEE CG&A Jul/Aug 1997, which has some "log2" and "exp2" approximations. It's also printed in his "Notation, Notation, Notation" book. Cheers Simon ________________________________ From: Jeff Russell [mailto:je...@8m...] Sent: 19 August 2010 16:32 To: Game Development Algorithms Subject: Re: [Algorithms] fast pow() for limited inputs Thanks Fabian, good info there. I'll check out minimax polynomials, and as both you and Simon pointed out base 2 log/exp would probably make more sense. Jeff |
From: Robin G. <rob...@gm...> - 2010-08-19 17:57:13
|
On Wed, Aug 18, 2010 at 11:35 PM, Fabian Giesen <ry...@gm...> wrote: > >> I would also love to just see a sample implementation of pow(), log(), >> and exp() somewhere, even that might be helpful. > > glibc math implementations are in sysdeps/ieee754 for generic IEEE-754 > compliant platforms, with optimized versions for all relevant > architectures in sysdeps/<arch>. If you really want to know how it's > implemented :) What he said. Also, take a look at the CEPHES library for platform agnostic reference implementations of the C math functions and some extras like cotangent, cuberoot and integer powers: http://www.netlib.org/cephes/ And here's an X86 specific implementation of powf() that claims to be faster (than what, it doesn't say): http://www.xyzw.de/c190.html - Robin Green. |
From: Steve L. <sm...@go...> - 2010-08-19 16:35:09
|
Another option, even if you can't find a much faster approximation, is to unroll your loop 4 times and calculate 4 pows at the same time using SSE/VMX. From: Jeff Russell [mailto:je...@8m...] Sent: 19 August 2010 2:16am To: Game Development Algorithms Subject: [Algorithms] fast pow() for limited inputs So I need to speed up the CRT pow() function a bit, but I have some restrictions on the input which hopefully should give me some room to optimize: I need to compute pow(x,y), where x is on [0,1], and y is guaranteed positive and could have an upper bound in the neighborhood of 1000. I need "reasonable" accuracy (could be a little looser than the standard pow() implementation). I've searched online and found some bit twiddling approaches that claim to be very fast, but they seem to be too inaccurate for my purposes. I've tried implementing pow() as exp( log(x), y ), with my own cheap Taylor series in place of the natural log function. It did produce good output but wasn't very fast (slightly slower than the CRT pow()). It is probably worth mentioning before anyone asks that yes I have confirmed pow() as the bottleneck with a profiling tool ;-) I would also love to just see a sample implementation of pow(), log(), and exp() somewhere, even that might be helpful. Thanks, Jeff -- Jeff Russell Engineer, 8monkey Labs www.8monkeylabs.com |
From: Fabian G. <ry...@gm...> - 2010-08-19 18:17:42
|
On 19.08.2010 10:57, Robin Green wrote: > On Wed, Aug 18, 2010 at 11:35 PM, Fabian Giesen<ry...@gm...> wrote: >> >>> I would also love to just see a sample implementation of pow(), log(), >>> and exp() somewhere, even that might be helpful. >> >> glibc math implementations are in sysdeps/ieee754 for generic IEEE-754 >> compliant platforms, with optimized versions for all relevant >> architectures in sysdeps/<arch>. If you really want to know how it's >> implemented :) > > > What he said. > > Also, take a look at the CEPHES library for platform agnostic > reference implementations of the C math functions and some extras like > cotangent, cuberoot and integer powers: > > http://www.netlib.org/cephes/ > > And here's an X86 specific implementation of powf() that claims to be > faster (than what, it doesn't say): > > http://www.xyzw.de/c190.html Now that's interesting :). I wrote most of that header file, around 2000 or so. It's faster than what used to be the standard pow() implementation on x86 (as in the VC++ 6.0 runtime library), using fscale (that method is still used for sFExp below). This is all code for 64k intros so it was optimized for size originally, but pow was a bottleneck during texture generation, and Agner Fogs version was 20-30% faster if I recall correctly. (This was back when P3s were the norm though, no idea how it looks now). The main change is to replace the fscale (which used to be very slow on some processors) with a longer code sequence that's faster. The original code sequence used to be commented out before the "// faster pow" comment, but I guess that got removed at some point :). Since VS2002 or 2003, the C library contains a much better pow() implementation (using SSE on processors that support it) that should be faster than this code. It's also a lot bigger though. -Fabian |
From: Jon W. <jw...@gm...> - 2010-08-21 04:44:10
|
Depending on accuracy needs, a look-up table with interpolation (say, cubic interpolation) can be just fine and dandy. Consider the 0-255/0-255 texture typically used to approximate pow() for specular lighting back in the days; it would be 65 kB in size and thus fit in L2 on a modern CPU (if you use bytes to represent that 0..1 range). Another, even faster, and even worse, approximation is simply to make a line from (1,1) that intersects y=0 somewhere between x=0 and x=1, and move it farther to the right for higher exponents. It all depends on what kind of precision you need this for. Sincerely, jw -- Americans might object: there is no way we would sacrifice our living standards for the benefit of people in the rest of the world. Nevertheless, whether we get there willingly or not, we shall soon have lower consumption rates, because our present rates are unsustainable. On Thu, Aug 19, 2010 at 11:17 AM, Fabian Giesen <ry...@gm...> wrote: > On 19.08.2010 10:57, Robin Green wrote: >> On Wed, Aug 18, 2010 at 11:35 PM, Fabian Giesen<ry...@gm...> wrote: >>> >>>> I would also love to just see a sample implementation of pow(), log(), >>>> and exp() somewhere, even that might be helpful. >>> >>> glibc math implementations are in sysdeps/ieee754 for generic IEEE-754 >>> compliant platforms, with optimized versions for all relevant >>> architectures in sysdeps/<arch>. If you really want to know how it's >>> implemented :) >> >> >> What he said. >> >> Also, take a look at the CEPHES library for platform agnostic >> reference implementations of the C math functions and some extras like >> cotangent, cuberoot and integer powers: >> >> http://www.netlib.org/cephes/ >> >> And here's an X86 specific implementation of powf() that claims to be >> faster (than what, it doesn't say): >> >> http://www.xyzw.de/c190.html > > Now that's interesting :). I wrote most of that header file, around 2000 > or so. It's faster than what used to be the standard pow() > implementation on x86 (as in the VC++ 6.0 runtime library), using fscale > (that method is still used for sFExp below). This is all code for 64k > intros so it was optimized for size originally, but pow was a bottleneck > during texture generation, and Agner Fogs version was 20-30% faster if I > recall correctly. (This was back when P3s were the norm though, no idea > how it looks now). The main change is to replace the fscale (which used > to be very slow on some processors) with a longer code sequence that's > faster. > > The original code sequence used to be commented out before the "// > faster pow" comment, but I guess that got removed at some point :). > > Since VS2002 or 2003, the C library contains a much better pow() > implementation (using SSE on processors that support it) that should be > faster than this code. It's also a lot bigger though. > > -Fabian > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by > > Make an app they can't live without > Enter the BlackBerry Developer Challenge > http://p.sf.net/sfu/RIM-dev2dev > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list > |
From: <chr...@pl...> - 2010-08-22 23:42:52
|
Jeff Russell <je...@8m...> wrote on 08/18/2010 06:16:17 PM: > I need to compute pow(x,y), where x is on [0,1], and y is guaranteed > positive and could have an upper bound in the neighborhood of 1000. I've never used it so I have no idea if Schlick's approximation fits that description, but fwiw: http://ompf.org/forum/viewtopic.php?f=11&t=1402 Christer Ericson, Director of Tools and Technology Sony Computer Entertainment, Santa Monica http://realtimecollisiondetection.net/blog/ |