Re: [Algorithms] fast pow() for limited inputs

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On 18.08.2010 18:16, Jeff Russell wrote:
> So I need to speed up the CRT pow() function a bit, but I have some
> restrictions on the input which hopefully should give me some room to
> optimize:
>
> I need to compute pow(x,y), where x is on [0,1], and y is guaranteed
> positive and could have an upper bound in the neighborhood of 1000.

Unfortunately, x in [0,1] isn't a big restriction for pow; it covers 
roughly a quarter of all possible 32-bit floating point numbers over a 
wide range of exponents. pow on [0,1] is, for all practical purposes, 
just as expensive as a full-range one.

What's a lot more interesting is the distribution of your x inside 
[0,1], and what type of error bound you need. Some background first: A 
full-precision pow tries to return a floating-point number that is as 
close as possible to the result you would get if you were doing the 
computation exactly then rounding to the nearest representable FP 
number. The error of numerical approximations is usually specified in 
ulps (units in the last place). With IEEE floating point, addition, 
subtraction, multiplication and division are guaranteed to return 
results accurate to within 0.5ulps (which is equivalent to saying that 
the results are the same as if they had been performed exactly then 
rounded). For a variety of reasons, you can't get the same error bound 
on pow and a lot of other transcendental functions, but library 
implementations are usually on the order of 1ulp or less, so they're 
pretty accurate.

Anyway, if you don't care about relative error, but rather need an 
absolute error bound, you can play fast and loose with small x (as long 
as y > 1) and the like.

> I need "reasonable" accuracy (could be a little looser than the standard
> pow() implementation). I've searched online and found some bit twiddling
> approaches that claim to be very fast, but they seem to be too
> inaccurate for my purposes. I've tried implementing pow() as exp(
> log(x), y ), with my own cheap Taylor series in place of the natural log
> function. It did produce good output but wasn't very fast (slightly

exp(log(x) * y) is already significantly worse than a "true" pow(x, y) 
in terms of relative error, and it has some extra roundoff steps that 
aren't directly visible: exp and log are usually implemented in terms of 
exp2 and log2 (since that matches better with the internal FP 
representation), so there's some invisible conversion factors in there.

Taylor series approximations are quite useless in practice. You can get 
a lot more precision for the same amount of work (or, equivalently, the 
same precision with a lot less work) by using optimized approximation 
polynomials (Minimax polynomials are usually the weapon of choice, but 
you really need a Computer Algebra package if you want to build your own 
approximation polynomials). log is still a fairly tough nut to crack 
however, since it's hard to approximate properly with polynomials.

Robin Green did a great GDC 2003 presentation on the subject, it's 
available here: (transcendental functions are in the 2nd part, but you 
should still start at the beginning)

   http://www.research.scea.com/gdc2003/fast-math-functions.html

Big spoiler: There's no really great general solution for pow that's 
significantly faster than what the CRT does.

> slower than the CRT pow()). It is probably worth mentioning before
> anyone asks that yes I have confirmed pow() as the bottleneck with a
> profiling tool ;-)

A bit more context would help. What are you pow'ing, and what do you do 
with the results? General pow is hard, but for applications like 
rendering, you can usually get by with rather rough approximations.

> I would also love to just see a sample implementation of pow(), log(),
> and exp() somewhere, even that might be helpful.

A handy reference is... the CRT source code. VC++ doesn't come with 
source code to the math functions though - you can either just 
disassemble (e.g. with objdump) or look at another CRT implementation. 
glibc is a good candidate.

glibc math implementations are in sysdeps/ieee754 for generic IEEE-754 
compliant platforms, with optimized versions for all relevant 
architectures in sysdeps/<arch>. If you really want to know how it's 
implemented :)

-Fabian