Thanks Fabian, good info there.

I'll check out minimax polynomials, and as both you and Simon pointed out base 2 log/exp would probably make more sense.

Jeff

On Thu, Aug 19, 2010 at 1:35 AM, Fabian Giesen wrote:
On 18.08.2010 18:16, Jeff Russell wrote:
> So I need to speed up the CRT pow() function a bit, but I have some
> restrictions on the input which hopefully should give me some room to
> optimize:
>
> I need to compute pow(x,y), where x is on [0,1], and y is guaranteed
> positive and could have an upper bound in the neighborhood of 1000.

Unfortunately, x in [0,1] isn't a big restriction for pow; it covers
roughly a quarter of all possible 32-bit floating point numbers over a
wide range of exponents. pow on [0,1] is, for all practical purposes,
just as expensive as a full-range one.

What's a lot more interesting is the distribution of your x inside
[0,1], and what type of error bound you need. Some background first: A
full-precision pow tries to return a floating-point number that is as
close as possible to the result you would get if you were doing the
computation exactly then rounding to the nearest representable FP
number. The error of numerical approximations is usually specified in
ulps (units in the last place). With IEEE floating point, addition,
subtraction, multiplication and division are guaranteed to return
results accurate to within 0.5ulps (which is equivalent to saying that
the results are the same as if they had been performed exactly then
rounded). For a variety of reasons, you can't get the same error bound
on pow and a lot of other transcendental functions, but library
implementations are usually on the order of 1ulp or less, so they're
pretty accurate.

Anyway, if you don't care about relative error, but rather need an
absolute error bound, you can play fast and loose with small x (as long
as y > 1) and the like.

> I need "reasonable" accuracy (could be a little looser than the standard
> pow() implementation). I've searched online and found some bit twiddling
> approaches that claim to be very fast, but they seem to be too
> inaccurate for my purposes. I've tried implementing pow() as exp(
> log(x), y ), with my own cheap Taylor series in place of the natural log
> function. It did produce good output but wasn't very fast (slightly

exp(log(x) * y) is already significantly worse than a "true" pow(x, y)
in terms of relative error, and it has some extra roundoff steps that
aren't directly visible: exp and log are usually implemented in terms of
exp2 and log2 (since that matches better with the internal FP
representation), so there's some invisible conversion factors in there.

Taylor series approximations are quite useless in practice. You can get
a lot more precision for the same amount of work (or, equivalently, the
same precision with a lot less work) by using optimized approximation
polynomials (Minimax polynomials are usually the weapon of choice, but
you really need a Computer Algebra package if you want to build your own
approximation polynomials). log is still a fairly tough nut to crack
however, since it's hard to approximate properly with polynomials.

Robin Green did a great GDC 2003 presentation on the subject, it's
available here: (transcendental functions are in the 2nd part, but you
should still start at the beginning)

http://www.research.scea.com/gdc2003/fast-math-functions.html

Big spoiler: There's no really great general solution for pow that's
significantly faster than what the CRT does.

> slower than the CRT pow()). It is probably worth mentioning before
> anyone asks that yes I have confirmed pow() as the bottleneck with a
> profiling tool ;-)

A bit more context would help. What are you pow'ing, and what do you do
with the results? General pow is hard, but for applications like
rendering, you can usually get by with rather rough approximations.

> I would also love to just see a sample implementation of pow(), log(),
> and exp() somewhere, even that might be helpful.

A handy reference is... the CRT source code. VC++ doesn't come with
source code to the math functions though - you can either just
disassemble (e.g. with objdump) or look at another CRT implementation.
glibc is a good candidate.

glibc math implementations are in sysdeps/ieee754 for generic IEEE-754
compliant platforms, with optimized versions for all relevant
architectures in sysdeps/<arch>. If you really want to know how it's
implemented :)

-Fabian

------------------------------------------------------------------------------
This SF.net email is sponsored by

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev
_______________________________________________
GDAlgorithms-list mailing list
GDAlgorithms-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
Archives:
http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list

--
Jeff Russell
Engineer, 8monkey Labs
www.8monkeylabs.com