From: Fabian Giesen <ryg@gm...>  20100819 06:36:04

On 18.08.2010 18:16, Jeff Russell wrote: > So I need to speed up the CRT pow() function a bit, but I have some > restrictions on the input which hopefully should give me some room to > optimize: > > I need to compute pow(x,y), where x is on [0,1], and y is guaranteed > positive and could have an upper bound in the neighborhood of 1000. Unfortunately, x in [0,1] isn't a big restriction for pow; it covers roughly a quarter of all possible 32bit floating point numbers over a wide range of exponents. pow on [0,1] is, for all practical purposes, just as expensive as a fullrange one. What's a lot more interesting is the distribution of your x inside [0,1], and what type of error bound you need. Some background first: A fullprecision pow tries to return a floatingpoint number that is as close as possible to the result you would get if you were doing the computation exactly then rounding to the nearest representable FP number. The error of numerical approximations is usually specified in ulps (units in the last place). With IEEE floating point, addition, subtraction, multiplication and division are guaranteed to return results accurate to within 0.5ulps (which is equivalent to saying that the results are the same as if they had been performed exactly then rounded). For a variety of reasons, you can't get the same error bound on pow and a lot of other transcendental functions, but library implementations are usually on the order of 1ulp or less, so they're pretty accurate. Anyway, if you don't care about relative error, but rather need an absolute error bound, you can play fast and loose with small x (as long as y > 1) and the like. > I need "reasonable" accuracy (could be a little looser than the standard > pow() implementation). I've searched online and found some bit twiddling > approaches that claim to be very fast, but they seem to be too > inaccurate for my purposes. I've tried implementing pow() as exp( > log(x), y ), with my own cheap Taylor series in place of the natural log > function. It did produce good output but wasn't very fast (slightly exp(log(x) * y) is already significantly worse than a "true" pow(x, y) in terms of relative error, and it has some extra roundoff steps that aren't directly visible: exp and log are usually implemented in terms of exp2 and log2 (since that matches better with the internal FP representation), so there's some invisible conversion factors in there. Taylor series approximations are quite useless in practice. You can get a lot more precision for the same amount of work (or, equivalently, the same precision with a lot less work) by using optimized approximation polynomials (Minimax polynomials are usually the weapon of choice, but you really need a Computer Algebra package if you want to build your own approximation polynomials). log is still a fairly tough nut to crack however, since it's hard to approximate properly with polynomials. Robin Green did a great GDC 2003 presentation on the subject, it's available here: (transcendental functions are in the 2nd part, but you should still start at the beginning) http://www.research.scea.com/gdc2003/fastmathfunctions.html Big spoiler: There's no really great general solution for pow that's significantly faster than what the CRT does. > slower than the CRT pow()). It is probably worth mentioning before > anyone asks that yes I have confirmed pow() as the bottleneck with a > profiling tool ;) A bit more context would help. What are you pow'ing, and what do you do with the results? General pow is hard, but for applications like rendering, you can usually get by with rather rough approximations. > I would also love to just see a sample implementation of pow(), log(), > and exp() somewhere, even that might be helpful. A handy reference is... the CRT source code. VC++ doesn't come with source code to the math functions though  you can either just disassemble (e.g. with objdump) or look at another CRT implementation. glibc is a good candidate. glibc math implementations are in sysdeps/ieee754 for generic IEEE754 compliant platforms, with optimized versions for all relevant architectures in sysdeps/<arch>. If you really want to know how it's implemented :) Fabian 