|
From: <co...@ph...> - 2006-02-14 23:12:30
|
[changed subject to reflect this better]
Tim Hochberg <tim...@co...> writes:
> David M. Cooke wrote:
>>Tim Hochberg <tim...@co...> writes:
>>>David M. Cooke wrote:
>>> 2, Don't distinguish between scalars and arrays -- that just makes
>>>things harder to explain.
>>Makes the optimizations better, though.
>>
>>
> Ah, Because you can hoist all the checks for what type of optimization
> to do, if any, out of the core loop, right? That's a good point. Still
> I'm not keen on a**b having different performance *and* different
> results depending on whether b is a scalar or matrix. The first thing
> to do is to measure how much overhead doing the optimization element
> by element is going to add. Assuming that it's signifigant that leaves
> us with the familiar dilema: fast, simple or general purpose; pick any
> two.
>
> 1. Do what I've proposed: optimize things at the c_pow level. This is
> general purpose and relatively simple to implement (since we can steal
> most of the code from complexobject.c). It may have a signifigant
> speed penalty versus 2 though:
>
> 2. Do what you've proposed: optimize things at the ufunc level. This
> fast and relatively simple to implement. It's more limited in scope
> and a bit harder to explain than 2.
>
> 3. Do both. This is straightforward, but adds a bunch of extra code
> paths with all the attendant required testing and possibility for
> bugs. So, fast, general purpose, but not simple.
Start with #1, then try #2. The problem with #2 is that you still have
to include #1: if you're doing x**y when y is an array, then you have
do if (y==2) etc. checks in your inner loop anyways. In that case, you
might as well do it in nc_pow. At that point, it may be better to move
the #1 optimization to the level of x.__pow__ (see below).
>>The speed comes from the inlining of how to do the power _outside_ the
>>inner loop. The reason x**2, etc. are slower currently is there is a
>>function call in the inner loop. Your's and mine C library's pow()
>>function mostly likely does something like I have above, for a single
>>case: pow(x, 2.0) is calculated as x*x. However, each time through it
>>has do decide _how_ to do it.
>>
>>
> Part of our difference in perspective comes from the fact that I've
> just been staring at the guts of complex power. In this case you
> always have function calls at present, even for s*s. (At least I'm
> fairly certain that doesn't get inlined although I haven't checked).
> Since much of the work I do is with complex matrices, it's
> appropriate that I focus on this.
Ah, ok, now things are clicking. Complex power is going to be harder,
because making sure that going from x**2.001 to x**2 doesn't do some
funny complex branch cut stuff (I work in reals all the time :-)
For the real numbers, these type of optimizations *are* a big win, and
don't have the same type of continuity problems. I'll put them into numpy soon.
> Have you measured the effect of a function call on the speed here, or
> is that just an educated guess. If it's an educated guess, it's
> probably worth determining how much of speed hit the function call
> actually causes. I was going to try to get a handle on this by
> comparing multiplication of Complex numbers (which requires a function
> call plus more math), with multiplication of Floats which does not.
> Perversly, the Complex multiplication came out marginally faster,
> which is hard to explain whichever way you look at it.
>
>>>> timeit.Timer("a*b", "from numpy import arange; a =
> arange(10000)+0j; b = arange(10000)+0j").time
> it(10000)
> 3.2974959107959876
>>>> timeit.Timer("a*b", "from numpy import arange; a = arange(10000);
>>>> b
> = arange(10000)").timeit(100
> 00)
> 3.4541194481425919
You're not multiplying floats in the last one: you're multiplying
integers. You either need to use a = arange(10000.0), or a =
arange(10000.0, dtype=float) (to be more specific).
Your integer numbers are about 3x better than mine, though (difference
in architecture, maybe? I'm on an Athlon64).
>>That's why I limited the optimization to scalar exponents: array
>>exponents would mean it's about as slow as the pow() call, even if the
>>checks were inlined into the loop. It would probably be even slower
>>for the non-optimized case, as you'd check for the special exponents,
>>then call pow() if it fails (which would likely recheck the exponents).
>>
>>
> Again, here I'm thinking of the complex case. In that case at least, I
> don't think that the non-optimized case would take a noticeable speed
> hit. I would put it into pow itself, which already special cases a==0
> and b==0. For float pow it might, but that's already slow, so I doubt
> that it would make much difference.
It does make a bit of difference with float pow: the general case
slows down a bit.
>>Maybe a simple way to add this is to rewrite x.__pow__() as something
>>like the C equivalent of
>>
>>def __pow__(self, p):
>> if p is not a scalar:
>> return power(self, p)
>> elif p == 1:
>> return p
>> elif p == 2:
>> return square(self)
>> elif p == 3:
>> return cube(self)
>> elif p == 4:
>> return power_4(self)
>> elif p == 0:
>> return ones(self.shape, dtype=self.dtype)
>> elif p == -1:
>> return 1.0/self
>> elif p == 0.5:
>> return sqrt(self)
>>
>>and add ufuncs square, cube, power_4 (etc.).
>
> It sounds like we need to benchmark some stuff and see what we come up
> with. One approach would be for each of us to implement this for one
> time (say float) and see how the approaches compare speed wise. That's
> not entirely fair as my approach will do much better at complex than
> float I believe, but it's certainly easier.
The way the ufuncs are templated, we can split out the complex
routines easily enough.
Here's what I propose:
- add a square() ufunc, where square(x) == x*x (but faster of course)
- I'll fiddle with the floats
- you fiddle with the complex numbers :-)
I've created a new branch in svn, at
http://svn.scipy.org/svn/numpy/branches/power_optimization
to do this fiddling. The changes below I mention are all checked in as
revision 2104 (http://projects.scipy.org/scipy/numpy/changeset/2104).
I've added a square() ufunc to the power_optimization branch because
I'd argue that it's probably *the* most common use of **. I've
implemented it, and it's as fast as a*a for reals, and runs in 2/3 the
time as a*a for complex (which makes sense: squaring a complex numbers
has 3 real multiplications, while multiplying has 4 in the (simple)
scheme [1]).
At least with square(), there's no argument about continuity, as it
only squares :-).
The next step I'd suggest is special-casing x.__pow__, like I suggest
above. We could just test for integer scalar exponents (0, 1, 2), and
just special-case those (returning ones(), x.copy(), square(x)), and
leave all the rest to power().
I've also checked in code to the power_optimization branch that
special cases power(x, <scalar exponent>), or anytime the basic ufunc
gets called with a stride of 0 for the exponent. It doesn't do complex
x, so no problems on your side, but it's a good chunk faster for this
case than what we've got now. One reason I'm also looking at adding
square() is because my optimization of power() makes x**2 run (only)
1.5 slower than x*x (and I can't for the life of me see where that 0.5
is coming from! It should be 1.0 like square()!).
[1] which brings up another point. Would using the 3-multiplication
version for complex multiplication be good? There might be some
effects with cancellation errors due to the extra subtractions...
--
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/
|co...@ph...
|