You can subscribe to this list here.
2000 |
Jan
(8) |
Feb
(49) |
Mar
(48) |
Apr
(28) |
May
(37) |
Jun
(28) |
Jul
(16) |
Aug
(16) |
Sep
(44) |
Oct
(61) |
Nov
(31) |
Dec
(24) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2001 |
Jan
(56) |
Feb
(54) |
Mar
(41) |
Apr
(71) |
May
(48) |
Jun
(32) |
Jul
(53) |
Aug
(91) |
Sep
(56) |
Oct
(33) |
Nov
(81) |
Dec
(54) |
2002 |
Jan
(72) |
Feb
(37) |
Mar
(126) |
Apr
(62) |
May
(34) |
Jun
(124) |
Jul
(36) |
Aug
(34) |
Sep
(60) |
Oct
(37) |
Nov
(23) |
Dec
(104) |
2003 |
Jan
(110) |
Feb
(73) |
Mar
(42) |
Apr
(8) |
May
(76) |
Jun
(14) |
Jul
(52) |
Aug
(26) |
Sep
(108) |
Oct
(82) |
Nov
(89) |
Dec
(94) |
2004 |
Jan
(117) |
Feb
(86) |
Mar
(75) |
Apr
(55) |
May
(75) |
Jun
(160) |
Jul
(152) |
Aug
(86) |
Sep
(75) |
Oct
(134) |
Nov
(62) |
Dec
(60) |
2005 |
Jan
(187) |
Feb
(318) |
Mar
(296) |
Apr
(205) |
May
(84) |
Jun
(63) |
Jul
(122) |
Aug
(59) |
Sep
(66) |
Oct
(148) |
Nov
(120) |
Dec
(70) |
2006 |
Jan
(460) |
Feb
(683) |
Mar
(589) |
Apr
(559) |
May
(445) |
Jun
(712) |
Jul
(815) |
Aug
(663) |
Sep
(559) |
Oct
(930) |
Nov
(373) |
Dec
|
From: Kenny O. <kor...@id...> - 2006-10-03 20:02:33
|
excuse my laziness for not looking this up, I googled it but could not find a solution. matlab has a isreal(array) where if the array is full of real numbers the value returned is 1. I'm translating matlab code and ran across if ~isreal(array) array = abs(array) Is there a way to check to see if a number is real or complex? and if so is there a way to extract the(a + ib) because the absolute value of a complex number is like the pythagorean therom on a and b? thanks for your help, Kenny |
From: Travis O. <oli...@ee...> - 2006-10-03 17:47:55
|
Tim Hochberg wrote: >Francesc Altet wrote: > > >>Hi, >> >>I thought that numpy.isscalar was a good way of distinguising a numpy scalar >>from a python scalar, but it seems not: >> >> >> >> >>>>>numpy.isscalar(numpy.string_('3')) >>>>> >>>>> >>>>> >>True >> >> >> >>>>>numpy.isscalar('3') >>>>> >>>>> >>>>> >>True >> >>Is there an easy (and fast, if possible) way to check whether an object is a >>numpy scalar or a python one? >> >> >> >> >It looks like isinstance(x, numpy.generic) works, but I didn't test it >extensively. > > > That should definitely work. All the array scalars are in a hierarchy inheriting from numpy.generic. There are also sub-levels (numpy.integer, numpy.inexact, etc...) -Travis |
From: Travis O. <oli...@ee...> - 2006-10-03 17:47:09
|
Sebastian Haase wrote: >Hi, >a noticed the underscore in "numpy.string_" ... >I thought the underscores were removed in favour of handling the >"from numpy import *" separately via the __all__ variable. >Is this done only for *some* members of numpy ? > > Yeah. The underscores are still on the conflicting type names. -Travis |
From: Russell E. O. <ro...@ce...> - 2006-10-03 17:35:49
|
In article <452...@st...>, Todd Miller <jm...@st...> wrote: > Russell E. Owen wrote: > > Has anyone tried to build numarray or Numeric with python 2.5? Since the > > > numarray builds and runs fine for Python-2.5, but only as 32-bit. > > type of array index was changed, I'm wondering if it works (and if so, > > > NOTE: numarray does not exploit the new ssize_t index in Python-2.5 so > numarray is not 64-bit enabled. Thank you very much. That's great. It means my current code is compatible with Python 2.5 (assuming Numeric works the same way, which seems likely). I am really looking forward to switching to numpy, but now is just a bit too early. -- Russell |
From: Tim H. <tim...@ie...> - 2006-10-03 17:27:27
|
Francesc Altet wrote: > Hi, > > I thought that numpy.isscalar was a good way of distinguising a numpy scalar > from a python scalar, but it seems not: > > >>>> numpy.isscalar(numpy.string_('3')) >>>> > True > >>>> numpy.isscalar('3') >>>> > True > > Is there an easy (and fast, if possible) way to check whether an object is a > numpy scalar or a python one? > > It looks like isinstance(x, numpy.generic) works, but I didn't test it extensively. -tim |
From: Sebastian H. <ha...@ms...> - 2006-10-03 17:24:31
|
Hi, a noticed the underscore in "numpy.string_" ... I thought the underscores were removed in favour of handling the "from numpy import *" separately via the __all__ variable. Is this done only for *some* members of numpy ? (sorry for hijacking your thread ...) -Sebastian Haase On Tuesday 03 October 2006 09:55, Francesc Altet wrote: > Hi, > > I thought that numpy.isscalar was a good way of distinguising a numpy > scalar > > from a python scalar, but it seems not: > >>> numpy.isscalar(numpy.string_('3')) > > True > > >>> numpy.isscalar('3') > > True > > Is there an easy (and fast, if possible) way to check whether an object is > a numpy scalar or a python one? > > Thanks, |
From: Kenny O. <kor...@id...> - 2006-10-03 17:03:17
|
Yeah thanks guys, I printed them out the size and realized what it was doing then used a for loop to put it into the 1D array, but thanks Travis the way you said makes it easier than the for loop. -Kenny |
From: Francesc A. <fa...@ca...> - 2006-10-03 16:55:40
|
Hi, I thought that numpy.isscalar was a good way of distinguising a numpy scala= r=20 from a python scalar, but it seems not: >>> numpy.isscalar(numpy.string_('3')) True >>> numpy.isscalar('3') True Is there an easy (and fast, if possible) way to check whether an object is = a=20 numpy scalar or a python one? Thanks, =2D-=20 >0,0< Francesc Altet =A0 =A0 http://www.carabos.com/ V V C=E1rabos Coop. V. =A0=A0Enjoy Data "-" |
From: Tim H. <tim...@ie...> - 2006-10-03 16:24:26
|
David Cournapeau wrote: > Hi, > > I was wondering if there was any way to speed up the following code: > > y = N.zeros((n, K)) > for i in range(K): > y[:, i] = gauss_den(data, mu[i, :], va[i, :]) > > Where K is of order 1e1, n of order 1e5. Normally, gauss_den is a quite > expensive function, but the profiler tells me that the indexing y[:,i] > takes almost as much time as the gauss_den computation (which computes n > exp !). To see if the profiler is "right", i replaces with the (non > valid) following function: > > y = N.zeros((n, K)) > for i in range(K): > yt = gauss_den(data, mu[i, :], va[i, :]) > return y > > Where more than 99% of the code is spent inside gauss_den. > > I guess the problem is coming from the fact that y being C order, y[:, > i] needs accessing data in a non 'linear' way. Is there a way to speed > this up ? I did something like this: > > y = N.zeros((K, n)) > for i in range(K): > y[i] = gauss_den(data, mu[i, :], va[i, :]) > return y.T > > which works, but I don't like it very much. Why not? > Isn't there any other way That depends on the details of gauss_den. A general note: for this kind of microoptimization puzzle, it's much easier to help if you can post a self contained example, preferably something fairly simple that still illustrates the speed issue, that we can experiment with. -tim |
From: Tim H. <tim...@ie...> - 2006-10-03 16:16:12
|
kor...@id... wrote: > Microsoft Windows XP [Version 5.1.2600] > (C) Copyright 1985-2001 Microsoft Corp. > > C:\Documents and Settings\kenny>cd c:\lameness > > C:\lameness>c:\Python24\python.exe templatewindow.py > a = [[ 1.00013175e+00 2.63483019e-05 1.00006740e+00 5.22246363e-05 > 1.00008735e+00 -2.77694969e-07 -1.30756273e-07 1.03202730e-06] > [ 2.63483019e-05 6.95644927e-05 -7.15426839e-07 1.99534228e-05 > 5.29400631e-05 -3.07638369e-09 -5.52729618e-06 -7.61431767e-06] > [ 1.00006740e+00 -7.15426839e-07 1.00011917e+00 2.50407751e-05 > 1.00006219e+00 -2.77757947e-07 -1.30856101e-07 1.23058301e-07] > [ 5.22246363e-05 1.99534228e-05 2.50407751e-05 8.21505582e-05 > 5.26966037e-05 -6.13563429e-09 -2.76420755e-06 -3.80791858e-06] > [ 1.00008735e+00 5.29400631e-05 1.00006219e+00 5.26966037e-05 > 1.00020132e+00 5.01389982e-05 3.45959412e-05 3.17129503e-05] > [ -2.77694969e-07 -3.07638369e-09 -2.77757947e-07 -6.13563429e-09 > 5.01389982e-05 3.45959412e-05 3.17129503e-05 3.27035490e-05] > [ -1.30756273e-07 -5.52729618e-06 -1.30856101e-07 -2.76420755e-06 > 3.45959412e-05 3.17129503e-05 3.27035490e-05 3.59732704e-05] > [ 1.03202730e-06 -7.61431767e-06 1.23058301e-07 -3.80791858e-06 > 3.17129503e-05 3.27035490e-05 3.59732704e-05 4.12184645e-05]] > F = [[ -1.44231014e+03] > [ -7.54006917e-02] > [ -1.44227222e+03] > [ -7.49199956e-02] > [ -1.44242446e+03] > [ -4.24684780e-02] > [ -2.49566072e-02] > [ -2.16978637e-02]] > Traceback (most recent call last): > File "C:\lameness\PlotPanel.py", line 476, in OnNo > self.parent.peaktime() > File "C:\lameness\PlotPanel.py", line 64, in peaktime > self._replot() > File "C:\lameness\PlotPanel.py", line 159, in _replot > self.drawpeak() > File "C:\lameness\PlotPanel.py", line 233, in drawpeak > COE[:,j1-1] = linalg.solve(A,F) > ValueError: object too deep for desired array > > Anyone know what this error could be coming from? Is it because the > floating point numbers are too large, memory wise?\ > No, this error has to do with the number of dimensions being wrong somewhere. Try printing out the shapes of COE[:,j-1] and linalg.solve(A,F); you'll probably see that they don't match. -tim |
From: Travis O. <oli...@ee...> - 2006-10-03 16:14:33
|
kor...@id... wrote: >Microsoft Windows XP [Version 5.1.2600] >(C) Copyright 1985-2001 Microsoft Corp. > >C:\Documents and Settings\kenny>cd c:\lameness > >C:\lameness>c:\Python24\python.exe templatewindow.py >a = [[ 1.00013175e+00 2.63483019e-05 1.00006740e+00 5.22246363e-05 > 1.00008735e+00 -2.77694969e-07 -1.30756273e-07 1.03202730e-06] > [ 2.63483019e-05 6.95644927e-05 -7.15426839e-07 1.99534228e-05 > 5.29400631e-05 -3.07638369e-09 -5.52729618e-06 -7.61431767e-06] > [ 1.00006740e+00 -7.15426839e-07 1.00011917e+00 2.50407751e-05 > 1.00006219e+00 -2.77757947e-07 -1.30856101e-07 1.23058301e-07] > [ 5.22246363e-05 1.99534228e-05 2.50407751e-05 8.21505582e-05 > 5.26966037e-05 -6.13563429e-09 -2.76420755e-06 -3.80791858e-06] > [ 1.00008735e+00 5.29400631e-05 1.00006219e+00 5.26966037e-05 > 1.00020132e+00 5.01389982e-05 3.45959412e-05 3.17129503e-05] > [ -2.77694969e-07 -3.07638369e-09 -2.77757947e-07 -6.13563429e-09 > 5.01389982e-05 3.45959412e-05 3.17129503e-05 3.27035490e-05] > [ -1.30756273e-07 -5.52729618e-06 -1.30856101e-07 -2.76420755e-06 > 3.45959412e-05 3.17129503e-05 3.27035490e-05 3.59732704e-05] > [ 1.03202730e-06 -7.61431767e-06 1.23058301e-07 -3.80791858e-06 > 3.17129503e-05 3.27035490e-05 3.59732704e-05 4.12184645e-05]] >F = [[ -1.44231014e+03] > [ -7.54006917e-02] > [ -1.44227222e+03] > [ -7.49199956e-02] > [ -1.44242446e+03] > [ -4.24684780e-02] > [ -2.49566072e-02] > [ -2.16978637e-02]] >Traceback (most recent call last): > File "C:\lameness\PlotPanel.py", line 476, in OnNo > self.parent.peaktime() > File "C:\lameness\PlotPanel.py", line 64, in peaktime > self._replot() > File "C:\lameness\PlotPanel.py", line 159, in _replot > self.drawpeak() > File "C:\lameness\PlotPanel.py", line 233, in drawpeak > COE[:,j1-1] = linalg.solve(A,F) >ValueError: object too deep for desired array > >Anyone know what this error could be coming from? Is it because the >floating point numbers are too large, memory wise? > > No, it's because you are trying to fit a 2-d array into a 1-d position. linalg.solve(A,F) produces a 2-d array in this instance (because F is a 2-d array). COE[:,j1-1:] =linalg.solve(A,F) should work or squeezing the result of the solution to a 1-d array would also work. -Travis |
From: Tim H. <tim...@ie...> - 2006-10-03 16:12:54
|
Travis Oliphant wrote: > Albert Strasheim wrote: > >>>> [1] 12.97% of function time >>>> [2] 8.65% of functiont ime >>>> [3] 62.14% of function time >>>> >>>> If statistics from elsewhere in the code would be helpful, let me >>>> know, >>>> >>>> >>> and >>> >>> >>>> I'll see if I can convince Quantify to cough it up. >>>> >>>> >>>> >>> Please run the same test but using >>> >>> x1 = N.random.rand(39,2000) >>> x2 = N.random.rand(39,64,1) >>> >>> z1 = x1[:,N.newaxis,:] - x2 >>> >>> >> Very similar results to what I had previously: >> >> [1] 10.88% >> [2] 7.25% >> [3] 68.25% >> >> >> > Thanks, > > I've got some ideas about how to speed this up by eliminating some of > the unnecessary calculations going on outside of the function loop, but > there will still be some speed issues depending on how the array is > traversed once you get above a certain size. I'm not sure there anyway > around that, ultimately, due to memory access being slow on most hardware. > > If anyone has any ideas, I'd love to hear them. I won't be able to > get to implementing my ideas until at least Friday (also when rc2 will > be released). > > I had an idea regarding which axis to operate on first. Rather than operate on strictly the longest axis or strictly the innermost axis, a hybrid approach could be used. We would operate on the longest axis that would not result in the inner loop overflowing the cache. The idea is minimize the loop overhead as we do now by choosing the largest axis, while at the same time attempting to maintain cache friendliness. Let's suppose that it's safe to stuff 2000 array elements into the cache[1], and lets consider two arrays where the lengths of the axes[2], from innermost to outermost are [10,100,1000] and [100,10,1000]. Under this scheme, we would loop over the length 100 axis in both cases. Looping over the length-1000 axis pull 1e6 elements into the cache. If the innermost axis alone exceeded our cache imposed limit, we would of course just loop over that axis first. -tim [1] This would really be specified in bytes and thus the number of allowable elements would vary by data type. It might also need to be tuned based on architecture. [2] I believe that there's some consolidation of adjacent axes that goes on where possible, let's assume that's already happened at this point. |
From: <kor...@id...> - 2006-10-03 16:03:36
|
Microsoft Windows XP [Version 5.1.2600] (C) Copyright 1985-2001 Microsoft Corp. C:\Documents and Settings\kenny>cd c:\lameness C:\lameness>c:\Python24\python.exe templatewindow.py a = [[ 1.00013175e+00 2.63483019e-05 1.00006740e+00 5.22246363e-05 1.00008735e+00 -2.77694969e-07 -1.30756273e-07 1.03202730e-06] [ 2.63483019e-05 6.95644927e-05 -7.15426839e-07 1.99534228e-05 5.29400631e-05 -3.07638369e-09 -5.52729618e-06 -7.61431767e-06] [ 1.00006740e+00 -7.15426839e-07 1.00011917e+00 2.50407751e-05 1.00006219e+00 -2.77757947e-07 -1.30856101e-07 1.23058301e-07] [ 5.22246363e-05 1.99534228e-05 2.50407751e-05 8.21505582e-05 5.26966037e-05 -6.13563429e-09 -2.76420755e-06 -3.80791858e-06] [ 1.00008735e+00 5.29400631e-05 1.00006219e+00 5.26966037e-05 1.00020132e+00 5.01389982e-05 3.45959412e-05 3.17129503e-05] [ -2.77694969e-07 -3.07638369e-09 -2.77757947e-07 -6.13563429e-09 5.01389982e-05 3.45959412e-05 3.17129503e-05 3.27035490e-05] [ -1.30756273e-07 -5.52729618e-06 -1.30856101e-07 -2.76420755e-06 3.45959412e-05 3.17129503e-05 3.27035490e-05 3.59732704e-05] [ 1.03202730e-06 -7.61431767e-06 1.23058301e-07 -3.80791858e-06 3.17129503e-05 3.27035490e-05 3.59732704e-05 4.12184645e-05]] F = [[ -1.44231014e+03] [ -7.54006917e-02] [ -1.44227222e+03] [ -7.49199956e-02] [ -1.44242446e+03] [ -4.24684780e-02] [ -2.49566072e-02] [ -2.16978637e-02]] Traceback (most recent call last): File "C:\lameness\PlotPanel.py", line 476, in OnNo self.parent.peaktime() File "C:\lameness\PlotPanel.py", line 64, in peaktime self._replot() File "C:\lameness\PlotPanel.py", line 159, in _replot self.drawpeak() File "C:\lameness\PlotPanel.py", line 233, in drawpeak COE[:,j1-1] = linalg.solve(A,F) ValueError: object too deep for desired array Anyone know what this error could be coming from? Is it because the floating point numbers are too large, memory wise? -Kenny |
From: Travis O. <oli...@ie...> - 2006-10-03 15:03:49
|
I'm going to have to put off release of rc2 for Friday. I'm just too busy right now. That might help us get some speed-ups into the NOBUFFER_UFUNCLOOP code as well. My speed-up ideas are: 1) Only keep track of 1 set of coordinates instead of self->nargs sets (1 for each iterator). 2) Keep track of the dataptr for each iterator in the bufptr array (so a copy isn't necessary) 3) Not increment the index on each iterator separately. All of these changes will be made directly in the NOBUFFER_UFUNCLOOP code. More generally, it would be nice to take these ideas and push them into other areas of the code --- perhaps through the multi-iterator that is already present. Probably, this will have to wait until 1.0.1 though. -Travis |
From: Travis O. <oli...@ie...> - 2006-10-03 14:58:35
|
Albert Strasheim wrote: >>> [1] 12.97% of function time >>> [2] 8.65% of functiont ime >>> [3] 62.14% of function time >>> >>> If statistics from elsewhere in the code would be helpful, let me >>> know, >>> >> and >> >>> I'll see if I can convince Quantify to cough it up. >>> >>> >> Please run the same test but using >> >> x1 = N.random.rand(39,2000) >> x2 = N.random.rand(39,64,1) >> >> z1 = x1[:,N.newaxis,:] - x2 >> > > Very similar results to what I had previously: > > [1] 10.88% > [2] 7.25% > [3] 68.25% > > Thanks, I've got some ideas about how to speed this up by eliminating some of the unnecessary calculations going on outside of the function loop, but there will still be some speed issues depending on how the array is traversed once you get above a certain size. I'm not sure there anyway around that, ultimately, due to memory access being slow on most hardware. If anyone has any ideas, I'd love to hear them. I won't be able to get to implementing my ideas until at least Friday (also when rc2 will be released). -Travis |
From: David C. <da...@ar...> - 2006-10-03 10:54:16
|
Hi, I was wondering if there was any way to speed up the following code: y = N.zeros((n, K)) for i in range(K): y[:, i] = gauss_den(data, mu[i, :], va[i, :]) Where K is of order 1e1, n of order 1e5. Normally, gauss_den is a quite expensive function, but the profiler tells me that the indexing y[:,i] takes almost as much time as the gauss_den computation (which computes n exp !). To see if the profiler is "right", i replaces with the (non valid) following function: y = N.zeros((n, K)) for i in range(K): yt = gauss_den(data, mu[i, :], va[i, :]) return y Where more than 99% of the code is spent inside gauss_den. I guess the problem is coming from the fact that y being C order, y[:, i] needs accessing data in a non 'linear' way. Is there a way to speed this up ? I did something like this: y = N.zeros((K, n)) for i in range(K): y[i] = gauss_den(data, mu[i, :], va[i, :]) return y.T which works, but I don't like it very much. Isn't there any other way ? David |
From: Albert S. <136...@su...> - 2006-10-03 09:12:38
|
Hello all > -----Original Message----- > From: Travis Oliphant [mailto:oli...@ee...] > Sent: 03 October 2006 03:51 > To: Discussion of Numerical Python; Albert Strasheim > Subject: Re: ***[Possible UCE]*** Re: [Numpy-discussion] Vectorizing > code, for loops, and all that > > <snip> > >Meanwhile, I can confirm that the NOBUFFER_UFUNCLOOP case in > >PyUFunc_GenericFunction is getting exercised in the slower case. > >Here's > some > >info on what's happening, courtesy of Rational Quantify: > > > >case NOBUFFER_UFUNCLOOP: > >while (loop->index < loop->size) { > > for (i=0; i<self->nargs; i++) > > loop->bufptr[i] = loop->iters[i]->dataptr; [1] > > > > loop->function((char **)loop->bufptr, &(loop->bufcnt), > > loop->steps, loop->funcdata); [2] > > UFUNC_CHECK_ERROR(loop); > > > > for (i=0; i<self->nargs; i++) { > > PyArray_ITER_NEXT(loop->iters[i]); [3] > > } > > loop->index++; > >} > >break; > > > >[1] 12.97% of function time > >[2] 8.65% of functiont ime > >[3] 62.14% of function time > > > >If statistics from elsewhere in the code would be helpful, let me > >know, > and > >I'll see if I can convince Quantify to cough it up. > > > Please run the same test but using > > x1 = N.random.rand(39,2000) > x2 = N.random.rand(39,64,1) > > z1 = x1[:,N.newaxis,:] - x2 Very similar results to what I had previously: [1] 10.88% [2] 7.25% [3] 68.25% Cheers, Albert |
From: David C. <da...@ar...> - 2006-10-03 07:31:47
|
Albert Strasheim wrote: > Hello all > > I recently started looking at David Cournapeau's PyEM package, specifically > his implementation of the K-Means algorithm. He implemented part of this > algorithm with in pure Python version and also provided a Pyrex alternative > that is significantly faster (about 10 times with the data I tested with). I > tried to figure out why this is so. > > For what it worths, PyEM has been quietly updated in the past weeks: most responses I got initially were people reporting errors almost always related to change in numpy API changes, so I decided to keep quiet for a while, waiting for an API freeze on numpy side (last versions use vq from scipy.cluster, for example, the plotting have been much improved, a fast Ctype version for diagonal gaussian kernel enables to run EM on hundred of thousand frames of several dozens dimensions with several dozens of gaussian in the mixture in reasonable times) > The job of this part of the algorithm is pretty simple: from a bunch of > cluster means (the codebook) find the nearest cluster mean for each data > point. The first attempt at implementing this algorithm might use two for > loops, one over the data points and one over the cluster means, computing a > Euclidean distance at each step. Slightly better is to use one loop and some > broadcasting. > > By randomly trying some code, I arrived at the following one-liner: > > N.sum((data[...,N.newaxis,...]-code)**2, 2).argmin(axis=1) > > where > > data = N.random.randn(2000, 39) > nclusters = 64 > code = data[0:nclusters,:] > > This code was still slow, so I mangled it a bit so that I could profile it > using cProfile under Python 2.5 (use profile if you have an older version of > Python): > > def kmean(): > data = N.random.randn(2000, 39) > nclusters = 64 > code = data[0:nclusters,:] > for i in xrange(10): > def a(): return data[...,N.newaxis,...] > z = a() > def b(): return z-code > z = b() > def c(): return z**2 > z = c() > def d(): return N.sum(z, 2) > z = d() > def e(): return z.argmin(axis=1) > z = e() > > if __name__ == '__main__': > import cProfile > cProfile.run("kmean()") > > Profiler output: > I got exactly the same kind of weird behaviour in other places of PyEM (apparently simple substractions taking a lot of time compared to other parts although it should theoratically be negligeable: eg X - mu where X is (n, d) and mu (d, 1) takeing almost as much time as exp( X**2) !). It would be great to at least know the origin of this non-intuitive result, David |
From: Travis O. <oli...@ee...> - 2006-10-03 01:50:36
|
Albert Strasheim wrote: >Hello all > > > >>-----Original Message----- >>From: num...@li... [mailto:numpy- >>dis...@li...] On Behalf Of Travis Oliphant >>Sent: 03 October 2006 02:32 >>To: Discussion of Numerical Python >>Subject: Re: [Numpy-discussion] Vectorizing code, for loops, and all that >> >>Travis Oliphant wrote: >> >> >> >>>I suspect I know why, although the difference seems rather large. >>> >>> >>> >>[snip] >> >> >> >>>I'm surprised the overhead of adjusting pointers is so high, but then >>>again you are probably getting a lot of cache misses in the first case >>>so there is more to it than that, the loops may run more slowly too. >>> >>> >>> >>> >>I'm personally bothered that this example runs so much more slowly. I >>don't think it should. Perhaps it is unavoidable because of the >>memory-layout issues. It is just hard to believe that the overhead for >>calling into the loop and adjusting the pointers is so much higher. >> >> > >Firstly, thanks to Tim... I'll try his functions tomorrow. > >Meanwhile, I can confirm that the NOBUFFER_UFUNCLOOP case in >PyUFunc_GenericFunction is getting exercised in the slower case. Here's some >info on what's happening, courtesy of Rational Quantify: > >case NOBUFFER_UFUNCLOOP: >while (loop->index < loop->size) { > for (i=0; i<self->nargs; i++) > loop->bufptr[i] = loop->iters[i]->dataptr; [1] > > loop->function((char **)loop->bufptr, &(loop->bufcnt), > loop->steps, loop->funcdata); [2] > UFUNC_CHECK_ERROR(loop); > > for (i=0; i<self->nargs; i++) { > PyArray_ITER_NEXT(loop->iters[i]); [3] > } > loop->index++; >} >break; > >[1] 12.97% of function time >[2] 8.65% of functiont ime >[3] 62.14% of function time > >If statistics from elsewhere in the code would be helpful, let me know, and >I'll see if I can convince Quantify to cough it up. > > > Please run the same test but using x1 = N.random.rand(39,2000) x2 = N.random.rand(39,64,1) z1 = x1[:,N.newaxis,:] - x2 Thanks, -Travis |
From: Albert S. <136...@su...> - 2006-10-03 00:42:21
|
Hello all > -----Original Message----- > From: num...@li... [mailto:numpy- > dis...@li...] On Behalf Of Travis Oliphant > Sent: 03 October 2006 02:32 > To: Discussion of Numerical Python > Subject: Re: [Numpy-discussion] Vectorizing code, for loops, and all that > > Travis Oliphant wrote: > > > > >I suspect I know why, although the difference seems rather large. > > > [snip] > > >I'm surprised the overhead of adjusting pointers is so high, but then > >again you are probably getting a lot of cache misses in the first case > >so there is more to it than that, the loops may run more slowly too. > > > > > > I'm personally bothered that this example runs so much more slowly. I > don't think it should. Perhaps it is unavoidable because of the > memory-layout issues. It is just hard to believe that the overhead for > calling into the loop and adjusting the pointers is so much higher. Firstly, thanks to Tim... I'll try his functions tomorrow. Meanwhile, I can confirm that the NOBUFFER_UFUNCLOOP case in PyUFunc_GenericFunction is getting exercised in the slower case. Here's some info on what's happening, courtesy of Rational Quantify: case NOBUFFER_UFUNCLOOP: while (loop->index < loop->size) { for (i=0; i<self->nargs; i++) loop->bufptr[i] = loop->iters[i]->dataptr; [1] loop->function((char **)loop->bufptr, &(loop->bufcnt), loop->steps, loop->funcdata); [2] UFUNC_CHECK_ERROR(loop); for (i=0; i<self->nargs; i++) { PyArray_ITER_NEXT(loop->iters[i]); [3] } loop->index++; } break; [1] 12.97% of function time [2] 8.65% of functiont ime [3] 62.14% of function time If statistics from elsewhere in the code would be helpful, let me know, and I'll see if I can convince Quantify to cough it up. > <snip> Thanks for all code and suggestions. Cheers, Albert |
From: Travis O. <oli...@ee...> - 2006-10-03 00:32:26
|
Travis Oliphant wrote: > >I suspect I know why, although the difference seems rather large. > [snip] >I'm surprised the overhead of adjusting pointers is so high, but then >again you are probably getting a lot of cache misses in the first case >so there is more to it than that, the loops may run more slowly too. > > I'm personally bothered that this example runs so much more slowly. I don't think it should. Perhaps it is unavoidable because of the memory-layout issues. It is just hard to believe that the overhead for calling into the loop and adjusting the pointers is so much higher. But, that isn't the problem, here. Notice the following: x3 = N.random.rand(39,2000) x4 = N.random.rand(39,64,1) %timeit z3 = x3[:,None,:] - x4 10 loops, best of 3: 76.4 ms per loop Hmm... It looks like cache misses are a lot more important than making sure the inner loop is taken over the largest number of variables (that's the current way ufuncs decide which axis ought to be used as the 1-d loop). Perhaps those inner 1-d loops could be optimized (using prefetch or something) to reduce the number of cache misses on the inner computation, and the concept of looping over the largest dimension (instead of the last dimension) should be re-considered. Ideas, -Travis |
From: Tim H. <tim...@ie...> - 2006-10-03 00:17:58
|
Tim Hochberg wrote: > Travis Oliphant wrote: > >> Albert Strasheim wrote: >> >> >> >>> In [571]: x1 = N.random.randn(2000,39) >>> >>> In [572]: y1 = N.random.randn(64,39) >>> >>> In [574]: %timeit z1=x1[...,N.newaxis,...]-y1 10 loops, best of 3: 703 ms >>> per loop >>> >>> In [575]: z1.shape >>> Out[575]: (2000, 64, 39) >>> >>> As far as I can figure, this operation is doing 2000*64*39 subtractions. >>> Doing this straight up yields the following: >>> >>> In [576]: x2 = N.random.randn(2000,64,39) >>> >>> In [577]: y2 = N.random.randn(2000,64,39) >>> >>> In [578]: %timeit z2 = x2-y2 >>> 10 loops, best of 3: 108 ms per loop >>> >>> Does anybody have any ideas on why this is so much faster? Hopefully I >>> didn't mess up somewhere... >>> >>> >>> [SNIP] > > > I just spent a while playing with this, and assuming I've correctly > translated your original intent I've come up with two alternative, > looping versions that run, respectively 2 and 3 times faster. I've a > feeling that kmean3, the fastest one, still has a little more room to be > sped up, but I'm out of time now. Code is below > > -tim > > > One more iterations, this time a little algorithmic improvement, and were up to 4x as fast as the original code. Here we take advantage of the fact that the term di**2 is constant across the axis we are minimizing on, computation of ci**2 can be hoisted out of the loop and the -2*di*ci term can be rejiggered to di*ci by appropriate rescaling. This reduces the computation in the inner loop from a subtract and a multiply to just a multiply. def kmean4(data): nclusters = 64 naxes = data.shape[-1] code = data[:nclusters] transdata = data.transpose().copy() totals = N.empty([nclusters, len(data)], float) code2 = (code**2).sum(-1) code2 *= -0.5 totals[:] = code2[:, N.newaxis] for cluster, tot in zip(code, totals): for di, ci in zip(transdata, cluster): tot += di*ci return totals.argmax(axis=0) [SNIP CODE] |
From: Tim H. <tim...@ie...> - 2006-10-02 23:50:06
|
Travis Oliphant wrote: > Albert Strasheim wrote: > > >> In [571]: x1 = N.random.randn(2000,39) >> >> In [572]: y1 = N.random.randn(64,39) >> >> In [574]: %timeit z1=x1[...,N.newaxis,...]-y1 10 loops, best of 3: 703 ms >> per loop >> >> In [575]: z1.shape >> Out[575]: (2000, 64, 39) >> >> As far as I can figure, this operation is doing 2000*64*39 subtractions. >> Doing this straight up yields the following: >> >> In [576]: x2 = N.random.randn(2000,64,39) >> >> In [577]: y2 = N.random.randn(2000,64,39) >> >> In [578]: %timeit z2 = x2-y2 >> 10 loops, best of 3: 108 ms per loop >> >> Does anybody have any ideas on why this is so much faster? Hopefully I >> didn't mess up somewhere... >> >> >> > > I suspect I know why, although the difference seems rather large. There > is code optimization that is being taken advantage of in the second > case. If you have contiguous arrays (no broadcasting needed), then 1 > C-loop is used for the subtraction (your second case). > > In the first case you are using broadcasting to generate the larger > array. This requires more complicated looping constructs under the > covers which causes your overhead. Bascially, you will have 64*39 1-d > loops of 2000 elements each in the first example with a bit of > calculation over-head to reset the pointers before each loop. > > In the ufunc code, compare the ONE_UFUNCLOOP case with the > NOBUFFER_UFUNCLOOP case. If you want to be sure what is running > un-comment the fprintf statements so you can tell. > > I'm surprised the overhead of adjusting pointers is so high, but then > again you are probably getting a lot of cache misses in the first case > so there is more to it than that, the loops may run more slowly too. > > I suspect that Travis is basically right about why your simple subtraction runs much faster than your test case. However, that doesn't mean you can't do better than at present. When dealing with large, multidimensional arrays my experience has been that unrolling all of the for loops is frequently counterproductive. I chalk this up to two factors: first that you tend to end up generating large temporary arrays and this in turn leads to cache misses. Second, you loose flexibility in how you perform the calculation, which in turns limits other possible explanations. I just spent a while playing with this, and assuming I've correctly translated your original intent I've come up with two alternative, looping versions that run, respectively 2 and 3 times faster. I've a feeling that kmean3, the fastest one, still has a little more room to be sped up, but I'm out of time now. Code is below -tim import numpy as N data = N.random.randn(2000, 39) def kmean0(data): nclusters = 64 code = data[0:nclusters,:] return N.sum((data[...,N.newaxis,...]-code)**2, 2).argmin(axis=1) def kmean1(data): nclusters = 64 code = data[0:nclusters,:] z = data[:,N.newaxis,:] z = z-code z = z**2 z = N.sum(z, 2) return z.argmin(axis=1) def kmean2(data): nclusters = 64 naxes = data.shape[-1] code = data[0:nclusters,:] data = data[:, N.newaxis] allz = N.zeros([len(data)]) for i, x in enumerate(data): z = (x - code) z **= 2 allz[i] = z.sum(-1).argmin(0) return allz def kmean3(data): nclusters = 64 naxes = data.shape[-1] code = data[0:nclusters] totals = N.zeros([nclusters, len(data)], float) transdata = data.transpose().copy() for cluster, tot in zip(code, totals): for di, ci in zip(transdata, cluster): delta = di - ci delta **=2 tot += delta return totals.argmin(axis=0) if __name__ == '__main__': assert N.alltrue(kmean0(data) == kmean1(data)) assert N.alltrue(kmean0(data) == kmean2(data)) assert N.alltrue(kmean0(data) == kmean3(data)) from timeit import Timer print Timer('kmean0(data)', 'from scratch import kmean0, data').timeit(3) print Timer('kmean1(data)', 'from scratch import kmean1, data').timeit(3) print Timer('kmean2(data)', 'from scratch import kmean2, data').timeit(3) print Timer('kmean3(data)', 'from scratch import kmean3, data').timeit(3) |
From: Travis O. <oli...@ee...> - 2006-10-02 23:04:02
|
Albert Strasheim wrote: >In [571]: x1 = N.random.randn(2000,39) > >In [572]: y1 = N.random.randn(64,39) > >In [574]: %timeit z1=x1[...,N.newaxis,...]-y1 10 loops, best of 3: 703 ms >per loop > >In [575]: z1.shape >Out[575]: (2000, 64, 39) > >As far as I can figure, this operation is doing 2000*64*39 subtractions. >Doing this straight up yields the following: > >In [576]: x2 = N.random.randn(2000,64,39) > >In [577]: y2 = N.random.randn(2000,64,39) > >In [578]: %timeit z2 = x2-y2 >10 loops, best of 3: 108 ms per loop > >Does anybody have any ideas on why this is so much faster? Hopefully I >didn't mess up somewhere... > > I suspect I know why, although the difference seems rather large. There is code optimization that is being taken advantage of in the second case. If you have contiguous arrays (no broadcasting needed), then 1 C-loop is used for the subtraction (your second case). In the first case you are using broadcasting to generate the larger array. This requires more complicated looping constructs under the covers which causes your overhead. Bascially, you will have 64*39 1-d loops of 2000 elements each in the first example with a bit of calculation over-head to reset the pointers before each loop. In the ufunc code, compare the ONE_UFUNCLOOP case with the NOBUFFER_UFUNCLOOP case. If you want to be sure what is running un-comment the fprintf statements so you can tell. I'm surprised the overhead of adjusting pointers is so high, but then again you are probably getting a lot of cache misses in the first case so there is more to it than that, the loops may run more slowly too. -Travis |
From: Albert S. <136...@su...> - 2006-10-02 22:10:39
|
Hello all I recently started looking at David Cournapeau's PyEM package, specifically his implementation of the K-Means algorithm. He implemented part of this algorithm with in pure Python version and also provided a Pyrex alternative that is significantly faster (about 10 times with the data I tested with). I tried to figure out why this is so. The job of this part of the algorithm is pretty simple: from a bunch of cluster means (the codebook) find the nearest cluster mean for each data point. The first attempt at implementing this algorithm might use two for loops, one over the data points and one over the cluster means, computing a Euclidean distance at each step. Slightly better is to use one loop and some broadcasting. By randomly trying some code, I arrived at the following one-liner: N.sum((data[...,N.newaxis,...]-code)**2, 2).argmin(axis=1) where data = N.random.randn(2000, 39) nclusters = 64 code = data[0:nclusters,:] This code was still slow, so I mangled it a bit so that I could profile it using cProfile under Python 2.5 (use profile if you have an older version of Python): def kmean(): data = N.random.randn(2000, 39) nclusters = 64 code = data[0:nclusters,:] for i in xrange(10): def a(): return data[...,N.newaxis,...] z = a() def b(): return z-code z = b() def c(): return z**2 z = c() def d(): return N.sum(z, 2) z = d() def e(): return z.argmin(axis=1) z = e() if __name__ == '__main__': import cProfile cProfile.run("kmean()") Profiler output: In [570]: %run -t -e kmean2.py 84 function calls in 8.567 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 8.567 8.567 <string>:1(<module>) 10 0.000 0.000 0.835 0.084 fromnumeric.py:375(sum) 10 0.000 0.000 0.000 0.000 kmean2.py:10(a) 10 6.837 0.684 6.837 0.684 kmean2.py:12(b) 10 0.623 0.062 0.623 0.062 kmean2.py:14(c) 10 0.000 0.000 0.835 0.084 kmean2.py:16(d) 10 0.080 0.008 0.080 0.008 kmean2.py:18(e) 1 0.180 0.180 8.567 8.567 kmean2.py:5(kmean2) 10 0.000 0.000 0.000 0.000 {isinstance} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1 0.012 0.012 0.012 0.012 {method 'randn' of 'mtrand.RandomState' objects} 10 0.835 0.083 0.835 0.083 {method 'sum' of 'numpy.ndarray' objects} It seems the b part of the computation is taking most of the time. I tried to simulate this separately: In [571]: x1 = N.random.randn(2000,39) In [572]: y1 = N.random.randn(64,39) In [574]: %timeit z1=x1[...,N.newaxis,...]-y1 10 loops, best of 3: 703 ms per loop In [575]: z1.shape Out[575]: (2000, 64, 39) As far as I can figure, this operation is doing 2000*64*39 subtractions. Doing this straight up yields the following: In [576]: x2 = N.random.randn(2000,64,39) In [577]: y2 = N.random.randn(2000,64,39) In [578]: %timeit z2 = x2-y2 10 loops, best of 3: 108 ms per loop Does anybody have any ideas on why this is so much faster? Hopefully I didn't mess up somewhere... Thanks! Regards, Albert P.S. I'm used NumPy 1.0rc1 with Python 2.5 on Windows for these tests. I get similar results with 1.0.dev3239 with Python 2.4.3 on Linux. P.P.S. SourceForge and GMail *really* don't like each other right now. |