From: Albert S. <136...@su...> - 2006-10-02 22:10:39
|
Hello all I recently started looking at David Cournapeau's PyEM package, specifically his implementation of the K-Means algorithm. He implemented part of this algorithm with in pure Python version and also provided a Pyrex alternative that is significantly faster (about 10 times with the data I tested with). I tried to figure out why this is so. The job of this part of the algorithm is pretty simple: from a bunch of cluster means (the codebook) find the nearest cluster mean for each data point. The first attempt at implementing this algorithm might use two for loops, one over the data points and one over the cluster means, computing a Euclidean distance at each step. Slightly better is to use one loop and some broadcasting. By randomly trying some code, I arrived at the following one-liner: N.sum((data[...,N.newaxis,...]-code)**2, 2).argmin(axis=1) where data = N.random.randn(2000, 39) nclusters = 64 code = data[0:nclusters,:] This code was still slow, so I mangled it a bit so that I could profile it using cProfile under Python 2.5 (use profile if you have an older version of Python): def kmean(): data = N.random.randn(2000, 39) nclusters = 64 code = data[0:nclusters,:] for i in xrange(10): def a(): return data[...,N.newaxis,...] z = a() def b(): return z-code z = b() def c(): return z**2 z = c() def d(): return N.sum(z, 2) z = d() def e(): return z.argmin(axis=1) z = e() if __name__ == '__main__': import cProfile cProfile.run("kmean()") Profiler output: In [570]: %run -t -e kmean2.py 84 function calls in 8.567 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 8.567 8.567 <string>:1(<module>) 10 0.000 0.000 0.835 0.084 fromnumeric.py:375(sum) 10 0.000 0.000 0.000 0.000 kmean2.py:10(a) 10 6.837 0.684 6.837 0.684 kmean2.py:12(b) 10 0.623 0.062 0.623 0.062 kmean2.py:14(c) 10 0.000 0.000 0.835 0.084 kmean2.py:16(d) 10 0.080 0.008 0.080 0.008 kmean2.py:18(e) 1 0.180 0.180 8.567 8.567 kmean2.py:5(kmean2) 10 0.000 0.000 0.000 0.000 {isinstance} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1 0.012 0.012 0.012 0.012 {method 'randn' of 'mtrand.RandomState' objects} 10 0.835 0.083 0.835 0.083 {method 'sum' of 'numpy.ndarray' objects} It seems the b part of the computation is taking most of the time. I tried to simulate this separately: In [571]: x1 = N.random.randn(2000,39) In [572]: y1 = N.random.randn(64,39) In [574]: %timeit z1=x1[...,N.newaxis,...]-y1 10 loops, best of 3: 703 ms per loop In [575]: z1.shape Out[575]: (2000, 64, 39) As far as I can figure, this operation is doing 2000*64*39 subtractions. Doing this straight up yields the following: In [576]: x2 = N.random.randn(2000,64,39) In [577]: y2 = N.random.randn(2000,64,39) In [578]: %timeit z2 = x2-y2 10 loops, best of 3: 108 ms per loop Does anybody have any ideas on why this is so much faster? Hopefully I didn't mess up somewhere... Thanks! Regards, Albert P.S. I'm used NumPy 1.0rc1 with Python 2.5 on Windows for these tests. I get similar results with 1.0.dev3239 with Python 2.4.3 on Linux. P.P.S. SourceForge and GMail *really* don't like each other right now. |
From: Albert S. <fu...@gm...> - 2006-10-04 03:19:46
|
Hello all I recently started looking at David Cournapeau's PyEM package, specifically his implementation of the K-Means algorithm. He implemented part of this algorithm with in pure Python version and also provided a Pyrex alternative that is significantly faster (about 10 times with the data I tested with). I tried to figure out why this is so. The job of this part of the algorithm is pretty simple: from a bunch of cluster means (the codebook) find the nearest cluster mean for each data point. The first attempt at implementing this algorithm might use two for loops, one over the data points and one over the cluster means, computing a Euclidean distance at each step. Slightly better is to use one loop and some broadcasting. By randomly trying some code, I arrived at the following one-liner: N.sum((data[...,N.newaxis,...]-code)**2, 2).argmin(axis=1) where data = N.random.randn(2000, 39) nclusters = 64 code = data[0:nclusters,:] This code was still slow, so I mangled it a bit so that I could profile it using cProfile under Python 2.5 (use profile if you have an older version of Python): def kmean(): data = N.random.randn(2000, 39) nclusters = 64 code = data[0:nclusters,:] for i in xrange(10): def a(): return data[...,N.newaxis,...] z = a() def b(): return z-code z = b() def c(): return z**2 z = c() def d(): return N.sum(z, 2) z = d() def e(): return z.argmin(axis=1) z = e() if __name__ == '__main__': import cProfile cProfile.run("kmean()") Profiler output: In [570]: %run -t -e kmean2.py 84 function calls in 8.567 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 8.567 8.567 <string>:1(<module>) 10 0.000 0.000 0.835 0.084 fromnumeric.py:375(sum) 10 0.000 0.000 0.000 0.000 kmean2.py:10(a) 10 6.837 0.684 6.837 0.684 kmean2.py:12(b) 10 0.623 0.062 0.623 0.062 kmean2.py:14(c) 10 0.000 0.000 0.835 0.084 kmean2.py:16(d) 10 0.080 0.008 0.080 0.008 kmean2.py:18(e) 1 0.180 0.180 8.567 8.567 kmean2.py:5(kmean2) 10 0.000 0.000 0.000 0.000 {isinstance} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 1 0.012 0.012 0.012 0.012 {method 'randn' of 'mtrand.RandomState' objects} 10 0.835 0.083 0.835 0.083 {method 'sum' of 'numpy.ndarray' objects} It seems the b part of the computation is taking most of the time. I tried to simulate this separately: In [571]: x1 = N.random.randn(2000,39) In [572]: y1 = N.random.randn(64,39) In [574]: %timeit z1=x1[...,N.newaxis,...]-y1 10 loops, best of 3: 703 ms per loop In [575]: z1.shape Out[575]: (2000, 64, 39) As far as I can figure, this operation is doing 2000*64*39 subtractions. Doing this straight up yields the following: In [576]: x2 = N.random.randn(2000,64,39) In [577]: y2 = N.random.randn(2000,64,39) In [578]: %timeit z2 = x2-y2 10 loops, best of 3: 108 ms per loop Does anybody have any ideas on why this is so much faster? Hopefully I didn't mess up somewhere... Thanks! Regards, Albert P.S. I'm used NumPy 1.0rc1 with Python 2.5 on Windows for these tests. I get similar results with 1.0.dev3239 with Python 2.4.3 on Linux. |
From: Bruce S. <bso...@gm...> - 2006-10-04 13:26:58
|
Hi, On 10/2/06, Albert Strasheim <fu...@gm...> wrote: > I recently started looking at David Cournapeau's PyEM package, > specifically his implementation of the K-Means algorithm. He > implemented part of this algorithm with in pure Python version and > also provided a Pyrex alternative that is significantly faster (about > 10 times with the data I tested with). I tried to figure out why this > is so. Just of of curosity, have you seen the implementation at BioPython (http://biopython.org)? Michiel de Hoon and other wrote 'The C Clustering Library', see the following document for more details (linked from documentation page - http://biopython.org/wiki/Documentation): http://biopython.org/DIST/docs/cluster/cluster.pdf Regards Bruce |
From: Travis O. <oli...@ee...> - 2006-10-02 23:04:02
|
Albert Strasheim wrote: >In [571]: x1 = N.random.randn(2000,39) > >In [572]: y1 = N.random.randn(64,39) > >In [574]: %timeit z1=x1[...,N.newaxis,...]-y1 10 loops, best of 3: 703 ms >per loop > >In [575]: z1.shape >Out[575]: (2000, 64, 39) > >As far as I can figure, this operation is doing 2000*64*39 subtractions. >Doing this straight up yields the following: > >In [576]: x2 = N.random.randn(2000,64,39) > >In [577]: y2 = N.random.randn(2000,64,39) > >In [578]: %timeit z2 = x2-y2 >10 loops, best of 3: 108 ms per loop > >Does anybody have any ideas on why this is so much faster? Hopefully I >didn't mess up somewhere... > > I suspect I know why, although the difference seems rather large. There is code optimization that is being taken advantage of in the second case. If you have contiguous arrays (no broadcasting needed), then 1 C-loop is used for the subtraction (your second case). In the first case you are using broadcasting to generate the larger array. This requires more complicated looping constructs under the covers which causes your overhead. Bascially, you will have 64*39 1-d loops of 2000 elements each in the first example with a bit of calculation over-head to reset the pointers before each loop. In the ufunc code, compare the ONE_UFUNCLOOP case with the NOBUFFER_UFUNCLOOP case. If you want to be sure what is running un-comment the fprintf statements so you can tell. I'm surprised the overhead of adjusting pointers is so high, but then again you are probably getting a lot of cache misses in the first case so there is more to it than that, the loops may run more slowly too. -Travis |
From: Tim H. <tim...@ie...> - 2006-10-02 23:50:06
|
Travis Oliphant wrote: > Albert Strasheim wrote: > > >> In [571]: x1 = N.random.randn(2000,39) >> >> In [572]: y1 = N.random.randn(64,39) >> >> In [574]: %timeit z1=x1[...,N.newaxis,...]-y1 10 loops, best of 3: 703 ms >> per loop >> >> In [575]: z1.shape >> Out[575]: (2000, 64, 39) >> >> As far as I can figure, this operation is doing 2000*64*39 subtractions. >> Doing this straight up yields the following: >> >> In [576]: x2 = N.random.randn(2000,64,39) >> >> In [577]: y2 = N.random.randn(2000,64,39) >> >> In [578]: %timeit z2 = x2-y2 >> 10 loops, best of 3: 108 ms per loop >> >> Does anybody have any ideas on why this is so much faster? Hopefully I >> didn't mess up somewhere... >> >> >> > > I suspect I know why, although the difference seems rather large. There > is code optimization that is being taken advantage of in the second > case. If you have contiguous arrays (no broadcasting needed), then 1 > C-loop is used for the subtraction (your second case). > > In the first case you are using broadcasting to generate the larger > array. This requires more complicated looping constructs under the > covers which causes your overhead. Bascially, you will have 64*39 1-d > loops of 2000 elements each in the first example with a bit of > calculation over-head to reset the pointers before each loop. > > In the ufunc code, compare the ONE_UFUNCLOOP case with the > NOBUFFER_UFUNCLOOP case. If you want to be sure what is running > un-comment the fprintf statements so you can tell. > > I'm surprised the overhead of adjusting pointers is so high, but then > again you are probably getting a lot of cache misses in the first case > so there is more to it than that, the loops may run more slowly too. > > I suspect that Travis is basically right about why your simple subtraction runs much faster than your test case. However, that doesn't mean you can't do better than at present. When dealing with large, multidimensional arrays my experience has been that unrolling all of the for loops is frequently counterproductive. I chalk this up to two factors: first that you tend to end up generating large temporary arrays and this in turn leads to cache misses. Second, you loose flexibility in how you perform the calculation, which in turns limits other possible explanations. I just spent a while playing with this, and assuming I've correctly translated your original intent I've come up with two alternative, looping versions that run, respectively 2 and 3 times faster. I've a feeling that kmean3, the fastest one, still has a little more room to be sped up, but I'm out of time now. Code is below -tim import numpy as N data = N.random.randn(2000, 39) def kmean0(data): nclusters = 64 code = data[0:nclusters,:] return N.sum((data[...,N.newaxis,...]-code)**2, 2).argmin(axis=1) def kmean1(data): nclusters = 64 code = data[0:nclusters,:] z = data[:,N.newaxis,:] z = z-code z = z**2 z = N.sum(z, 2) return z.argmin(axis=1) def kmean2(data): nclusters = 64 naxes = data.shape[-1] code = data[0:nclusters,:] data = data[:, N.newaxis] allz = N.zeros([len(data)]) for i, x in enumerate(data): z = (x - code) z **= 2 allz[i] = z.sum(-1).argmin(0) return allz def kmean3(data): nclusters = 64 naxes = data.shape[-1] code = data[0:nclusters] totals = N.zeros([nclusters, len(data)], float) transdata = data.transpose().copy() for cluster, tot in zip(code, totals): for di, ci in zip(transdata, cluster): delta = di - ci delta **=2 tot += delta return totals.argmin(axis=0) if __name__ == '__main__': assert N.alltrue(kmean0(data) == kmean1(data)) assert N.alltrue(kmean0(data) == kmean2(data)) assert N.alltrue(kmean0(data) == kmean3(data)) from timeit import Timer print Timer('kmean0(data)', 'from scratch import kmean0, data').timeit(3) print Timer('kmean1(data)', 'from scratch import kmean1, data').timeit(3) print Timer('kmean2(data)', 'from scratch import kmean2, data').timeit(3) print Timer('kmean3(data)', 'from scratch import kmean3, data').timeit(3) |
From: Tim H. <tim...@ie...> - 2006-10-03 00:17:58
|
Tim Hochberg wrote: > Travis Oliphant wrote: > >> Albert Strasheim wrote: >> >> >> >>> In [571]: x1 = N.random.randn(2000,39) >>> >>> In [572]: y1 = N.random.randn(64,39) >>> >>> In [574]: %timeit z1=x1[...,N.newaxis,...]-y1 10 loops, best of 3: 703 ms >>> per loop >>> >>> In [575]: z1.shape >>> Out[575]: (2000, 64, 39) >>> >>> As far as I can figure, this operation is doing 2000*64*39 subtractions. >>> Doing this straight up yields the following: >>> >>> In [576]: x2 = N.random.randn(2000,64,39) >>> >>> In [577]: y2 = N.random.randn(2000,64,39) >>> >>> In [578]: %timeit z2 = x2-y2 >>> 10 loops, best of 3: 108 ms per loop >>> >>> Does anybody have any ideas on why this is so much faster? Hopefully I >>> didn't mess up somewhere... >>> >>> >>> [SNIP] > > > I just spent a while playing with this, and assuming I've correctly > translated your original intent I've come up with two alternative, > looping versions that run, respectively 2 and 3 times faster. I've a > feeling that kmean3, the fastest one, still has a little more room to be > sped up, but I'm out of time now. Code is below > > -tim > > > One more iterations, this time a little algorithmic improvement, and were up to 4x as fast as the original code. Here we take advantage of the fact that the term di**2 is constant across the axis we are minimizing on, computation of ci**2 can be hoisted out of the loop and the -2*di*ci term can be rejiggered to di*ci by appropriate rescaling. This reduces the computation in the inner loop from a subtract and a multiply to just a multiply. def kmean4(data): nclusters = 64 naxes = data.shape[-1] code = data[:nclusters] transdata = data.transpose().copy() totals = N.empty([nclusters, len(data)], float) code2 = (code**2).sum(-1) code2 *= -0.5 totals[:] = code2[:, N.newaxis] for cluster, tot in zip(code, totals): for di, ci in zip(transdata, cluster): tot += di*ci return totals.argmax(axis=0) [SNIP CODE] |
From: Travis O. <oli...@ee...> - 2006-10-03 00:32:26
|
Travis Oliphant wrote: > >I suspect I know why, although the difference seems rather large. > [snip] >I'm surprised the overhead of adjusting pointers is so high, but then >again you are probably getting a lot of cache misses in the first case >so there is more to it than that, the loops may run more slowly too. > > I'm personally bothered that this example runs so much more slowly. I don't think it should. Perhaps it is unavoidable because of the memory-layout issues. It is just hard to believe that the overhead for calling into the loop and adjusting the pointers is so much higher. But, that isn't the problem, here. Notice the following: x3 = N.random.rand(39,2000) x4 = N.random.rand(39,64,1) %timeit z3 = x3[:,None,:] - x4 10 loops, best of 3: 76.4 ms per loop Hmm... It looks like cache misses are a lot more important than making sure the inner loop is taken over the largest number of variables (that's the current way ufuncs decide which axis ought to be used as the 1-d loop). Perhaps those inner 1-d loops could be optimized (using prefetch or something) to reduce the number of cache misses on the inner computation, and the concept of looping over the largest dimension (instead of the last dimension) should be re-considered. Ideas, -Travis |
From: Albert S. <136...@su...> - 2006-10-03 00:42:21
|
Hello all > -----Original Message----- > From: num...@li... [mailto:numpy- > dis...@li...] On Behalf Of Travis Oliphant > Sent: 03 October 2006 02:32 > To: Discussion of Numerical Python > Subject: Re: [Numpy-discussion] Vectorizing code, for loops, and all that > > Travis Oliphant wrote: > > > > >I suspect I know why, although the difference seems rather large. > > > [snip] > > >I'm surprised the overhead of adjusting pointers is so high, but then > >again you are probably getting a lot of cache misses in the first case > >so there is more to it than that, the loops may run more slowly too. > > > > > > I'm personally bothered that this example runs so much more slowly. I > don't think it should. Perhaps it is unavoidable because of the > memory-layout issues. It is just hard to believe that the overhead for > calling into the loop and adjusting the pointers is so much higher. Firstly, thanks to Tim... I'll try his functions tomorrow. Meanwhile, I can confirm that the NOBUFFER_UFUNCLOOP case in PyUFunc_GenericFunction is getting exercised in the slower case. Here's some info on what's happening, courtesy of Rational Quantify: case NOBUFFER_UFUNCLOOP: while (loop->index < loop->size) { for (i=0; i<self->nargs; i++) loop->bufptr[i] = loop->iters[i]->dataptr; [1] loop->function((char **)loop->bufptr, &(loop->bufcnt), loop->steps, loop->funcdata); [2] UFUNC_CHECK_ERROR(loop); for (i=0; i<self->nargs; i++) { PyArray_ITER_NEXT(loop->iters[i]); [3] } loop->index++; } break; [1] 12.97% of function time [2] 8.65% of functiont ime [3] 62.14% of function time If statistics from elsewhere in the code would be helpful, let me know, and I'll see if I can convince Quantify to cough it up. > <snip> Thanks for all code and suggestions. Cheers, Albert |
From: Travis O. <oli...@ee...> - 2006-10-03 01:50:36
|
Albert Strasheim wrote: >Hello all > > > >>-----Original Message----- >>From: num...@li... [mailto:numpy- >>dis...@li...] On Behalf Of Travis Oliphant >>Sent: 03 October 2006 02:32 >>To: Discussion of Numerical Python >>Subject: Re: [Numpy-discussion] Vectorizing code, for loops, and all that >> >>Travis Oliphant wrote: >> >> >> >>>I suspect I know why, although the difference seems rather large. >>> >>> >>> >>[snip] >> >> >> >>>I'm surprised the overhead of adjusting pointers is so high, but then >>>again you are probably getting a lot of cache misses in the first case >>>so there is more to it than that, the loops may run more slowly too. >>> >>> >>> >>> >>I'm personally bothered that this example runs so much more slowly. I >>don't think it should. Perhaps it is unavoidable because of the >>memory-layout issues. It is just hard to believe that the overhead for >>calling into the loop and adjusting the pointers is so much higher. >> >> > >Firstly, thanks to Tim... I'll try his functions tomorrow. > >Meanwhile, I can confirm that the NOBUFFER_UFUNCLOOP case in >PyUFunc_GenericFunction is getting exercised in the slower case. Here's some >info on what's happening, courtesy of Rational Quantify: > >case NOBUFFER_UFUNCLOOP: >while (loop->index < loop->size) { > for (i=0; i<self->nargs; i++) > loop->bufptr[i] = loop->iters[i]->dataptr; [1] > > loop->function((char **)loop->bufptr, &(loop->bufcnt), > loop->steps, loop->funcdata); [2] > UFUNC_CHECK_ERROR(loop); > > for (i=0; i<self->nargs; i++) { > PyArray_ITER_NEXT(loop->iters[i]); [3] > } > loop->index++; >} >break; > >[1] 12.97% of function time >[2] 8.65% of functiont ime >[3] 62.14% of function time > >If statistics from elsewhere in the code would be helpful, let me know, and >I'll see if I can convince Quantify to cough it up. > > > Please run the same test but using x1 = N.random.rand(39,2000) x2 = N.random.rand(39,64,1) z1 = x1[:,N.newaxis,:] - x2 Thanks, -Travis |
From: Albert S. <136...@su...> - 2006-10-03 09:12:38
|
Hello all > -----Original Message----- > From: Travis Oliphant [mailto:oli...@ee...] > Sent: 03 October 2006 03:51 > To: Discussion of Numerical Python; Albert Strasheim > Subject: Re: ***[Possible UCE]*** Re: [Numpy-discussion] Vectorizing > code, for loops, and all that > > <snip> > >Meanwhile, I can confirm that the NOBUFFER_UFUNCLOOP case in > >PyUFunc_GenericFunction is getting exercised in the slower case. > >Here's > some > >info on what's happening, courtesy of Rational Quantify: > > > >case NOBUFFER_UFUNCLOOP: > >while (loop->index < loop->size) { > > for (i=0; i<self->nargs; i++) > > loop->bufptr[i] = loop->iters[i]->dataptr; [1] > > > > loop->function((char **)loop->bufptr, &(loop->bufcnt), > > loop->steps, loop->funcdata); [2] > > UFUNC_CHECK_ERROR(loop); > > > > for (i=0; i<self->nargs; i++) { > > PyArray_ITER_NEXT(loop->iters[i]); [3] > > } > > loop->index++; > >} > >break; > > > >[1] 12.97% of function time > >[2] 8.65% of functiont ime > >[3] 62.14% of function time > > > >If statistics from elsewhere in the code would be helpful, let me > >know, > and > >I'll see if I can convince Quantify to cough it up. > > > Please run the same test but using > > x1 = N.random.rand(39,2000) > x2 = N.random.rand(39,64,1) > > z1 = x1[:,N.newaxis,:] - x2 Very similar results to what I had previously: [1] 10.88% [2] 7.25% [3] 68.25% Cheers, Albert |
From: Travis O. <oli...@ie...> - 2006-10-03 14:58:35
|
Albert Strasheim wrote: >>> [1] 12.97% of function time >>> [2] 8.65% of functiont ime >>> [3] 62.14% of function time >>> >>> If statistics from elsewhere in the code would be helpful, let me >>> know, >>> >> and >> >>> I'll see if I can convince Quantify to cough it up. >>> >>> >> Please run the same test but using >> >> x1 = N.random.rand(39,2000) >> x2 = N.random.rand(39,64,1) >> >> z1 = x1[:,N.newaxis,:] - x2 >> > > Very similar results to what I had previously: > > [1] 10.88% > [2] 7.25% > [3] 68.25% > > Thanks, I've got some ideas about how to speed this up by eliminating some of the unnecessary calculations going on outside of the function loop, but there will still be some speed issues depending on how the array is traversed once you get above a certain size. I'm not sure there anyway around that, ultimately, due to memory access being slow on most hardware. If anyone has any ideas, I'd love to hear them. I won't be able to get to implementing my ideas until at least Friday (also when rc2 will be released). -Travis |
From: Tim H. <tim...@ie...> - 2006-10-03 16:12:54
|
Travis Oliphant wrote: > Albert Strasheim wrote: > >>>> [1] 12.97% of function time >>>> [2] 8.65% of functiont ime >>>> [3] 62.14% of function time >>>> >>>> If statistics from elsewhere in the code would be helpful, let me >>>> know, >>>> >>>> >>> and >>> >>> >>>> I'll see if I can convince Quantify to cough it up. >>>> >>>> >>>> >>> Please run the same test but using >>> >>> x1 = N.random.rand(39,2000) >>> x2 = N.random.rand(39,64,1) >>> >>> z1 = x1[:,N.newaxis,:] - x2 >>> >>> >> Very similar results to what I had previously: >> >> [1] 10.88% >> [2] 7.25% >> [3] 68.25% >> >> >> > Thanks, > > I've got some ideas about how to speed this up by eliminating some of > the unnecessary calculations going on outside of the function loop, but > there will still be some speed issues depending on how the array is > traversed once you get above a certain size. I'm not sure there anyway > around that, ultimately, due to memory access being slow on most hardware. > > If anyone has any ideas, I'd love to hear them. I won't be able to > get to implementing my ideas until at least Friday (also when rc2 will > be released). > > I had an idea regarding which axis to operate on first. Rather than operate on strictly the longest axis or strictly the innermost axis, a hybrid approach could be used. We would operate on the longest axis that would not result in the inner loop overflowing the cache. The idea is minimize the loop overhead as we do now by choosing the largest axis, while at the same time attempting to maintain cache friendliness. Let's suppose that it's safe to stuff 2000 array elements into the cache[1], and lets consider two arrays where the lengths of the axes[2], from innermost to outermost are [10,100,1000] and [100,10,1000]. Under this scheme, we would loop over the length 100 axis in both cases. Looping over the length-1000 axis pull 1e6 elements into the cache. If the innermost axis alone exceeded our cache imposed limit, we would of course just loop over that axis first. -tim [1] This would really be specified in bytes and thus the number of allowable elements would vary by data type. It might also need to be tuned based on architecture. [2] I believe that there's some consolidation of adjacent axes that goes on where possible, let's assume that's already happened at this point. |
From: A. M. A. <per...@gm...> - 2006-10-04 03:16:18
|
On 03/10/06, Tim Hochberg <tim...@ie...> wrote: > I had an idea regarding which axis to operate on first. Rather than > operate on strictly the longest axis or strictly the innermost axis, a > hybrid approach could be used. We would operate on the longest axis that > would not result in the inner loop overflowing the cache. The idea is > minimize the loop overhead as we do now by choosing the largest axis, > while at the same time attempting to maintain cache friendliness. If elements are smaller than cache lines (usually at least eight bytes, I think), we might end up pulling many times as many bytes into the cache as we actually need if we don't loop along axes with small strides first. Can BLAS be used for some of these operations? A. M. Archibald A. M. Archibald |
From: Travis O. <oli...@ee...> - 2006-10-05 13:34:48
|
Travis Oliphant wrote: >Albert Strasheim wrote: > > >>>>[1] 12.97% of function time >>>>[2] 8.65% of functiont ime >>>>[3] 62.14% of function time >>>> >>>>If statistics from elsewhere in the code would be helpful, let me >>>>know, >>>> >>>> >>>> >>>and >>> >>> >>> >>>>I'll see if I can convince Quantify to cough it up. >>>> >>>> >>>> >>>> >>>Please run the same test but using >>> >>>x1 = N.random.rand(39,2000) >>>x2 = N.random.rand(39,64,1) >>> >>>z1 = x1[:,N.newaxis,:] - x2 >>> >>> >>> >>Very similar results to what I had previously: >> >>[1] 10.88% >>[2] 7.25% >>[3] 68.25% >> >> >> >> >Thanks, > >I've got some ideas about how to speed this up by eliminating some of >the unnecessary calculations going on outside of the function loop, but >there will still be some speed issues depending on how the array is >traversed once you get above a certain size. I'm not sure there anyway >around that, ultimately, due to memory access being slow on most hardware. > > Well, I tried out my ideas and didn't get much improvement (8-10%). Then, I finally realized more fully that the slowness was due to the loop taking place over an axis which had a very large stride so that the memory access was taking a long time. Thus, instead of picking the loop axis to correspond to the axis with the longest dimension, I've picked the loop axis to be one with the smallest sum of strides. In this particular example, the speed-up is about 6-7 times... -Travis |
From: Albert S. <fu...@gm...> - 2006-10-04 05:31:58
|
Hello all > -----Original Message----- > From: Travis Oliphant [mailto:oli...@ee...] > Sent: 03 October 2006 03:51 > To: Discussion of Numerical Python; Albert Strasheim > Subject: Re: ***[Possible UCE]*** Re: [Numpy-discussion] Vectorizing code, > for loops, and all that > > <snip> > >Meanwhile, I can confirm that the NOBUFFER_UFUNCLOOP case in > >PyUFunc_GenericFunction is getting exercised in the slower case. Here's > some > >info on what's happening, courtesy of Rational Quantify: > > > >case NOBUFFER_UFUNCLOOP: > >while (loop->index < loop->size) { > > for (i=0; i<self->nargs; i++) > > loop->bufptr[i] = loop->iters[i]->dataptr; [1] > > > > loop->function((char **)loop->bufptr, &(loop->bufcnt), > > loop->steps, loop->funcdata); [2] > > UFUNC_CHECK_ERROR(loop); > > > > for (i=0; i<self->nargs; i++) { > > PyArray_ITER_NEXT(loop->iters[i]); [3] > > } > > loop->index++; > >} > >break; > > > >[1] 12.97% of function time > >[2] 8.65% of functiont ime > >[3] 62.14% of function time > > > >If statistics from elsewhere in the code would be helpful, let me know, > and > >I'll see if I can convince Quantify to cough it up. > > > Please run the same test but using > > x1 = N.random.rand(39,2000) > x2 = N.random.rand(39,64,1) > > z1 = x1[:,N.newaxis,:] - x2 Very similar results to what I had previously: [1] 10.88% [2] 7.25% [3] 68.25% Cheers, Albert |
From: A. M. A. <per...@gm...> - 2006-10-04 01:46:22
|
On 02/10/06, Travis Oliphant <oli...@ee...> wrote: > Perhaps those inner 1-d loops could be optimized (using prefetch or > something) to reduce the number of cache misses on the inner > computation, and the concept of looping over the largest dimension > (instead of the last dimension) should be re-considered. Cache control seems to be the main factor deciding the speed of many algorithms. Prefectching could make a huge difference, particularly on NUMA machines (like a dual opteron). I think GCC has a moderately portable way to request it (though it may be only in beta versions as yet). More generally, all the tricks that ATLAS uses to accelerate BLAS routines would (in principle) be applicable here. The implementation would be extremely difficult, though, even if all the basic loops could be expressed in a few primitives. A. M. Archibald |
From: David C. <da...@ar...> - 2006-10-03 07:31:47
|
Albert Strasheim wrote: > Hello all > > I recently started looking at David Cournapeau's PyEM package, specifically > his implementation of the K-Means algorithm. He implemented part of this > algorithm with in pure Python version and also provided a Pyrex alternative > that is significantly faster (about 10 times with the data I tested with). I > tried to figure out why this is so. > > For what it worths, PyEM has been quietly updated in the past weeks: most responses I got initially were people reporting errors almost always related to change in numpy API changes, so I decided to keep quiet for a while, waiting for an API freeze on numpy side (last versions use vq from scipy.cluster, for example, the plotting have been much improved, a fast Ctype version for diagonal gaussian kernel enables to run EM on hundred of thousand frames of several dozens dimensions with several dozens of gaussian in the mixture in reasonable times) > The job of this part of the algorithm is pretty simple: from a bunch of > cluster means (the codebook) find the nearest cluster mean for each data > point. The first attempt at implementing this algorithm might use two for > loops, one over the data points and one over the cluster means, computing a > Euclidean distance at each step. Slightly better is to use one loop and some > broadcasting. > > By randomly trying some code, I arrived at the following one-liner: > > N.sum((data[...,N.newaxis,...]-code)**2, 2).argmin(axis=1) > > where > > data = N.random.randn(2000, 39) > nclusters = 64 > code = data[0:nclusters,:] > > This code was still slow, so I mangled it a bit so that I could profile it > using cProfile under Python 2.5 (use profile if you have an older version of > Python): > > def kmean(): > data = N.random.randn(2000, 39) > nclusters = 64 > code = data[0:nclusters,:] > for i in xrange(10): > def a(): return data[...,N.newaxis,...] > z = a() > def b(): return z-code > z = b() > def c(): return z**2 > z = c() > def d(): return N.sum(z, 2) > z = d() > def e(): return z.argmin(axis=1) > z = e() > > if __name__ == '__main__': > import cProfile > cProfile.run("kmean()") > > Profiler output: > I got exactly the same kind of weird behaviour in other places of PyEM (apparently simple substractions taking a lot of time compared to other parts although it should theoratically be negligeable: eg X - mu where X is (n, d) and mu (d, 1) takeing almost as much time as exp( X**2) !). It would be great to at least know the origin of this non-intuitive result, David |
From: Bill B. <wb...@gm...> - 2006-10-04 05:21:00
|
Er, is this the PyEM in question? It doesn't have much of a web presence... http://www.ar.media.kyoto-u.ac.jp/members/david/ --bb On 10/3/06, David Cournapeau <da...@ar...> wrote: > Albert Strasheim wrote: > > Hello all > > > > I recently started looking at David Cournapeau's PyEM package, specifically > > his implementation of the K-Means algorithm. He implemented part of this > > algorithm with in pure Python version and also provided a Pyrex alternative > > that is significantly faster (about 10 times with the data I tested with). I > > tried to figure out why this is so. |