From: Martin S. <nu...@ms...> - 2006-08-26 10:06:57
|
Hello, I'm a bit ignorant of optimization in numpy. I have a movie with 65535 32x32 frames stored in a 3D array of uint8 with shape (65535, 32, 32). I load it from an open file f like this: >>> import numpy as np >>> data = np.fromfile(f, np.uint8, count=65535*32*32) >>> data = data.reshape(65535, 32, 32) I'm picking several thousand frames more or less randomly from throughout the movie and finding the mean frame over those frames: >>> meanframe = data[frameis].mean(axis=0) frameis is a 1D array of frame indices with no repeated values in it. If it has say 4000 indices in it, then the above line takes about 0.5 sec to complete on my system. I'm doing this for a large number of different frameis, some of which can have many more indices in them. All this takes many minutes to complete, so I'm looking for ways to speed it up. If I divide it into 2 steps: >>> temp = data[frameis] >>> meanframe = temp.mean(axis=0) and time it, I find the first step takes about 0.2 sec, and the second takes about 0.3 sec. So it's not just the mean() step, but also the indexing step that's taking some time. If I flatten with ravel: >>> temp = data[frameis].ravel() >>> meanframe = temp.mean(axis=0) then the first step still takes about 0.2 sec, but the mean() step drops to about 0.1 sec. But of course, this is taking a flat average across all pixels in the movie, which isn't what I want to do. I have a feeling that the culprit is the non contiguity of the movie frames being averaged, but I don't know how to proceed. Any ideas? Could reshaping the data somehow speed things up? Would weave.blitz or weave.inline or pyrex help? I'm running numpy 0.9.8 Thanks, Martin |
From: Travis O. <oli...@ie...> - 2006-08-26 10:26:35
|
Martin Spacek wrote: > Hello, > > I'm a bit ignorant of optimization in numpy. > > I have a movie with 65535 32x32 frames stored in a 3D array of uint8 > with shape (65535, 32, 32). I load it from an open file f like this: > > >>> import numpy as np > >>> data = np.fromfile(f, np.uint8, count=65535*32*32) > >>> data = data.reshape(65535, 32, 32) > > I'm picking several thousand frames more or less randomly from > throughout the movie and finding the mean frame over those frames: > > >>> meanframe = data[frameis].mean(axis=0) > > frameis is a 1D array of frame indices with no repeated values in it. If > it has say 4000 indices in it, then the above line takes about 0.5 sec > to complete on my system. I'm doing this for a large number of different > frameis, some of which can have many more indices in them. All this > takes many minutes to complete, so I'm looking for ways to speed it up. > > If I divide it into 2 steps: > > >>> temp = data[frameis] > >>> meanframe = temp.mean(axis=0) > > and time it, I find the first step takes about 0.2 sec, and the second > takes about 0.3 sec. So it's not just the mean() step, but also the > indexing step that's taking some time. > If frameis is 1-D, then you should be able to use temp = data.take(frameis,axis=0) for the first step. This can be quite a bit faster (and is a big reason why take is still around). There are several reasons for this (one of which is that index checking is done over the entire list when using indexing). -Travis |
From: Francesc A. <fa...@ca...> - 2006-08-26 21:22:42
|
A Dissabte 26 Agost 2006 12:26, Travis Oliphant va escriure: > If frameis is 1-D, then you should be able to use > > temp =3D data.take(frameis,axis=3D0) > > for the first step. This can be quite a bit faster (and is a big > reason why take is still around). There are several reasons for this > (one of which is that index checking is done over the entire list when > using indexing). Well, some days ago I've stumbled on this as well. NumPy manual says=20 that .take() is usually faster than fancy indexing, but my timings shows th= at=20 this is no longer true in recent versions of NumPy: In [56]: Timer("b.take(a)","import numpy; a=3Dnumpy.arange(999,-1,-1,=20 dtype=3D'l');b=3Da[:]").repeat(3,1000) Out[56]: [0.28740906715393066, 0.20345211029052734, 0.20371079444885254] In [57]: Timer("b[a]","import numpy; a=3Dnumpy.arange(999,-1,-1,=20 dtype=3D'l');b=3Da[:]").repeat(3,1000) Out[57]: [0.20807695388793945, 0.11684703826904297, 0.11686491966247559] I've done some profiling on this and it seems that take is using C memmove= =20 call so as to copy the data, and this is *very* slow, at least in my platfo= rm=20 (Linux on Intel). On its hand, fancy indexing seems to use an iterator and= =20 copying the elements one-by-one seems faster. I'd say that replacing memmov= e=20 by memcpy would make .take() much faster. Regards, =2D-=20 >0,0< Francesc Altet =A0 =A0 http://www.carabos.com/ V V C=E1rabos Coop. V. =A0=A0Enjoy Data "-" |
From: Martin S. <nu...@ms...> - 2006-08-27 12:05:33
|
Travis Oliphant wrote: > > If frameis is 1-D, then you should be able to use > > temp = data.take(frameis,axis=0) > > for the first step. This can be quite a bit faster (and is a big > reason why take is still around). There are several reasons for this > (one of which is that index checking is done over the entire list when > using indexing). > Yup, that dropped the indexing step down to essentially 0 seconds. Thanks Travis! Martin |
From: Tim H. <tim...@ie...> - 2006-08-26 18:00:11
|
Martin Spacek wrote: > Hello, > > I'm a bit ignorant of optimization in numpy. > > I have a movie with 65535 32x32 frames stored in a 3D array of uint8 > with shape (65535, 32, 32). I load it from an open file f like this: > > >>> import numpy as np > >>> data = np.fromfile(f, np.uint8, count=65535*32*32) > >>> data = data.reshape(65535, 32, 32) > > I'm picking several thousand frames more or less randomly from > throughout the movie and finding the mean frame over those frames: > > >>> meanframe = data[frameis].mean(axis=0) > > frameis is a 1D array of frame indices with no repeated values in it. If > it has say 4000 indices in it, then the above line takes about 0.5 sec > to complete on my system. I'm doing this for a large number of different > frameis, some of which can have many more indices in them. All this > takes many minutes to complete, so I'm looking for ways to speed it up. > > If I divide it into 2 steps: > > >>> temp = data[frameis] > >>> meanframe = temp.mean(axis=0) > > and time it, I find the first step takes about 0.2 sec, and the second > takes about 0.3 sec. So it's not just the mean() step, but also the > indexing step that's taking some time. > > If I flatten with ravel: > > >>> temp = data[frameis].ravel() > >>> meanframe = temp.mean(axis=0) > > then the first step still takes about 0.2 sec, but the mean() step drops > to about 0.1 sec. But of course, this is taking a flat average across > all pixels in the movie, which isn't what I want to do. > > I have a feeling that the culprit is the non contiguity of the movie > frames being averaged, but I don't know how to proceed. > > Any ideas? Could reshaping the data somehow speed things up? Would > weave.blitz or weave.inline or pyrex help? > > I'm running numpy 0.9.8 > > Thanks, > > Martin > Martin, Here's an approach (mean_accumulate) that avoids making any copies of the data. It runs almost 4x as fast as your approach (called baseline here) on my box. Perhaps this will be useful: frames = 65535 samples = 4000 data = (256 * np.random.random((frames, 32, 32))).astype(np.uint8) indices = np.arange(frames) random.shuffle(indices) indices = indices[:samples] def mean_baseline(data, indices): return data[indices].mean(axis=0) def mean_accumulate(data, indices): result = np.zeros([32, 32], float) for i in indices: result += data[i] result /= len(indices) return result if __name__ == "__main__": import timeit print mean_baseline(data, indices)[0,:8] print timeit.Timer("s.mean_baseline(s.data, s.indices)", "import scratch as s").timeit(10) print mean_accumulate(data, indices)[0,:8] print timeit.Timer("s.mean_accumulate(s.data, s.indices)", "import scratch as s").timeit(10) This gives: [ 126.947 127.39175 128.03725 129.83425 127.98925 126.866 128.5352 127.6205 ] 3.95907664242 [ 126.947 127.39175 128.03725 129.83425 127.98925 126.866 128.53525 127.6205 ] 1.06913644053 I also wondered if sorting indices would help since it would help improve locality of reference, but when I measured that it appeared to help not at all. regards, -tim |
From: Martin S. <nu...@ms...> - 2006-08-27 12:28:14
|
Tim Hochberg wrote: > Here's an approach (mean_accumulate) that avoids making any copies of > the data. It runs almost 4x as fast as your approach (called baseline > here) on my box. Perhaps this will be useful: > --snip-- > def mean_accumulate(data, indices): > result = np.zeros([32, 32], float) > for i in indices: > result += data[i] > result /= len(indices) > return result Great! I got a roughly 9x speed improvement using take() in combination with this approach. Thanks Tim! Here's what my code looks like now: >>> def mean_accum(data): >>> result = np.zeros(data[0].shape, np.float64) >>> for dataslice in data: >>> result += dataslice >>> result /= len(data) >>> return result >>> >>> # frameis are int64 >>> frames = data.take(frameis.astype(np.int32), axis=0) >>> meanframe = mean_accum(frames) I'm surprised that using a python for loop is faster than the built-in mean method. I suppose mean() can't perform the same in-place operations because in certain cases doing so would fail? Martin |
From: Tim H. <tim...@ie...> - 2006-08-27 15:37:27
|
Martin Spacek wrote: > Tim Hochberg wrote: > > >> Here's an approach (mean_accumulate) that avoids making any copies of >> the data. It runs almost 4x as fast as your approach (called baseline >> here) on my box. Perhaps this will be useful: >> >> > --snip-- > >> def mean_accumulate(data, indices): >> result = np.zeros([32, 32], float) >> for i in indices: >> result += data[i] >> result /= len(indices) >> return result >> > > Great! I got a roughly 9x speed improvement using take() in combination > with this approach. Thanks Tim! > > Here's what my code looks like now: > > >>> def mean_accum(data): > >>> result = np.zeros(data[0].shape, np.float64) > >>> for dataslice in data: > >>> result += dataslice > >>> result /= len(data) > >>> return result > >>> > >>> # frameis are int64 > >>> frames = data.take(frameis.astype(np.int32), axis=0) > >>> meanframe = mean_accum(frames) > > I'm surprised that using a python for loop is faster than the built-in > mean method. I suppose mean() can't perform the same in-place operations > because in certain cases doing so would fail? > I'm not sure why mean is slow here, although possibly it's a locality issue -- mean likely computes along axis zero each time, which means it's killing the cache -- and on the other hand the accumulate version is cache friendly. One thing to keep in mind about python for loops is that they are slow if you are doing a simple computation inside (a single add for instance). IIRC, they are 10's of times slower. However, here one is doing 1000 odd operations in the inner loop, so the loop overhead is minimal. (What would be perfect here is something just like take, but that returned an iterator instead of a new array as that could be done with no copying -- unfortunately such a beast does not exist as far as I know) I'm actually surprised that the take version is faster than my original version since it makes a big ol' copy. I guess this is an indication that indexing is more expensive than I realize. That's why nothing beats measuring! An experiment to reshape your data so that it's friendly to mean (assuming it really does operate on axis zero) and try that. However, this turns out to be a huge pesimization, mostly because take + transpose is pretty slow. -tim > Martin > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Numpy-discussion mailing list > Num...@li... > https://lists.sourceforge.net/lists/listinfo/numpy-discussion > > > |
From: Martin S. <nu...@ms...> - 2006-08-28 07:02:07
|
Tim Hochberg wrote: > I'm actually surprised that the take version is faster than my original > version since it makes a big ol' copy. I guess this is an indication > that indexing is more expensive than I realize. That's why nothing beats > measuring! Actually, your original version is just as fast as the take() version. Both are about 9X faster than numpy.mean() on my system. I prefer the take() version because you only have to pass a single argument to mean_accum() Martin |
From: Martin S. <nu...@ms...> - 2006-08-28 07:13:18
|
Martin Spacek wrote: > > Actually, your original version is just as fast as the take() version. > Both are about 9X faster than numpy.mean() on my system. I prefer the > take() version because you only have to pass a single argument to > mean_accum() I forgot to mention that all my indices are, for now, sorted. I just tried shuffling them (as you did), but I still get the same 9x improvement in speed, so I don't know why you only get a 4x improvement on your system. Martin |