From: Pierre T. <thi...@ph...> - 2006-09-12 13:52:09
|
Hi, I would like to have information on the best techniques to do in-place calculations and to minimize temporary array creations. To me this seems to be very important whenever the arrays become very large. I already know about the ufunc in-place functionality (which is great). More specifically, here are examples that occured in my code 1) FFTs: Let A and B be two large arrays, already allocated. I want the fft of A to be stored in B. If I just type B = fft(A), there is a temprary array creation, right? Is it possible to avoid that? 2) Function output: In general, I think the same thing happens with functions like def f1(array_in): array_out = # something using array_in return array_out Then, if B is already allocated, writing B = f1(A) involves again a temporary array creation I thought instead of doing something like def f2(array_in, array_out): array_out[:] = # something # Is this good practice? and call f2(A,B). If I understand well, this still requires a temporary array creation. Is there another way of doing that (appart from actually looping through the indices of A and B)? I know that the use of f2 has one clear advantage: it makes sure that whatever was in B is discarded. With f1, the following could happen: A # contains something B # contains something C = B B = f1(A) C still contains whatever was in B. This could be what you wanted, but if you consider C just as a reference to B, this is not good. I guess these considerations are not standard python problems because you expect python to take care of memory issues. With big arrays in scientific computations, I feel the question is more relevant. I might be wrong... Pierre -- Pierre Thibault 616 Clark Hall, Cornell University (607) 255-5522 |
From: Francesc A. <fa...@ca...> - 2006-09-12 14:51:16
|
Hello Pierre, El dt 12 de 09 del 2006 a les 09:52 -0400, en/na Pierre Thibault va escriure: > Hi, >=20 > I would like to have information on the best techniques to do in-place > calculations and to minimize temporary array creations. To me this > seems to be very important whenever the arrays become very large. >=20 > I already know about the ufunc in-place functionality (which is great). >=20 > More specifically, here are examples that occured in my code >=20 > 1) FFTs: Let A and B be two large arrays, already allocated. I want > the fft of A to be stored in B. If I just type B =3D fft(A), there is a > temprary array creation, right? Is it possible to avoid that? Well, in some way, there is a temporary array creation that is immediately bound to B, so in the end, the temporary is not so temporary, but a new (bounded) object. Obviously, the object that was referencing B is freed (unless there is another reference to it). >=20 > 2) Function output: In general, I think the same thing happens with > functions like >=20 > def f1(array_in): > array_out =3D # something using array_in > return array_out >=20 > Then, if B is already allocated, writing B =3D f1(A) involves again a > temporary array creation Again, it depends what do you understand by 'temporary'. >=20 > I thought instead of doing something like >=20 > def f2(array_in, array_out): > array_out[:] =3D # something > # Is this good practice? >=20 > and call f2(A,B). >=20 > If I understand well, this still requires a temporary array creation. > Is there another way of doing that (appart from actually looping > through the indices of A and B)? Here I'd say that yes, you are creating a truly temporary object and then assign element by element to B. >=20 > I know that the use of f2 has one clear advantage: it makes sure that > whatever was in B is discarded. With f1, the following could happen: >=20 > A # contains something > B # contains something > C =3D B > B =3D f1(A) >=20 > C still contains whatever was in B. This could be what you wanted, but > if you consider C just as a reference to B, this is not good. This is not good if you want to get rid of the object pointed by B, but in general, this is considered a nice feature of python. >=20 > I guess these considerations are not standard python problems because > you expect python to take care of memory issues. With big arrays in > scientific computations, I feel the question is more relevant. I might > be wrong... If you are worried about wasting memory, just get familiar with this rule: an object only exists in memory when it is referenced (bounded) by a variable. When the object is no longer referenced, its memory becomes freed and is available to the system for later reuse. With this, B =3D fft(A) will create a new object (the FFT of A), the object pointed by B will be freed (if there is not any other reference to it) and the new object will be bound to B. If what you want is to avoid having in memory the three objects (namely A, old B and new B) at the same time, you can do something like: del B # deletes reference to object pointed by B B =3D fft(A) # B gets bounded to new FFT object HTH, --=20 >0,0< Francesc Altet http://www.carabos.com/ V V C=C3=A1rabos Coop. V. Enjoy Data "-" |
From: Pierre T. <thi...@ph...> - 2006-09-12 17:17:26
|
Hello again, On 9/12/06, Francesc Altet <fa...@ca...> wrote: > Hello Pierre, > [...] > > Well, in some way, there is a temporary array creation that is > immediately bound to B, so in the end, the temporary is not so > temporary, but a new (bounded) object. Obviously, the object that was > referencing B is freed (unless there is another reference to it). ok, I guess I was aware of all that. My worries are related to two cases: 1) When the mere creation of a new array is prohibitive. 2) When what I really want to change is the _content_ of an array. Using assignment (=) disconnects B from the array it refers to, so that what used to be true (C=B) is not anymore. I understant that B[:] = ... solves the problem 2), though I don't know if this notation is recommended; but I would like to know if there is anyway to solve 1), in the way ufuncs can do. I had fun using kde's "performance monitor" (ksysguard) to see the difference between a = numpy.random.rand(2048,2048) + 1j * numpy.random.rand(2048,2048) a = numpy.exp(a) and a = numpy.random.rand(2048,2048) + 1j * numpy.random.rand(2048,2048) numpy.exp(a,out=a) Not only is the latter faster, but I could see a large glitch in the memory usage for the former. Pierre -- Pierre Thibault 616 Clark Hall, Cornell University (607) 255-5522 |
From: Francesc A. <fa...@ca...> - 2006-09-12 17:49:36
|
El dt 12 de 09 del 2006 a les 13:17 -0400, en/na Pierre Thibault va escriure: > Hello again, >=20 > On 9/12/06, Francesc Altet <fa...@ca...> wrote: > > Hello Pierre, > > [...] > > > > Well, in some way, there is a temporary array creation that is > > immediately bound to B, so in the end, the temporary is not so > > temporary, but a new (bounded) object. Obviously, the object that was > > referencing B is freed (unless there is another reference to it). >=20 > ok, I guess I was aware of all that. My worries are related to two cases: > 1) When the mere creation of a new array is prohibitive. As Archibald said in other message, creation of a big array is not an issue (malloc is very fast) and indepent of the size: >>> Timer("a=3Dnumpy.array(100,dtype=3Dnumpy.complex128)", "import numpy").repeat(3,10000) [0.19819307327270508, 0.14915895462036133, 0.14999985694885254] >>> Timer("a=3Dnumpy.array(10000,dtype=3Dnumpy.complex128)", "import numpy").repeat(3,10000) [0.15171599388122559, 0.14998698234558105, 0.14901280403137207] that is 15 us (in my old machine) irregardingly of the size. [BTW, numpy.empty seems twice as slower in my machine. Why? >>> Timer("a=3Dnumpy.empty(10000,dtype=3Dnumpy.complex128)", "import numpy").repeat(3,10000) [0.37033700942993164, 0.31780219078063965, 0.31607294082641602] ] > 2) When what I really want to change is the _content_ of an array. > Using assignment (=3D) disconnects B from the array it refers to, so > that what used to be true (C=3DB) is not anymore. >=20 > I understant that B[:] =3D ... solves the problem 2), though I don't > know if this notation is recommended; but I would like to know if > there is anyway to solve 1), in the way ufuncs can do. >=20 > I had fun using kde's "performance monitor" (ksysguard) to see the > difference between >=20 > a =3D numpy.random.rand(2048,2048) + 1j * numpy.random.rand(2048,2048) > a =3D numpy.exp(a) >=20 > and >=20 > a =3D numpy.random.rand(2048,2048) + 1j * numpy.random.rand(2048,2048) > numpy.exp(a,out=3Da) >=20 > Not only is the latter faster, but I could see a large glitch in the > memory usage for the former. Yes, it seems that some ufuncs have an additional "out" parameter that I was not aware of. Well, it that case, this parameter in the fft function would solve your needs, although I don't know how complicated would be this. Cheers, --=20 >0,0< Francesc Altet http://www.carabos.com/ V V C=C3=A1rabos Coop. V. Enjoy Data "-" |
From: A. M. A. <per...@gm...> - 2006-09-12 17:19:07
|
On 12/09/06, Pierre Thibault <thi...@ph...> wrote: > I would like to have information on the best techniques to do in-place > calculations and to minimize temporary array creations. To me this > seems to be very important whenever the arrays become very large. The first rule of optimization: don't do it yet. You can usually go through and banish temporary arrays (using ufuncs and so on) at the cost of readability, code encapsulation, and thread-safe-ness. But it may not do what you want. I had an image-processing code that was taking longer than I thought it should and using two hundred megabytes or so of RAM. So I rewrote it, with a certain amount of pain, in a way that it used the fewest possible temporary arrays. It didn't run any faster, and it then took five hundred megabytes. Because all the arrays ended up being in memory at once, the memory footprint increased drastically. malloc() is fast, typically just a handful of instructions; if you're allocating a giant array, it's almost certainly being allocated using mmap(), and it can be released back to the OS on deallocation. But you probably still want to avoid temporary arrays. So: > More specifically, here are examples that occured in my code > > 1) FFTs: Let A and B be two large arrays, already allocated. I want > the fft of A to be stored in B. If I just type B = fft(A), there is a > temprary array creation, right? Is it possible to avoid that? Doing an FFT in-place is a major challenge, and involves its own slowdowns, so generally high-level toolkits don't bother. But fft seems to be like many functions (those generated by interp1d, for example) that insist on malloc()ing their own arrays to return. Short of rooting around in the numpy/scipy code, there's no real way around this for such functions. The best you can do is make actual use of the allocated array (rather than copy its contents to *another* array and discard it). > 2) Function output: In general, I think the same thing happens with > functions like > > def f1(array_in): > array_out = # something using array_in > return array_out > > Then, if B is already allocated, writing B = f1(A) involves again a > temporary array creation Uh, no, not really. The way you have written f1, it probably malloc()s space for array_out. the address of that space (roughly) is saved in the array_out variable. If you write B=f1(A), you are just storing the address in B. The memory is not copied. Even if you do B=f1(A)[::10] you don't copy the memory. > I thought instead of doing something like > > def f2(array_in, array_out): > array_out[:] = # something > # Is this good practice? > > and call f2(A,B). This is a ufunc-like solution; you could even make array_out an optional argument, and return it. > If I understand well, this still requires a temporary array creation. > Is there another way of doing that (appart from actually looping > through the indices of A and B)? It depends what #something is. If, say, it is 2*array_in, you can simply do multiply(array_in,2,array_out) to avoid any dynamic allocation. > I guess these considerations are not standard python problems because > you expect python to take care of memory issues. With big arrays in > scientific computations, I feel the question is more relevant. I might > be wrong... Some of these issues come up when dealing with mutable objects (lists, dictionaries, and so on). Some of them (the fact that python variables contain simply references) are discussed in various python FAQs. A. M. Archibald |
From: Travis O. <oli...@ee...> - 2006-09-12 19:28:23
|
Francesc Altet wrote: >>>>Timer("a=numpy.array(100,dtype=numpy.complex128)", "import >>>> >>>> >numpy").repeat(3,10000) >[0.19819307327270508, 0.14915895462036133, 0.14999985694885254] > > >>>>Timer("a=numpy.array(10000,dtype=numpy.complex128)", "import >>>> >>>> >numpy").repeat(3,10000) >[0.15171599388122559, 0.14998698234558105, 0.14901280403137207] > >that is 15 us (in my old machine) irregardingly of the size. > > Ummm.. You are not creating empty arrays here. You are creating a 0-d array with a single entry. >[BTW, numpy.empty seems twice as slower in my machine. Why? > > >>>>Timer("a=numpy.empty(10000,dtype=numpy.complex128)", "import >>>> >>>> >numpy").repeat(3,10000) >[0.37033700942993164, 0.31780219078063965, 0.31607294082641602] >] > > Now, you are creating an empty array with 10000 elements in it. Best, -Travis |
From: Francesc A. <fa...@ca...> - 2006-09-13 07:16:00
|
El dt 12 de 09 del 2006 a les 13:28 -0600, en/na Travis Oliphant va escriure: > >[BTW, numpy.empty seems twice as slower in my machine. Why? > > =20 > > > >>>>Timer("a=3Dnumpy.empty(10000,dtype=3Dnumpy.complex128)", "import > >>>> =20 > >>>> > >numpy").repeat(3,10000) > >[0.37033700942993164, 0.31780219078063965, 0.31607294082641602] > >] > > =20 > > > Now, you are creating an empty array with 10000 elements in it.=20 Ups, my bad. So, here are the correct times for array creation: >>> Timer("a=3Dnumpy.empty(10,dtype=3Dnumpy.complex128)", "import numpy").repeat(3,10000) [0.083303928375244141, 0.080381870269775391, 0.077172040939331055] >>> Timer("a=3Dnumpy.empty(100,dtype=3Dnumpy.complex128)", "import numpy").repeat(3,10000) [0.086454868316650391, 0.084085941314697266, 0.083555936813354492] >>> Timer("a=3Dnumpy.empty(1000,dtype=3Dnumpy.complex128)", "import numpy").repeat(3,10000) [0.084996223449707031, 0.082299947738647461, 0.081347942352294922] >>> Timer("a=3Dnumpy.empty(10000,dtype=3Dnumpy.complex128)", "import numpy").repeat(3,10000) [0.31068897247314453, 0.30376386642456055, 0.30176281929016113] >>> Timer("a=3Dnumpy.empty(100000,dtype=3Dnumpy.complex128)", "import numpy").repeat(3,10000) [0.42552995681762695, 0.36864185333251953, 0.36870002746582031] >>> Timer("a=3Dnumpy.empty(1000000,dtype=3Dnumpy.complex128)", "import numpy").repeat(3,10000) [0.48045611381530762, 0.41251182556152344, 0.40645909309387207] So, it seems that there are a certain time dependency with size array of 10 elements --> 7.7 us array of 100 elements --> 8.4 us array of 1000 elements --> 8.1 us array of 10000 elements --> 30.2 us array of 100000 elements --> 36.9 us array of 1000000 elements --> 40.6 us Well, it seems that malloc actually takes more time when asking for more space. However, this can't be the reason why Pierre is seeing that: a =3D numpy.exp(a) [1] is slower than numpy.exp(a,out=3Da) [2] as I'd say that this increment in time is negligible compared with processing times of those big arrays. In fact, here are my times: >>> Timer("a =3D numpy.exp(a)", "import numpy;a =3D numpy.random.rand(2048,2048) + 1j * numpy.random.rand(2048,2048)").repeat(3,1) [2.5527338981628418, 2.5427830219268799, 2.5074479579925537] >>> Timer("numpy.exp(a,out=3Da)", "import numpy;a =3D numpy.random.rand(2048,2048) + 1j * numpy.random.rand(2048,2048)").repeat(3,1) [2.5298278331756592, 2.5082788467407227, 2.5222280025482178] So, both times are comparable. Perhaps what Pierre is seeing is that he is approaching the limits of memory in his system and because [1] takes more memory than [2] (two objects in memory instead of one) perhaps the former is causing the OS to start swapping. However a quick look with top at the processes, says that both [1] and [2] takes similar amounts of memory (~ 170 MB peak) and, as arrays take 64 MB each, in both cases the used memory seems higher than the required at first sight. Mmmm, the only explanation is that the exp() ufunc does require temporaries, although this is a bit strange as exp() works element wise. I recognize that I'm a bit lost here... --=20 >0,0< Francesc Altet http://www.carabos.com/ V V C=C3=A1rabos Coop. V. Enjoy Data "-" |
From: Charles R H. <cha...@gm...> - 2006-09-14 01:24:37
|
On 9/13/06, Francesc Altet <fa...@ca...> wrote: > > El dt 12 de 09 del 2006 a les 13:28 -0600, en/na Travis Oliphant va > escriure: > > >[BTW, numpy.empty seems twice as slower in my machine. Why? > > > > > > > > >>>>Timer("a=numpy.empty(10000,dtype=numpy.complex128)", "import > > >>>> > > >>>> > > >numpy").repeat(3,10000) > > >[0.37033700942993164, 0.31780219078063965, 0.31607294082641602] > > >] > > > > > > > > Now, you are creating an empty array with 10000 elements in it. > > Ups, my bad. So, here are the correct times for array creation: > > >>> Timer("a=numpy.empty(10,dtype=numpy.complex128)", "import > numpy").repeat(3,10000) > [0.083303928375244141, 0.080381870269775391, 0.077172040939331055] > >>> Timer("a=numpy.empty(100,dtype=numpy.complex128)", "import > numpy").repeat(3,10000) > [0.086454868316650391, 0.084085941314697266, 0.083555936813354492] > >>> Timer("a=numpy.empty(1000,dtype=numpy.complex128)", "import > numpy").repeat(3,10000) > [0.084996223449707031, 0.082299947738647461, 0.081347942352294922] > >>> Timer("a=numpy.empty(10000,dtype=numpy.complex128)", "import > numpy").repeat(3,10000) > [0.31068897247314453, 0.30376386642456055, 0.30176281929016113] > >>> Timer("a=numpy.empty(100000,dtype=numpy.complex128)", "import > numpy").repeat(3,10000) > [0.42552995681762695, 0.36864185333251953, 0.36870002746582031] > >>> Timer("a=numpy.empty(1000000,dtype=numpy.complex128)", "import > numpy").repeat(3,10000) > [0.48045611381530762, 0.41251182556152344, 0.40645909309387207] > > So, it seems that there are a certain time dependency with size > > array of 10 elements --> 7.7 us > array of 100 elements --> 8.4 us > array of 1000 elements --> 8.1 us > array of 10000 elements --> 30.2 us > array of 100000 elements --> 36.9 us > array of 1000000 elements --> 40.6 us The transition looks a bit like a cache effect, although I don't see why the cache should enter in. But all the allocations look pretty fast to me. Chuck |
From: Pierre T. <thi...@ph...> - 2006-09-14 13:01:18
|
On 9/13/06, Francesc Altet <fa...@ca...> wrote: > Well, it seems that malloc actually takes more time when asking for more > space. However, this can't be the reason why Pierre is seeing that: > > a = numpy.exp(a) [1] > > is slower than > > numpy.exp(a,out=a) [2] > > as I'd say that this increment in time is negligible compared with > processing times of those big arrays. In fact, here are my times: > > >>> Timer("a = numpy.exp(a)", "import numpy;a = > numpy.random.rand(2048,2048) + 1j * > numpy.random.rand(2048,2048)").repeat(3,1) > [2.5527338981628418, 2.5427830219268799, 2.5074479579925537] > >>> Timer("numpy.exp(a,out=a)", "import numpy;a = > numpy.random.rand(2048,2048) + 1j * > numpy.random.rand(2048,2048)").repeat(3,1) > [2.5298278331756592, 2.5082788467407227, 2.5222280025482178] > > So, both times are comparable. Yeah, sorry about that: I had not checked carefully the timing. It seemed slower to me, but you're right, this is not a problem as long as there is enough free RAM. Ok, I'll go back to my coding and do like I should always do: care about optimization later. Thanks for all the comments and explanations! Pierre -- Pierre Thibault 616 Clark Hall, Cornell University (607) 255-5522 |
From: Johannes L. <a.u...@gm...> - 2006-09-13 08:24:04
|
Hi, one word in advance, instead of optimizing it is advisable to seek for a way to refactorize the algorithm using smaller arrays, since this kind of optimization almost certainly reduces readability. If you do it, comment well. ;-) If you have very large arrays and want to do some arithmetics on it, say B = 2*B + C you can use inplace operators to avoid memory overhead: B *= 2 B += C Another trick which works in most situations is to do the outermost loop in python: for i in xrange(len(B)): B[i] = 2*B[i] + C[i] This reduces the temporary array size to 1/len(B) while still being fast (if the other dimensions are large enough). For very large 1d arrays, you could split them into chunks of a certain size. However, you have to be careful that your calculation does not access already-calculated elements of B. Consider the following example: In [2]: B=arange(10) In [3]: B+B[::-1] Out[3]: array([9, 9, 9, 9, 9, 9, 9, 9, 9, 9]) In [4]: B += B[::-1] In [5]: B Out[5]: array([ 9, 9, 9, 9, 9, 14, 15, 16, 17, 18]) Johannes |