From: Tim H. <tim...@co...> - 2006-06-03 03:18:20
|
Some time ago some people, myself including, were making some noise about having 'array' iterate over iterable object producing ndarrays in a manner analogous to they way sequences are treated. I finally got around to looking at it seriously and once I came to the following three conclusions: 1. All I really care about is the 1D case where dtype is specified. This case should be relatively easy to implement so that it's fast. Most other cases are not likely to be particularly faster than converting the iterators to lists at the Python level and then passing those lists to array. 2. 'array' already has plenty of special cases. I'm reluctant to add more. 3. Adding this to 'array' would be non-trivial. The more cases we tried to make fast, the more likely that some of the paths would be buggy. Regardless of how we did it though, some cases would be much slower than other, which would probably be suprising. So, with that in mind, I retreated a little and implemented the simplest thing that did the stuff that I cared about: fromiter(iterable, dtype, count) => ndarray of type dtype and length count This is essentially the same interface as fromstring except that the values of dtype and count are always required. Some primitive benchmarking indicates that 'fromiter(generator, dtype, count)' is about twice as fast as 'array(list(generator))' for medium to large arrays. When producing very large arrays, the advantage of fromiter is larger, presumably because 'list(generator)' causes things to start swapping. Anyway I'm about to bail out of town till the middle of next week, so it'll be a while till I can get it clean enough to submit in some form or another. Plenty of time for people to think of why it's a terrible idea ;-) -tim |
From: Tim H. <tim...@co...> - 2006-06-10 20:20:31
|
I finally got around to cleaning up and checking in fromiter. As Travis suggested, this version does not require that you specify count. From the docstring: fromiter(...) fromiter(iterable, dtype, count=-1) returns a new 1d array initialized from iterable. If count is nonegative, the new array will have count elements, otherwise it's size is determined by the generator. If count is specified, it allocates the full array ahead of time. If it is not, it periodically reallocates space for the array, allocating 50% extra space each time and reallocating back to the final size at the end (to give realloc a chance to reclaim any extra space). Speedwise, "fromiter(iterable, dtype, count)" is about twice as fast as "array(list(iterable),dtype=dtype)". Omitting count slows things down by about 15%; still much faster than using "array(list(...))". It also is going to chew up more memory than if you include count, at least temporarily, but still should typically use much less than the "array(list(...))" approach. -tim |
From: David M. C. <co...@ph...> - 2006-06-10 21:42:58
|
On Sat, Jun 10, 2006 at 01:18:05PM -0700, Tim Hochberg wrote: > > I finally got around to cleaning up and checking in fromiter. As Travis > suggested, this version does not require that you specify count. From > the docstring: > > fromiter(...) > fromiter(iterable, dtype, count=-1) returns a new 1d array > initialized from iterable. If count is nonegative, the new array > will have count elements, otherwise it's size is determined by the > generator. > > If count is specified, it allocates the full array ahead of time. If it > is not, it periodically reallocates space for the array, allocating 50% > extra space each time and reallocating back to the final size at the end > (to give realloc a chance to reclaim any extra space). > > Speedwise, "fromiter(iterable, dtype, count)" is about twice as fast as > "array(list(iterable),dtype=dtype)". Omitting count slows things down by > about 15%; still much faster than using "array(list(...))". It also is > going to chew up more memory than if you include count, at least > temporarily, but still should typically use much less than the > "array(list(...))" approach. Can this be integrated into array() so that array(iterable, dtype=dtype) does the expected thing? Can you try to find the length of the iterable, with PySequence_Size() on the original object? This gets a bit iffy, as that might not be correct (but it could be used as a hint). What about iterables that return, say, tuples? Maybe add a shape argument, so that fromiter(iterable, dtype, count, shape=(None, 3)) expects elements from iterable that can be turned into arrays of shape (3,)? That could replace count, too. -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |co...@ph... |
From: Robert K. <rob...@gm...> - 2006-06-10 22:05:45
|
David M. Cooke wrote: > Can this be integrated into array() so that array(iterable, dtype=dtype) > does the expected thing? That was rejected early on because array() is so incredibly overloaded as it is. http://article.gmane.org/gmane.comp.python.numeric.general/5756 -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco |
From: Tim H. <tim...@co...> - 2006-06-10 22:31:12
|
David M. Cooke wrote: >On Sat, Jun 10, 2006 at 01:18:05PM -0700, Tim Hochberg wrote: > > >>I finally got around to cleaning up and checking in fromiter. As Travis >>suggested, this version does not require that you specify count. From >>the docstring: >> >> fromiter(...) >> fromiter(iterable, dtype, count=-1) returns a new 1d array >> initialized from iterable. If count is nonegative, the new array >> will have count elements, otherwise it's size is determined by the >> generator. >> >>If count is specified, it allocates the full array ahead of time. If it >>is not, it periodically reallocates space for the array, allocating 50% >>extra space each time and reallocating back to the final size at the end >>(to give realloc a chance to reclaim any extra space). >> >>Speedwise, "fromiter(iterable, dtype, count)" is about twice as fast as >>"array(list(iterable),dtype=dtype)". Omitting count slows things down by >>about 15%; still much faster than using "array(list(...))". It also is >>going to chew up more memory than if you include count, at least >>temporarily, but still should typically use much less than the >>"array(list(...))" approach. >> >> > >Can this be integrated into array() so that array(iterable, dtype=dtype) >does the expected thing? > > It get's a little sticky since the expected thing is probably that array([iterable, iterable, iterable], dtype=dtype) work and produce an array of shape [3, N]. That looks like that would be hard to do efficiently. >Can you try to find the length of the iterable, with PySequence_Size() on >the original object? This gets a bit iffy, as that might not be correct >(but it could be used as a hint). > > The way the code is setup, a hint could be made use of with little additional complexity. Allegedly, some objects in 2.5 will grow __length_hint__, which could be made use of as well. I'm not very motivated to mess with this at the moment though as the benefit is relatively small. >What about iterables that return, say, tuples? Maybe add a shape argument, >so that fromiter(iterable, dtype, count, shape=(None, 3)) expects elements >from iterable that can be turned into arrays of shape (3,)? That could >replace count, too. > > I expect that this would double (or more) the complexity of the current code (which is nice and simple at present). I'm inclined to leave it as it is and advocate solutions of this type: >>> import numpy >>> tupleiter = ((x, x+1, x+2) for x in range(10)) # Just for example >>> def flatten(x): ... for y in x: ... for z in y: ... yield z >>> numpy.fromiter(flatten(tupleiter), int).reshape(-1, 3) array([[ 0, 1, 2], [ 1, 2, 3], [ 2, 3, 4], [ 3, 4, 5], [ 4, 5, 6], [ 5, 6, 7], [ 6, 7, 8], [ 7, 8, 9], [ 8, 9, 10], [ 9, 10, 11]]) [As a side note, I'm quite suprised that there isn't a way to flatten stuff already in itertools, but if there is, I can't find it]. -tim |
From: Travis O. <oli...@ie...> - 2006-06-03 07:25:56
|
Tim Hochberg wrote: > Some time ago some people, myself including, were making some noise > about having 'array' iterate over iterable object producing ndarrays in > a manner analogous to they way sequences are treated. I finally got > around to looking at it seriously and once I came to the following three > conclusions: > > 1. All I really care about is the 1D case where dtype is specified. > This case should be relatively easy to implement so that it's > fast. Most other cases are not likely to be particularly faster > than converting the iterators to lists at the Python level and > then passing those lists to array. > 2. 'array' already has plenty of special cases. I'm reluctant to add > more. > 3. Adding this to 'array' would be non-trivial. The more cases we > tried to make fast, the more likely that some of the paths would > be buggy. Regardless of how we did it though, some cases would be > much slower than other, which would probably be suprising. > Good job. I just added a called fromiter for this very purpose. Right now, it's just a stub that calls list(obj) first and then array. Your code would be a perfect fit for it. I think count could be optional, though, to handle cases where the count can be determined from the object. We'll look forward to your check-in. -Travis |
From: Tim H. <tim...@co...> - 2006-06-03 14:31:33
|
Travis Oliphant wrote: >Tim Hochberg wrote: > > >>Some time ago some people, myself including, were making some noise >>about having 'array' iterate over iterable object producing ndarrays in >>a manner analogous to they way sequences are treated. I finally got >>around to looking at it seriously and once I came to the following three >>conclusions: >> >> 1. All I really care about is the 1D case where dtype is specified. >> This case should be relatively easy to implement so that it's >> fast. Most other cases are not likely to be particularly faster >> than converting the iterators to lists at the Python level and >> then passing those lists to array. >> 2. 'array' already has plenty of special cases. I'm reluctant to add >> more. >> 3. Adding this to 'array' would be non-trivial. The more cases we >> tried to make fast, the more likely that some of the paths would >> be buggy. Regardless of how we did it though, some cases would be >> much slower than other, which would probably be suprising. >> >> >> > >Good job. I just added a called fromiter for this very purpose. Right >now, it's just a stub that calls list(obj) first and then array. Your >code would be a perfect fit for it. I think count could be optional, >though, to handle cases where the count can be determined from the object. > > I'll look at that when I get back. There are two ways to approach this: one is to only allow to count to be optional in those cases that the original object supports either __len__ or __length_hint__. The advantage their is that it's easy and there's no chance of locking up the interpreter by passing an unbounded generator. The other way is to figure out the length based on the generator itself. The "natural" way to do this is to steal stuff from array.array. However, that doesn't export a C-level interface that I can tell (everything is declared static), so you'd be going through the interpreter, which would potentially be slow. I guess another approach would be to hijack PyArray_Resize and steal the resizing pattern from array.array. I'm not sure how well that would work though. I'll look into it... -tim >We'll look forward to your check-in. > >-Travis > > > >_______________________________________________ >Numpy-discussion mailing list >Num...@li... >https://lists.sourceforge.net/lists/listinfo/numpy-discussion > > > > |