From: Peter V. <ve...@em...> - 2003-09-01 09:34:41
|
Hi All, I noticed that the sum() and mean() methods of numarrays use the precision of the given array in their calculations. That leads to resuls like this: >>> array([255, 255], Int8).sum() -2 >>> array([255, 255], Int8).mean() -1.0 Would it not be better to use double precision internally and return the correct result? Cheers, Peter -- Dr. Peter J. Verveer Cell Biology and Cell Biophysics Programme EMBL Meyerhofstrasse 1 D-69117 Heidelberg Germany Tel. : +49 6221 387245 Fax : +49 6221 387242 Email: Pet...@em... |
From: Todd M. <jm...@st...> - 2003-09-02 18:33:19
|
On Mon, 2003-09-01 at 05:34, Peter Verveer wrote: > Hi All, > > I noticed that the sum() and mean() methods of numarrays use the precision of > the given array in their calculations. That leads to resuls like this: > > >>> array([255, 255], Int8).sum() > -2 > >>> array([255, 255], Int8).mean() > -1.0 > > Would it not be better to use double precision internally and return the > correct result? > > Cheers, Peter > Hi Peter, I thought about this a lot yesterday and today talked it over with Perry. There are several ways to fix the problem with mean() and sum(), and I'm hoping that you and the rest of the community will help sort them out. (1) The first "solution" is to require users to do their own up-casting prior to calling mean() or sum(). This gives the end user fine control over storage cost but leaves the C-like pitfall/bug you discovered. I mention this because this is how the numarray/Numeric reductions are designed. Is there a reason why the numarray/Numeric reductions don't implicitly up-cast? (2) The second way is what you proposed: use double precision within mean and sum. This has great simplicity but gives no control over storage usage, and as implemented, the storage would be much higher than one might think, potentially 8x. (3) Lastly, Perry suggested a more radical approach: rather than changing the mean and sum methods themselves, we could alter the universal function accumulate and reduce methods to implicitly use additional precision. Perry's idea was to make all accumulations and reductions up-cast their results to the largest type of the current family, either Bool, Int64, Float64, or Complex64. By doing this, we can improve the utility of the reductions and accumulations as well as fixing the problem with sum and mean. -- Todd Miller jm...@st... STSCI / ESS / SSB |
From: <ve...@em...> - 2003-09-02 20:32:43
|
Hi Todd, > I thought about this a lot yesterday and today talked it over with > Perry. There are several ways to fix the problem with mean() and > sum(), and I'm hoping that you and the rest of the community will help > sort them out. It was just an innocent question, I did not think it would have such ramifications :-) Here are my thoughts: If I understand you well, the sum() and mean() array methods are based on the reduce method of the universal functions. And these do their calculations in the precision of the array, is that correct? I also gave this some thought, and I would like to make a distinction between a reduction and the calculation of a statistical value such as the mean or the sum: To me, a reduction means the projection of an multi-dimensional array to an array with a rank that is one less than the input. The result is still an array, and often I want the result to have the same precision as the input. A statistical calculation like a sum or a mean is different: the result should be correct and the same irrespective of the type of the input, and that mandates using sufficient precision in the calculation. Note however, that such a statistic is a scalar result and does not require temporary storage at high precision for the whole array. So keeping this in mind, my comments to your solutions are: > (1) The first "solution" is to require users to do their own up-casting > prior to calling mean() or sum(). This gives the end user fine control > over storage cost but leaves the C-like pitfall/bug you discovered. I > mention this because this is how the numarray/Numeric reductions are > designed. Is there a reason why the numarray/Numeric reductions don't > implicitly up-cast? For reductions this behaviour suits me, precisely because it allows control over storage, which is one of the strengths of numarray. For calculating the mean or the sum of an array this is however an expensive solution for a very common operation. I do use this solution, but sometimes I prefer an optimized C routine instead. > (2) The second way is what you proposed: use double precision within > mean and sum. This has great simplicity but gives no control over > storage usage, and as implemented, the storage would be much higher than > one might think, potentially 8x. I did not want to suggest to store a casted version of the array before calculation of the mean or the sum. That can be done in double precision without converting the whole array in memory. I think we can all agree that this option would not be a good idea. > (3) Lastly, Perry suggested a more radical approach: rather than > changing the mean and sum methods themselves, we could alter the > universal function accumulate and reduce methods to implicitly use > additional precision. Perry's idea was to make all accumulations and > reductions up-cast their results to the largest type of the current > family, either Bool, Int64, Float64, or Complex64. By doing this, we > can improve the utility of the reductions and accumulations as well as > fixing the problem with sum and mean. I think that is a great idea in principle, but I think you should consider this carefully: First of all control of the storage cost is lost when you do a reduction. I would not find that always desirable. Thus, I would then like the old behaviour for reduction to be accesible either as a different method or by a setting an optional argument. Additionally, it would not work well for some operations. For instance precise calculation of the mean requires floating point precision. Maybe this can be solved, but would it require different casting behaviour for different operations. That might be too much trouble... I would like to propose a fourth option: (4) provide separate implementations for array methods like sum() and mean() that only calculate the scalar result. No additional storage would be necessary and the calculation can be done in double precision. I guess that the disadvantage is that one cannot leverage the existing code in the ufuncs so easily. I also assume that it would not be such a general solution as changing the reduce method is. I do have some experience in writing these sorts of code for multidimensional arrays in C and would be happy to contribute code. However, I am not too familiar with the internals of numarray library and I don't know how well my code fits in there (although I interface all my code to numarray). But I am happy to help if I can, numarray is great stuff, it has become the main tool for my numerical work. Peter -- Dr. Peter J. Verveer Cell Biology and Cell Biophysics Programme EMBL Meyerhofstrasse 1 D-69117 Heidelberg Germany Tel. : +49 6221 387245 Fax : +49 6221 387242 |
From: Fernando P. <fp...@co...> - 2003-09-02 19:20:27
|
Todd Miller wrote: > I thought about this a lot yesterday and today talked it over with > Perry. There are several ways to fix the problem with mean() and > sum(), and I'm hoping that you and the rest of the community will help > sort them out. [snip] Just a thought: why not make the upcasting an optional parameter? I've found that python's arguments with default values provide a very convenient way of giving the user fine control with minimal conceptual overhead. I'd rather write: arr = array([255, 255], Int8) ... later arr.sum(use_double=1) # or some similar way of tuning sum() than arr = array([255, 255], Int8) ... later array(arr,typecode='d').sum() Numarray/numpy are trying to tackle an inherently hard problem: matching the high-level comfort of python with the low-level performance of C. This situation is an excellent example of what I've seen described as the 'law of leaky abstractions': in most cases where you encapsulate low level details in a high level abstraction, there end up being situations where the details poke through the abstraction and cause you grief. This is an inherently tricky problem, with no easy, universal solution (that I'm aware of). Cheers, f. |
From: Robert K. <ke...@ca...> - 2003-09-02 21:31:14
|
On Tue, Sep 02, 2003 at 01:20:17PM -0600, Fernando Perez wrote: > Todd Miller wrote: > > >I thought about this a lot yesterday and today talked it over with > >Perry. There are several ways to fix the problem with mean() and > >sum(), and I'm hoping that you and the rest of the community will help > >sort them out. > > [snip] > > Just a thought: why not make the upcasting an optional parameter? > > I've found that python's arguments with default values provide a very > convenient way of giving the user fine control with minimal conceptual > overhead. > > I'd rather write: > > arr = array([255, 255], Int8) > ... later > arr.sum(use_double=1) # or some similar way of tuning sum() +1, but arr.sum(typecode=Float64) would be my choice of spelling. Not sure what the default typecode should be, though. Probably Perry's suggestion: the largest type of the family. -- Robert Kern ke...@ca... "In the fields of hell where the grass grows high Are the graves of dreams allowed to die." -- Richard Harter |
From: Todd M. <jm...@st...> - 2003-09-03 12:43:36
|
On Tue, 2003-09-02 at 15:20, Fernando Perez wrote: > Todd Miller wrote: > > > I thought about this a lot yesterday and today talked it over with > > Perry. There are several ways to fix the problem with mean() and > > sum(), and I'm hoping that you and the rest of the community will help > > sort them out. > > [snip] > > Just a thought: why not make the upcasting an optional parameter? > <snip> That sounds like a great idea. Simple, but doesn't throw out all storage control. > in most cases where you encapsulate low level details in > a high level abstraction, there end up being situations where the details poke > through the abstraction and cause you grief. Thanks for these kind words. -- Todd Miller <jm...@st...> |
From: Peter V. <ve...@em...> - 2003-09-03 15:40:01
|
I also believe that the current behavior for numarray/Numeric reduce method (not to cast) is the right one. It is fine to leave the user with the responsibility to be careful in the case of the reduce operation. But to correctly calculate a mean or a sum by the array methods that are provided you have to convert the array first to a more precise type, and then do the calculation. That wastes space and is slow, and seems not very elegant considering that these are very common statistical operations. A separate implementation for the mean() and sum() methods that uses double precision in the calculation without first converting the array would be straightforward. Since calculating a mean or a sum of a complete array is such a common case I think this would be useful. That leaves the same problem for the reduce method which in some cases would require first a conversion, but this is much less of a problem (at least for me). Having to convert before the operation can be wasteful though. I do like the idea that was also proposed on the list to supply an optional argument to specify the output type. Then the user has full control of the output type (nice if you want high precision in the result without converting the input), and the code can easily be used to implement the mean() and sum() methods. The default behavior of the reduce method can then remain unchanged, so this would not be an obtrusive change. But, I imagine that this may complicate the implementation. Cheers, Peter On Wednesday 03 September 2003 17:13, Paul Dubois wrote: > So after you get the result in a higher precision, then what? > a. Cast it down blindly? > b. Test every element and throw an exception if casting would lose > precision? > c. Test every element and return the smallest kind that "holds" the answer? > d. Always return the highest precision? > > a. is close to equivalent to the present behavior > b. and c. are expensive. > c. makes the type of the result unpredictable, which has its own problems. > d. uses space > > It was the originally design of Numeric to be fast rather than careful, > user beware. There is now a another considerable portion of the > community that is for very careful, and another that is for keeping it > small. You can't satisfy all those goals at once. > > If you make it slow or big in order to be careful, it will always be > slow or big, while the opposite is not true. If you make it fast, the > user can be careful. > > Todd Miller wrote: > > On Mon, 2003-09-01 at 05:34, Peter Verveer wrote: > >>Hi All, > >> > >>I noticed that the sum() and mean() methods of numarrays use the > >> precision of > >> > >>the given array in their calculations. That leads to resuls like this: > >>>>>array([255, 255], Int8).sum() > >> > >>-2 > >> > >>>>>array([255, 255], Int8).mean() > >> > >>-1.0 > >> > >>Would it not be better to use double precision internally and return the > >>correct result? > >> > >>Cheers, Peter > > > > Hi Peter, > > > > I thought about this a lot yesterday and today talked it over with > > Perry. There are several ways to fix the problem with mean() and > > sum(), and I'm hoping that you and the rest of the community will help > > sort them out. > > > > (1) The first "solution" is to require users to do their own up-casting > > prior to calling mean() or sum(). This gives the end user fine control > > over storage cost but leaves the C-like pitfall/bug you discovered. I > > mention this because this is how the numarray/Numeric reductions are > > designed. Is there a reason why the numarray/Numeric reductions don't > > implicitly up-cast? > > > > (2) The second way is what you proposed: use double precision within > > mean and sum. This has great simplicity but gives no control over > > storage usage, and as implemented, the storage would be much higher than > > one might think, potentially 8x. > > > > (3) Lastly, Perry suggested a more radical approach: rather than > > changing the mean and sum methods themselves, we could alter the > > universal function accumulate and reduce methods to implicitly use > > additional precision. Perry's idea was to make all accumulations and > > reductions up-cast their results to the largest type of the current > > family, either Bool, Int64, Float64, or Complex64. By doing this, we > > can improve the utility of the reductions and accumulations as well as > > fixing the problem with sum and mean. |
From: Todd M. <jm...@st...> - 2003-09-03 21:46:14
|
I want to thank everyone who participated in this discussion: Peter, Fernando, Robert, Paul, Perry, and Tim. Tim's post has IMO a completely synthesized solution: 1. Add a type parameter to sum which defaults to widest type. 2. Add a type parameter to reductions (and fix output type handling). Default is same-type as it is now. No major changes to C-code. 3. Add a WidestType(array) function: Bool --> Bool Int8,Int16,Int32,Int64 --> Int64 UInt8, UInt16,UInt32,UInt64 --> UInt64 (Int64 on win32) Float32, Float64 --> Float64 Complex32, Complex64 --> Complex64 The only thing this really leaves out, is a higher performance implementation of sum/mean which Peter referred to a few times. Peter, if you want to write a specialized module, we'd be happy to put it in the add-ons package. Thanks again everybody, Todd -- Todd Miller <jm...@st...> |