From: Sebastian H. <ha...@ms...> - 2006-09-19 22:41:24
|
Hello all, I just had someone from my lab coming to my desk saying: "My god - SciPy is really stupid .... An array with only positive numbers claims to have a negative mean !! "? I was asking about this before ... the reason was of course that her array was of dtype int32 and had many large values to cause an overflow (wrap around) . Now that the default for axis is None (for all functions having an axis argument), can we please change dtype to default to float64 !? It is really a very confusing and shocking result to get a negative mean on all positive values. It has been stated here before that numpy should target the "scientist" rather than the programmers ... I would argue that mean() almost always requires the precision of "double" (float64) to produce usable results. Please consider this change before the 1.0 release ... Thanks, Sebastian Haase |
From: Travis O. <oli...@ee...> - 2006-09-19 22:48:06
|
Sebastian Haase wrote: >Hello all, >I just had someone from my lab coming to my desk saying: >"My god - SciPy is really stupid .... >An array with only positive numbers claims to have a negative mean !! "? > > > >I was asking about this before ... the reason was of course that her array was >of dtype int32 and had many large values to cause an overflow (wrap >around) . > >Now that the default for axis is None (for all functions having an axis >argument), >can we please change dtype to default to float64 !? > > The default is float64 now (as long as you are not using numpy.oldnumeric). I suppose more appropriately, we could reduce over float for integer data-types when calculating the mean as well (since a floating point is returned anyway). -Travis |
From: Charles R H. <cha...@gm...> - 2006-09-20 00:37:04
|
On 9/19/06, Travis Oliphant <oli...@ee...> wrote: > > Sebastian Haase wrote: > > >On Tuesday 19 September 2006 15:48, Travis Oliphant wrote: > > > > > >>Sebastian Haase wrote: > >> > >> > ><snip> > > > > > >>>can we please change dtype to default to float64 !? > >>> > >>> > >>The default is float64 now (as long as you are not using > >>numpy.oldnumeric). > >> > >>I suppose more appropriately, we could reduce over float for integer > >>data-types when calculating the mean as well (since a floating point is > >>returned anyway). > >> > >> > >> > > > >Is now mean() always "reducing over" float64 ? > >The svn note """Log: > >Fix mean, std, and var methods so that they reduce over double data-type > with > >integer inputs. > >""" > >makes it sound that a float32 input is stays float32 ? > > > > > Yes, that is true. Only integer inputs are changed because you are > going to get a floating point output anyway. > > >For mean calculation this might introduce large errors - I usually would > >require double-precision for *any* input type ... > > > > > Of course. The system is not fool-proof. I hesitate to arbitrarily > change this. The advantage of using single-precision calculation is > that it is faster. We do rely on the user who expressly requests these > things to be aware of the difficulties. Speed depends on the achitecture. Float is a trifle slower than double on my Athlon64, but faster on PPC750. I don't know about other machines. I think there is a good argument for accumlating in double and converting to float for output if required. Chuck |
From: Travis O. <oli...@ie...> - 2006-09-20 02:30:52
|
Charles R Harris wrote: > > Speed depends on the achitecture. Float is a trifle slower than double > on my Athlon64, but faster on PPC750. I don't know about other > machines. I think there is a good argument for accumlating in double > and converting to float for output if required. Yes there is. It's just not what NumPy ever does so it would be an exception in this case and would need a more convincing argument in my opinion. You can always specify the accumulation type yourself with the dtype argument. We are only talking about what the default should be. -Travis |
From: Sebastian H. <ha...@ms...> - 2006-09-20 00:04:03
|
On Tuesday 19 September 2006 15:48, Travis Oliphant wrote: > Sebastian Haase wrote: <snip> > >can we please change dtype to default to float64 !? > > The default is float64 now (as long as you are not using > numpy.oldnumeric). > > I suppose more appropriately, we could reduce over float for integer > data-types when calculating the mean as well (since a floating point is > returned anyway). > Is now mean() always "reducing over" float64 ? The svn note """Log: Fix mean, std, and var methods so that they reduce over double data-type with integer inputs. """ makes it sound that a float32 input is stays float32 ? For mean calculation this might introduce large errors - I usually would require double-precision for *any* input type ... (don't know how to say this for complex types !? Are here real and imag treated separately / independently ?) Thanks, Sebastian Haase |
From: Travis O. <oli...@ee...> - 2006-09-20 00:17:51
|
Sebastian Haase wrote: >On Tuesday 19 September 2006 15:48, Travis Oliphant wrote: > > >>Sebastian Haase wrote: >> >> ><snip> > > >>>can we please change dtype to default to float64 !? >>> >>> >>The default is float64 now (as long as you are not using >>numpy.oldnumeric). >> >>I suppose more appropriately, we could reduce over float for integer >>data-types when calculating the mean as well (since a floating point is >>returned anyway). >> >> >> > >Is now mean() always "reducing over" float64 ? >The svn note """Log: >Fix mean, std, and var methods so that they reduce over double data-type with >integer inputs. >""" >makes it sound that a float32 input is stays float32 ? > > Yes, that is true. Only integer inputs are changed because you are going to get a floating point output anyway. >For mean calculation this might introduce large errors - I usually would >require double-precision for *any* input type ... > > Of course. The system is not fool-proof. I hesitate to arbitrarily change this. The advantage of using single-precision calculation is that it is faster. We do rely on the user who expressly requests these things to be aware of the difficulties. >(don't know how to say this for complex types !? Are here real and imag >treated separately / independently ?) > > There is a complex add performed at a low-level as two separate adds. The addition is performed in the precision requested. -Travis |
From: Sebastian H. <ha...@ms...> - 2006-09-20 00:51:50
|
On Tuesday 19 September 2006 17:17, Travis Oliphant wrote: > Sebastian Haase wrote: > >On Tuesday 19 September 2006 15:48, Travis Oliphant wrote: > >>Sebastian Haase wrote: > > > ><snip> > > > >>>can we please change dtype to default to float64 !? > >> > >>The default is float64 now (as long as you are not using > >>numpy.oldnumeric). > >> > >>I suppose more appropriately, we could reduce over float for integer > >>data-types when calculating the mean as well (since a floating point is > >>returned anyway). > > > >Is now mean() always "reducing over" float64 ? > >The svn note """Log: > >Fix mean, std, and var methods so that they reduce over double data-type > > with integer inputs. > >""" > >makes it sound that a float32 input is stays float32 ? > > Yes, that is true. Only integer inputs are changed because you are > going to get a floating point output anyway. > > >For mean calculation this might introduce large errors - I usually would > >require double-precision for *any* input type ... > > Of course. The system is not fool-proof. I hesitate to arbitrarily > change this. The advantage of using single-precision calculation is > that it is faster. We do rely on the user who expressly requests these > things to be aware of the difficulties. I still would argue that getting a "good" (smaller rounding errors) answer should be the default -- if speed is wanted, then *that* could be still specified by explicitly using dtype=float32 (which would also be a possible choice for int32 input) . In image processing we always want means to be calculated in float64 even though input data is always float32 (if not uint16). Also it is simpler to say "float64 is the default" (full stop.) - instead "float64 is the default unless you have float32" Thanks, Sebastian Haase |
From: Travis O. <oli...@ie...> - 2006-09-20 02:33:35
|
Sebastian Haase wrote: > I still would argue that getting a "good" (smaller rounding errors) answer > should be the default -- if speed is wanted, then *that* could be still > specified by explicitly using dtype=float32 (which would also be a possible > choice for int32 input) . > So you are arguing for using long double then.... ;-) > In image processing we always want means to be calculated in float64 even > though input data is always float32 (if not uint16). > > Also it is simpler to say "float64 is the default" (full stop.) - instead > > "float64 is the default unless you have float32" > "the type you have is the default except for integers". Do you really want float64 to be the default for float96? Unless we are going to use long double as the default, then I'm not convinced that we should special-case the "double" type. -Travis |
From: Sebastian H. <ha...@ms...> - 2006-09-20 03:07:41
|
Travis Oliphant wrote: > Sebastian Haase wrote: >> I still would argue that getting a "good" (smaller rounding errors) answer >> should be the default -- if speed is wanted, then *that* could be still >> specified by explicitly using dtype=float32 (which would also be a possible >> choice for int32 input) . >> > So you are arguing for using long double then.... ;-) > >> In image processing we always want means to be calculated in float64 even >> though input data is always float32 (if not uint16). >> >> Also it is simpler to say "float64 is the default" (full stop.) - instead >> >> "float64 is the default unless you have float32" >> > "the type you have is the default except for integers". Do you really > want float64 to be the default for float96? > > Unless we are going to use long double as the default, then I'm not > convinced that we should special-case the "double" type. > I guess I'm not really aware of the float96 type ... Is that a "machine type" on any system ? I always thought that -- e.g. coming from C -- double is "as good as it gets"... Who uses float96 ? I heard somewhere that (some) CPUs use 80bits internally when calculating 64bit double-precision... Is this not going into some academic argument !? For all I know, calculating mean()s (and sum()s, ...) is always done in double precision -- never in single precision, even when the data is in float32. Having float32 be the default for float32 data is just requiring more typing, and more explaining ... it would compromise numpy usability as a day-to-day replacement for other systems. Sorry, if I'm being ignorant ... - Sebastian |
From: Charles R H. <cha...@gm...> - 2006-09-20 03:15:33
|
On 9/19/06, Sebastian Haase <ha...@ms...> wrote: > > Travis Oliphant wrote: > > Sebastian Haase wrote: > >> I still would argue that getting a "good" (smaller rounding errors) > answer > >> should be the default -- if speed is wanted, then *that* could be still > >> specified by explicitly using dtype=float32 (which would also be a > possible > >> choice for int32 input) . > >> > > So you are arguing for using long double then.... ;-) > > > >> In image processing we always want means to be calculated in float64 > even > >> though input data is always float32 (if not uint16). > >> > >> Also it is simpler to say "float64 is the default" (full stop.) - > instead > >> > >> "float64 is the default unless you have float32" > >> > > "the type you have is the default except for integers". Do you really > > want float64 to be the default for float96? > > > > Unless we are going to use long double as the default, then I'm not > > convinced that we should special-case the "double" type. > > > I guess I'm not really aware of the float96 type ... > Is that a "machine type" on any system ? I always thought that -- e.g. > coming from C -- double is "as good as it gets"... > Who uses float96 ? I heard somewhere that (some) CPUs use 80bits > internally when calculating 64bit double-precision... > > Is this not going into some academic argument !? > For all I know, calculating mean()s (and sum()s, ...) is always done in > double precision -- never in single precision, even when the data is in > float32. > > Having float32 be the default for float32 data is just requiring more > typing, and more explaining ... it would compromise numpy usability as > a day-to-day replacement for other systems. > > Sorry, if I'm being ignorant ... I'm going to side with Travis here. It is only a default and easily overridden. And yes, there are precisions greater than double. I was using quad precision back in the eighties on a VAX for some inherently ill conditioned problems. And on my machine long double is 12 bytes. Chuck |
From: Bill B. <wb...@gm...> - 2006-09-20 03:26:26
|
On 9/20/06, Charles R Harris <cha...@gm...> wrote: > > I guess I'm not really aware of the float96 type ... > > Is that a "machine type" on any system ? I always thought that -- e.g . > > coming from C -- double is "as good as it gets"... > > Who uses float96 ? I heard somewhere that (some) CPUs use 80bits > > internally when calculating 64bit double-precision... > > > I'm going to side with Travis here. It is only a default and easily > overridden. And yes, there are precisions greater than double. I was using > quad precision back in the eighties on a VAX for some inherently ill > conditioned problems. And on my machine long double is 12 bytes. > And on Intel chips the internal fp precision is 80bits. The D programming language even exposes this 80-bit floating point type to the user. http://www.digitalmars.com/d/type.html http://www.digitalmars.com/d/faq.html#real --bb |
From: Sebastian H. <ha...@ms...> - 2006-09-20 03:39:11
|
Charles R Harris wrote: > > > On 9/19/06, *Sebastian Haase* <ha...@ms... > <mailto:ha...@ms...>> wrote: > > Travis Oliphant wrote: > > Sebastian Haase wrote: > >> I still would argue that getting a "good" (smaller rounding > errors) answer > >> should be the default -- if speed is wanted, then *that* could > be still > >> specified by explicitly using dtype=float32 (which would also > be a possible > >> choice for int32 input) . > >> > > So you are arguing for using long double then.... ;-) > > > >> In image processing we always want means to be calculated in > float64 even > >> though input data is always float32 (if not uint16). > >> > >> Also it is simpler to say "float64 is the default" (full stop.) > - instead > >> > >> "float64 is the default unless you have float32" > >> > > "the type you have is the default except for integers". Do you > really > > want float64 to be the default for float96? > > > > Unless we are going to use long double as the default, then I'm not > > convinced that we should special-case the "double" type. > > > I guess I'm not really aware of the float96 type ... > Is that a "machine type" on any system ? I always thought that -- e.g . > coming from C -- double is "as good as it gets"... > Who uses float96 ? I heard somewhere that (some) CPUs use 80bits > internally when calculating 64bit double-precision... > > Is this not going into some academic argument !? > For all I know, calculating mean()s (and sum()s, ...) is always done in > double precision -- never in single precision, even when the data is in > float32. > > Having float32 be the default for float32 data is just requiring more > typing, and more explaining ... it would compromise numpy usability as > a day-to-day replacement for other systems. > > Sorry, if I'm being ignorant ... > > > I'm going to side with Travis here. It is only a default and easily > overridden. And yes, there are precisions greater than double. I was > using quad precision back in the eighties on a VAX for some inherently > ill conditioned problems. And on my machine long double is 12 bytes. > > Chuck > I just did a web search for "long double" http://www.google.com/search?q=%22long+double%22 and it does not look like there is much agreement on what that is - see also http://en.wikipedia.org/wiki/Long_double I really think that float96 *is* a special case - but computing mean()s and var()s in float32 would be "bad science". I hope I'm not alone in seeing numpy a great "interactive platform" to do evaluate data... I know that having too much knowledge of the details often makes one forget what the "newcomers" will do and expect. We are only talking about people that will a) work with single-precision data (e.g. large scale-image analysis) and who b) will tend to "just use the default" (dtype) --- How else can I say this: these people will just assume that arr.mean() *is* the mean of arr. -Sebastian |
From: Robert K. <rob...@gm...> - 2006-09-20 04:10:37
|
Sebastian Haase wrote: > I know that having too much knowledge of the details often makes one > forget what the "newcomers" will do and expect. Please be more careful with such accusations. Repeated frequently, they can become quite insulting. > We are only talking > about people that will a) work with single-precision data (e.g. large > scale-image analysis) and who b) will tend to "just use the default" > (dtype) --- How else can I say this: these people will just assume that > arr.mean() *is* the mean of arr. I don't understand what you mean, here. arr.mean() is almost never *the* mean of arr. Double precision can't fix that. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco |
From: Sebastian H. <ha...@ms...> - 2006-09-20 04:53:18
|
Robert Kern wrote: > Sebastian Haase wrote: >> I know that having too much knowledge of the details often makes one >> forget what the "newcomers" will do and expect. > > Please be more careful with such accusations. Repeated frequently, they can > become quite insulting. > I did not mean to insult anyone - what I meant was, that I'm for numpy becoming an easy platform to use. I have spend and enjoyed part the last four years developing and evangelizing Python as an alternative to Matlab and C/Fortran based image analysis environment. I often find myself arguing for good support of the single precision data format. So I find it actually somewhat ironic to see myself arguing now for wanting float64 over float32 ;-) >> We are only talking >> about people that will a) work with single-precision data (e.g. large >> scale-image analysis) and who b) will tend to "just use the default" >> (dtype) --- How else can I say this: these people will just assume that >> arr.mean() *is* the mean of arr. > > I don't understand what you mean, here. arr.mean() is almost never *the* mean of > arr. Double precision can't fix that. > This was not supposed to be a scientific statement -- I'm (again) thinking of our students that not always appreciate the full complexity of computational numerics and data types and such. The best I can hope for is a "sound" default for most (practical) cases... I still think that 80bit vs. 128bit vs 96bit is rather academic for most people ... most people seem to only use float64 and then there are some that use float32 (like us) ... Cheers, Sebastian |
From: Robert K. <rob...@gm...> - 2006-09-20 08:01:35
|
Sebastian Haase wrote: > Robert Kern wrote: >> Sebastian Haase wrote: >>> I know that having too much knowledge of the details often makes one >>> forget what the "newcomers" will do and expect. >> Please be more careful with such accusations. Repeated frequently, they can >> become quite insulting. >> > I did not mean to insult anyone - what I meant was, that I'm for numpy > becoming an easy platform to use. I have spend and enjoyed part the last > four years developing and evangelizing Python as an alternative to > Matlab and C/Fortran based image analysis environment. I often find > myself arguing for good support of the single precision data format. So > I find it actually somewhat ironic to see myself arguing now for wanting > float64 over float32 ;-) No one is doubting that you want numpy to be easy to use. Please don't doubt that the rest of us want otherwise. However, the fact that you *want* numpy to be easy to use does not mean that your suggestions *will* make numpy easy to use. We haven't forgotten what newcomers will do; to the contrary, we are quite aware that new users need consistent behavior in order to learn how to use a system. Adding another special case in how dtypes implicitly convert to one another will impede new users being able to understand the whole system. See A. M. Archibald's question in the thread "ufunc.reduce and conversion" for an example. In our judgement this is a worse outcome than notational convenience for float32 users, who already need to be aware of the effects of their precision choice. Each of us can come to different conclusions in good faith without one of us forgetting the new user experience. Let me offer a third path: the algorithms used for .mean() and .var() are substandard. There are much better incremental algorithms that entirely avoid the need to accumulate such large (and therefore precision-losing) intermediate values. The algorithms look like the following for 1D arrays in Python: def mean(a): m = a[0] for i in range(1, len(a)): m += (a[i] - m) / (i + 1) return m def var(a): m = a[0] t = a.dtype.type(0) for i in range(1, len(a)): q = a[i] - m r = q / (i+1) m += r t += i * q * r t /= len(a) return t Alternatively, from Knuth: def var_knuth(a): m = a.dtype.type(0) variance = a.dtype.type(0) for i in range(len(a)): delta = a[i] - m m += delta / (i+1) variance += delta * (a[i] - m) variance /= len(a) return variance If you will code up implementations of these for ndarray.mean() and ndarray.var(), I will check them in and then float32 arrays will avoid most of the catastrophes that the current implementations run into. >>> We are only talking >>> about people that will a) work with single-precision data (e.g. large >>> scale-image analysis) and who b) will tend to "just use the default" >>> (dtype) --- How else can I say this: these people will just assume that >>> arr.mean() *is* the mean of arr. >> I don't understand what you mean, here. arr.mean() is almost never *the* mean of >> arr. Double precision can't fix that. >> > This was not supposed to be a scientific statement -- I'm (again) > thinking of our students that not always appreciate the full complexity > of computational numerics and data types and such. They need to appreciate the complexity of computational numerics if they are going to do numerical computation. Double precision does not make it any simpler. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco |
From: Sebastian H. <ha...@ms...> - 2006-09-20 15:50:33
|
Robert Kern wrote: >> This was not supposed to be a scientific statement -- I'm (again) >> thinking of our students that not always appreciate the full >> complexity >> of computational numerics and data types and such. > > They need to appreciate the complexity of computational numerics if > they are going to do numerical computation. Double precision does not > make it any simpler. This is were we differ. > We haven't forgotten what newcomers will do; to the contrary, we are > quite aware > that new users need consistent behavior in order to learn how to use a > system. > Adding another special case in how dtypes implicitly convert to one > another will > impede new users being able to understand the whole system. All I'm proposing could be summarized in: mean(), sum(), var() ... produce output of dtype float64 (except for input float96 which produces float96) A comment on this is also that for these operations the input type/precision is almost not related to the resulting output precision -- the int case makes that already clear. (This is different for e.g. min() or max() ) The proposed alternative implementations seem to have one or more multiplication (or division) for each value -- this might be noticeably slower ... Regards, Sebastian |
From: Tim H. <tim...@ie...> - 2006-09-20 16:08:38
|
Robert Kern wrote: > Sebastian Haase wrote: > >> Robert Kern wrote: >> >>> Sebastian Haase wrote: >>> >>>> I know that having too much knowledge of the details often makes one >>>> forget what the "newcomers" will do and expect. >>>> >>> Please be more careful with such accusations. Repeated frequently, they can >>> become quite insulting. >>> >>> >> I did not mean to insult anyone - what I meant was, that I'm for numpy >> becoming an easy platform to use. I have spend and enjoyed part the last >> four years developing and evangelizing Python as an alternative to >> Matlab and C/Fortran based image analysis environment. I often find >> myself arguing for good support of the single precision data format. So >> I find it actually somewhat ironic to see myself arguing now for wanting >> float64 over float32 ;-) >> > > No one is doubting that you want numpy to be easy to use. Please don't doubt > that the rest of us want otherwise. However, the fact that you *want* numpy to > be easy to use does not mean that your suggestions *will* make numpy easy to use. > > We haven't forgotten what newcomers will do; to the contrary, we are quite aware > that new users need consistent behavior in order to learn how to use a system. > Adding another special case in how dtypes implicitly convert to one another will > impede new users being able to understand the whole system. See A. M. > Archibald's question in the thread "ufunc.reduce and conversion" for an example. > In our judgement this is a worse outcome than notational convenience for float32 > users, who already need to be aware of the effects of their precision choice. > Each of us can come to different conclusions in good faith without one of us > forgetting the new user experience. > > Let me offer a third path: the algorithms used for .mean() and .var() are > substandard. There are much better incremental algorithms that entirely avoid > the need to accumulate such large (and therefore precision-losing) intermediate > values. The algorithms look like the following for 1D arrays in Python: > > def mean(a): > m = a[0] > for i in range(1, len(a)): > m += (a[i] - m) / (i + 1) > return m > > def var(a): > m = a[0] > t = a.dtype.type(0) > for i in range(1, len(a)): > q = a[i] - m > r = q / (i+1) > m += r > t += i * q * r > t /= len(a) > return t > > Alternatively, from Knuth: > > def var_knuth(a): > m = a.dtype.type(0) > variance = a.dtype.type(0) > for i in range(len(a)): > delta = a[i] - m > m += delta / (i+1) > variance += delta * (a[i] - m) > variance /= len(a) > return variance > > If you will code up implementations of these for ndarray.mean() and > ndarray.var(), I will check them in and then float32 arrays will avoid most of > the catastrophes that the current implementations run into. > +1 > >>>> We are only talking >>>> about people that will a) work with single-precision data (e.g. large >>>> scale-image analysis) and who b) will tend to "just use the default" >>>> (dtype) --- How else can I say this: these people will just assume that >>>> arr.mean() *is* the mean of arr. >>>> >>> I don't understand what you mean, here. arr.mean() is almost never *the* mean of >>> arr. Double precision can't fix that. >>> >>> >> This was not supposed to be a scientific statement -- I'm (again) >> thinking of our students that not always appreciate the full complexity >> of computational numerics and data types and such. >> > > They need to appreciate the complexity of computational numerics if they are > going to do numerical computation. Double precision does not make it any simpler. > > |
From: David M. C. <co...@ph...> - 2006-09-20 16:35:14
|
On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: > Let me offer a third path: the algorithms used for .mean() and .var() are > substandard. There are much better incremental algorithms that entirely avoid > the need to accumulate such large (and therefore precision-losing) intermediate > values. The algorithms look like the following for 1D arrays in Python: > > def mean(a): > m = a[0] > for i in range(1, len(a)): > m += (a[i] - m) / (i + 1) > return m This isn't really going to be any better than using a simple sum. It'll also be slower (a division per iteration). You do avoid accumulating large sums, but then doing the division a[i]/len(a) and adding that will do the same. Now, if you want to avoid losing precision, you want to use a better summation technique, like compensated (or Kahan) summation: def mean(a): s = e = a.dtype.type(0) for i in range(0, len(a)): temp = s y = a[i] + e s = temp + y e = (temp - s) + y return s / len(a) Some numerical experiments in Maple using 5-digit precision show that your mean is maybe a bit better in some cases, but can also be much worse, than sum(a)/len(a), but both are quite poor in comparision to the Kahan summation. (We could probably use a fast implementation of Kahan summation in addition to a.sum()) > def var(a): > m = a[0] > t = a.dtype.type(0) > for i in range(1, len(a)): > q = a[i] - m > r = q / (i+1) > m += r > t += i * q * r > t /= len(a) > return t > > Alternatively, from Knuth: > > def var_knuth(a): > m = a.dtype.type(0) > variance = a.dtype.type(0) > for i in range(len(a)): > delta = a[i] - m > m += delta / (i+1) > variance += delta * (a[i] - m) > variance /= len(a) > return variance These formulas are good when you can only do one pass over the data (like in a calculator where you don't store all the data points), but are slightly worse than doing two passes. Kahan summation would probably also be good here too. -- |>|\/|< /--------------------------------------------------------------------------\ |David M. Cooke http://arbutus.physics.mcmaster.ca/dmc/ |co...@ph... |
From: Charles R H. <cha...@gm...> - 2006-09-20 16:59:10
|
On 9/20/06, David M. Cooke <co...@ph...> wrote: > > On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: > > Let me offer a third path: the algorithms used for .mean() and .var() > are > > substandard. There are much better incremental algorithms that entirely > avoid > > the need to accumulate such large (and therefore precision-losing) > intermediate > > values. The algorithms look like the following for 1D arrays in Python: > > > > def mean(a): > > m = a[0] > > for i in range(1, len(a)): > > m += (a[i] - m) / (i + 1) > > return m > > This isn't really going to be any better than using a simple sum. > It'll also be slower (a division per iteration). You do avoid > accumulating large sums, but then doing the division a[i]/len(a) and > adding that will do the same. > > Now, if you want to avoid losing precision, you want to use a better > summation technique, like compensated (or Kahan) summation: > > def mean(a): > s = e = a.dtype.type(0) > for i in range(0, len(a)): > temp = s > y = a[i] + e > s = temp + y > e = (temp - s) + y > return s / len(a) A variant of this can also be used to generate the sin/cos for fourier transforms using repeated complex multiplication. It works amazingly well. But I suspect accumulating in higher precision would be just as good and faster if the hardware supports it. Divide and conquer, like an fft where only the constant coefficient is computed, is also a good bet but requires lg(n) passes over the data and extra storage. Some numerical experiments in Maple using 5-digit precision show that > your mean is maybe a bit better in some cases, but can also be much > worse, than sum(a)/len(a), but both are quite poor in comparision to the > Kahan summation. > > (We could probably use a fast implementation of Kahan summation in > addition to a.sum()) > > > def var(a): > > m = a[0] > > t = a.dtype.type(0) > > for i in range(1, len(a)): > > q = a[i] - m > > r = q / (i+1) > > m += r > > t += i * q * r > > t /= len(a) > > return t > > > > Alternatively, from Knuth: > > > > def var_knuth(a): > > m = a.dtype.type(0) > > variance = a.dtype.type(0) > > for i in range(len(a)): > > delta = a[i] - m > > m += delta / (i+1) > > variance += delta * (a[i] - m) > > variance /= len(a) > > return variance > > These formulas are good when you can only do one pass over the data > (like in a calculator where you don't store all the data points), but > are slightly worse than doing two passes. Kahan summation would probably > also be good here too. I think we should leave things as they are. Choosing the right precision for accumulating sums is absolutely fundamental, I always think about it when programming such loops because it is always a potential gotcha. Making the defaults higher precision just kicks the can down the road where the unwary are still likely to trip over it at some point. Perhaps these special tricks for special cases could be included in scipy somewhere. Chuck |
From: Robert K. <rob...@gm...> - 2006-09-20 17:23:49
|
David M. Cooke wrote: > On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: >> Let me offer a third path: the algorithms used for .mean() and .var() are >> substandard. There are much better incremental algorithms that entirely avoid >> the need to accumulate such large (and therefore precision-losing) intermediate >> values. The algorithms look like the following for 1D arrays in Python: >> >> def mean(a): >> m = a[0] >> for i in range(1, len(a)): >> m += (a[i] - m) / (i + 1) >> return m > > This isn't really going to be any better than using a simple sum. > It'll also be slower (a division per iteration). With one exception, every test that I've thrown at it shows that it's better for float32. That exception is uniformly spaced arrays, like linspace(). > You do avoid > accumulating large sums, but then doing the division a[i]/len(a) and > adding that will do the same. Okay, this is true. > Now, if you want to avoid losing precision, you want to use a better > summation technique, like compensated (or Kahan) summation: > > def mean(a): > s = e = a.dtype.type(0) > for i in range(0, len(a)): > temp = s > y = a[i] + e > s = temp + y > e = (temp - s) + y > return s / len(a) > > Some numerical experiments in Maple using 5-digit precision show that > your mean is maybe a bit better in some cases, but can also be much > worse, than sum(a)/len(a), but both are quite poor in comparision to the > Kahan summation. > > (We could probably use a fast implementation of Kahan summation in > addition to a.sum()) +1 >> def var(a): >> m = a[0] >> t = a.dtype.type(0) >> for i in range(1, len(a)): >> q = a[i] - m >> r = q / (i+1) >> m += r >> t += i * q * r >> t /= len(a) >> return t >> >> Alternatively, from Knuth: >> >> def var_knuth(a): >> m = a.dtype.type(0) >> variance = a.dtype.type(0) >> for i in range(len(a)): >> delta = a[i] - m >> m += delta / (i+1) >> variance += delta * (a[i] - m) >> variance /= len(a) >> return variance > > These formulas are good when you can only do one pass over the data > (like in a calculator where you don't store all the data points), but > are slightly worse than doing two passes. Kahan summation would probably > also be good here too. Again, my tests show otherwise for float32. I'll condense my ipython log into a module for everyone's perusal. It's possible that the Kahan summation of the squared residuals will work better than the current two-pass algorithm and the implementations I give above. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco |
From: Tim H. <tim...@ie...> - 2006-09-20 18:54:04
|
Robert Kern wrote: > David M. Cooke wrote: > >> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: >> >>> Let me offer a third path: the algorithms used for .mean() and .var() are >>> substandard. There are much better incremental algorithms that entirely avoid >>> the need to accumulate such large (and therefore precision-losing) intermediate >>> values. The algorithms look like the following for 1D arrays in Python: >>> >>> def mean(a): >>> m = a[0] >>> for i in range(1, len(a)): >>> m += (a[i] - m) / (i + 1) >>> return m >>> >> This isn't really going to be any better than using a simple sum. >> It'll also be slower (a division per iteration). >> > > With one exception, every test that I've thrown at it shows that it's better for > float32. That exception is uniformly spaced arrays, like linspace(). > Here's version of mean based on the Kahan sum, including the compensation term that seems to be consistently much better than builtin mean It seems to be typically 4 orders or magnitude closer to the "correct" answer than the builtin mean and 2 orders of magnitude better than just naively using the Kahan sum. I'm using the sum performed with dtype=int64 as the "correct" value. def rawKahanSum(values): # Stolen from mathworld if not input: return 0.0 total = values[0] c = type(total)(0.0) # A running compensation for lost low-order bits. for x in values[1:]: y = x - c # So far, so good: c is zero. t = total + y # Alas, total is big, y small, so low-order digits of y are lost. c = (t - total) - y # (t - total) recovers the high-order part of y; subtracting y recovers -(low part of y) total = t # Algebriacally, c should always be zero. Beware eagerly optimising compilers! # Next time around, the lost low part will be added to y in a fresh attempt. return total, c def kahanSum(values): total, c = rawKahanSum(values) return total def mean(values): total, c = rawKahanSum(values) n = float(len(values)) return total / n - c / n for i in range(100): data = random.random([10000]).astype(float32) baseline = data.mean(dtype=float64) delta_builtin_mean = baseline - data.mean() delta_compensated_mean = baseline - mean(data) delta_kahan_mean = baseline - (kahanSum(data) / len(data)) if not abs(delta_builtin_mean) >= abs(delta_kahan_mean) >= abs(delta_compensated_mean): print ">>>", print delta_builtin_mean, delta_kahan_mean, delta_compensated_mean -tim > > You do avoid > > accumulating large sums, but then doing the division a[i]/len(a) and > > adding that will do the same. > > Okay, this is true. > > >> Now, if you want to avoid losing precision, you want to use a better >> summation technique, like compensated (or Kahan) summation: >> >> def mean(a): >> s = e = a.dtype.type(0) >> for i in range(0, len(a)): >> temp = s >> y = a[i] + e >> s = temp + y >> e = (temp - s) + y >> return s / len(a) >> >> Some numerical experiments in Maple using 5-digit precision show that >> your mean is maybe a bit better in some cases, but can also be much >> worse, than sum(a)/len(a), but both are quite poor in comparision to the >> Kahan summation. >> >> (We could probably use a fast implementation of Kahan summation in >> addition to a.sum()) >> > > +1 > > >>> def var(a): >>> m = a[0] >>> t = a.dtype.type(0) >>> for i in range(1, len(a)): >>> q = a[i] - m >>> r = q / (i+1) >>> m += r >>> t += i * q * r >>> t /= len(a) >>> return t >>> >>> Alternatively, from Knuth: >>> >>> def var_knuth(a): >>> m = a.dtype.type(0) >>> variance = a.dtype.type(0) >>> for i in range(len(a)): >>> delta = a[i] - m >>> m += delta / (i+1) >>> variance += delta * (a[i] - m) >>> variance /= len(a) >>> return variance >>> >> These formulas are good when you can only do one pass over the data >> (like in a calculator where you don't store all the data points), but >> are slightly worse than doing two passes. Kahan summation would probably >> also be good here too. >> > > Again, my tests show otherwise for float32. I'll condense my ipython log into a > module for everyone's perusal. It's possible that the Kahan summation of the > squared residuals will work better than the current two-pass algorithm and the > implementations I give above. > > |
From: Tim H. <tim...@ie...> - 2006-09-20 19:03:03
|
[Sorry, this version should have less munged formatting since I clipped the comments. Oh, and the Kahan sum algorithm was grabbed from wikipedia, not mathworld] Tim Hochberg wrote: > Robert Kern wrote: > >> David M. Cooke wrote: >> >> >>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: >>> >>> >>>> Let me offer a third path: the algorithms used for .mean() and .var() are >>>> substandard. There are much better incremental algorithms that entirely avoid >>>> the need to accumulate such large (and therefore precision-losing) intermediate >>>> values. The algorithms look like the following for 1D arrays in Python: >>>> >>>> def mean(a): >>>> m = a[0] >>>> for i in range(1, len(a)): >>>> m += (a[i] - m) / (i + 1) >>>> return m >>>> >>>> >>> This isn't really going to be any better than using a simple sum. >>> It'll also be slower (a division per iteration). >>> >>> >> With one exception, every test that I've thrown at it shows that it's better for >> float32. That exception is uniformly spaced arrays, like linspace(). >> >> > Here's version of mean based on the Kahan sum, including the > compensation term that seems to be consistently much better than builtin > mean It seems to be typically 4 orders or magnitude closer to the > "correct" answer than the builtin mean and 2 orders of magnitude better > than just naively using the Kahan sum. I'm using the sum performed with > dtype=int64 as the "correct" value. > > > def rawKahanSum(values): if not input: return 0.0 total = values[0] c = type(total)(0.0) for x in values[1:]: y = x - c t = total + y c = (t - total) - y total = t return total, c def kahanSum(values): total, c = rawKahanSum(values) return total def mean(values): total, c = rawKahanSum(values) n = float(len(values)) return total / n - c / n for i in range(100): data = random.random([10000]).astype(float32) baseline = data.mean(dtype=float64) delta_builtin_mean = baseline - data.mean() delta_compensated_mean = baseline - mean(data) delta_kahan_mean = baseline - (kahanSum(data) / len(data)) if not abs(delta_builtin_mean) >= abs(delta_kahan_mean) >= abs(delta_compensated_mean): print ">>>", print delta_builtin_mean, delta_kahan_mean, delta_compensated_mean > -tim > >> > You do avoid >> > accumulating large sums, but then doing the division a[i]/len(a) and >> > adding that will do the same. >> >> Okay, this is true. >> >> >> >>> Now, if you want to avoid losing precision, you want to use a better >>> summation technique, like compensated (or Kahan) summation: >>> >>> def mean(a): >>> s = e = a.dtype.type(0) >>> for i in range(0, len(a)): >>> temp = s >>> y = a[i] + e >>> s = temp + y >>> e = (temp - s) + y >>> return s / len(a) >>> >>> Some numerical experiments in Maple using 5-digit precision show that >>> your mean is maybe a bit better in some cases, but can also be much >>> worse, than sum(a)/len(a), but both are quite poor in comparision to the >>> Kahan summation. >>> >>> (We could probably use a fast implementation of Kahan summation in >>> addition to a.sum()) >>> >>> >> +1 >> >> >> >>>> def var(a): >>>> m = a[0] >>>> t = a.dtype.type(0) >>>> for i in range(1, len(a)): >>>> q = a[i] - m >>>> r = q / (i+1) >>>> m += r >>>> t += i * q * r >>>> t /= len(a) >>>> return t >>>> >>>> Alternatively, from Knuth: >>>> >>>> def var_knuth(a): >>>> m = a.dtype.type(0) >>>> variance = a.dtype.type(0) >>>> for i in range(len(a)): >>>> delta = a[i] - m >>>> m += delta / (i+1) >>>> variance += delta * (a[i] - m) >>>> variance /= len(a) >>>> return variance >>>> >>>> >>> These formulas are good when you can only do one pass over the data >>> (like in a calculator where you don't store all the data points), but >>> are slightly worse than doing two passes. Kahan summation would probably >>> also be good here too. >>> >>> >> Again, my tests show otherwise for float32. I'll condense my ipython log into a >> module for everyone's perusal. It's possible that the Kahan summation of the >> squared residuals will work better than the current two-pass algorithm and the >> implementations I give above. >> >> >> > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Numpy-discussion mailing list > Num...@li... > https://lists.sourceforge.net/lists/listinfo/numpy-discussion > > > |
From: Tim H. <tim...@ie...> - 2006-09-20 19:41:43
|
Robert Kern wrote: > David M. Cooke wrote: > >> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: >> >>> Let me offer a third path: the algorithms used for .mean() and .var() are >>> substandard. There are much better incremental algorithms that entirely avoid >>> the need to accumulate such large (and therefore precision-losing) intermediate >>> values. The algorithms look like the following for 1D arrays in Python: >>> >>> def mean(a): >>> m = a[0] >>> for i in range(1, len(a)): >>> m += (a[i] - m) / (i + 1) >>> return m >>> >> This isn't really going to be any better than using a simple sum. >> It'll also be slower (a division per iteration). >> > > With one exception, every test that I've thrown at it shows that it's better for > float32. That exception is uniformly spaced arrays, like linspace(). > > > You do avoid > > accumulating large sums, but then doing the division a[i]/len(a) and > > adding that will do the same. > > Okay, this is true. > > >> Now, if you want to avoid losing precision, you want to use a better >> summation technique, like compensated (or Kahan) summation: >> >> def mean(a): >> s = e = a.dtype.type(0) >> for i in range(0, len(a)): >> temp = s >> y = a[i] + e >> s = temp + y >> e = (temp - s) + y >> return s / len(a) >> >> Some numerical experiments in Maple using 5-digit precision show that >> your mean is maybe a bit better in some cases, but can also be much >> worse, than sum(a)/len(a), but both are quite poor in comparision to the >> Kahan summation. >> >> (We could probably use a fast implementation of Kahan summation in >> addition to a.sum()) >> > > +1 > > >>> def var(a): >>> m = a[0] >>> t = a.dtype.type(0) >>> for i in range(1, len(a)): >>> q = a[i] - m >>> r = q / (i+1) >>> m += r >>> t += i * q * r >>> t /= len(a) >>> return t >>> >>> Alternatively, from Knuth: >>> >>> def var_knuth(a): >>> m = a.dtype.type(0) >>> variance = a.dtype.type(0) >>> for i in range(len(a)): >>> delta = a[i] - m >>> m += delta / (i+1) >>> variance += delta * (a[i] - m) >>> variance /= len(a) >>> return variance >>> >> These formulas are good when you can only do one pass over the data >> (like in a calculator where you don't store all the data points), but >> are slightly worse than doing two passes. Kahan summation would probably >> also be good here too. >> On the flip side, it doesn't take a very large array (somewhere in the vicinity of 10,000 values in my experience) before memory bandwidth becomes a limiting factor. In that region a two pass algorithm could well be twice as slow as a single pass algorithm even if the former is somewhat simpler in terms of numeric operations. -tim > > Again, my tests show otherwise for float32. I'll condense my ipython log into a > module for everyone's perusal. It's possible that the Kahan summation of the > squared residuals will work better than the current two-pass algorithm and the > implementations I give above. > |
From: Tim H. <tim...@ie...> - 2006-09-21 17:57:05
|
Robert Kern wrote: > David M. Cooke wrote: > >> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: >> >>> Let me offer a third path: the algorithms used for .mean() and .var() are >>> substandard. There are much better incremental algorithms that entirely avoid >>> the need to accumulate such large (and therefore precision-losing) intermediate >>> values. The algorithms look like the following for 1D arrays in Python: >>> >>> def mean(a): >>> m = a[0] >>> for i in range(1, len(a)): >>> m += (a[i] - m) / (i + 1) >>> return m >>> >> This isn't really going to be any better than using a simple sum. >> It'll also be slower (a division per iteration). >> > > With one exception, every test that I've thrown at it shows that it's better for > float32. That exception is uniformly spaced arrays, like linspace(). > > > You do avoid > > accumulating large sums, but then doing the division a[i]/len(a) and > > adding that will do the same. > > Okay, this is true. > > >> Now, if you want to avoid losing precision, you want to use a better >> summation technique, like compensated (or Kahan) summation: >> >> def mean(a): >> s = e = a.dtype.type(0) >> for i in range(0, len(a)): >> temp = s >> y = a[i] + e >> s = temp + y >> e = (temp - s) + y >> return s / len(a) >> >> Some numerical experiments in Maple using 5-digit precision show that >> your mean is maybe a bit better in some cases, but can also be much >> worse, than sum(a)/len(a), but both are quite poor in comparision to the >> Kahan summation. >> >> (We could probably use a fast implementation of Kahan summation in >> addition to a.sum()) >> > > +1 > > >>> def var(a): >>> m = a[0] >>> t = a.dtype.type(0) >>> for i in range(1, len(a)): >>> q = a[i] - m >>> r = q / (i+1) >>> m += r >>> t += i * q * r >>> t /= len(a) >>> return t >>> >>> Alternatively, from Knuth: >>> >>> def var_knuth(a): >>> m = a.dtype.type(0) >>> variance = a.dtype.type(0) >>> for i in range(len(a)): >>> delta = a[i] - m >>> m += delta / (i+1) >>> variance += delta * (a[i] - m) >>> variance /= len(a) >>> return variance >>> >> These formulas are good when you can only do one pass over the data >> (like in a calculator where you don't store all the data points), but >> are slightly worse than doing two passes. Kahan summation would probably >> also be good here too. >> > > Again, my tests show otherwise for float32. I'll condense my ipython log into a > module for everyone's perusal. It's possible that the Kahan summation of the > squared residuals will work better than the current two-pass algorithm and the > implementations I give above. > This is what my tests show as well var_knuth outperformed any simple two pass algorithm I could come up with, even ones using Kahan sums. Interestingly, for 1D arrays the built in float32 variance performs better than it should. After a bit of twiddling around I discovered that it actually does most of it's calculations in float64. It uses a two pass calculation, the result of mean is a scalar, and in the process of converting that back to an array we end up with float64 values. Or something like that; I was mostly reverse engineering the sequence of events from the results. -tim |
From: Tim H. <tim...@ie...> - 2006-09-21 18:35:06
Attachments:
numpy_statistics.py
|
Tim Hochberg wrote: > Robert Kern wrote: > >> David M. Cooke wrote: >> >> >>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote: >>> >>> >>>> Let me offer a third path: the algorithms used for .mean() and .var() are >>>> substandard. There are much better incremental algorithms that entirely avoid >>>> the need to accumulate such large (and therefore precision-losing) intermediate >>>> values. The algorithms look like the following for 1D arrays in Python: >>>> >>>> def mean(a): >>>> m = a[0] >>>> for i in range(1, len(a)): >>>> m += (a[i] - m) / (i + 1) >>>> return m >>>> >>>> >>> This isn't really going to be any better than using a simple sum. >>> It'll also be slower (a division per iteration). >>> >>> >> With one exception, every test that I've thrown at it shows that it's better for >> float32. That exception is uniformly spaced arrays, like linspace(). >> >> > You do avoid >> > accumulating large sums, but then doing the division a[i]/len(a) and >> > adding that will do the same. >> >> Okay, this is true. >> >> >> >>> Now, if you want to avoid losing precision, you want to use a better >>> summation technique, like compensated (or Kahan) summation: >>> >>> def mean(a): >>> s = e = a.dtype.type(0) >>> for i in range(0, len(a)): >>> temp = s >>> y = a[i] + e >>> s = temp + y >>> e = (temp - s) + y >>> return s / len(a) >>> >>> Some numerical experiments in Maple using 5-digit precision show that >>> your mean is maybe a bit better in some cases, but can also be much >>> worse, than sum(a)/len(a), but both are quite poor in comparision to the >>> Kahan summation. >>> >>> (We could probably use a fast implementation of Kahan summation in >>> addition to a.sum()) >>> >>> >> +1 >> >> >> >>>> def var(a): >>>> m = a[0] >>>> t = a.dtype.type(0) >>>> for i in range(1, len(a)): >>>> q = a[i] - m >>>> r = q / (i+1) >>>> m += r >>>> t += i * q * r >>>> t /= len(a) >>>> return t >>>> >>>> Alternatively, from Knuth: >>>> >>>> def var_knuth(a): >>>> m = a.dtype.type(0) >>>> variance = a.dtype.type(0) >>>> for i in range(len(a)): >>>> delta = a[i] - m >>>> m += delta / (i+1) >>>> variance += delta * (a[i] - m) >>>> variance /= len(a) >>>> return variance >>>> >>>> >>> These formulas are good when you can only do one pass over the data >>> (like in a calculator where you don't store all the data points), but >>> are slightly worse than doing two passes. Kahan summation would probably >>> also be good here too. >>> >>> >> Again, my tests show otherwise for float32. I'll condense my ipython log into a >> module for everyone's perusal. It's possible that the Kahan summation of the >> squared residuals will work better than the current two-pass algorithm and the >> implementations I give above. >> >> > This is what my tests show as well var_knuth outperformed any simple two > pass algorithm I could come up with, even ones using Kahan sums. > Interestingly, for 1D arrays the built in float32 variance performs > better than it should. After a bit of twiddling around I discovered that > it actually does most of it's calculations in float64. It uses a two > pass calculation, the result of mean is a scalar, and in the process of > converting that back to an array we end up with float64 values. Or > something like that; I was mostly reverse engineering the sequence of > events from the results. > Here's a simple of example of how var is a little wacky. A shape-[N] array will give you a different result than a shape-[1,N] array. The reason is clear -- in the second case the mean is not a scalar so there isn't the inadvertent promotion to float64, but it's still odd. >>> data = (1000*(random.random([10000]) - 0.1)).astype(float32) >>> print data.var() - data.reshape([1, -1]).var(-1) [ 0.1171875] I'm going to go ahead and attach a module containing the versions of mean, var, etc that I've been playing with in case someone wants to mess with them. Some were stolen from traffic on this list, for others I grabbed the algorithms from wikipedia or equivalent. -tim > -tim > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Numpy-discussion mailing list > Num...@li... > https://lists.sourceforge.net/lists/listinfo/numpy-discussion > > > |