Thread: [Numpy-discussion] please change mean to use dtype=float

A package for scientific computing with Python

Brought to you by: charris208, jarrodmillman, kern, rgommers, teoliphant

numpy-discussion

[Numpy-discussion] please change mean to use dtype=float

From: Sebastian H. <ha...@ms...> - 2006-09-19 22:41:24

Hello all,
I just had someone from my lab coming to my desk saying:
"My god - SciPy is really stupid ....
An array with only positive numbers claims to have a negative mean !! "?

I was asking about this before ... the reason was of course that her array was 
of dtype int32 and had many large values  to cause an overflow (wrap 
around) .

Now that the default for axis is None (for all functions having an axis 
argument),
can we please change dtype to default to float64 !?

It is really a very confusing and shocking result to get a negative mean on 
all positive values.

It has been stated here before that numpy should target the "scientist" rather 
than the programmers ...  I would argue that mean() almost always requires 
the precision of "double" (float64)  to produce usable results.

Please consider this change before the 1.0 release  ...
Thanks,
Sebastian Haase

Re: [Numpy-discussion] please change mean to use dtype=float

From: Travis O. <oli...@ee...> - 2006-09-19 22:48:06

Sebastian Haase wrote:

>Hello all,
>I just had someone from my lab coming to my desk saying:
>"My god - SciPy is really stupid ....
>An array with only positive numbers claims to have a negative mean !! "?
>
>  
>
>I was asking about this before ... the reason was of course that her array was 
>of dtype int32 and had many large values  to cause an overflow (wrap 
>around) .
>
>Now that the default for axis is None (for all functions having an axis 
>argument),
>can we please change dtype to default to float64 !?
>  
>

The default is float64 now (as long as you are not using 
numpy.oldnumeric). 

I suppose more appropriately, we could reduce over float for integer 
data-types when calculating the mean as well (since a floating point is 
returned anyway).

-Travis

Re: [Numpy-discussion] please change mean to use dtype=float

From: Charles R H. <cha...@gm...> - 2006-09-20 00:37:04

On 9/19/06, Travis Oliphant <oli...@ee...> wrote:
>
> Sebastian Haase wrote:
>
> >On Tuesday 19 September 2006 15:48, Travis Oliphant wrote:
> >
> >
> >>Sebastian Haase wrote:
> >>
> >>
> ><snip>
> >
> >
> >>>can we please change dtype to default to float64 !?
> >>>
> >>>
> >>The default is float64 now (as long as you are not using
> >>numpy.oldnumeric).
> >>
> >>I suppose more appropriately, we could reduce over float for integer
> >>data-types when calculating the mean as well (since a floating point is
> >>returned anyway).
> >>
> >>
> >>
> >
> >Is now mean() always "reducing over" float64 ?
> >The svn note """Log:
> >Fix mean, std, and var methods so that they reduce over double data-type
> with
> >integer inputs.
> >"""
> >makes it sound that a float32 input is stays float32 ?
> >
> >
> Yes, that is true.  Only integer inputs are changed because you are
> going to get a floating point output anyway.
>
> >For mean calculation this might introduce large errors - I usually would
> >require double-precision for *any*  input type ...
> >
> >
> Of course.  The system is not fool-proof.  I hesitate to arbitrarily
> change this.  The advantage of using single-precision calculation is
> that it is faster.  We do rely on the user who expressly requests these
> things to be aware of the difficulties.


Speed depends on the achitecture. Float is a trifle slower than double on my
Athlon64, but faster on PPC750. I don't know about other machines. I think
there is a good argument for accumlating in double and converting to float
for output if required.

Chuck

Re: [Numpy-discussion] please change mean to use dtype=float

From: Travis O. <oli...@ie...> - 2006-09-20 02:30:52

Charles R Harris wrote:
>
> Speed depends on the achitecture. Float is a trifle slower than double 
> on my Athlon64, but faster on PPC750. I don't know about other 
> machines. I think there is a good argument for accumlating in double 
> and converting to float for output if required.

Yes there is.  It's just not what NumPy ever does so it would be an 
exception in this case and would need  a more convincing argument in my 
opinion.    You can always specify the accumulation type yourself with 
the dtype argument.  We are only talking about what the default should 
be.  

-Travis

Re: [Numpy-discussion] please change mean to use dtype=float

From: Sebastian H. <ha...@ms...> - 2006-09-20 00:04:03

On Tuesday 19 September 2006 15:48, Travis Oliphant wrote:
> Sebastian Haase wrote:
<snip>
> >can we please change dtype to default to float64 !?
>
> The default is float64 now (as long as you are not using
> numpy.oldnumeric).
>
> I suppose more appropriately, we could reduce over float for integer
> data-types when calculating the mean as well (since a floating point is
> returned anyway).
>

Is now mean() always "reducing over" float64 ? 
The svn note """Log:
Fix mean, std, and var methods so that they reduce over double data-type with 
integer inputs.
"""
makes it sound that a float32 input is stays float32 ?

For mean calculation this might introduce large errors - I usually would 
require double-precision for *any*  input type ...

(don't know how to say this for complex types !? Are here real and imag 
treated separately / independently ?)

Thanks,
Sebastian Haase

Re: [Numpy-discussion] please change mean to use dtype=float

From: Travis O. <oli...@ee...> - 2006-09-20 00:17:51

Sebastian Haase wrote:

>On Tuesday 19 September 2006 15:48, Travis Oliphant wrote:
>  
>
>>Sebastian Haase wrote:
>>    
>>
><snip>
>  
>
>>>can we please change dtype to default to float64 !?
>>>      
>>>
>>The default is float64 now (as long as you are not using
>>numpy.oldnumeric).
>>
>>I suppose more appropriately, we could reduce over float for integer
>>data-types when calculating the mean as well (since a floating point is
>>returned anyway).
>>
>>    
>>
>
>Is now mean() always "reducing over" float64 ? 
>The svn note """Log:
>Fix mean, std, and var methods so that they reduce over double data-type with 
>integer inputs.
>"""
>makes it sound that a float32 input is stays float32 ?
>  
>
Yes, that is true.  Only integer inputs are changed because you are 
going to get a floating point output anyway.

>For mean calculation this might introduce large errors - I usually would 
>require double-precision for *any*  input type ...
>  
>
Of course.  The system is not fool-proof.  I hesitate to arbitrarily 
change this.  The advantage of using single-precision calculation is 
that it is faster.  We do rely on the user who expressly requests these 
things to be aware of the difficulties.

>(don't know how to say this for complex types !? Are here real and imag 
>treated separately / independently ?)
>  
>

There is a complex add performed at a low-level as two separate adds.  
The addition is performed in the precision requested.

-Travis

Re: [Numpy-discussion] please change mean to use dtype=float

From: Sebastian H. <ha...@ms...> - 2006-09-20 00:51:50

On Tuesday 19 September 2006 17:17, Travis Oliphant wrote:
> Sebastian Haase wrote:
> >On Tuesday 19 September 2006 15:48, Travis Oliphant wrote:
> >>Sebastian Haase wrote:
> >
> ><snip>
> >
> >>>can we please change dtype to default to float64 !?
> >>
> >>The default is float64 now (as long as you are not using
> >>numpy.oldnumeric).
> >>
> >>I suppose more appropriately, we could reduce over float for integer
> >>data-types when calculating the mean as well (since a floating point is
> >>returned anyway).
> >
> >Is now mean() always "reducing over" float64 ?
> >The svn note """Log:
> >Fix mean, std, and var methods so that they reduce over double data-type
> > with integer inputs.
> >"""
> >makes it sound that a float32 input is stays float32 ?
>
> Yes, that is true.  Only integer inputs are changed because you are
> going to get a floating point output anyway.
>
> >For mean calculation this might introduce large errors - I usually would
> >require double-precision for *any*  input type ...
>
> Of course.  The system is not fool-proof.  I hesitate to arbitrarily
> change this.  The advantage of using single-precision calculation is
> that it is faster.  We do rely on the user who expressly requests these
> things to be aware of the difficulties.

I still would argue that getting a "good" (smaller rounding errors) answer 
should be the default -- if speed is wanted, then *that* could be still 
specified by explicitly using dtype=float32  (which would also be a possible 
choice for int32 input) . 
In image processing we always want means to be calculated in float64 even 
though input data is always float32 (if not uint16).

Also it is simpler to say "float64 is the default" (full stop.) - instead 
"float64 is the default unless you have float32" 

Thanks,
Sebastian Haase

Re: [Numpy-discussion] please change mean to use dtype=float

From: Travis O. <oli...@ie...> - 2006-09-20 02:33:35

Sebastian Haase wrote:
> I still would argue that getting a "good" (smaller rounding errors) answer 
> should be the default -- if speed is wanted, then *that* could be still 
> specified by explicitly using dtype=float32  (which would also be a possible 
> choice for int32 input) . 
>   
So you are arguing for using long double then.... ;-)

> In image processing we always want means to be calculated in float64 even 
> though input data is always float32 (if not uint16).
>
> Also it is simpler to say "float64 is the default" (full stop.) - instead 
>   
> "float64 is the default unless you have float32" 
>   
"the type you have is the default except for integers".  Do you really 
want float64 to be the default for float96?

Unless we are going to use long double as the default, then I'm not 
convinced that we should special-case the "double" type.

-Travis

Re: [Numpy-discussion] please change mean to use dtype=float

From: Sebastian H. <ha...@ms...> - 2006-09-20 03:07:41

Travis Oliphant wrote:
> Sebastian Haase wrote:
>> I still would argue that getting a "good" (smaller rounding errors) answer 
>> should be the default -- if speed is wanted, then *that* could be still 
>> specified by explicitly using dtype=float32  (which would also be a possible 
>> choice for int32 input) . 
>>   
> So you are arguing for using long double then.... ;-)
> 
>> In image processing we always want means to be calculated in float64 even 
>> though input data is always float32 (if not uint16).
>>
>> Also it is simpler to say "float64 is the default" (full stop.) - instead 
>>   
>> "float64 is the default unless you have float32" 
>>   
> "the type you have is the default except for integers".  Do you really 
> want float64 to be the default for float96?
> 
> Unless we are going to use long double as the default, then I'm not 
> convinced that we should special-case the "double" type.
> 
I guess I'm not really aware of the float96 type ...
Is that a "machine type" on any system ?  I always thought that -- e.g. 
coming from C -- double is "as good as it gets"...
Who uses float96 ?  I heard somewhere that (some) CPUs use 80bits 
internally when calculating 64bit double-precision...

Is this not going into some academic argument !?
For all I know, calculating mean()s (and sum()s, ...) is always done in 
double precision -- never in single precision, even when the data is in 
float32.

Having float32 be the default for float32 data is just requiring more 
typing, and more explaining ...  it would compromise numpy usability as 
a day-to-day replacement for other systems.

Sorry, if I'm being ignorant ...

- Sebastian

Re: [Numpy-discussion] please change mean to use dtype=float

From: Charles R H. <cha...@gm...> - 2006-09-20 03:15:33

On 9/19/06, Sebastian Haase <ha...@ms...> wrote:
>
> Travis Oliphant wrote:
> > Sebastian Haase wrote:
> >> I still would argue that getting a "good" (smaller rounding errors)
> answer
> >> should be the default -- if speed is wanted, then *that* could be still
> >> specified by explicitly using dtype=float32  (which would also be a
> possible
> >> choice for int32 input) .
> >>
> > So you are arguing for using long double then.... ;-)
> >
> >> In image processing we always want means to be calculated in float64
> even
> >> though input data is always float32 (if not uint16).
> >>
> >> Also it is simpler to say "float64 is the default" (full stop.) -
> instead
> >>
> >> "float64 is the default unless you have float32"
> >>
> > "the type you have is the default except for integers".  Do you really
> > want float64 to be the default for float96?
> >
> > Unless we are going to use long double as the default, then I'm not
> > convinced that we should special-case the "double" type.
> >
> I guess I'm not really aware of the float96 type ...
> Is that a "machine type" on any system ?  I always thought that -- e.g.
> coming from C -- double is "as good as it gets"...
> Who uses float96 ?  I heard somewhere that (some) CPUs use 80bits
> internally when calculating 64bit double-precision...
>
> Is this not going into some academic argument !?
> For all I know, calculating mean()s (and sum()s, ...) is always done in
> double precision -- never in single precision, even when the data is in
> float32.
>
> Having float32 be the default for float32 data is just requiring more
> typing, and more explaining ...  it would compromise numpy usability as
> a day-to-day replacement for other systems.
>
> Sorry, if I'm being ignorant ...


I'm going to side with Travis here. It is only a default and easily
overridden. And yes, there are precisions greater than double. I was using
quad precision back in the eighties on a VAX for some inherently ill
conditioned problems. And on my machine long double is 12 bytes.

Chuck

Re: [Numpy-discussion] please change mean to use dtype=float

From: Bill B. <wb...@gm...> - 2006-09-20 03:26:26

On 9/20/06, Charles R Harris <cha...@gm...> wrote:
> > I guess I'm not really aware of the float96 type ...
> > Is that a "machine type" on any system ?  I always thought that -- e.g .
> > coming from C -- double is "as good as it gets"...
> > Who uses float96 ?  I heard somewhere that (some) CPUs use 80bits
> > internally when calculating 64bit double-precision...
> >
> I'm going to side with Travis here. It is only a default and easily
> overridden. And yes, there are precisions greater than double. I was using
> quad precision back in the eighties on a VAX for some inherently ill
> conditioned problems. And on my machine long double is 12 bytes.
>

And on Intel chips the internal fp precision is 80bits.  The D
programming language even exposes this 80-bit floating point type to
the user.
http://www.digitalmars.com/d/type.html
http://www.digitalmars.com/d/faq.html#real

--bb

Re: [Numpy-discussion] please change mean to use dtype=float

From: Sebastian H. <ha...@ms...> - 2006-09-20 03:39:11

Charles R Harris wrote:
> 
> 
> On 9/19/06, *Sebastian Haase* <ha...@ms... 
> <mailto:ha...@ms...>> wrote:
> 
>     Travis Oliphant wrote:
>      > Sebastian Haase wrote:
>      >> I still would argue that getting a "good" (smaller rounding
>     errors) answer
>      >> should be the default -- if speed is wanted, then *that* could
>     be still
>      >> specified by explicitly using dtype=float32  (which would also
>     be a possible
>      >> choice for int32 input) .
>      >>
>      > So you are arguing for using long double then.... ;-)
>      >
>      >> In image processing we always want means to be calculated in
>     float64 even
>      >> though input data is always float32 (if not uint16).
>      >>
>      >> Also it is simpler to say "float64 is the default" (full stop.)
>     - instead
>      >>
>      >> "float64 is the default unless you have float32"
>      >>
>      > "the type you have is the default except for integers".  Do you
>     really
>      > want float64 to be the default for float96?
>      >
>      > Unless we are going to use long double as the default, then I'm not
>      > convinced that we should special-case the "double" type.
>      >
>     I guess I'm not really aware of the float96 type ...
>     Is that a "machine type" on any system ?  I always thought that -- e.g .
>     coming from C -- double is "as good as it gets"...
>     Who uses float96 ?  I heard somewhere that (some) CPUs use 80bits
>     internally when calculating 64bit double-precision...
> 
>     Is this not going into some academic argument !?
>     For all I know, calculating mean()s (and sum()s, ...) is always done in
>     double precision -- never in single precision, even when the data is in
>     float32.
> 
>     Having float32 be the default for float32 data is just requiring more
>     typing, and more explaining ...  it would compromise numpy usability as
>     a day-to-day replacement for other systems.
> 
>     Sorry, if I'm being ignorant ...
> 
> 
> I'm going to side with Travis here. It is only a default and easily 
> overridden. And yes, there are precisions greater than double. I was 
> using quad precision back in the eighties on a VAX for some inherently 
> ill conditioned problems. And on my machine long double is 12 bytes.
> 
> Chuck
> 
I just did a web search for "long double"
http://www.google.com/search?q=%22long+double%22

and it does not look like there is much agreement on what that is - see 
also http://en.wikipedia.org/wiki/Long_double

I really think that float96 *is* a special case - but computing mean()s 
and var()s in float32 would be "bad science".
I hope I'm not alone in seeing numpy a great "interactive platform" to 
do evaluate data...
I know that having too much knowledge of the details often makes one 
forget what the "newcomers" will do and expect. We are only talking 
about people that will a) work with single-precision data (e.g. large 
scale-image analysis) and who b) will tend to "just use the default" 
(dtype)  --- How else can I say this: these people will just assume that 
arr.mean() *is* the mean of arr.

-Sebastian

Re: [Numpy-discussion] please change mean to use dtype=float

From: Robert K. <rob...@gm...> - 2006-09-20 04:10:37

Sebastian Haase wrote:
> I know that having too much knowledge of the details often makes one 
> forget what the "newcomers" will do and expect.

Please be more careful with such accusations. Repeated frequently, they can 
become quite insulting.

> We are only talking 
> about people that will a) work with single-precision data (e.g. large 
> scale-image analysis) and who b) will tend to "just use the default" 
> (dtype)  --- How else can I say this: these people will just assume that 
> arr.mean() *is* the mean of arr.

I don't understand what you mean, here. arr.mean() is almost never *the* mean of 
arr. Double precision can't fix that.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

Re: [Numpy-discussion] please change mean to use dtype=float

From: Sebastian H. <ha...@ms...> - 2006-09-20 04:53:18

Robert Kern wrote:
> Sebastian Haase wrote:
>> I know that having too much knowledge of the details often makes one 
>> forget what the "newcomers" will do and expect.
> 
> Please be more careful with such accusations. Repeated frequently, they can 
> become quite insulting.
> 
I did not mean to insult anyone - what I meant was, that I'm for numpy 
becoming an easy platform to use. I have spend and enjoyed part the last 
four years developing and evangelizing Python as an alternative to 
Matlab and C/Fortran based image analysis environment. I often find 
myself arguing for good support of the single precision data format. So 
I find it actually somewhat ironic to see myself arguing now for wanting 
float64 over float32 ;-)


>> We are only talking 
>> about people that will a) work with single-precision data (e.g. large 
>> scale-image analysis) and who b) will tend to "just use the default" 
>> (dtype)  --- How else can I say this: these people will just assume that 
>> arr.mean() *is* the mean of arr.
> 
> I don't understand what you mean, here. arr.mean() is almost never *the* mean of 
> arr. Double precision can't fix that.
> 
This was not supposed to be a scientific statement -- I'm (again) 
thinking of our students that not always appreciate the full complexity 
of computational numerics and data types and such.
The best I can hope for is a "sound" default for most (practical) cases...
I still think that 80bit vs. 128bit vs 96bit is rather academic for most 
people ...  most people seem to only use float64 and then there are some 
that use float32 (like us) ...


Cheers,
Sebastian

Re: [Numpy-discussion] please change mean to use dtype=float

From: Robert K. <rob...@gm...> - 2006-09-20 08:01:35

Sebastian Haase wrote:
> Robert Kern wrote:
>> Sebastian Haase wrote:
>>> I know that having too much knowledge of the details often makes one 
>>> forget what the "newcomers" will do and expect.
>> Please be more careful with such accusations. Repeated frequently, they can 
>> become quite insulting.
>>
> I did not mean to insult anyone - what I meant was, that I'm for numpy 
> becoming an easy platform to use. I have spend and enjoyed part the last 
> four years developing and evangelizing Python as an alternative to 
> Matlab and C/Fortran based image analysis environment. I often find 
> myself arguing for good support of the single precision data format. So 
> I find it actually somewhat ironic to see myself arguing now for wanting 
> float64 over float32 ;-)

No one is doubting that you want numpy to be easy to use. Please don't doubt 
that the rest of us want otherwise. However, the fact that you *want* numpy to 
be easy to use does not mean that your suggestions *will* make numpy easy to use.

We haven't forgotten what newcomers will do; to the contrary, we are quite aware 
that new users need consistent behavior in order to learn how to use a system. 
Adding another special case in how dtypes implicitly convert to one another will 
impede new users being able to understand the whole system. See A. M. 
Archibald's question in the thread "ufunc.reduce and conversion" for an example. 
In our judgement this is a worse outcome than notational convenience for float32 
users, who already need to be aware of the effects of their precision choice. 
Each of us can come to different conclusions in good faith without one of us 
forgetting the new user experience.

Let me offer a third path: the algorithms used for .mean() and .var() are 
substandard. There are much better incremental algorithms that entirely avoid 
the need to accumulate such large (and therefore precision-losing) intermediate 
values. The algorithms look like the following for 1D arrays in Python:

def mean(a):
     m = a[0]
     for i in range(1, len(a)):
         m += (a[i] - m) / (i + 1)
     return m

def var(a):
     m = a[0]
     t = a.dtype.type(0)
     for i in range(1, len(a)):
         q = a[i] - m
         r = q / (i+1)
         m += r
         t += i * q * r
     t /= len(a)
     return t

Alternatively, from Knuth:

def var_knuth(a):
     m = a.dtype.type(0)
     variance = a.dtype.type(0)
     for i in range(len(a)):
         delta = a[i] - m
         m += delta / (i+1)
         variance += delta * (a[i] - m)
     variance /= len(a)
     return variance

If you will code up implementations of these for ndarray.mean() and 
ndarray.var(), I will check them in and then float32 arrays will avoid most of 
the catastrophes that the current implementations run into.

>>> We are only talking 
>>> about people that will a) work with single-precision data (e.g. large 
>>> scale-image analysis) and who b) will tend to "just use the default" 
>>> (dtype)  --- How else can I say this: these people will just assume that 
>>> arr.mean() *is* the mean of arr.
>> I don't understand what you mean, here. arr.mean() is almost never *the* mean of 
>> arr. Double precision can't fix that.
>>
> This was not supposed to be a scientific statement -- I'm (again) 
> thinking of our students that not always appreciate the full complexity 
> of computational numerics and data types and such.

They need to appreciate the complexity of computational numerics if they are 
going to do numerical computation. Double precision does not make it any simpler.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

Re: [Numpy-discussion] please change mean to use dtype=float

From: Sebastian H. <ha...@ms...> - 2006-09-20 15:50:33

Robert Kern wrote:
 >> This was not supposed to be a scientific statement -- I'm (again)
 >> thinking of our students that not always appreciate the full
 >> complexity
 >> of computational numerics and data types and such.
 >
 > They need to appreciate the complexity of computational numerics if
 > they are going to do numerical computation. Double precision does not
 > make it any simpler.

This is were we differ.

 > We haven't forgotten what newcomers will do; to the contrary, we are
 > quite aware
 > that new users need consistent behavior in order to learn how to use a
 > system.
 > Adding another special case in how dtypes implicitly convert to one
 > another will
 > impede new users being able to understand the whole system.

All I'm proposing could be summarized in:
mean(), sum(), var() ... produce output of dtype float64 (except for 
input float96 which produces float96)

A comment on this is also that for these operations the input 
type/precision is almost not related to the resulting output precision 
-- the int case makes that already clear.  (This is different for e.g. 
min() or max() )

The proposed alternative implementations seem to have one or more 
multiplication (or division) for each value -- this might be noticeably 
slower ...

Regards,
Sebastian

Re: [Numpy-discussion] please change mean to use dtype=float

From: Tim H. <tim...@ie...> - 2006-09-20 16:08:38

Robert Kern wrote:
> Sebastian Haase wrote:
>   
>> Robert Kern wrote:
>>     
>>> Sebastian Haase wrote:
>>>       
>>>> I know that having too much knowledge of the details often makes one 
>>>> forget what the "newcomers" will do and expect.
>>>>         
>>> Please be more careful with such accusations. Repeated frequently, they can 
>>> become quite insulting.
>>>
>>>       
>> I did not mean to insult anyone - what I meant was, that I'm for numpy 
>> becoming an easy platform to use. I have spend and enjoyed part the last 
>> four years developing and evangelizing Python as an alternative to 
>> Matlab and C/Fortran based image analysis environment. I often find 
>> myself arguing for good support of the single precision data format. So 
>> I find it actually somewhat ironic to see myself arguing now for wanting 
>> float64 over float32 ;-)
>>     
>
> No one is doubting that you want numpy to be easy to use. Please don't doubt 
> that the rest of us want otherwise. However, the fact that you *want* numpy to 
> be easy to use does not mean that your suggestions *will* make numpy easy to use.
>
> We haven't forgotten what newcomers will do; to the contrary, we are quite aware 
> that new users need consistent behavior in order to learn how to use a system. 
> Adding another special case in how dtypes implicitly convert to one another will 
> impede new users being able to understand the whole system. See A. M. 
> Archibald's question in the thread "ufunc.reduce and conversion" for an example. 
> In our judgement this is a worse outcome than notational convenience for float32 
> users, who already need to be aware of the effects of their precision choice. 
> Each of us can come to different conclusions in good faith without one of us 
> forgetting the new user experience.
>
> Let me offer a third path: the algorithms used for .mean() and .var() are 
> substandard. There are much better incremental algorithms that entirely avoid 
> the need to accumulate such large (and therefore precision-losing) intermediate 
> values. The algorithms look like the following for 1D arrays in Python:
>
> def mean(a):
>      m = a[0]
>      for i in range(1, len(a)):
>          m += (a[i] - m) / (i + 1)
>      return m
>
> def var(a):
>      m = a[0]
>      t = a.dtype.type(0)
>      for i in range(1, len(a)):
>          q = a[i] - m
>          r = q / (i+1)
>          m += r
>          t += i * q * r
>      t /= len(a)
>      return t
>
> Alternatively, from Knuth:
>
> def var_knuth(a):
>      m = a.dtype.type(0)
>      variance = a.dtype.type(0)
>      for i in range(len(a)):
>          delta = a[i] - m
>          m += delta / (i+1)
>          variance += delta * (a[i] - m)
>      variance /= len(a)
>      return variance
>
> If you will code up implementations of these for ndarray.mean() and 
> ndarray.var(), I will check them in and then float32 arrays will avoid most of 
> the catastrophes that the current implementations run into.
>   
+1


>   
>>>> We are only talking 
>>>> about people that will a) work with single-precision data (e.g. large 
>>>> scale-image analysis) and who b) will tend to "just use the default" 
>>>> (dtype)  --- How else can I say this: these people will just assume that 
>>>> arr.mean() *is* the mean of arr.
>>>>         
>>> I don't understand what you mean, here. arr.mean() is almost never *the* mean of 
>>> arr. Double precision can't fix that.
>>>
>>>       
>> This was not supposed to be a scientific statement -- I'm (again) 
>> thinking of our students that not always appreciate the full complexity 
>> of computational numerics and data types and such.
>>     
>
> They need to appreciate the complexity of computational numerics if they are 
> going to do numerical computation. Double precision does not make it any simpler.
>
>

Re: [Numpy-discussion] please change mean to use dtype=float

From: David M. C. <co...@ph...> - 2006-09-20 16:35:14

On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
> Let me offer a third path: the algorithms used for .mean() and .var() are 
> substandard. There are much better incremental algorithms that entirely avoid 
> the need to accumulate such large (and therefore precision-losing) intermediate 
> values. The algorithms look like the following for 1D arrays in Python:
> 
> def mean(a):
>      m = a[0]
>      for i in range(1, len(a)):
>          m += (a[i] - m) / (i + 1)
>      return m

This isn't really going to be any better than using a simple sum.
It'll also be slower (a division per iteration). You do avoid
accumulating large sums, but then doing the division a[i]/len(a) and
adding that will do the same.

Now, if you want to avoid losing precision, you want to use a better
summation technique, like compensated (or Kahan) summation:

def mean(a):
    s = e = a.dtype.type(0)
    for i in range(0, len(a)):
        temp = s
        y = a[i] + e
        s = temp + y
        e = (temp - s) + y
    return s / len(a)

Some numerical experiments in Maple using 5-digit precision show that
your mean is maybe a bit better in some cases, but can also be much
worse, than sum(a)/len(a), but both are quite poor in comparision to the
Kahan summation.

(We could probably use a fast implementation of Kahan summation in
addition to a.sum())

> def var(a):
>      m = a[0]
>      t = a.dtype.type(0)
>      for i in range(1, len(a)):
>          q = a[i] - m
>          r = q / (i+1)
>          m += r
>          t += i * q * r
>      t /= len(a)
>      return t
> 
> Alternatively, from Knuth:
> 
> def var_knuth(a):
>      m = a.dtype.type(0)
>      variance = a.dtype.type(0)
>      for i in range(len(a)):
>          delta = a[i] - m
>          m += delta / (i+1)
>          variance += delta * (a[i] - m)
>      variance /= len(a)
>      return variance

These formulas are good when you can only do one pass over the data
(like in a calculator where you don't store all the data points), but
are slightly worse than doing two passes. Kahan summation would probably
also be good here too.

-- 
|>|\/|<
/--------------------------------------------------------------------------\
|David M. Cooke                      http://arbutus.physics.mcmaster.ca/dmc/
|co...@ph...

Re: [Numpy-discussion] please change mean to use dtype=float

From: Charles R H. <cha...@gm...> - 2006-09-20 16:59:10

On 9/20/06, David M. Cooke <co...@ph...> wrote:
>
> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
> > Let me offer a third path: the algorithms used for .mean() and .var()
> are
> > substandard. There are much better incremental algorithms that entirely
> avoid
> > the need to accumulate such large (and therefore precision-losing)
> intermediate
> > values. The algorithms look like the following for 1D arrays in Python:
> >
> > def mean(a):
> >      m = a[0]
> >      for i in range(1, len(a)):
> >          m += (a[i] - m) / (i + 1)
> >      return m
>
> This isn't really going to be any better than using a simple sum.
> It'll also be slower (a division per iteration). You do avoid
> accumulating large sums, but then doing the division a[i]/len(a) and
> adding that will do the same.
>
> Now, if you want to avoid losing precision, you want to use a better
> summation technique, like compensated (or Kahan) summation:
>
> def mean(a):
>     s = e = a.dtype.type(0)
>     for i in range(0, len(a)):
>         temp = s
>         y = a[i] + e
>         s = temp + y
>         e = (temp - s) + y
>     return s / len(a)


A variant of this can also be used to generate the sin/cos for fourier
transforms using repeated complex multiplication. It works amazingly well.
But I suspect accumulating in higher precision would be just as good and
faster if the hardware supports it.

Divide and conquer, like an fft where only the constant coefficient is
computed, is also a good bet but requires lg(n) passes over the data and
extra storage.

Some numerical experiments in Maple using 5-digit precision show that
> your mean is maybe a bit better in some cases, but can also be much
> worse, than sum(a)/len(a), but both are quite poor in comparision to the
> Kahan summation.
>
> (We could probably use a fast implementation of Kahan summation in
> addition to a.sum())
>
> > def var(a):
> >      m = a[0]
> >      t = a.dtype.type(0)
> >      for i in range(1, len(a)):
> >          q = a[i] - m
> >          r = q / (i+1)
> >          m += r
> >          t += i * q * r
> >      t /= len(a)
> >      return t
> >
> > Alternatively, from Knuth:
> >
> > def var_knuth(a):
> >      m = a.dtype.type(0)
> >      variance = a.dtype.type(0)
> >      for i in range(len(a)):
> >          delta = a[i] - m
> >          m += delta / (i+1)
> >          variance += delta * (a[i] - m)
> >      variance /= len(a)
> >      return variance
>
> These formulas are good when you can only do one pass over the data
> (like in a calculator where you don't store all the data points), but
> are slightly worse than doing two passes. Kahan summation would probably
> also be good here too.


I think we should leave things as they are. Choosing the right precision for
accumulating sums is absolutely fundamental, I always think about it when
programming such loops because it is always a potential gotcha. Making the
defaults higher precision just kicks the can down the road where the unwary
are still likely to trip over it at some point.

Perhaps these special tricks for special cases could be included in scipy
somewhere.

Chuck

Re: [Numpy-discussion] please change mean to use dtype=float

From: Robert K. <rob...@gm...> - 2006-09-20 17:23:49

David M. Cooke wrote:
> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
>> Let me offer a third path: the algorithms used for .mean() and .var() are 
>> substandard. There are much better incremental algorithms that entirely avoid 
>> the need to accumulate such large (and therefore precision-losing) intermediate 
>> values. The algorithms look like the following for 1D arrays in Python:
>>
>> def mean(a):
>>      m = a[0]
>>      for i in range(1, len(a)):
>>          m += (a[i] - m) / (i + 1)
>>      return m
> 
> This isn't really going to be any better than using a simple sum.
> It'll also be slower (a division per iteration).

With one exception, every test that I've thrown at it shows that it's better for 
float32. That exception is uniformly spaced arrays, like linspace().

 > You do avoid
 > accumulating large sums, but then doing the division a[i]/len(a) and
 > adding that will do the same.

Okay, this is true.

> Now, if you want to avoid losing precision, you want to use a better
> summation technique, like compensated (or Kahan) summation:
> 
> def mean(a):
>     s = e = a.dtype.type(0)
>     for i in range(0, len(a)):
>         temp = s
>         y = a[i] + e
>         s = temp + y
>         e = (temp - s) + y
>     return s / len(a)
> 
> Some numerical experiments in Maple using 5-digit precision show that
> your mean is maybe a bit better in some cases, but can also be much
> worse, than sum(a)/len(a), but both are quite poor in comparision to the
> Kahan summation.
> 
> (We could probably use a fast implementation of Kahan summation in
> addition to a.sum())

+1

>> def var(a):
>>      m = a[0]
>>      t = a.dtype.type(0)
>>      for i in range(1, len(a)):
>>          q = a[i] - m
>>          r = q / (i+1)
>>          m += r
>>          t += i * q * r
>>      t /= len(a)
>>      return t
>>
>> Alternatively, from Knuth:
>>
>> def var_knuth(a):
>>      m = a.dtype.type(0)
>>      variance = a.dtype.type(0)
>>      for i in range(len(a)):
>>          delta = a[i] - m
>>          m += delta / (i+1)
>>          variance += delta * (a[i] - m)
>>      variance /= len(a)
>>      return variance
> 
> These formulas are good when you can only do one pass over the data
> (like in a calculator where you don't store all the data points), but
> are slightly worse than doing two passes. Kahan summation would probably
> also be good here too.

Again, my tests show otherwise for float32. I'll condense my ipython log into a 
module for everyone's perusal. It's possible that the Kahan summation of the 
squared residuals will work better than the current two-pass algorithm and the 
implementations I give above.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

Re: [Numpy-discussion] please change mean to use dtype=float

From: Tim H. <tim...@ie...> - 2006-09-20 18:54:04

Robert Kern wrote:
> David M. Cooke wrote:
>   
>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
>>     
>>> Let me offer a third path: the algorithms used for .mean() and .var() are 
>>> substandard. There are much better incremental algorithms that entirely avoid 
>>> the need to accumulate such large (and therefore precision-losing) intermediate 
>>> values. The algorithms look like the following for 1D arrays in Python:
>>>
>>> def mean(a):
>>>      m = a[0]
>>>      for i in range(1, len(a)):
>>>          m += (a[i] - m) / (i + 1)
>>>      return m
>>>       
>> This isn't really going to be any better than using a simple sum.
>> It'll also be slower (a division per iteration).
>>     
>
> With one exception, every test that I've thrown at it shows that it's better for 
> float32. That exception is uniformly spaced arrays, like linspace().
>   
Here's version of mean based on the Kahan sum, including the 
compensation term that seems to be consistently much better than builtin 
mean  It seems to be typically 4 orders or magnitude closer to the 
"correct" answer than the builtin mean and 2 orders of magnitude better 
than just naively using the Kahan sum. I'm using the sum performed with 
dtype=int64 as the "correct" value.


def rawKahanSum(values):  # Stolen from mathworld
    if not input:
        return 0.0
    total = values[0]
    c = type(total)(0.0)            # A running compensation for lost 
low-order bits.
    for x in values[1:]:
        y = x - c                 # So far, so good: c is zero.
        t = total + y               # Alas, total is big, y small, so 
low-order digits of y are lost.
        c = (t - total) - y        # (t - total) recovers the high-order 
part of y; subtracting y recovers -(low part of y)
        total = t                   # Algebriacally, c should always be 
zero. Beware eagerly optimising compilers!
                                  # Next time around, the lost low part 
will be added to y in a fresh attempt.
    return total, c
   
def kahanSum(values):
    total, c =  rawKahanSum(values)
    return total
   
def mean(values):
    total, c = rawKahanSum(values)
    n = float(len(values))
    return total /  n - c / n
   


for i in range(100):
    data = random.random([10000]).astype(float32)
    baseline = data.mean(dtype=float64) 
    delta_builtin_mean = baseline - data.mean()
    delta_compensated_mean = baseline - mean(data)
    delta_kahan_mean = baseline - (kahanSum(data) / len(data))
    if not abs(delta_builtin_mean) >= abs(delta_kahan_mean) >= 
abs(delta_compensated_mean):
        print ">>>",
    print delta_builtin_mean, delta_kahan_mean, delta_compensated_mean

-tim
>  > You do avoid
>  > accumulating large sums, but then doing the division a[i]/len(a) and
>  > adding that will do the same.
>
> Okay, this is true.
>
>   
>> Now, if you want to avoid losing precision, you want to use a better
>> summation technique, like compensated (or Kahan) summation:
>>
>> def mean(a):
>>     s = e = a.dtype.type(0)
>>     for i in range(0, len(a)):
>>         temp = s
>>         y = a[i] + e
>>         s = temp + y
>>         e = (temp - s) + y
>>     return s / len(a)
>>
>> Some numerical experiments in Maple using 5-digit precision show that
>> your mean is maybe a bit better in some cases, but can also be much
>> worse, than sum(a)/len(a), but both are quite poor in comparision to the
>> Kahan summation.
>>
>> (We could probably use a fast implementation of Kahan summation in
>> addition to a.sum())
>>     
>
> +1
>
>   
>>> def var(a):
>>>      m = a[0]
>>>      t = a.dtype.type(0)
>>>      for i in range(1, len(a)):
>>>          q = a[i] - m
>>>          r = q / (i+1)
>>>          m += r
>>>          t += i * q * r
>>>      t /= len(a)
>>>      return t
>>>
>>> Alternatively, from Knuth:
>>>
>>> def var_knuth(a):
>>>      m = a.dtype.type(0)
>>>      variance = a.dtype.type(0)
>>>      for i in range(len(a)):
>>>          delta = a[i] - m
>>>          m += delta / (i+1)
>>>          variance += delta * (a[i] - m)
>>>      variance /= len(a)
>>>      return variance
>>>       
>> These formulas are good when you can only do one pass over the data
>> (like in a calculator where you don't store all the data points), but
>> are slightly worse than doing two passes. Kahan summation would probably
>> also be good here too.
>>     
>
> Again, my tests show otherwise for float32. I'll condense my ipython log into a 
> module for everyone's perusal. It's possible that the Kahan summation of the 
> squared residuals will work better than the current two-pass algorithm and the 
> implementations I give above.
>
>

Re: [Numpy-discussion] please change mean to use dtype=float

From: Tim H. <tim...@ie...> - 2006-09-20 19:03:03

[Sorry, this version should have less munged formatting since I clipped 
the comments. Oh, and the Kahan sum algorithm was grabbed from 
wikipedia, not mathworld]

Tim Hochberg wrote:
> Robert Kern wrote:
>   
>> David M. Cooke wrote:
>>   
>>     
>>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
>>>     
>>>       
>>>> Let me offer a third path: the algorithms used for .mean() and .var() are 
>>>> substandard. There are much better incremental algorithms that entirely avoid 
>>>> the need to accumulate such large (and therefore precision-losing) intermediate 
>>>> values. The algorithms look like the following for 1D arrays in Python:
>>>>
>>>> def mean(a):
>>>>      m = a[0]
>>>>      for i in range(1, len(a)):
>>>>          m += (a[i] - m) / (i + 1)
>>>>      return m
>>>>       
>>>>         
>>> This isn't really going to be any better than using a simple sum.
>>> It'll also be slower (a division per iteration).
>>>     
>>>       
>> With one exception, every test that I've thrown at it shows that it's better for 
>> float32. That exception is uniformly spaced arrays, like linspace().
>>   
>>     
> Here's version of mean based on the Kahan sum, including the 
> compensation term that seems to be consistently much better than builtin 
> mean  It seems to be typically 4 orders or magnitude closer to the 
> "correct" answer than the builtin mean and 2 orders of magnitude better 
> than just naively using the Kahan sum. I'm using the sum performed with 
> dtype=int64 as the "correct" value.
>
>
>   



def rawKahanSum(values):
    if not input:
        return 0.0
    total = values[0]
    c = type(total)(0.0)       
    for x in values[1:]:
        y = x - c             
        t = total + y         
        c = (t - total) - y  
        total = t            
                           
    return total, c
   
def kahanSum(values):
    total, c =  rawKahanSum(values)
    return total
   
def mean(values):
    total, c = rawKahanSum(values)
    n = float(len(values))
    return total /  n - c / n
   


for i in range(100):
    data = random.random([10000]).astype(float32)
    baseline = data.mean(dtype=float64)
    delta_builtin_mean = baseline - data.mean()
    delta_compensated_mean = baseline - mean(data)
    delta_kahan_mean = baseline - (kahanSum(data) / len(data))
    if not abs(delta_builtin_mean) >= abs(delta_kahan_mean) >= 
abs(delta_compensated_mean):
        print ">>>",
    print delta_builtin_mean, delta_kahan_mean, delta_compensated_mean


   


> -tim
>   
>>  > You do avoid
>>  > accumulating large sums, but then doing the division a[i]/len(a) and
>>  > adding that will do the same.
>>
>> Okay, this is true.
>>
>>   
>>     
>>> Now, if you want to avoid losing precision, you want to use a better
>>> summation technique, like compensated (or Kahan) summation:
>>>
>>> def mean(a):
>>>     s = e = a.dtype.type(0)
>>>     for i in range(0, len(a)):
>>>         temp = s
>>>         y = a[i] + e
>>>         s = temp + y
>>>         e = (temp - s) + y
>>>     return s / len(a)
>>>
>>> Some numerical experiments in Maple using 5-digit precision show that
>>> your mean is maybe a bit better in some cases, but can also be much
>>> worse, than sum(a)/len(a), but both are quite poor in comparision to the
>>> Kahan summation.
>>>
>>> (We could probably use a fast implementation of Kahan summation in
>>> addition to a.sum())
>>>     
>>>       
>> +1
>>
>>   
>>     
>>>> def var(a):
>>>>      m = a[0]
>>>>      t = a.dtype.type(0)
>>>>      for i in range(1, len(a)):
>>>>          q = a[i] - m
>>>>          r = q / (i+1)
>>>>          m += r
>>>>          t += i * q * r
>>>>      t /= len(a)
>>>>      return t
>>>>
>>>> Alternatively, from Knuth:
>>>>
>>>> def var_knuth(a):
>>>>      m = a.dtype.type(0)
>>>>      variance = a.dtype.type(0)
>>>>      for i in range(len(a)):
>>>>          delta = a[i] - m
>>>>          m += delta / (i+1)
>>>>          variance += delta * (a[i] - m)
>>>>      variance /= len(a)
>>>>      return variance
>>>>       
>>>>         
>>> These formulas are good when you can only do one pass over the data
>>> (like in a calculator where you don't store all the data points), but
>>> are slightly worse than doing two passes. Kahan summation would probably
>>> also be good here too.
>>>     
>>>       
>> Again, my tests show otherwise for float32. I'll condense my ipython log into a 
>> module for everyone's perusal. It's possible that the Kahan summation of the 
>> squared residuals will work better than the current two-pass algorithm and the 
>> implementations I give above.
>>
>>   
>>     
>
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Numpy-discussion mailing list
> Num...@li...
> https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>
>
>

Re: [Numpy-discussion] please change mean to use dtype=float

From: Tim H. <tim...@ie...> - 2006-09-20 19:41:43

Robert Kern wrote:
> David M. Cooke wrote:
>   
>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
>>     
>>> Let me offer a third path: the algorithms used for .mean() and .var() are 
>>> substandard. There are much better incremental algorithms that entirely avoid 
>>> the need to accumulate such large (and therefore precision-losing) intermediate 
>>> values. The algorithms look like the following for 1D arrays in Python:
>>>
>>> def mean(a):
>>>      m = a[0]
>>>      for i in range(1, len(a)):
>>>          m += (a[i] - m) / (i + 1)
>>>      return m
>>>       
>> This isn't really going to be any better than using a simple sum.
>> It'll also be slower (a division per iteration).
>>     
>
> With one exception, every test that I've thrown at it shows that it's better for 
> float32. That exception is uniformly spaced arrays, like linspace().
>
>  > You do avoid
>  > accumulating large sums, but then doing the division a[i]/len(a) and
>  > adding that will do the same.
>
> Okay, this is true.
>
>   
>> Now, if you want to avoid losing precision, you want to use a better
>> summation technique, like compensated (or Kahan) summation:
>>
>> def mean(a):
>>     s = e = a.dtype.type(0)
>>     for i in range(0, len(a)):
>>         temp = s
>>         y = a[i] + e
>>         s = temp + y
>>         e = (temp - s) + y
>>     return s / len(a)
>>
>> Some numerical experiments in Maple using 5-digit precision show that
>> your mean is maybe a bit better in some cases, but can also be much
>> worse, than sum(a)/len(a), but both are quite poor in comparision to the
>> Kahan summation.
>>
>> (We could probably use a fast implementation of Kahan summation in
>> addition to a.sum())
>>     
>
> +1
>
>   
>>> def var(a):
>>>      m = a[0]
>>>      t = a.dtype.type(0)
>>>      for i in range(1, len(a)):
>>>          q = a[i] - m
>>>          r = q / (i+1)
>>>          m += r
>>>          t += i * q * r
>>>      t /= len(a)
>>>      return t
>>>
>>> Alternatively, from Knuth:
>>>
>>> def var_knuth(a):
>>>      m = a.dtype.type(0)
>>>      variance = a.dtype.type(0)
>>>      for i in range(len(a)):
>>>          delta = a[i] - m
>>>          m += delta / (i+1)
>>>          variance += delta * (a[i] - m)
>>>      variance /= len(a)
>>>      return variance
>>>       
>> These formulas are good when you can only do one pass over the data
>> (like in a calculator where you don't store all the data points), but
>> are slightly worse than doing two passes. Kahan summation would probably
>> also be good here too.
>>     
On the flip side, it doesn't take a very large array (somewhere in the 
vicinity of 10,000 values in my experience) before memory bandwidth 
becomes a limiting factor. In that region a two pass algorithm could 
well be twice as slow as a single pass algorithm even if the former is 
somewhat simpler in terms of numeric operations.

-tim


>
> Again, my tests show otherwise for float32. I'll condense my ipython log into a 
> module for everyone's perusal. It's possible that the Kahan summation of the 
> squared residuals will work better than the current two-pass algorithm and the 
> implementations I give above.
>

Re: [Numpy-discussion] please change mean to use dtype=float

From: Tim H. <tim...@ie...> - 2006-09-21 17:57:05

Robert Kern wrote:
> David M. Cooke wrote:
>   
>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
>>     
>>> Let me offer a third path: the algorithms used for .mean() and .var() are 
>>> substandard. There are much better incremental algorithms that entirely avoid 
>>> the need to accumulate such large (and therefore precision-losing) intermediate 
>>> values. The algorithms look like the following for 1D arrays in Python:
>>>
>>> def mean(a):
>>>      m = a[0]
>>>      for i in range(1, len(a)):
>>>          m += (a[i] - m) / (i + 1)
>>>      return m
>>>       
>> This isn't really going to be any better than using a simple sum.
>> It'll also be slower (a division per iteration).
>>     
>
> With one exception, every test that I've thrown at it shows that it's better for 
> float32. That exception is uniformly spaced arrays, like linspace().
>
>  > You do avoid
>  > accumulating large sums, but then doing the division a[i]/len(a) and
>  > adding that will do the same.
>
> Okay, this is true.
>
>   
>> Now, if you want to avoid losing precision, you want to use a better
>> summation technique, like compensated (or Kahan) summation:
>>
>> def mean(a):
>>     s = e = a.dtype.type(0)
>>     for i in range(0, len(a)):
>>         temp = s
>>         y = a[i] + e
>>         s = temp + y
>>         e = (temp - s) + y
>>     return s / len(a)
>>
>> Some numerical experiments in Maple using 5-digit precision show that
>> your mean is maybe a bit better in some cases, but can also be much
>> worse, than sum(a)/len(a), but both are quite poor in comparision to the
>> Kahan summation.
>>
>> (We could probably use a fast implementation of Kahan summation in
>> addition to a.sum())
>>     
>
> +1
>
>   
>>> def var(a):
>>>      m = a[0]
>>>      t = a.dtype.type(0)
>>>      for i in range(1, len(a)):
>>>          q = a[i] - m
>>>          r = q / (i+1)
>>>          m += r
>>>          t += i * q * r
>>>      t /= len(a)
>>>      return t
>>>
>>> Alternatively, from Knuth:
>>>
>>> def var_knuth(a):
>>>      m = a.dtype.type(0)
>>>      variance = a.dtype.type(0)
>>>      for i in range(len(a)):
>>>          delta = a[i] - m
>>>          m += delta / (i+1)
>>>          variance += delta * (a[i] - m)
>>>      variance /= len(a)
>>>      return variance
>>>       
>> These formulas are good when you can only do one pass over the data
>> (like in a calculator where you don't store all the data points), but
>> are slightly worse than doing two passes. Kahan summation would probably
>> also be good here too.
>>     
>
> Again, my tests show otherwise for float32. I'll condense my ipython log into a 
> module for everyone's perusal. It's possible that the Kahan summation of the 
> squared residuals will work better than the current two-pass algorithm and the 
> implementations I give above.
>   
This is what my tests show as well var_knuth outperformed any simple two 
pass algorithm I could come up with, even ones using Kahan sums. 
Interestingly, for 1D arrays the built in float32 variance performs 
better than it should. After a bit of twiddling around I discovered that 
it actually does most of it's calculations in float64. It uses a two 
pass calculation, the result of mean is a scalar, and in the process of 
converting that back to an array we end up with float64 values. Or 
something like that; I was mostly reverse engineering the sequence of 
events from the results.

-tim

Re: [Numpy-discussion] please change mean to use dtype=float

From: Tim H. <tim...@ie...> - 2006-09-21 18:35:06

Attachments: numpy_statistics.py

Tim Hochberg wrote:
> Robert Kern wrote:
>   
>> David M. Cooke wrote:
>>   
>>     
>>> On Wed, Sep 20, 2006 at 03:01:18AM -0500, Robert Kern wrote:
>>>     
>>>       
>>>> Let me offer a third path: the algorithms used for .mean() and .var() are 
>>>> substandard. There are much better incremental algorithms that entirely avoid 
>>>> the need to accumulate such large (and therefore precision-losing) intermediate 
>>>> values. The algorithms look like the following for 1D arrays in Python:
>>>>
>>>> def mean(a):
>>>>      m = a[0]
>>>>      for i in range(1, len(a)):
>>>>          m += (a[i] - m) / (i + 1)
>>>>      return m
>>>>       
>>>>         
>>> This isn't really going to be any better than using a simple sum.
>>> It'll also be slower (a division per iteration).
>>>     
>>>       
>> With one exception, every test that I've thrown at it shows that it's better for 
>> float32. That exception is uniformly spaced arrays, like linspace().
>>
>>  > You do avoid
>>  > accumulating large sums, but then doing the division a[i]/len(a) and
>>  > adding that will do the same.
>>
>> Okay, this is true.
>>
>>   
>>     
>>> Now, if you want to avoid losing precision, you want to use a better
>>> summation technique, like compensated (or Kahan) summation:
>>>
>>> def mean(a):
>>>     s = e = a.dtype.type(0)
>>>     for i in range(0, len(a)):
>>>         temp = s
>>>         y = a[i] + e
>>>         s = temp + y
>>>         e = (temp - s) + y
>>>     return s / len(a)
>>>
>>> Some numerical experiments in Maple using 5-digit precision show that
>>> your mean is maybe a bit better in some cases, but can also be much
>>> worse, than sum(a)/len(a), but both are quite poor in comparision to the
>>> Kahan summation.
>>>
>>> (We could probably use a fast implementation of Kahan summation in
>>> addition to a.sum())
>>>     
>>>       
>> +1
>>
>>   
>>     
>>>> def var(a):
>>>>      m = a[0]
>>>>      t = a.dtype.type(0)
>>>>      for i in range(1, len(a)):
>>>>          q = a[i] - m
>>>>          r = q / (i+1)
>>>>          m += r
>>>>          t += i * q * r
>>>>      t /= len(a)
>>>>      return t
>>>>
>>>> Alternatively, from Knuth:
>>>>
>>>> def var_knuth(a):
>>>>      m = a.dtype.type(0)
>>>>      variance = a.dtype.type(0)
>>>>      for i in range(len(a)):
>>>>          delta = a[i] - m
>>>>          m += delta / (i+1)
>>>>          variance += delta * (a[i] - m)
>>>>      variance /= len(a)
>>>>      return variance
>>>>       
>>>>         
>>> These formulas are good when you can only do one pass over the data
>>> (like in a calculator where you don't store all the data points), but
>>> are slightly worse than doing two passes. Kahan summation would probably
>>> also be good here too.
>>>     
>>>       
>> Again, my tests show otherwise for float32. I'll condense my ipython log into a 
>> module for everyone's perusal. It's possible that the Kahan summation of the 
>> squared residuals will work better than the current two-pass algorithm and the 
>> implementations I give above.
>>   
>>     
> This is what my tests show as well var_knuth outperformed any simple two 
> pass algorithm I could come up with, even ones using Kahan sums. 
> Interestingly, for 1D arrays the built in float32 variance performs 
> better than it should. After a bit of twiddling around I discovered that 
> it actually does most of it's calculations in float64. It uses a two 
> pass calculation, the result of mean is a scalar, and in the process of 
> converting that back to an array we end up with float64 values. Or 
> something like that; I was mostly reverse engineering the sequence of 
> events from the results.
>   
Here's a simple of example of how var is a little wacky. A shape-[N] 
array will give you a different result than a shape-[1,N] array. The 
reason is clear -- in the second case the mean is not a scalar so there 
isn't the inadvertent promotion to float64, but it's still odd.

 >>> data = (1000*(random.random([10000]) - 0.1)).astype(float32)
 >>> print data.var() - data.reshape([1, -1]).var(-1)
[ 0.1171875]

I'm going to go ahead and attach a module containing the versions of 
mean, var, etc that I've been playing with in case someone wants to mess 
with them. Some were stolen from traffic on this list, for others I 
grabbed the algorithms from wikipedia or equivalent.

-tim



> -tim
>
>
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Numpy-discussion mailing list
> Num...@li...
> https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>
>
>

1 2 > >> (Page 1 of 2)