Thread: [Numpy-discussion] A reimplementation of MaskedArray

A package for scientific computing with Python

Brought to you by: charris208, jarrodmillman, kern, rgommers, teoliphant

numpy-discussion

[Numpy-discussion] A reimplementation of MaskedArray

From: Pierre GM <pgm...@ma...> - 2006-10-16 06:35:05

Folks,
I just posted on the scipy/developers zone wiki 
(http://projects.scipy.org/scipy/numpy/wiki/MaskedArray) a reimplementation 
of the masked_array mopdule, motivated by some problems I ran into while  
subclassing MaskedArray.

The main differences with the initial numpy.core.ma package are that 
MaskedArray is now a subclass of ndarray and that the _data section can now 
be any subclass of ndarray (well, it should work in most cases, some tweaking 
might required here and there). Apart from a couple of issues listed below, 
the behavior of the new MaskedArray class reproduces the old one. It is quite 
likely to be significantly slower, though: I was more interested in a clear 
organization than in performance, so I tended to use wrappers liberally. I'm 
sure we can improve that rather easily. 

The new module, along with a test suite and some utilities, are available 
here:
http://projects.scipy.org/scipy/numpy/attachment/wiki/MaskedArray/maskedarray.py
http://projects.scipy.org/scipy/numpy/attachment/wiki/MaskedArray/masked_testutils.py
http://projects.scipy.org/scipy/numpy/attachment/wiki/MaskedArray/test_maskedarray.py

Please note that it's still a work in progress (even if it seems to work quite 
OK when I use it). Suggestions, comments, improvements and general feedback 
are more than welcome !

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Michael S. <mic...@gm...> - 2006-10-17 02:08:53

Does this new MA class allow masking of rearray like arrays? The numpy
(1.0b5) version does not seem to. e.g.

from numpy import *

desc = [('name','S30'),('age',int8),('weight',float32)]
a = array([('Bill',31,260.0),('Fred', 15, 145.0)], dtype=desc)
print a[0]
print a['name']

a2 = ma.array([('Bill',31,260.0),('Fred', 15, 145.0)], dtype=desc,
mask=ma.nomask)
print a2[0]
print a2['name']

a3 = ma.array([('Bill',31,260.0),('Fred', 15, 145.0)], dtype=desc,
mask=[[True,False,False],[False,False, False]])
print a3[0]
print a3['name']

-- output

('Bill', 31, 260.0)
[Bill Fred]
('Bill', 31, 260.0)
[Bill Fred]
Traceback (most recent call last):
  File "D:\eclipse\Table\scripts\testrecarray.py", line 12, in ?
    a3 = ma.array([('Bill',31,260.0),('Fred', 15, 145.0)], dtype=desc,
mask=[[True,False,False],[False,False, False]])
  File "C:\Python23\Lib\site-packages\numpy\core\ma.py", line 592, in __init__
    raise MAError, "Mask and data not compatible."
numpy.core.ma.MAError: Mask and data not compatible.




On 10/16/06, Pierre GM <pgm...@ma...> wrote:
> Folks,
> I just posted on the scipy/developers zone wiki
> (http://projects.scipy.org/scipy/numpy/wiki/MaskedArray) a reimplementation
> of the masked_array mopdule, motivated by some problems I ran into while
> subclassing MaskedArray.
>
> The main differences with the initial numpy.core.ma package are that
> MaskedArray is now a subclass of ndarray and that the _data section can now
> be any subclass of ndarray (well, it should work in most cases, some tweaking
> might required here and there). Apart from a couple of issues listed below,
> the behavior of the new MaskedArray class reproduces the old one. It is quite
> likely to be significantly slower, though: I was more interested in a clear
> organization than in performance, so I tended to use wrappers liberally. I'm
> sure we can improve that rather easily.
>
> The new module, along with a test suite and some utilities, are available
> here:
> http://projects.scipy.org/scipy/numpy/attachment/wiki/MaskedArray/maskedarray.py
> http://projects.scipy.org/scipy/numpy/attachment/wiki/MaskedArray/masked_testutils.py
> http://projects.scipy.org/scipy/numpy/attachment/wiki/MaskedArray/test_maskedarray.py
>
> Please note that it's still a work in progress (even if it seems to work quite
> OK when I use it). Suggestions, comments, improvements and general feedback
> are more than welcome !
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Numpy-discussion mailing list
> Num...@li...
> https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Pierre GM <pgm...@ma...> - 2006-10-17 02:43:11

On Monday 16 October 2006 22:08, Michael Sorich wrote:
> Does this new MA class allow masking of rearray like arrays? 

Excellent question! Which is to say, I have no idea... I don't use 
recordarray, so I didn't think about testing them.

So, a first test indicates that it doesn't work either. The mask for a3 is 
understood as having a size 6, when the data is seen as size 2 only (exactly 
the same problem as the original ma module).
I gonna poke around and check whether I can come with a workaround. Just to 
make it clear: you want to be able to mask some fields in some records, 
right ? Not mask all the fields of a record ?

Thanks for the feedback, I'll keep you posted.

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Michael S. <mic...@gm...> - 2006-10-18 00:19:31

On 10/17/06, Pierre GM <pgm...@ma...> wrote:
> On Monday 16 October 2006 22:08, Michael Sorich wrote:
> > Does this new MA class allow masking of rearray like arrays?
>
> Excellent question! Which is to say, I have no idea... I don't use
> recordarray, so I didn't think about testing them.
>
> So, a first test indicates that it doesn't work either. The mask for a3 is
> understood as having a size 6, when the data is seen as size 2 only (exactly
> the same problem as the original ma module).
> I gonna poke around and check whether I can come with a workaround. Just to
> make it clear: you want to be able to mask some fields in some records,
> right ? Not mask all the fields of a record ?

Yes, that is correct. It would be preferable to have more control over
masking than simply masking an entire record. I view the heterogenouse
arrays as 2D arrays in which it should be possible to mask individual
values.

[Numpy-discussion] A reimplementation of MaskedArray + MaskedRecords

From: Pierre GM <pgm...@ma...> - 2006-10-23 12:52:28

Folks,
I updated the alternative implementation of MaskedArray on the wiki, mainly to 
correct a couple of bugs. 
(http://projects.scipy.org/scipy/numpy/wiki/MaskedArray)

In addition, I attached another file, maskedrecordarray, which introduce a new 
class, MaskedRecord, as a subclass of recarray and MaskedArray. An instance 
of this class accepts a recarray as data, and uses two masks: the 
'recordmask' has as many entries as records in the array, each entry with the 
same fields as a record, but of boolean types, indicating whether a field is 
masked or not; an entry is flagged as masked in the 'mask' array if at least 
one field is masked. The 'mask' object is introduced mostly for compatibilty 
with MaskedArray, only 'recordmask' is really useful.

A few examples in the file should give you an idea of what can be done. In 
particular, you can define a new maskedrecord array as simply as ;
a = masked_record([('Alan',29,200.), ('Bill',31,260.0)], 
dtype=[('name','S30'),('age',int_),('weight',float_)], mask=[(1,0,0), 
(0,0,0)])

Note that maskedrecordarray is still quite experimental. As I'm not a regular 
user of records, I don't really know what should be implemented... The file 
can be accessed at
http://projects.scipy.org/scipy/numpy/attachment/wiki/MaskedArray/maskedrecordarray.py

Once again, I need your comments and suggestions !

Thanks.
Pierre

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Michael S. <mic...@gm...> - 2006-10-24 06:50:21

I am currently running numpy rc2 (I haven't tried your
reimplementation yet as I am still using python 2.3). I am wondering
whether the new maskedarray is able to handle construction of arrays
from masked scalar values (not sure if this is the correct term). I
ran across a situation recently when I was picking individual values
from a masked array, collecting them in a list and then subsequently
constructing an array with these values. This does not work if any of
the values choosen are masked. See example below

On a more general note I am interested to find out whether there are
any other languages that handle masked/missing data well and if so how
this is done. My only experience is with R, which I have found to be
quite good (there is a special value NA this signifies a masked value
- this can be mixed in with non-masked values when defining an array).

from numpy import *
a = ma.array([1,2,3], mask=[True, False, False])
print a[0], type(a[0])
print a[1], type(a[1])
print list(a)
a = ma.array(list(a))

-- output --
-- <class 'numpy.core.ma.MaskedArray'>
2 <type 'numpy.int32'>
[array(data =
 999999,
      mask =
 True,
      fill_value=999999)
, 2, 3]
C:\Python23\Lib\site-packages\numpy\core\ma.py:604: UserWarning:
Cannot automatically convert masked array to numeric because data
    is masked in one or more locations.
  warnings.warn("Cannot automatically convert masked array to "\
Traceback (most recent call last):
  File "D:\eclipse\Table\scripts\testrecarray.py", line 23, in ?
    a = ma.array(list(a))
  File "C:\Python23\Lib\site-packages\numpy\core\ma.py", line 562, in __init__
    c = numeric.array(data, dtype=tc, copy=True, order=order)
TypeError: an integer is required

On 10/16/06, Pierre GM <pgm...@ma...> wrote:
> Folks,
> I just posted on the scipy/developers zone wiki
> (http://projects.scipy.org/scipy/numpy/wiki/MaskedArray) a reimplementation
> of the masked_array mopdule, motivated by some problems I ran into while
> subclassing MaskedArray.
>
> The main differences with the initial numpy.core.ma package are that
> MaskedArray is now a subclass of ndarray and that the _data section can now
> be any subclass of ndarray (well, it should work in most cases, some tweaking
> might required here and there). Apart from a couple of issues listed below,
> the behavior of the new MaskedArray class reproduces the old one. It is quite
> likely to be significantly slower, though: I was more interested in a clear
> organization than in performance, so I tended to use wrappers liberally. I'm
> sure we can improve that rather easily.
>
> The new module, along with a test suite and some utilities, are available
> here:
> http://projects.scipy.org/scipy/numpy/attachment/wiki/MaskedArray/maskedarray.py
> http://projects.scipy.org/scipy/numpy/attachment/wiki/MaskedArray/masked_testutils.py
> http://projects.scipy.org/scipy/numpy/attachment/wiki/MaskedArray/test_maskedarray.py
>
> Please note that it's still a work in progress (even if it seems to work quite
> OK when I use it). Suggestions, comments, improvements and general feedback
> are more than welcome !
>
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Numpy-discussion mailing list
> Num...@li...
> https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Pierre GM <pgm...@gm...> - 2006-10-24 18:09:18

On Tuesday 24 October 2006 02:50, Michael Sorich wrote:
> I am currently running numpy rc2 (I haven't tried your
> reimplementation yet as I am still using python 2.3). I am wondering
> whether the new maskedarray is able to handle construction of arrays
> from masked scalar values (not sure if this is the correct term). 

The answer is no, unfortunately: both the new and old implementations fail at 
the same point, raising a TypeError: lists are processed through 
numpy.core.numeric.array, which doesn't handle that.

But there's a workaround: creating an empty masked array, and filling it by 
hand:
b = list(a)
A = MA.array(N.empty(len(b)))
for (k,v) in enumerate(b): A[k] = v

should work in most cases. I guess I could plug something along those lines in 
the new implementation...

But there's a problem anyway: if you don't precise a type at the creation of 
the empty array, the type will be determined automatically, which may be a 
problem if you mix numbers and strings: the maskedarray is detected as a 
string in that case.

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Michael S. <mic...@gm...> - 2006-11-08 01:11:31

On 10/25/06, Pierre GM <pgm...@gm...> wrote:
> On Tuesday 24 October 2006 02:50, Michael Sorich wrote:
> > I am currently running numpy rc2 (I haven't tried your
> > reimplementation yet as I am still using python 2.3). I am wondering
> > whether the new maskedarray is able to handle construction of arrays
> > from masked scalar values (not sure if this is the correct term).
>
> The answer is no, unfortunately: both the new and old implementations fail at
> the same point, raising a TypeError: lists are processed through
> numpy.core.numeric.array, which doesn't handle that.

I have finally gotten around to upgrading to python 2.4 and have had a
chance to play with your new version of the MaskedArray. It is great
to see that someone is working on this. I have a few thoughts on
masked arrays that may or may no warrant discusion

1. It would be nice if the masked_singleton could be passed into a
ndarray, as this would allow it to be passed into the MaskedArray e.g.

import numpy as N
import ma.maskedarray as MA
test = N.array([1,2,MA.masked])
>> ValueError: setting an array element with a sequence

If the masked_singleton was implemented as an object that is not a
MakedArray (which is a sequence that numpy.array chokes on), then a
valid numpy array with an object dtype could be produced. e.g.

class MaskedScalar:
    def __str__(self):
        return 'masked'
masked = MaskedScalar()

test = N.array([1,2,masked])
print test.dtype, test
>>object [1 2 masked]
print test == masked
>>[False False True]
print test[2] == masked
>>True
print test[2] is masked
>>True

Then it would be possible to alternatively define a masked array as
MA.array([1,2,masked]) or MA.array(N.array([1,2,masked])). In the
__init__ of the MaskedArray if the ndarray has an object dtype simply
calculate the mask from a==masked.

2. What happens if a masked array is passed into a ndarray or passed
into a MaskedArray with another mask?

test_ma1 = MA.array([1,2,3], mask=[False, False, True])
print test_ma1
>>[1 2 --]

print N.array(test_ma1)
>>[1 2 3]

test_ma2 = MA.array(test_ma1, mask=[True, False, False])
print test_ma2
>>[-- 2 3]

I suppose it depends on whether you are masking useful data, or the
masks represent missing data. In the former it may make sense to
change or remove the mask. However in the latter case the original
data entered is a bogus value which should never be unmasked. In this
case, when converting to a ndarray I think it make more sense to make
an object ndarray with the missing value containing the masked
singleton. Additionally, if the MaskedArray is holding missing data,
it does not make much sense to be able to pass in to the MA
constructor both an existing ma and a mask.

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Pierre GM <pgm...@gm...> - 2006-11-08 10:10:21

Michael,
First of all, thanks for your interest in the exercise of style the new 
implementation of MaskedArray is basically nothing but.

On Tuesday 07 November 2006 20:11, Michael Sorich wrote:
> 1. It would be nice if the masked_singleton could be passed into a
> ndarray, as this would allow it to be passed into the MaskedArray e.g.
>
> import numpy as N
> import ma.maskedarray as MA
> test = N.array([1,2,MA.masked])
>
> >> ValueError: setting an array element with a sequence

I like your idea, but not its implementation. If MA.masked_singleton is 
defined as an object, as you suggest, then the dtype of the ndarray it is 
passed to becomes 'object', as you pointed out, and that is not something one 
would naturally expec, as basic numerical functions don't  work well  with the 
'object' dtype (just try  N.sqrt(N.array([1],dtype=N.object)) to see what I 
mean).
Even if we can construct a mask rather easily at the creation of the masked 
array, following your 'a==masked' suggestion, we still need to get the dtype 
of the non-masked section, and that doesn't seem trivial...

I guess that a simple solution is to use MA.masked_values. 
Make sure to use a numerical value for the masked data, else you'll end up 
with yet another object array.  

> 2. What happens if a masked array is passed into a ndarray or passed
> into a MaskedArray with another mask?
>>> test_ma1 = MA.array([1,2,3], mask=[False, False, True])
>>> print test_ma1, N.array(test_ma1), 
[1 2 --] [1 2 3]
>>> MA.array(test_ma1, mask=[True, False, False]) 
[-- 2 3]

Let me precise that my objective was to get an implementation as close to the 
original numpy.core.ma as possible, for 'backward compatibility'. I'm not 
sure it'd be wise to change it at this point, but that could be discussed.

As you've noticed, when creating a new masked array from an existing one, the 
'mask' argument supersedes the initial mask. That's ideal when you want to 
focus on a fraction of the initial data: you just mask what you don't need, 
and are still able to retrieve it when you need it. I agree that this default 
behavior is a bit strange when you have missing data: in that case, one would 
expect the new mask to be a combination of the 'mask' argument and the old 
mask. 
A possibility would then be to add a 'keep_mask' flag: a default of False 
would give the current behavior, a value of True would force the new mask to 
be a combination. I think that feelings are mixed on that list about extra 
flags, but the users of maskedarray are only a minority anyway (hopefully, 
only for the moment).

About the conversion to ndarray: 
By default, the result should have the same dtype as the _data section. 
For this reason, I disagree with your idea of "(returning) an object ndarray 
with the missing value containing the masked singleton". If you really want 
an object ndarray, you can use the filled method or the filled function, with 
your own definition of the filling value (such as your MaskedScalar).

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: A. M. A. <per...@gm...> - 2006-11-08 18:42:09

On 08/11/06, Pierre GM <pgm...@gm...> wrote:

> I like your idea, but not its implementation. If MA.masked_singleton is
> defined as an object, as you suggest, then the dtype of the ndarray it is
> passed to becomes 'object', as you pointed out, and that is not something one
> would naturally expec, as basic numerical functions don't  work well  with the
> 'object' dtype (just try  N.sqrt(N.array([1],dtype=N.object)) to see what I
> mean).
> Even if we can construct a mask rather easily at the creation of the masked
> array, following your 'a==masked' suggestion, we still need to get the dtype
> of the non-masked section, and that doesn't seem trivial...

A good candidate for "should be masked" marked is NaN. It is supposed
to mean, more or less, "no sensible value". Unfortunately, integer
types do not have such a special value. It's also conceivable that
some user might want to keep NaNs in their array separate from the
mask. Finally, on some hardware, operations with NaN are very slow (so
leaving them in the array, even masked, might not be a good idea).

The reason I suggest this is that in the last major application I had
for numpy, one stage of the problem would occasionally result in NaNs
for certain values, but the best thing I could do was leave them in
place to represent "no data". Switching to a MaskedArray might have
been a better idea, but the NaNs were a rare occurrence.

> About the conversion to ndarray:
> By default, the result should have the same dtype as the _data section.
> For this reason, I disagree with your idea of "(returning) an object ndarray
> with the missing value containing the masked singleton". If you really want
> an object ndarray, you can use the filled method or the filled function, with
> your own definition of the filling value (such as your MaskedScalar).

If you've got floating point, you can again fill in NaNs, but you have
a good point about wanting to extract the original values that were
masked out. Depending on what one is doing, one might want one or the
other.

A. M. Archibald

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Tim H. <tim...@ie...> - 2006-11-09 05:20:53

A. M. Archibald wrote:
> On 08/11/06, Pierre GM <pgm...@gm...> wrote:
>
>   
>> I like your idea, but not its implementation. If MA.masked_singleton is
>> defined as an object, as you suggest, then the dtype of the ndarray it is
>> passed to becomes 'object', as you pointed out, and that is not something one
>> would naturally expec, as basic numerical functions don't  work well  with the
>> 'object' dtype (just try  N.sqrt(N.array([1],dtype=N.object)) to see what I
>> mean).
>> Even if we can construct a mask rather easily at the creation of the masked
>> array, following your 'a==masked' suggestion, we still need to get the dtype
>> of the non-masked section, and that doesn't seem trivial...
>>     
>
> A good candidate for "should be masked" marked is NaN. It is supposed
> to mean, more or less, "no sensible value". Unfortunately, integer
> types do not have such a special value. It's also conceivable that
> some user might want to keep NaNs in their array separate from the
> mask. Finally, on some hardware, operations with NaN are very slow (so
> leaving them in the array, even masked, might not be a good idea).
>   
It has always been my experience (on various flavors or Pentium) that 
operating on NANs is extremely slow. Does anyone know on what hardware 
NANs are *not* slow? Of course it's always possible I just never notice 
NANs on hardware where they aren't slow.


[SNIP]

-tim

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: A. M. A. <per...@gm...> - 2006-11-09 07:27:49

On 08/11/06, Tim Hochberg <tim...@ie...> wrote:

> It has always been my experience (on various flavors or Pentium) that
> operating on NANs is extremely slow. Does anyone know on what hardware
> NANs are *not* slow? Of course it's always possible I just never notice
> NANs on hardware where they aren't slow.

On an opteron machine I have access to, they appear to be no slower
(and even faster for some transcendental functions) than ordinary
floats:

In [13]: a=zeros(1000000)

In [14]: %time for i in xrange(1000): a += 1.1
CPU times: user 6.87 s, sys: 0.00 s, total: 6.87 s
Wall time: 6.87

In [15]: a *= NaN

In [16]: %time for i in xrange(1000): a += 1.1
CPU times: user 6.86 s, sys: 0.00 s, total: 6.86 s
Wall time: 6.87

On my Pentium M, they are indeed significantly slower (three times? I
didn't really do enough testing to say how much). I am actually rather
offended by this unfair discrimination against a citizen in good
standing of the IEEE floating point community.

A. M. Archibald

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Tim H. <tim...@ie...> - 2006-11-09 18:05:35

A. M. Archibald wrote:
> On 08/11/06, Tim Hochberg <tim...@ie...> wrote:
>
>   
>> It has always been my experience (on various flavors or Pentium) that
>> operating on NANs is extremely slow. Does anyone know on what hardware
>> NANs are *not* slow? Of course it's always possible I just never notice
>> NANs on hardware where they aren't slow.
>>     
>
> On an opteron machine I have access to, they appear to be no slower
> (and even faster for some transcendental functions) than ordinary
> floats:
>
> In [13]: a=zeros(1000000)
>
> In [14]: %time for i in xrange(1000): a += 1.1
> CPU times: user 6.87 s, sys: 0.00 s, total: 6.87 s
> Wall time: 6.87
>
> In [15]: a *= NaN
>
> In [16]: %time for i in xrange(1000): a += 1.1
> CPU times: user 6.86 s, sys: 0.00 s, total: 6.86 s
> Wall time: 6.87
>
> On my Pentium M, they are indeed significantly slower (three times? I
> didn't really do enough testing to say how much). I am actually rather
> offended by this unfair discrimination against a citizen in good
> standing of the IEEE floating point community.
>   
If they're only 3x slower you're doing better than I am. On my core duo 
box they appear to be nearly 20x slower for both addition and 
multiplication. This more or less matches what I recall from earlier boxes.

     >>> Timer("a.prod()", "import numpy as np; a = np.ones(4096); a[0]
    = np.nan").timeit(10000)
    5.6495738983585397
     >>> Timer("a.prod()", "import numpy as np; a =
    np.ones(4096)").timeit(10000)
    0.34439041833525152
     >>> Timer("a.sum()", "import numpy as np; a = np.ones(4096); a[0] =
    np.nan").timeit(10000)
    6.0985655998179027
     >>> Timer("a.sum()", "import numpy as np; a =
    np.ones(4096)").timeit(10000)
    0.32354363473564263

I've been told that operations on NANs are slow because they aren't 
always implemented in the FPU hardware. Instead they are trapped and 
implemented software or firmware or something or other. That may well be 
bogus 42nd hand information though.

-tim

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Pierre GM <pgm...@gm...> - 2006-11-08 21:08:35

> A good candidate for "should be masked" marked is NaN. It is supposed
> to mean, more or less, "no sensible value". 

Which might turn out out to be the best indeed. Michael's application would 
then look like
>>> import numpy as N
>>> import maskedarray as MA
>>> maskit = N.nan
>>> test = N.array([1,2,maskit])
>>> test_ma1 = MA.array(x,mask=N.isnan(x))

> Switching to a MaskedArray might have
> been a better idea, but the NaNs were a rare occurrence.
Once again, that's a situation when one would use masked arrays. 

> If you've got floating point, you can again fill in NaNs, but you have
> a good point about wanting to extract the original values that were
> masked out. Depending on what one is doing, one might want one or the
> other.

In any case, I think we should stick to the numpy.core.ma default behavior for 
backwards compatibility. If you really wanht to distinguish between several 
kind of masks (one for missing data, one for data to discard temporarily), 
that could be done by defining a special subclass. But is it really needed ? 
A smart use of filled and masked_values should do the trick in most cases.

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Paul D. <pfd...@gm...> - 2006-11-09 04:02:02

2 cents from the author of the first folio:

The intent was to allow creation of masked arrays with copy=no, so that the
original data could be retrieved from it if desired. But I was quite, quite
rigorous about NEVER assuming the data in a masked slot made any sense
whatsoever.

The intention was that there are two ways to get a numeric array out of a
masked one:

1. Get the data field
2. m.filled()

(1) is strictly at your own risk.

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Eric F. <ef...@ha...> - 2006-11-09 21:50:14

It looks like on my Pentium M multiplication with NaNs is slow, but 
using a masked array ranges from slightly faster (with only one value 
masked) to twice as slow (with all values masked):


In [15]:Timer("a.prod()", "import numpy as np; aa = np.ones(4096); a = 
np.ma.masked_greater(aa,0)").timeit(10000)
Out[15]:9.4012830257415771

In [16]:Timer("a.prod()", "import numpy as np; a = np.ones(4096); a[0]= 
np.nan").timeit(10000)
Out[16]:5.5737850666046143

In [17]:Timer("a.prod()", "import numpy as np; a = 
np.ones(4096)").timeit(10000)
Out[17]:0.40796804428100586

In [18]:Timer("a.prod()", "import numpy as np; aa = np.ones(4096); aa[0] 
= 2; a = np.ma.masked_greater(aa,1)").timeit(10000)
Out[18]:4.1544749736785889

In [19]:Timer("a.prod()", "import numpy as np; a = np.ones(4096); a[:]= 
np.nan").timeit(10000)Out[19]:5.8589630126953125

For transcendentals, nans or masks don't make much difference, although 
masks are slightly faster than nans:

In [20]:Timer("np.sin(a)", "import numpy as np; a = np.ones(4096); a[:]= 
np.nan").timeit(10000)
Out[20]:4.5575671195983887

In [21]:Timer("np.ma.sin(a)", "import numpy as np; aa = np.ones(4096); a 
= np.ma.masked_greater(aa,0)").timeit(10000)
Out[21]:4.4125270843505859

In [22]:Timer("b=np.sin(a)", "import numpy as np; a = 
np.ones(4096)").timeit(10000)
Out[22]:3.5793929100036621


Eric



Tim Hochberg wrote:
> A. M. Archibald wrote:
>> On 08/11/06, Tim Hochberg <tim...@ie...> wrote:
>>
>>   
>>> It has always been my experience (on various flavors or Pentium) that
>>> operating on NANs is extremely slow. Does anyone know on what hardware
>>> NANs are *not* slow? Of course it's always possible I just never notice
>>> NANs on hardware where they aren't slow.
>>>     
>> On an opteron machine I have access to, they appear to be no slower
>> (and even faster for some transcendental functions) than ordinary
>> floats:
>>
>> In [13]: a=zeros(1000000)
>>
>> In [14]: %time for i in xrange(1000): a += 1.1
>> CPU times: user 6.87 s, sys: 0.00 s, total: 6.87 s
>> Wall time: 6.87
>>
>> In [15]: a *= NaN
>>
>> In [16]: %time for i in xrange(1000): a += 1.1
>> CPU times: user 6.86 s, sys: 0.00 s, total: 6.86 s
>> Wall time: 6.87
>>
>> On my Pentium M, they are indeed significantly slower (three times? I
>> didn't really do enough testing to say how much). I am actually rather
>> offended by this unfair discrimination against a citizen in good
>> standing of the IEEE floating point community.
>>   
> If they're only 3x slower you're doing better than I am. On my core duo 
> box they appear to be nearly 20x slower for both addition and 
> multiplication. This more or less matches what I recall from earlier boxes.
> 
>      >>> Timer("a.prod()", "import numpy as np; a = np.ones(4096); a[0]
>     = np.nan").timeit(10000)
>     5.6495738983585397
>      >>> Timer("a.prod()", "import numpy as np; a =
>     np.ones(4096)").timeit(10000)
>     0.34439041833525152
>      >>> Timer("a.sum()", "import numpy as np; a = np.ones(4096); a[0] =
>     np.nan").timeit(10000)
>     6.0985655998179027
>      >>> Timer("a.sum()", "import numpy as np; a =
>     np.ones(4096)").timeit(10000)
>     0.32354363473564263
> 
> I've been told that operations on NANs are slow because they aren't 
> always implemented in the FPU hardware. Instead they are trapped and 
> implemented software or firmware or something or other. That may well be 
> bogus 42nd hand information though.
> 
> -tim
> 
> 
> -------------------------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
> _______________________________________________
> Numpy-discussion mailing list
> Num...@li...
> https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Christopher B. <Chr...@no...> - 2006-11-09 21:58:34

Tim Hochberg wrote:
> I've been told that operations on NANs are slow because they aren't 
> always implemented in the FPU hardware. Instead they are trapped and 
> implemented software or firmware or something or other. 

which still doesn't make sense -- doesn't ANY operation with a NaN 
return NaN? how hard could that be?

I share in the disappointment that NaN's are not first class citizens in 
common hardware....

-Chris


-- 
Christopher Barker, Ph.D.
Oceanographer
                                     		
NOAA/OR&R/HAZMAT         (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chr...@no...

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Tim H. <tim...@ie...> - 2006-11-09 22:14:28

Christopher Barker wrote:
> Tim Hochberg wrote:
>   
>> I've been told that operations on NANs are slow because they aren't 
>> always implemented in the FPU hardware. Instead they are trapped and 
>> implemented software or firmware or something or other. 
>>     
>
> which still doesn't make sense -- doesn't ANY operation with a NaN 
> return NaN? how hard could that be?
>   
Well, it's not free. You still have to recognize a particular bit 
pattern corresponding to QNaN and then copy the result in that case. All 
that probably burns more silicon for a case that most people don't care 
about. Do NaNs make Half Life run any faster? No I didn't think so.

I suspect that instead they just look for the case where the exponent is 
all 1s, in which case the result is either +-Inf, SNaN or QNaN and then 
drop into software to handle those infrequent cases.
> I share in the disappointment that NaN's are not first class citizens in 
> common hardware....
>   
It would be nice to see IEEE 754 supported more faithfully, but there 
seems to be more of a problem on the software side than on the hardware 
side in practice.

I still can't help feeling that using NaNs for mask values would be 
abuse even if it weren't slow and eventually it would reach around and 
bite you someplace tender.

-tim

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: Paul D. <pfd...@gm...> - 2006-11-09 23:00:32

Disappointed in NaN land?

Since the function of old retired persons is to tell youngsters stories
around the campfile:

A supercomputer hardware designer told me that when forced to add IEEE
arithmetic to his designs that it decreased performance substantially, maybe
25-30%; it wasn't that doing the operations took so much longer, it was that
the increased physical space needed for that circuitry pushed the memory
farther away. Doubtless this inspired doing some of it in software instead.

No standard for controlling the behaviors exists, either, so you can find
out the hard way that underflow-to-zero is being done in software by
default, and that you are doing a lot of it. Or that your code doesn't have
the same behavior on different platforms.

To my mind, all that was really accomplished was to convince said youngsters
that somehow this NaN stuff was the solution to some problem. In reality,
computing for 3 days and having it print out 1000 NaNs is not exactly
satisfying. I think this stuff was essentially a mistake in the sense that
it is a nice ivory-tower idea that costs more in practice than it is worth.
I do not think a properly thought-out and coded algorithm ought to do
anything that this stuff is supposed to 'help' with, and if it does do it,
the code should stop executing.

Anyway, if I thought it would do the job I wouldn't have written MA in the
first place.

Rant off. Nap. Grumble. Stupid kids. (:->
-- Paul

Re: [Numpy-discussion] A reimplementation of MaskedArray

From: A. M. A. <per...@gm...> - 2006-11-10 06:43:12

On 09/11/06, Paul Dubois <pfd...@gm...> wrote:

> Since the function of old retired persons is to tell youngsters stories
> around the campfile:

I'll pull up a log. But since I'm uppity, and not especially young, I
hope you won't mind if I heckle.

> A supercomputer hardware designer told me that when forced to add IEEE
> arithmetic to his designs that it decreased performance substantially, maybe
> 25-30%; it wasn't that doing the operations took so much longer, it was that
> the increased physical space needed for that circuitry pushed the memory
> farther away. Doubtless this inspired doing some of it in software instead.

The goal of IEEE floats is not to be faster but to make doing correct
numerical programming easier. (Numerical analysts can write robust
numerical code for almost any kind of float, but the vast majority of
numerical code is written by scientists who know the bare minimum or
less about numerical analysis, so a good system makes such code work
as well as possible.)

The other reason IEEE floats are good is, no disrespect to your
hardware designer friend intended, that they were designed with some
care. Reimplementations are liable to get some of the finicky details
wrong (round-to-even, denormalized numbers, what have you...). I found
Kahan's "How Java's Floating-point Hurts Everyone Everywhere" (
http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf ) very educational
when it comes to "what can go wrong with platform-dependent
floating-point".

> No standard for controlling the behaviors exists, either, so you can find
> out the hard way that underflow-to-zero is being done in software by
> default, and that you are doing a lot of it. Or that your code doesn't have
> the same behavior on different platforms.

Well, if the platform is IEEE-compliant, you can trust it to behave
the same way logicially, even if some applications are slower on some
systems than others. Or did you mean that there's no standard way to
set the various IEEE flags? That seems like a language/compiler issue,
to me (which is not to minimize the pain it causes!)

> To my mind, all that was really accomplished was to convince said youngsters
> that somehow this NaN stuff was the solution to some problem. In reality,
> computing for 3 days and having it print out 1000 NaNs is not exactly
> satisfying. I think this stuff was essentially a mistake in the sense that
> it is a nice ivory-tower idea that costs more in practice than it is worth.
> I do not think a properly thought-out and coded algorithm ought to do
> anything that this stuff is supposed to 'help' with, and if it does do it,
> the code should stop executing.

Well, then turn on floating-point exceptions. I find a few NaNs in my
calculations relatively benign. If I'm doing a calculation where
they'll propagate and destroy everything, I trap them. But it happens
just as often that I launch a job, a NaN appears early on, but I come
back in the morning to find that instead of aborting, the job has
completed except for a few bad data points.

If you want an algorithm that takes advantage of NaNs, I have one. I
was simulating light rays around a black hole, generating a map of the
bending for various view angles. So I naturally did a lot of exploring
of grazing incidence, where the calculations would often yield NaNs.
So I used the NaNs to fill in the gap between the internal bending of
light and the external bending of the light; no extra programming
effort required, since that was what came out of the ray tracer
anyway. The rendering step just returned NaNs for the image pixels
that came from interpolating in the NaN-filled regions, which came out
black; just what I wanted. I just wrote the program ignoring what
would happen if a NaN appeared, and it came out right. Which is, I
think, the goal - to save the programmer having to think too much
about numerical-analysis headaches.

> Anyway, if I thought it would do the job I wouldn't have written MA in the
> first place.

It's certainly not always a good solution; in particular it's hard to
control the behaviour of things like where(). There's also no NaN for
integer types, which makes MaskedArray more useful.

Anyway, I am pleased that numpy has both controllable IEEE floats and
MaskedArray, and I apologize for a basically off-topic post.

A. M. Archibald