From: Benjamin R. <ben...@ou...> - 2011-11-07 19:50:44
|
On Sun, Nov 6, 2011 at 4:43 PM, Nathaniel Smith <nj...@po...> wrote: > Hi matplotters, > > As any of you subscribed to the numpy-discussion list will have > probably noticed, there's intense debate going on about how numpy can > do a better job of handling missing data and masked arrays. Part of > the problem is that we aren't actually sure what users need these > features to do. There's one group who just wants R-style "missing > data", and their needs are pretty straightforward -- they just want a > magic value that indicates some data point doesn't actually exist. But > it seems like there's also demand for a more "masked array"-like > feature, similar to the current numpy.ma, where the mask is > non-destructive and easily manipulable. No-one seems clear on who > exactly this should work, though, and there's a lot of disagreement > about what semantics make sense. (If you want more details, there's a > wiki page summarizing some of this[1]). > > Since you seem to be the biggest users of numpy.ma, it would be really > helpful if you could explain how you actually use it, so we can make > sure that whatever we do in numpy-land is actually useful to you! > > What does matplotlib use masked arrays for? Is it just a convenient > way to keep an array and a boolean mask together in one object, or do > you take advantage of more numpy.ma features? For example, do you > ever: > - unmask values? > - create multiple arrays that share the same storage for their data, > but have different masks? (i.e., creating a new array with new > elements masked, but without actually allocating the memory for a full > array copy) > - use reduction operations on masked arrays? (e.g., np.sum(masked_arr)) > - use binary operations on masked arrays? (e.g., masked_arr1 + > masked_arr2) > > And while we're at it, any complaints about how numpy.ma works now, > that a new version might do better? > > Thanks in advance, > -- Nathaniel > > [1] https://github.com/njsmith/numpy/wiki/NA-discussion-status > > Hi Nathaniel, Unfortunately, I can't spend much more time on this topic due to my dissertation work. I will allow others to elaborate further, if they wish. But I think I can summarize it a bit. First, we try our best to respect multiple ways of users specifying missing data as input to our main plotting functions. Most common are NaNs and np.mamasks. Given that we try to maintain compatibility with older versions of Numpy, we are going to have to build some sort of compatibility mechanism to unify any representation (NaNs, np.ma, NA(or whatever it will be called)) under a single abstraction to be used internally. This will probably be np.ma at first until we can depend on the existence of np.NA. Second, with functions that have multiple input arrays (pretty much all of them), a single mask has to be applied to all data (typically a logical_or'ing of the individual masks). Some other functions such as the pcolor family of functions have slightly more complicated mask merging. The most important thing is that we do not modify the user's data, and we keep copies to a minimum. np.ma works great because we can convert the arrays into masked_arrays without a copy, and the mask-merging process does not modify the user's input data. I don't think we were using some of the more advanced features of np.ma, but I can't be sure of that. I guess the tricky thing that could occur (and probably should be tested for) is if the input array is already a masked array and that we aren't changing the user's pre-existing masks. Ben Root |