From: Scott G. <xs...@ya...> - 2002-04-11 11:31:53
|
Hello All. I'm interested in this project, and am curious to what level you are willing to accept outside contribution. I just tried to subscribe to the developers list, but I didn't realize that required admin approval. Hopefully it doesn't look like I was shaking the door without knocking first. Is this list active? Is this the correct place to talk about Numarray? A little about me: My name is Scott Gilbert, and I work as a software developer for a company called Rincon Research in Tucson Arizona. We do a lot digital signal processing/analysis among other things. In the last year or so, we've started to use Python in various capacities, and we're hoping to use it for more things. We need a good array module for various things. Some are similar to what it looks like Numarray is targeted at (fft, convolutions, etc...), and others are pretty different (providing buffers for reading data from specialized hardware etc...) About a week ago, I noticed that Guido over in Python developer land was willing to accept patches to the standard array module. As such, I thought I would take that opportunity to try and wedge some desirements and requirements I have into that baseline. Bummer for me, but they weren't exactly exited about bloating out arraymodule.c to meet my needs, and in retrospect that does make good sense. A number of people suggested that this might be a better place to try and get what I need. So here I am, poking around and wondering if I can play in your sandbox. If you're willing to let me contribute, my specific itches that I need to scratch are below. Otherwise - bummer, and I hope you all catch crabs... :-) ----------------------------------- It's taken me a couple of days to understand what's going on in the source. I've read through the design docs, and the PEP, but it wasn't until I tried to re-implement it that it really clicked. My re-implementation of the array portion of what you're doing is attached. There are still some holes to fill in, but it's fairly complete and supports a whole bunch of things which yours does not (Some of which you might even find useful: Pickling, Bit type). I'm pretty proud of it for only 400 lines of Python (Most of which is the bazillion type declarations). It's probably riddled with bugs as it's less than a day old... After initially thinking that you guys were getting too clever, I've come to realize it's a pretty good design overall. Still I have some changes I would like to make if you'll let me. (Both to the design and the implementation) ------------------------- Following your design for the Array stuff, I've been able to implement a pretty usable array class that supports the bazillion array types I need (Bit, Complex Integer, etc...). This gets me past my core requirements without polluting your world, but unfortunately my new XArray type doesn't play so well with your UFuncs. I think my users will definitely want to use your UFuncs when the time comes, so I want to remedy this situation. The first change I would like to make is to rework your code that verifies that an object is a "usable" array. I think NumArray should only check for the interface required, not the actual type hierarchy. By this I mean that the minimum required to be a supported array type is that it support the correct attributes, not that it actually inherit from NDArray: (quoting from your paper) something like: _data _shape _strides _byteoffset _aligned _contiguous _type _byteswap Most of these are just integer fields, or tuples of integers. Ignoring _type for the moment, it appears that the interface required to be a NumArray is much less strict than actually requiring it to derive from NumArray. If you allow me to change a few functions (inputarray() in numarray.py is one small example), I could use my independant XArray class almost as is, and moreover I can implement new array objects (possibly as extension types) for crazy things like working with page aligned memory, memory mapping etc... Well, that's almost enough. The _type field poses a small problem of sorts. It looks like you don't require a _type to be derived from NumericType, and this is a good thing since it allows me (and others) to implement NumArray compatible arrays without actually requiring NumArray to be present. However, it would be nice if you declared a more comprehensive list of typenames - even if they aren't all implemented in NumArray proper. Who knows, maybe the SciPy guys have a use for complex integers or bit arrays. If you make a reasonable canonical list, our data could be passed back and forth even if NumArray doesn't know what to do with it. See my attached module for the types of things I'm thinking of. I'm not so concerned about the "Native Types" that are in there, but I think committing a list of named standard types. (I suspect there are others that are interested in standard C types even if the size changes between machines...) If you were to specify a minimal interface like this in the short term, I could begin propagating my array module to my users. I could get my work done now, knowing that I'll be compatible with NumArray proper once it matures. I'd be willing to participate in making these changes if necessary. Looking at the big picture, I think it's desirable that there really only be one official standard for ND arrays in the Python world. That way, the various independent groups can all share their independent work. You guys are the heir-apparent, so to speak, from the Python guys point of view. I don't know if you're trying to get all of NumArray into the Python distribution or not, but I suspect a good interim step would be to have a PEP that specifies what it means to be a NumArray or NDArray in minimal terms. Perhaps supplying an Array only module in Python that implements this interface. Again, I'd be willing to help with all of this. ------------------------- Ok, other suggestions... Here is the list of things that your design document indicates are required to be a NumArray: _data _shape _strides _byteoffset _aligned _contiguous _type _byteswap I believe that one could calculate the values for _aligned and _contiguous from the other fields. So they shouldn't really be part of the interface required. I suspect it is useful for the C implementation of UFuncs to have this information in the NDINfo struct though, so while I would drop them from attribute interface, I would delegate the task of calculating these values to getNDInfo() and/or getNumInfo(). I also notice that you chose _byteswap to indicate byteswapping is needed. I think a better choice would be to specify the endian-ness of the data (with an _endian attr), and have getNDInfo() and getNumInfo() calculte the _byteswap value for the NDInfo struct. In my implementation, I came up with a slightly different list: self._endian self._offset self._shape self._stride self._itemtype self._itemsize self._itemformat self._buffer The only minimal differences are that _itemsize allows me to work with arrays of bytes without having any clue what the underlying type is (in some cases, _itemtype is "Unknown".) Secondly, I implemented a "Struct" _itemtype, and _itemformat is useful for for this case. (It's the same format string that the struct module in Python uses.) Also, I specified 0 for _itemsize when the actual items aren't byte addressable. In my module, this only occurred with the Bit type. I figured specifying 0 like this could keep a UFunc that isn't Bit aware from stepping on memory that it isn't allowed to. ------------------------- Next thought: Memory Mapping I really like the idea of having Python objects that map huge files a piece at time without using all of available memory. I've seen this in NumArray's charter as part of the reason for breaking away from Numeric, and I'm curious how you intend to address it. Right now, the only requirement for _data seems to be that it implement the PyBufferProcs. For memory mapping something else is needed... I haven't implemented this, so take it as just my rambling thoughts: With the addition of 3 new, optional, attributes to the NumArray object interface, I think this could be efficiently accomplished: _mapproc _mapmin _mapmax If _mapproc is present and not None, then it points to a function who's responsibility it is to set _mapmin and _mapmax appropriately. _mapproc takes one argument which is the desired byte offset into the virtual array. This is probably easier to describe with code: def _mapproc(self, offset): unmap_the_old_range() mmap_a_new_range_that_includes_byteoffset() self._mapmin = minimum_of_new_range() self._mapmax = maximum_of_new_range() In this way, when the delta between _mapmin and _mapmax is large enough, the UFuncs could act over a large contiguous portion of the _data array at a time before another remapping is necessary. If the byteoffset that a UFunc needs to work with is outside of _mapmin and _mapmax, it must call _mapproc to remedy the situation. This puts a lot of work into UFuncs that choose to support this. I suppose that is tough to avoid though. Also, there are threading issues to think about here. I don't know if UFuncs are going to release the Global Interpreter Lock, but if they do it's possible that multiple threads could have the same PyObject and try to _mapproc different offsets at different times. It is possible to implement a mutex for the NumArray without requiring anything special from the PyObject that implements it... ----------------------------- Ok. That's probably way too much content for an Introductory email. I do have more thoughts on this stuff though. They'll just have to wait for another time. Nice to meet you all, -Scott Gilbert __________________________________________________ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ |