From: Scott G. <xs...@ya...> - 2002-04-11 11:31:53
Attachments:
XArray.py
|
Hello All. I'm interested in this project, and am curious to what level you are willing to accept outside contribution. I just tried to subscribe to the developers list, but I didn't realize that required admin approval. Hopefully it doesn't look like I was shaking the door without knocking first. Is this list active? Is this the correct place to talk about Numarray? A little about me: My name is Scott Gilbert, and I work as a software developer for a company called Rincon Research in Tucson Arizona. We do a lot digital signal processing/analysis among other things. In the last year or so, we've started to use Python in various capacities, and we're hoping to use it for more things. We need a good array module for various things. Some are similar to what it looks like Numarray is targeted at (fft, convolutions, etc...), and others are pretty different (providing buffers for reading data from specialized hardware etc...) About a week ago, I noticed that Guido over in Python developer land was willing to accept patches to the standard array module. As such, I thought I would take that opportunity to try and wedge some desirements and requirements I have into that baseline. Bummer for me, but they weren't exactly exited about bloating out arraymodule.c to meet my needs, and in retrospect that does make good sense. A number of people suggested that this might be a better place to try and get what I need. So here I am, poking around and wondering if I can play in your sandbox. If you're willing to let me contribute, my specific itches that I need to scratch are below. Otherwise - bummer, and I hope you all catch crabs... :-) ----------------------------------- It's taken me a couple of days to understand what's going on in the source. I've read through the design docs, and the PEP, but it wasn't until I tried to re-implement it that it really clicked. My re-implementation of the array portion of what you're doing is attached. There are still some holes to fill in, but it's fairly complete and supports a whole bunch of things which yours does not (Some of which you might even find useful: Pickling, Bit type). I'm pretty proud of it for only 400 lines of Python (Most of which is the bazillion type declarations). It's probably riddled with bugs as it's less than a day old... After initially thinking that you guys were getting too clever, I've come to realize it's a pretty good design overall. Still I have some changes I would like to make if you'll let me. (Both to the design and the implementation) ------------------------- Following your design for the Array stuff, I've been able to implement a pretty usable array class that supports the bazillion array types I need (Bit, Complex Integer, etc...). This gets me past my core requirements without polluting your world, but unfortunately my new XArray type doesn't play so well with your UFuncs. I think my users will definitely want to use your UFuncs when the time comes, so I want to remedy this situation. The first change I would like to make is to rework your code that verifies that an object is a "usable" array. I think NumArray should only check for the interface required, not the actual type hierarchy. By this I mean that the minimum required to be a supported array type is that it support the correct attributes, not that it actually inherit from NDArray: (quoting from your paper) something like: _data _shape _strides _byteoffset _aligned _contiguous _type _byteswap Most of these are just integer fields, or tuples of integers. Ignoring _type for the moment, it appears that the interface required to be a NumArray is much less strict than actually requiring it to derive from NumArray. If you allow me to change a few functions (inputarray() in numarray.py is one small example), I could use my independant XArray class almost as is, and moreover I can implement new array objects (possibly as extension types) for crazy things like working with page aligned memory, memory mapping etc... Well, that's almost enough. The _type field poses a small problem of sorts. It looks like you don't require a _type to be derived from NumericType, and this is a good thing since it allows me (and others) to implement NumArray compatible arrays without actually requiring NumArray to be present. However, it would be nice if you declared a more comprehensive list of typenames - even if they aren't all implemented in NumArray proper. Who knows, maybe the SciPy guys have a use for complex integers or bit arrays. If you make a reasonable canonical list, our data could be passed back and forth even if NumArray doesn't know what to do with it. See my attached module for the types of things I'm thinking of. I'm not so concerned about the "Native Types" that are in there, but I think committing a list of named standard types. (I suspect there are others that are interested in standard C types even if the size changes between machines...) If you were to specify a minimal interface like this in the short term, I could begin propagating my array module to my users. I could get my work done now, knowing that I'll be compatible with NumArray proper once it matures. I'd be willing to participate in making these changes if necessary. Looking at the big picture, I think it's desirable that there really only be one official standard for ND arrays in the Python world. That way, the various independent groups can all share their independent work. You guys are the heir-apparent, so to speak, from the Python guys point of view. I don't know if you're trying to get all of NumArray into the Python distribution or not, but I suspect a good interim step would be to have a PEP that specifies what it means to be a NumArray or NDArray in minimal terms. Perhaps supplying an Array only module in Python that implements this interface. Again, I'd be willing to help with all of this. ------------------------- Ok, other suggestions... Here is the list of things that your design document indicates are required to be a NumArray: _data _shape _strides _byteoffset _aligned _contiguous _type _byteswap I believe that one could calculate the values for _aligned and _contiguous from the other fields. So they shouldn't really be part of the interface required. I suspect it is useful for the C implementation of UFuncs to have this information in the NDINfo struct though, so while I would drop them from attribute interface, I would delegate the task of calculating these values to getNDInfo() and/or getNumInfo(). I also notice that you chose _byteswap to indicate byteswapping is needed. I think a better choice would be to specify the endian-ness of the data (with an _endian attr), and have getNDInfo() and getNumInfo() calculte the _byteswap value for the NDInfo struct. In my implementation, I came up with a slightly different list: self._endian self._offset self._shape self._stride self._itemtype self._itemsize self._itemformat self._buffer The only minimal differences are that _itemsize allows me to work with arrays of bytes without having any clue what the underlying type is (in some cases, _itemtype is "Unknown".) Secondly, I implemented a "Struct" _itemtype, and _itemformat is useful for for this case. (It's the same format string that the struct module in Python uses.) Also, I specified 0 for _itemsize when the actual items aren't byte addressable. In my module, this only occurred with the Bit type. I figured specifying 0 like this could keep a UFunc that isn't Bit aware from stepping on memory that it isn't allowed to. ------------------------- Next thought: Memory Mapping I really like the idea of having Python objects that map huge files a piece at time without using all of available memory. I've seen this in NumArray's charter as part of the reason for breaking away from Numeric, and I'm curious how you intend to address it. Right now, the only requirement for _data seems to be that it implement the PyBufferProcs. For memory mapping something else is needed... I haven't implemented this, so take it as just my rambling thoughts: With the addition of 3 new, optional, attributes to the NumArray object interface, I think this could be efficiently accomplished: _mapproc _mapmin _mapmax If _mapproc is present and not None, then it points to a function who's responsibility it is to set _mapmin and _mapmax appropriately. _mapproc takes one argument which is the desired byte offset into the virtual array. This is probably easier to describe with code: def _mapproc(self, offset): unmap_the_old_range() mmap_a_new_range_that_includes_byteoffset() self._mapmin = minimum_of_new_range() self._mapmax = maximum_of_new_range() In this way, when the delta between _mapmin and _mapmax is large enough, the UFuncs could act over a large contiguous portion of the _data array at a time before another remapping is necessary. If the byteoffset that a UFunc needs to work with is outside of _mapmin and _mapmax, it must call _mapproc to remedy the situation. This puts a lot of work into UFuncs that choose to support this. I suppose that is tough to avoid though. Also, there are threading issues to think about here. I don't know if UFuncs are going to release the Global Interpreter Lock, but if they do it's possible that multiple threads could have the same PyObject and try to _mapproc different offsets at different times. It is possible to implement a mutex for the NumArray without requiring anything special from the PyObject that implements it... ----------------------------- Ok. That's probably way too much content for an Introductory email. I do have more thoughts on this stuff though. They'll just have to wait for another time. Nice to meet you all, -Scott Gilbert __________________________________________________ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ |
From: Perry G. <pe...@st...> - 2002-04-11 16:01:32
|
Hi Scott, I've printed out your message and will try to read and understand it today. It may be a couple days before we can respond, so don't take a lack of an immediate response as disinterest. Thanks, Perry |
From: Perry G. <pe...@st...> - 2002-04-11 21:56:49
|
> [mailto:num...@li...]On Behalf Of Scott > Gilbert > Subject: [Numpy-discussion] Introduction > > > Hello All. > > I'm interested in this project, and am curious to what level you are > willing to accept outside contribution. I just tried to subscribe to > the developers list, but I didn't realize that required admin approval. > Hopefully it doesn't look like I was shaking the door without knocking > first. > > Is this list active? Is this the correct place to talk about Numarray? Sure. > > Following your design for the Array stuff, I've been able to implement > a pretty usable array class that supports the bazillion array types I > need (Bit, Complex Integer, etc...). This gets me past my core > requirements without polluting your world, but unfortunately my new > XArray type doesn't play so well with your UFuncs. I think my users > will definitely want to use your UFuncs when the time comes, so I want > to remedy this situation. > > The first change I would like to make is to rework your code that > verifies that an object is a "usable" array. I think NumArray should > only check for the interface required, not the actual type hierarchy. > By this I mean that the minimum required to be a supported array type > is that it support the correct attributes, not that it actually inherit > from NDArray: > > (quoting from your paper) something like: > > _data > _shape > _strides > _byteoffset > _aligned > _contiguous > _type > _byteswap > > Most of these are just integer fields, or tuples of integers. Ignoring > _type for the moment, it appears that the interface required to be a > NumArray is much less strict than actually requiring it to derive from > NumArray. If you allow me to change a few functions (inputarray() in > numarray.py is one small example), I could use my independant XArray > class almost as is, and moreover I can implement new array objects > (possibly as extension types) for crazy things like working with page > aligned memory, memory mapping etc... > I guess we are not sure we understand what you mean by interface. In particular, we don't understand why sharing the same object attributes (the private ones you list above) is a benefit to the code you are writing if you aren't also using the low level implementation. The above attributes are private and nothing external to the Class should depend on or even know about them. Could you elaborate on what you mean by interface and the relationship between your arrays and numarrays? > > Well, that's almost enough. The _type field poses a small problem of > sorts. It looks like you don't require a _type to be derived from > NumericType, and this is a good thing since it allows me (and others) > to implement NumArray compatible arrays without actually requiring > NumArray to be present. > What do you mean by NumArray compatible? [some issues snipped since we need to understand the interface issue first] > I don't know if you're trying to get all of NumArray into the Python > distribution or not, but I suspect a good interim step would be to have > a PEP that specifies what it means to be a NumArray or NDArray in > minimal terms. Perhaps supplying an Array only module in Python that > implements this interface. Again, I'd be willing to help with all of > this. > We are hoping to get numarray into the distribution [it won't be the end of the world for us if it doesn't happen]. I'll warn you that the PEP is out of date. We are likely to update it only after we feel we are close to having the implementation ready for consideration for including into the standard distribution. I would refer to the actual implementation and the design notes for the time being. > > ------------------------- > > Ok, other suggestions... > > Here is the list of things that your design document indicates are > required to be a NumArray: > > _data > _shape > _strides > _byteoffset > _aligned > _contiguous > _type > _byteswap > > I believe that one could calculate the values for _aligned and > _contiguous from the other fields. So they shouldn't really be part of > the interface required. I suspect it is useful for the C > implementation of UFuncs to have this information in the NDINfo struct > though, so while I would drop them from attribute interface, I would > delegate the task of calculating these values to getNDInfo() and/or > getNumInfo(). > > I also notice that you chose _byteswap to indicate byteswapping is > needed. I think a better choice would be to specify the endian-ness of > the data (with an _endian attr), and have getNDInfo() and getNumInfo() > calculte the _byteswap value for the NDInfo struct. > > In my implementation, I came up with a slightly different list: > > self._endian > self._offset > self._shape > self._stride > self._itemtype > self._itemsize > self._itemformat > self._buffer > Some of the name changes are worth considering (like replacing ._byteswap with an endian indicator, though I find _endian completely opaque as to what it would mean--1 means what? little or big?). (BTW, we already have _itemsize). _contiguous and _aligned are things we have been considering changing, but I would have to think about it carefully to determine if they really are redundant. > The only minimal differences are that _itemsize allows me to work with > arrays of bytes without having any clue what the underlying type is (in > some cases, _itemtype is "Unknown".) Secondly, I implemented a > "Struct" _itemtype, and _itemformat is useful for for this case. (It's > the same format string that the struct module in Python uses.) > It looks like you are trying to deal with records with these "structs". We deal with records (efficiently) in a completely different way. Take a look at the recarray module. > Also, I specified 0 for _itemsize when the actual items aren't byte > addressable. In my module, this only occurred with the Bit type. I > figured specifying 0 like this could keep a UFunc that isn't Bit aware > from stepping on memory that it isn't allowed to. > Again, we aren't sure how this works with numarray. > ------------------------- > > Next thought: Memory Mapping > > I really like the idea of having Python objects that map huge files a > piece at time without using all of available memory. I've seen this in > NumArray's charter as part of the reason for breaking away from > Numeric, and I'm curious how you intend to address it. > > Right now, the only requirement for _data seems to be that it implement > the PyBufferProcs. For memory mapping something else is needed... > > I haven't implemented this, so take it as just my rambling thoughts: > > With the addition of 3 new, optional, attributes to the NumArray object > interface, I think this could be efficiently accomplished: > > _mapproc > _mapmin > _mapmax > > If _mapproc is present and not None, then it points to a function who's > responsibility it is to set _mapmin and _mapmax appropriately. > _mapproc takes one argument which is the desired byte offset into the > virtual array. This is probably easier to describe with code: > > def _mapproc(self, offset): > unmap_the_old_range() > mmap_a_new_range_that_includes_byteoffset() > self._mapmin = minimum_of_new_range() > self._mapmax = maximum_of_new_range() > > In this way, when the delta between _mapmin and _mapmax is large > enough, the UFuncs could act over a large contiguous portion of the > _data array at a time before another remapping is necessary. If the > byteoffset that a UFunc needs to work with is outside of _mapmin and > _mapmax, it must call _mapproc to remedy the situation. > > This puts a lot of work into UFuncs that choose to support this. I > suppose that is tough to avoid though. > We deal with memory mapping a completely differnent way. It's a bit late for me to go into it in great detail, but we wrap the standard library mmap module with a module that lets us manage memory mapped files. This module basically memory maps an entire file and then in effect mallocs segments of that file as buffer objects. This allocation of subsets is needed to ensure that overlapping memory maps buffers don't happen. One can basically reserve part of the memory mapped file as a buffer. Once that is done, nothing else can use that part of the file for another buffer. We do not intend to handle memory maps as a way of sequentially mapping parts of the file to provide windowed views as your code segment above suggests. If you want a buffer that is the whole (large) file, you just get a mapped buffer to the whole thing. (Why wouldn't you?) The above scheme is needed for our purposes because many of our data files contain multiple data arrays and we need a means of creating a numarray object for each one. Most of this machinery has already been implemented, but we haven't released it since our I/O package (for astronomical FITS files) is not yet at the point of being able to use it. > Also, there are threading issues to think about here. I don't know if > UFuncs are going to release the Global Interpreter Lock, but if they do > it's possible that multiple threads could have the same PyObject and > try to _mapproc different offsets at different times. > To tell you the truth, we haven't dealt with the threading issue much. We think about it occasionally, but have deferred dealing with it until we have finished other aspects first. We do want to make it thread safe though. Perry Greenfield |
From: Scott G. <xs...@ya...> - 2002-04-12 04:45:56
|
--- Perry Greenfield <pe...@st...> wrote: > > I guess we are not sure we understand what you mean by interface. > In particular, we don't understand why sharing the same object > attributes (the private ones you list above) is a benefit to the > code you are writing if you aren't also using the low level > implementation. The above attributes are private and nothing > external to the Class should depend on or even know about them. > Could you elaborate on what you mean by interface and the > relationship between your arrays and numarrays? > There are several places in your code that check to see if you are working with a valid type for NDArrays. Currently this check consists of asking the following questions: 'Is it a tuple or list?' 'Is it a scalar of some sort?' 'Does it derive from our NDArray class?' If any of these questions answer true, it does the right thing and moves on. If none of these is true, it raises an exception. I suppose this is fine if you are only concerned about working with your own implementation of an array type, but I hope you'll consider the following as a minor change that opens up the possibility for other compatible array implementations to work interoperably. Instead have the code ask the following questions: 'Is it a tuple or list?' 'Is it a scalar of some sort?' 'Does it support the attributes necessary to be like an NDArray object?' This change is very similar to how you can pass in any Python object to the "pickle.dump()" function, and if it supports the "write()" method it will be called: >>> class WhoKnows: ... def write(self, x): ... print x >>> >>> import pickle >>> >>> w = WhoKnows() >>> >>> pickle.dump('some data', w) S'some data' p1 . Until reading your response above, I didn't realize that you consider your single underscore attributes to be totally private. In general, I try to use a single underscore to mean protected (meaning you can use them if you REALLY know what you are doing), hence my confusion. With that in mind, pretend that I suggested the following instead: The specification of an NDArray is that it has the following attributes ndarray_buffer - a PyObject which has PyBufferProcs ndarray_shape - a tuple specifying the shape of the array ndarray_stride - a tuple specifyinf the index multipliers ndarray_itemsize - an int/long stating the size of items ndarray_itemtype - some representation of type This would be a very minor change to your functions like inputarray(), getNDInfo(), getNDArray(), but it would allow your UFuncs to work with other implementations of arrays. As an example similar to the pickle example above: import array class ScottArray: def __init__(self): self.ndarray_buffer = array.array('d', [0]*100) self.ndarray_shape = (10, 10) self.ndarray_stride = (80, 8) self.ndarray_itemsize = 8 self.ndarray_itemtype = 'Float64' import numarray n = numarray.numarray((10, 10), type='Float64') s = ScottArray() very_cool = numarray.add(n, s) This example is kind of silly. I mean, why wouldn't I just use numarray for all of my array needs? Well, that's where my world is a little different than yours I think. Instead of using 'array.array()' above, there are times where I'll need to use 'whizbang.array()' to get a different PyBufferProcs supporting object. Or where I'll need to work with a crazy type in one part of the code, but I'd like to pass it to an extension that combines your types and mine. In these cases where I need "special memory" or "special types" I could try and get you guys to accept a patch, but this would just pollute your project and probably annoy you in general. A better solution is to create a general standard mechanism for implementing NDArray types, and let me make my own. In the above example, we could have completely different NDArray implementations working interoperably inside of one UFunc. It seems to me that all it really takes to be an NDArray can be specified by a list of attributes like the one above. (Probably need a few more attributes to be really general: 'ndarray_endian', etc...) In the end, NDArrays are just pointers to a buffer, and descriptors for indexing. I don't believe this would have any significant affect on the performance of numarray. (The efficient fast C code still gets a pointer to work with.) More over, I'd be very willing to contribute patches to make this happen. If you agree, and we can flesh out what this "attribute interface" should be, then I can start distributing my own array module to the engineers where I work without too much fear that they'll be screwed once numarray is stable and they want to mix and match. Code always lives a lot longer than I want it to, and if I give them something now which doesn't work with your end product, I'll have done them a disservice. BTW: Allowing other types to fill in as NDArrays also allows other types to implement things like slicing as they see fit (slice and copy contiguious, slice and copy on write, slice and copy by reference, etc...). > > We are hoping to get numarray into the distribution [it won't be the > end of the world for us if it doesn't happen]. I'll warn you that the > PEP is out of date. We are likely to update it only after we feel > we are close to having the implementation ready for consideration > for including into the standard distribution. I would refer to the > actual implementation and the design notes for the time being. > Yeah, I recognize that the PEP is gathering dust at the moment. I'm not having too much trouble following through the source and design docs. It took me a few days to "get it", but that's probably because I'm slower than your average bear. :-) Regarding the PEP, what I would like to see happen is that if we agree that the "attribute interface" stuff above is the right way to go about things, I would (or we would) submit a milder interim PEP specifying what those attributes are, how they are to be interpreted, and a simple Python module implementing a general NDArray class for consumption. Hopefully this PEP would specify a canonical list of type names as well. Then we could make updates to the other PEP if necessary. > > Some of the name changes are worth considering (like replacing ._byteswap > with an endian indicator, though I find _endian completely opaque as to > what it would mean--1 means what? little or big?). (BTW, we already have > _itemsize). _contiguous and _aligned are things we have been considering > changing, but I would have to think about it carefully to determine if > they really are redundant. > It's all open for discussion, but I would propose that ndarray_endian be one of: '>' - big endian '<' - little endian This is how the standard Python struct module specifies endian, and I've been trying to stay consistant with the baseline when possible. > > It looks like you are trying to deal with records with these "structs". > We deal with records (efficiently) in a completely different way. Take > a look at the recarray module. > Will definitely do. I've called them structs simply because they borrow their format string from the struct module that ships with Python. I'm not hung up on the name, and I wouldn't object to an alias. Too early for me to tell if there is even a difference in the underlying memory, but maybe we'll end up with 'structs' for my notion of things, and 'records' for yours. > > We deal with memory mapping a completely different way. It's a bit late > for me to go into it in great detail, but we wrap the standard library > mmap module with a module that lets us manage memory mapped files. > This module basically memory maps an entire file and then in effect > mallocs segments of that file as buffer objects. This allocation of > subsets is needed to ensure that overlapping memory maps buffers > don't happen. One can basically reserve part of the memory mapped file > as a buffer. Once that is done, nothing else can use that part of the > file for another buffer. We do not intend to handle memory maps as a > way of sequentially mapping parts of the file to provide windowed views > as your code segment above suggests. If you want a buffer that is the > whole (large) file, you just get a mapped buffer to the whole thing. > (Why wouldn't you?) > I think the idea of taking a 500 megabyte (or 5 gigabyte) file, and windowing 1 meg of actual memory at time pretty attractive. Sometimes we do very large correlations, and there just isn't enough memory to mmap the whole file (much less two files for correlation). Any library that doesn't want to support this business could just raise a NotImplemented error on encountering them. Maybe I shouldn't be calling this "memory mapping". Even though it could be implemented on top of mmap, truthfully I just want to support a "windowing" interface. If we could specify the windowing attributes and indicate the standard usage that would be great. Maybe: ndarray_window(self, offset) ndarray_winmin ndarray_winmax > > The above scheme is needed for our purposes because many of our data files > contain multiple data arrays and we need a means of creating a numarray > object for each one. Most of this machinery has already been implemented, > but we haven't released it since our I/O package (for astronomical FITS > files) is not yet at the point of being able to use it. > There is a group at my company that is using FITS for some stuff. I don't know enough about it to comment though... Cheers, -Scott __________________________________________________ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ |
From: Perry G. <pe...@st...> - 2002-04-13 00:43:32
|
Scott Gilbert writes: > import array > class ScottArray: > def __init__(self): > self.ndarray_buffer = array.array('d', [0]*100) > self.ndarray_shape = (10, 10) > self.ndarray_stride = (80, 8) > self.ndarray_itemsize = 8 > self.ndarray_itemtype = 'Float64' > > import numarray > > n = numarray.numarray((10, 10), type='Float64') > s = ScottArray() > > very_cool = numarray.add(n, s) > But why not (I may have some details wrong, I'm doing this from memory, and I haven't worked on it myself in a bit): import array import numarray import memory # comes with numarray class ScottArray(NumArray): def __init__(self): # create necessary buffer obj buf = memory.writeable_buffer(array.array('d', [0]*100)) Numarray.__init__(self, shape=(10, 10), type=numarray.Float64 buffer=buf) # _strides not settable from constructor yet, but currently # if you needed to set it: # self._strides = (80, 8) # But for this case it would be computed automatically from # the supplied shape n = numarray.numarray((10, 10), type='Float64') s = ScottArray() maybe_not_quite_so_cool_but_just_as_functional = n + s > This example is kind of silly. I mean, why wouldn't I just use > numarray for > all of my array needs? Well, that's where my world is a little > different than > yours I think. Instead of using 'array.array()' above, there are > times where > I'll need to use 'whizbang.array()' to get a different > PyBufferProcs supporting > object. Or where I'll need to work with a crazy type in one part > of the code, > but I'd like to pass it to an extension that combines your types and mine. > > In these cases where I need "special memory" or "special types" I > could try and > get you guys to accept a patch, but this would just pollute your > project and > probably annoy you in general. A better solution is to create a general > standard mechanism for implementing NDArray types, and let me make my own. > From everything I've seen so far, I don't see why you can't just create a NumArray object directly. You can subclass it (and use multiple inheritance if you need to subclass a different object as well) and add whatever customized behavior you want. You can create new kinds of objects as buffers just so long as you satisfy the buffer interface. > > In the above example, we could have completely different NDArray > implementations working interoperably inside of one UFunc. It > seems to me that > all it really takes to be an NDArray can be specified by a list > of attributes > like the one above. (Probably need a few more attributes to be > really general: > 'ndarray_endian', etc...) In the end, NDArrays are just pointers > to a buffer, > and descriptors for indexing. > Again, why not just create an NDArray object with the appropriate buffer object and attributes (subclassing if necessary). > > I don't believe this would have any significant affect on the > performance of > numarray. (The efficient fast C code still gets a pointer to > work with.) More > over, I'd be very willing to contribute patches to make this happen. > > > If you agree, and we can flesh out what this "attribute > interface" should be, > then I can start distributing my own array module to the > engineers where I work > without too much fear that they'll be screwed once numarray is > stable and they > want to mix and match. > > Code always lives a lot longer than I want it to, and if I give > them something > now which doesn't work with your end product, I'll have done them > a disservice. > All good in principle, but I haven't yet seen a reason to change numarray. As far as I can tell, it provides all you need exactly as it is. If you could give an example that demonstrated otherwise... > > It's all open for discussion, but I would propose that > ndarray_endian be one > of: > > '>' - big endian > '<' - little endian > > This is how the standard Python struct module specifies endian, > and I've been > trying to stay consistant with the baseline when possible. > To tell you the truth, I'm not crazy about how the struct module handles types or attributes. It's generally far too cryptic for my tastes. Other than providing backward compatibility, we aren't interested in it emulating struct. > > > > The above scheme is needed for our purposes because many of our > data files > > contain multiple data arrays and we need a means of creating a numarray > > object for each one. Most of this machinery has already been > implemented, > > but we haven't released it since our I/O package (for astronomical FITS > > files) is not yet at the point of being able to use it. > > > > I could well misundertand, but I thought that if you mmap a file in unix in write mode, you do not use up the virtual memory as limited by the physical memory and the paging file. Your only limit becomes the virtual address space available to the processor. If the 32 bit address is your problem, you are far, far better off using a 64-bit processor and operating system than trying to kludge up a windowing memory mechanism. I could see a way of doing it for ufuncs, but the numeric world (and I would think the DSP world as well) needs far more than element-by-element array functionality. providing a usable C-api for that kind of memory model would be a nightmare. But I'm not sure if this or the page file is your limitation. Perry |
From: Scott G. <xs...@ya...> - 2002-04-13 10:08:25
|
--- Perry Greenfield <pe...@st...> wrote: > Scott Gilbert writes: [...] > > > > very_cool = numarray.add(n, s) > > > But why not (I may have some details wrong, I'm doing this > from memory, and I haven't worked on it myself in a bit): > [...] > > maybe_not_quite_so_cool_but_just_as_functional = n + s > [...] > > From everything I've seen so far, I don't see why you can't > just create a NumArray object directly. You can subclass it > (and use multiple inheritance if you need to subclass a different > object as well) and add whatever customized behavior you want. > You can create new kinds of objects as buffers just so long > as you satisfy the buffer interface. > Your point about the optional buffer parameter to the NumArray is well taken. I had seen that when looking through the code, but it slipped my mind for that example. I could very well be wrong about some of these other reasons too... I have a number of reasons listed below for wanting the standard that Python adopts to specify only the interface and not the implementation. You may not find all of these pursuasive, and I apologize in advance if any looks like a criticism. (In my limited years as a professional software developer, I've found that the majority of people can be very defensive and protective of their code. I've been trying to tread lightly, but I don't know if I'm succeeding.) However if any of these reasons is persuasive, keep in mind that the actual changes I'm proposing are pretty minimal in scope. And that I'd be willing to submit patches so as to reduce any inconvenience to you. (Not that you have any reason to believe I can code my way out of a box... :-) Ok, here's my list: Philosophical You have a proposal in to the Python guys to make Numarray into the standard _implementation_. I think standards like this should specify an _interface_, not an implementation. Simplicity I can give my users a single XArray.py file, and they can be off and running with something that works right then and there, and it could in many ways be compatible with Numarray (with some slight modifications) when they decide they want the extra functionality of extension modules that you or anyone else who follows your standard provides. But they don't have to compile anything until they really need to. Your implementation leaves me with all or nothing. I'll have to build and use numarray, or I've got an in house only solution. Expediency I want to see a usable standard arise quickly. If you maintain the stance that we should all use the Numarray implementation, instead of just defining a good Numarray interface, everyone has to wait for you to finish things enough to get them accepted by the Python group. Your implementation is complicated, and I suspect they will have many things that they will want you to change before they accept it into their baseline. (If you think my list of suggestions is annoying, wait until you see theirs!) If a simple interface protocol is presented, and a simple pure Python module that implements it. The PEP acceptance process might move along quickly, but you could take your time with implementing your code. Pragmatic You guys aren't finished yet, and I need to give my users an array module ASAP. As such a new project, there are likely to be many bugs floating around in there. I think that when you are done, you will probably have a very good library. Moreover, I'm grateful that you are making it open source. That's very generous of you, and the fact that you are tolerating this discussion is definitely appreciated. Still, I can't put off my projects, and I can't task you to work faster. However, I do think we could agree in a very short term that your design for the interface is a good one. I also think that we (or just me if you like) could make a much smaller PEP that would be more readily accepted. Then everyone in this community could proceed at their own pace - knowing that if we followed the simple standard we would have inter operability with each other. Social Normally I wouldn't expect you to care about any of my special issues. You have your own problems to solve. As I said above, it's generous of you to even offer your source code. However, you are (or at least were) trying to push for this to become a standard. As such, considering how to be more general and apply to a wider class of problems should be on your agenda. If it's not, then you shouldn't be creating the standard. If you don't care about numarray becoming standard, I would like to try my hand at submitting the slightly modified version of your design. I won't be compatible with your stuff, but hopefully others will follow suit. Functionality Data Types I have needs for other types of data that you probably have little use for. If I can't coerce you to make a minor change in specification, I really don't think I could coerce you to support brand new data types (complex ints is the one I've beaten to death, because I could use that one in the short term). What happens when someone at my company wants quaternions? I suspect that you won't have direct support for those. I know that numarray is supposed to be extensible, but the following raises an exception: from numarray import * class QuaternionType(NumericType): def __init__(self): NumericType.__init__(self, "Quaternion", 4*8, 0) Quaternion = QuaternionType() # BOOM! q = array(shape=(10, 10), type=Quaternion) Maybe I'm just doing something wrong, but it looks like your code wants "Quaternion" to be in your (private?) typeConverters dictionary. Ok, try two: from numarray import * q = NDArray(shape=(10, 10), itemsize=4*8) if a[5][5] is None: print "No boom, but what can I do with it?" Maybe this is just a documentation problem. On the other hand, I can do the following pretty readily: import array class Quat2D: def __init__(self, *shape): assert len(shape) == 2 self._buffer = array.array('d', [0])*shape[0]*shape[1]*4 self._shape, self._stride = tuple(shape), (4*shape[0], 4) self._itemsize = 4*8 def __getitem__(self, sub): assert isinstance(sub, tuple) and len(sub) == 2 offset = sub[0]*self._stride[0] + sub[1]*self._stride[1] return tuple([self._buffer[offset + i] for i in range(4)]) def __setitem__(self, sub, val): assert isinstance(sub, tuple) and len(sub) == 2 offset = sub[0]*self._stride[0] + sub[1]*self._stride[1] for i in range(4): self._buffer[offset + i] = val[i] return val q = Quat2D(10, 10) q[5, 5] = (1, 2, 3, 4) print q[5, 5] This isn't very general, but it is short, and it makes a good example. If they get half of their data from calculations using Numarray, and half from whatever I provide them, and then try to mix the results in an extension module that has to know about separate implementations, life is more complicated than it should be. Operations I'm going to have to write my own C extension modules for some high performance operations. All I need to get this done is a void* pointer, the shape, stride, itemsize, itemtype, and maybe some other things to get off and running. You have a growing framework, and you have already indicated that you think of your hidden variables as private. I don't think I or my users should have to understand the whole UFunc framework and API just to create an extension that manipulates a pointer to an array of doubles. Arrays are simpler than UFuncs. I consider them to be pretty seperable parts of your design. If you keep it this way, and it becomes the standard, it seems that I and everyone else will have to understand both parts in order to create an extension module. Flexibility Numarray is going to make a choice of how to implement slicing. My guess is that it will be one of "copy contiguous", "copy on write", "copy by reference". I don't know what the correct choice is, but I know that someone else will need something different based on context. Things like UFuncs and other extension modules that do fast C level calculations typically don't need to concern themselves with slicing behaviour. Design Your implementation would be similar to having the 'pickle' module require you to derive from a 'Pickleable' base class - instead of simply providing __getstate__ and __setstate__ methods. It's an artificial constraint, and those are usually bad. > > All good in principle, but I haven't yet seen a reason to change > numarray. As far as I can tell, it provides all you need exactly > as it is. If you could give an example that demonstrated otherwise... > Maybe you're right. I suspect you as the author will come up with the quick example that shows how to implement my bizarre quaternion example above. I'm not sure if this makes either of us right or wrong, but if you're not buying any of this, then it's probably time for me to chock this off to a difference in opinion and move on. Truthfully this is taking me pretty far from my original tack. Originally I had simply hoped to hack a couple of things into arraymodule.c, and here I am now trying to get a simpler standard in place. I'll try one last time to convince you with the following two statements: - Changing such that you only require the interface is a subtle, but noticeable, improvement to your otherwise very good design. - It's not a difficult change. If that doesn't compel you, at least I can walk away knowing I tried. For the volumes I've written, this will probably be my last pesky message if you really don't want to budge on this issue. > > To tell you the truth, I'm not crazy about how the struct module > handles types or attributes. It's generally far too cryptic for > my tastes. Other than providing backward compatibility, we aren't > interested in it emulating struct. > I consider it a lot like regular expressions. I cringe when I see someone else's, but I don't have much difficulty putting them together. The alternative of coming up with a different specifier for records/structs is probably a mistake now that the struct module already has it's (terse) format specification. Once that is taken into consideration, following all the leads of the struct module makes sense to me. > > I could well misunderstand, but I thought that if you mmap a file > in unix in write mode, you do not use up the virtual memory as > limited by the physical memory and the paging file. Your only > limit becomes the virtual address space available to the processor. > Regarding efficiency, it depends on the implementations, which vary greatly, and there are other subtleties. I've already written a book above, so I won't tire you with details. I will say that closing a large memory mapped file on top of NFS can be dreadful. It probably takes the same amount of total time or less, but from an interactive analysys point of view it's pretty unpleasant on Tru64 at least. Also, just mmaping the whole file puts all of the memory use at the discretion of the OS. I might have a gig or two to work with, but if mmap takes them all, other threads will have to contend for memory. The system (application) as a whole might very well run better if I can retain some control over this. I'm not married to the windowing suggestion. I think it's something to consider, but it might not be a common enough case to try and make a standard mechanism for. If there isn't a way to do it without a kluge, then I'll drop it. Likewise if a simple strategy can't meet anyone's real needs. > > If the 32 bit address is your problem, you are far, far better off > using a 64-bit processor and operating system than trying to kludge up > a windowing memory mechanism. > We don't always get to specify what platform we want to run on. Our customer has other needs, and sometimes hardware support for exotic devices dictate what we'll be using. Frequently it is on 64 bit Alphas, but sometimes the requirement is x86 Linux, or 32 bit Solaris. Finally, our most frustrating piece of legacy software was written in Fortran assuming you could stuff a pointer into an INT*4 and now requires the -taso flag to the compiler for all new code (which turns a sexy 64 bit Alpha into a 32 bit kluge...). Also, much of our data comes on tapes. It's not easy to memory map those. > > I could see a way of doing it for > ufuncs, but the numeric world (and I would think the DSP world > as well) needs far more than element-by-element array functionality. > providing a usable C-api for that kind of memory model would be > a nightmare. But I'm not sure if this or the page file is your > limitation. > I would suggest that any extension module which is not interested in this feature simply raise a NotImplemented exception of some sort. UFuncs could fall into this camp without any criticism from me. All it would have to do is check if the 'window_get' attribute is a callable, and punt an exception. My proposal wasn't necessarily to map in a single element at a time. If the C extension was willing to work these beasts at all, it would check to see if the offset it wanted was between window_min and window_max. If it wasn't, then it would call ob.window_get(offset), and the Python object could update window_min and window_max however it sees fit. For instance by remapping 10 or 20 megabytes on both sides. This particular implementation would allow us to do correlations of a small (mega sample) chunk of data against a HUGE (giga sample) file. This might be the wrong interface, and I'm willing to listen to a better suggestion. It might also be too special of a need to detract from a simpler overall design. Also, there are other uses for things like this. It could possibly be used to implement sparse arrays. It's probably not the best implementation of that, but it could hide a dict of set data points, and present it to an extension module as a complete array. Cheers, -Scott Gilbert __________________________________________________ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ |
From: Perry G. <pe...@st...> - 2002-04-14 01:42:06
|
> Ok, here's my list: > > Philosophical > > You have a proposal in to the Python guys to make Numarray into the > standard _implementation_. I think standards like this should specify > an _interface_, not an implementation. > Sure (though there is often more to a standard than just an interface, but certainly an implementation is generally not the standard). I'm not sure why you think we imply the implementation is the standard. We are waiting to rewrite the PEP when we are closer to having the implementation ready, but we've been very open about the design and have asked for input on it for a long time now. > Simplicity > > I can give my users a single XArray.py file, and they can be off and > running with something that works right then and there, and it could in > many ways be compatible with Numarray (with some slight modifications) > when they decide they want the extra functionality of extension modules > that you or anyone else who follows your standard provides. But they > don't have to compile anything until they really need to. > > Your implementation leaves me with all or nothing. I'll have to build > and use numarray, or I've got an in house only solution. > Hard to comment on this. > Expediency > > I want to see a usable standard arise quickly. If you maintain the > stance that we should all use the Numarray implementation, instead of > just defining a good Numarray interface, everyone has to wait for you > to finish things enough to get them accepted by the Python group. Your > implementation is complicated, and I suspect they will have many things > that they will want you to change before they accept it into their > baseline. (If you think my list of suggestions is annoying, wait until > you see theirs!) > I have the strong sense you misunderstand how the process works. Guido will be driven in large part by the acceptance or non-acceptance of the Numeric community. If they don't buy into it. It won't be part of the standard. If it won't be used by many, it won't be part of the standard. Yes, he will review the design and interface to see if there should be a long term commitment by the Python maintainers to have it in the standard library. We have sent him the design documents, and we do keep him informed. He has given us feedback about it. But for the most part, the judgement is going to be by the Numeric community. > If a simple interface protocol is presented, and a simple pure Python > module that implements it. The PEP acceptance process might move along > quickly, but you could take your time with implementing your code. > > Pragmatic > > You guys aren't finished yet, and I need to give my users an array > module ASAP. As such a new project, there are likely to be many bugs > floating around in there. I think that when you are done, you will > probably have a very good library. Moreover, I'm grateful that you are > making it open source. That's very generous of you, and the fact that > you are tolerating this discussion is definitely appreciated. > > Still, I can't put off my projects, and I can't task you to > work faster. > > > However, I do think we could agree in a very short term that your design > for the interface is a good one. I also think that we (or just > me if you > like) could make a much smaller PEP that would be more readily accepted. > Then everyone in this community could proceed at their own pace > - knowing > that if we followed the simple standard we would have inter operability > with each other. > I think we still don't understand what you need yet. More elaboration on that later. > Social > > Normally I wouldn't expect you to care about any of my special issues. > You have your own problems to solve. As I said above, it's generous of > you to even offer your source code. > > However, you are (or at least were) trying to push for this to become a > standard. As such, considering how to be more general and apply to a > wider class of problems should be on your agenda. If it's not, then you > shouldn't be creating the standard. > Pleeease. Just because a library developer doesn't happen to meet your needs doesn't mean it can't be part of the standard library. There are plenty of modules in the standard library that could have been made more general in some way, but there they are. The criteria is whether it solves problems for a large community of users, not that it is infinitely extensible or so on. Software development is full of trade-offs and that includes limits to generalization. Sure we can discuss whether things could be made more general or not. But because you want it more general doesn't mean we just say "Sure, you define everything!" > If you don't care about numarray becoming standard, I would like to try > my hand at submitting the slightly modified version of your design. I > won't be compatible with your stuff, but hopefully others will follow > suit. > You are free to propose your own standard at any time. No one will stop you from doing so. > Functionality > > Data Types > > I have needs for other types of data that you probably have little use > for. If I can't coerce you to make a minor change in specification, I > really don't think I could coerce you to support brand new data types > (complex ints is the one I've beaten to death, because I > could use that > You are right on complex ints (that we won't consider them). One could take numarray and add them if one wanted and have a more extended version. But we won't do it, and we wouldn't support as being in what we maintain. It's one of those trade offs. > one in the short term). What happens when someone at my company wants > quaternions? I suspect that you won't have direct support for those. > I know that numarray is supposed to be extensible, but the following > raises an exception: > > from numarray import * > > class QuaternionType(NumericType): > def __init__(self): > NumericType.__init__(self, "Quaternion", 4*8, 0) > > Quaternion = QuaternionType() # BOOM! > > q = array(shape=(10, 10), type=Quaternion) > > Maybe I'm just doing something wrong, but it looks like your code > wants "Quaternion" to be in your (private?) typeConverters dictionary. > Yep, and there's a good reason for that. Just spend a few minutes thinking about the role types play with array packages and how they have traditionally been implemented. Generally speaking, it is presumed that any two numeric types may be used in a binary operator. So you, Scott, define your special type, Quaternions. You will need to provide the module all the machinery for knowing what to do with all the other numeric types available. You may not care, but it is a requirement that numarray (and Numeric) know what to do. If that doesn't fit in with your needs, then you shouldn't be trying to use it. The problem is worse than that. You supply a Quaternion type extension to numarray, and Bob supplies a super long int type (64 bytes!) also. Both of you have gone to the trouble of giving numarray the means of handling all other default numarray types. But you don't know to handle each other. How do you solve that problem? I don't know. If you do, let us know. Given the requirements, adding new numeric types is not going to allow indepenent extensions to work with each other. That's fairly limiting, but that's the price that is paid for the feature. > Ok, try two: > > from numarray import * > > q = NDArray(shape=(10, 10), itemsize=4*8) > > if a[5][5] is None: > print "No boom, but what can I do with it?" > > Maybe this is just a documentation problem. On the other hand, I can > do the following pretty readily: > > import array > class Quat2D: > def __init__(self, *shape): > assert len(shape) == 2 > self._buffer = array.array('d', [0])*shape[0]*shape[1]*4 > self._shape, self._stride = tuple(shape), (4*shape[0], 4) > self._itemsize = 4*8 > > def __getitem__(self, sub): > assert isinstance(sub, tuple) and len(sub) == 2 > offset = sub[0]*self._stride[0] + sub[1]*self._stride[1] > return tuple([self._buffer[offset + i] for i in range(4)]) > > def __setitem__(self, sub, val): > assert isinstance(sub, tuple) and len(sub) == 2 > offset = sub[0]*self._stride[0] + sub[1]*self._stride[1] > for i in range(4): self._buffer[offset + i] = val[i] > return val > > q = Quat2D(10, 10) > q[5, 5] = (1, 2, 3, 4) > print q[5, 5] > > This isn't very general, but it is short, and it makes a good example. > I'm not sure what it proves. If all you need is an array to store some kind of type, be able to index and slice it, and not provide numeric operations, by all means use the existing array module, it does that fine. It's more work to subclass NDArray, but it can do it too, and gives you more capabilities (you won't be able to use index arrays or broadcasting in the array module for example). The extra functionality comes at some price. Sure, it isn't as simple to extend. It's your choice if it is worth it or not. If you want to add your large quaterion array efficiently, then the array module is worthless. Your example shows nothing about what your real needs for the object are. > If they get half of their data from calculations using Numarray, and > half from whatever I provide them, and then try to mix the results in > an extension module that has to know about separate implementations, > life is more complicated than it should be. > It's how you intend to 'mix' these that I have no clue about. > Operations > > I'm going to have to write my own C extension modules for some high > performance operations. All I need to get this done is a void* > pointer, > the shape, stride, itemsize, itemtype, and maybe some other things to > get off and running. You have a growing framework, and you have > already > indicated that you think of your hidden variables as private. I don't > think I or my users should have to understand the whole UFunc > framework > and API just to create an extension that manipulates a pointer to an > array of doubles. > Sigh. No one said you had to understand the ufunc framework to do so. We are working on an C API that just gives you a simple pointer (it's actually available now, but we aren't going to tout it until we have better documentation). > Arrays are simpler than UFuncs. I consider them to be pretty > seperable > parts of your design. If you keep it this way, and it becomes the > standard, it seems that I and everyone else will have to understand > both parts in order to create an extension module. > Wrong. > Flexibility > > Numarray is going to make a choice of how to implement slicing. > My guess > is that it will be one of "copy contiguous", "copy on write", "copy by > reference". I don't know what the correct choice is, but I know that > someone else will need something different based on context. > Things like > UFuncs and other extension modules that do fast C level calculations > typically don't need to concern themselves with slicing behaviour. > And they don't. > Design > > Your implementation would be similar to having the 'pickle' module > require you to derive from a 'Pickleable' base class - instead of simply > providing __getstate__ and __setstate__ methods. > > It's an artificial constraint, and those are usually bad. > You say. You are quite welcome do your own implementation that doesn't have this 'artificial' constraint. After all your text I *still* don't understand how you intend to use the 'interface' of the private attributes. You haven't provided any example (let alone a compelling one) of why we should accept any object that provides those attributes. Shoudn't the object also provide all the public methods. Shouldn't also provide indexing and so forth. All in all you are talking about checking quite a few attributes to make sure the object has the interface. And even if it does, *why* in the world would we presume that the C functions used by numarray would work properly with the object you provide. I really don't have a clue as to what you are getting at here, and without some real concrete example illustrating this point, I don't think there is any point to continuing this discussion. > > > > All good in principle, but I haven't yet seen a reason to change > > numarray. As far as I can tell, it provides all you need exactly > > as it is. If you could give an example that demonstrated otherwise... > > > > Maybe you're right. I suspect you as the author will come up with the > quick example that shows how to implement my bizarre quaternion example > above. I'm not sure if this makes either of us right or wrong, but if > you're not buying any of this, then it's probably time for me to chock > this off to a difference in opinion and move on. > > Truthfully this is taking me pretty far from my original tack. Originally > I had simply hoped to hack a couple of things into arraymodule.c, and here > I am now trying to get a simpler standard in place. I'll try one > last time > to convince you with the following two statements: > > - Changing such that you only require the interface is a subtle, > but noticeable, improvement to your otherwise very good design. > > - It's not a difficult change. > > > If that doesn't compel you, at least I can walk away knowing I tried. For > the volumes I've written, this will probably be my last pesky message if > you really don't want to budge on this issue. > We're not going to budge until you show us what the hell you are talking about. > > The alternative of coming up with a different specifier for > records/structs > is probably a mistake now that the struct module already has it's (terse) > format specification. Once that is taken into consideration, > following all > the leads of the struct module makes sense to me. > Again, you are free to do your own, or fork our numarray and do it the way you want. Or do your own from scratch. Or whatever. > [...] > Also, just mmaping the whole file puts all of the memory use at the > discretion of the OS. I might have a gig or two to work with, but if mmap > takes them all, other threads will have to contend for memory. The system > (application) as a whole might very well run better if I can retain some > control over this. > > > I'm not married to the windowing suggestion. I think it's something to > consider, but it might not be a common enough case to try and make a > standard mechanism for. If there isn't a way to do it without a kluge, > then I'll drop it. Likewise if a simple strategy can't meet anyone's real > needs. > You can forget our doing it. It's out of the question for us. > > > > If the 32 bit address is your problem, you are far, far better off > > using a 64-bit processor and operating system than trying to kludge up > > a windowing memory mechanism. > > > > We don't always get to specify what platform we want to run on. Our > customer has other needs, and sometimes hardware support for > exotic devices > dictate what we'll be using. Frequently it is on 64 bit Alphas, but > sometimes the requirement is x86 Linux, or 32 bit Solaris. > > Finally, our most frustrating piece of legacy software was written in > Fortran assuming you could stuff a pointer into an INT*4 and now requires > the -taso flag to the compiler for all new code (which turns a sexy 64 bit > Alpha into a 32 bit kluge...). > You may have customers with unreasonable demands. We don't have to let them cause an incredible complication in the underlying machinery. (And we won't). And we won't make it work on Windows 3.1 either. We have to draw the line somewhere. Your customers will pay dearly (and you will benefit :-). > Also, much of our data comes on tapes. It's not easy to memory map those. > Your point being? > > > [...] This doesn't seem to be going anywhere. If you can give us a better idea of how your interface needs would be used, at least we could respond to the specific issues. But we don't understand and although we are considering some changes, I'm not going to fold in your requests until we do understand. You may not be happy with the progress we are making either. Sorry, I can't help that. If you need something sooner, you'll need to do something else. Come up with your own system and try to get it into Python. Take numarray and do it the way you think it ought to be done and at the rate you think it should be done. You're welcome to. Take the array module and use that as a basis. We'd like numarray to be part of the standard. We'd like it to be the standard package in the Numeric community. But if neither happened, we'd still be working on it. We need it for our own work. Numeric doesn't give us the capabilities that we need. We are using it for our software development and it is being used to reduce HST data now. We are continuing on this regardless. Perry |
From: Paul F D. <pa...@pf...> - 2002-04-14 02:34:19
|
I haven't been following this discussion (I have a product release on Monday). But I am getting a lot of mail stacking up for numpy-developers which will not go through unless you are one of the registered developers mailing from your registered mail account. All others, please do not use numpy-developers. This is a private channel for the official developers only. I gather from my brief reading that someone is looking for a standard to use now. That standard is Numeric. If you go with that now then when the time comes to switch to Numarray, you'll be in the same boat as the whole community and therefore liable to be able to profit from any conversion tools required. You can reduce your problems to a minimum by sticking with the Python interface where possible. If you have some special need that Numeric is not meeting please realize that what exists is a consensus product after a long evolution and it is not likely to change much to meet your particular needs. There are some areas where what is right for one set of people is wrong for the others. |
From: Scott G. <xs...@ya...> - 2002-04-14 11:19:12
|
Perry, I've been trying to be persuasive, but I think all I've managed to do is to be verbose and annoy you. Please accept my apologies. I really am sorry this is going as poorly as it is. I'm doing a lousy job of getting my point across, and I'd like to turn around the tone this has taken. Email always comes off as more antagonistic than intended. Finally, my appeal to the fact that you are proposing a standard was heavy handed. I guess I was trying to use that to force you to consider my position. It clearly backfired... I'll try to be more to the point. Here's what I'm proposing, and it's only a suggestion. *** I think the requirements for being a general purpose "NDArray" can be specified with only the following attributes: __array_buffer__ - as buffer object __array_shape__ - as tuple of long __array_itemsize__ - as int Optionally __array_stride__ - as tuple of long (get from shape if None) __array_offset__ - as int (would default to 0 if not present) Then anyone who implemented these could work with the same C API for getting the pointer to memory, shape array, stride array, and item size. The set of operations on a pure "NDArray" is probably pretty minimal (reshape, transpose/rotate, index arrays?). So in order to create a full featured "NumArray", a few more attributes are required: __array_itemtype__ - as string? Optionally __array_endian__ - as 1 char string? (default to the native endian) This brings the total up to 4 required attributes, and 3 optional ones for a very general purpose array data structure. (I can think of other optional ones, but skip that for now.) > > All in all you are talking about checking quite a few attributes > to make sure the object has the interface. And even if it does, > *why* in the world would we presume that the C functions used by > numarray would work properly with the object you provide. > Because truthfully arrays are little more than a pointer to memory. That's like asking "why in the world would we presume memcpy() or qsort() would know what to do with your memory?" > > You haven't provided any example (let > alone a compelling one) of why we should accept any object that > provides those attributes. > Well, the UFuncs certainly should reject any object that they don't know how to handle. I'm currently only addressing what it takes to be an NDArray/NumArray object. OTOH, if I can present something to the UFuncs that looks like a known array type, why wouldn't UFuncs want to work with it? Ok, so what does this buy you? Well, it probably doesn't buy you personally very much. Your needs are already being met by the current implementation. Ok, so what does this cost you? A few translations: _data -> __array_buffer__ _shape -> __array_shape__ _strides -> __array_stride__ _itemsize -> __array_itemsize__ _offset -> __array_offset__ _type -> __array_type__ _byteswap -> __array_endian__ This isn't a style criticism. I'm not just asking you to change your names, I'm asking to promote the names to be a "standard interface" much like these things are in many places in Python. Also requires some small changes to getNDInfo() and getNumInfo() so that they can calculate the derived fields (contiguous, aligned, etc...). Also requires some changes to your scripts so that it checks for the interface rather than the inheritance. What are the benefits to anyone else? - Describes how anyone could implement something that looks and acts like NDArrays or NumArrays. There are probably a lot of reasons to want to do this. I have some reasons that I don't think you value too much. I think others would have reasons which I can't imagine too. - Allows one standard API for getting at the basics of NDArrays/NumArrays - Allows anyone to easily implement other data types for NumArrays. The typecode won't match any of your builtin types, but maybe other third parties could agree on other typecodes for their crazy needs and share modules. - Allows me personally to distribute a separate (and simpler) implementation of NDArrays/NumArrays right now and have the same data objects work with yours when you're all done. If I give the UFuncs a pointer to memory, and the attributes above, why shouldn't it work correctly? > > We're not going to budge until you show us what the hell you are talking > about. > Am I doing any better? I am trying. > > You are right on complex ints (that we won't consider them). One > could take numarray and add them if one wanted and have a more > extended version. But we won't do it, and we wouldn't support as > being in what we maintain. It's one of those trade offs. > Is there a way, today, without modifying numarray, for me to use numarray as a holder for these esoteric data types? Is that way difficult? Could it be easier? I'm not asking numarray to know about my types in it's core baseline. I'm wondering what it takes to implement new types at all. > > Your example shows nothing about what your > real needs for the object are. > My real needs are all over the place. Some of which you've shown me are solvable with the current implementation of numarray. Some of which you've not addressed or said you won't address. To be explicit: Here are (at least most of) my _needs_ for array objects: - support a wide variety of data types (user defined) - have efficient storage - support the pickle interface for serialization - allow alternate sources of underlying memory - have an easy interface for accessing the pieces necessary to create C extensions (buffer, shape, stride, ...) - completed and reliable in the near term Here are (at least some of) my _wants_ for array objects: - cooperate on some level with other standard array modules (once the standard is set) - have same API for accessing the pieces (buffer, shape, stride, ...) as all standard array modules will. - implementation in pure Python so that building extension modules is not required until the fast operations present in those modules is required. - implemented from a standard that is as good as it can be Here are (at least some of) my _whims_ for array objects: - has "windowing" functionality to work efficiently with really large files (on any modern platform). - alternate implementations for things such as "slicing behaviour" (copy on write, reference). Loosely following your design, I've already written a module that meets my "needs", I was hoping that we could cooperate towards filling in some of my "wants" (cooperating array modules), and I've brought up my "whims" because I thought they were interesting possibilities for discussion. I was going to respond to some of your other remarks, but I've probably wasted enough of your time. If you don't respond to this message, I'll take that as a sign that we just aren't going to see eye to eye on any of this, and I won't bother you any more. (I'll be half surprised if you even get this message. From the tone of your last one, I wouldn't be shocked to find out you've already added me to your killfile. :-) No hard feelings, -Scott Gilbert __________________________________________________ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ |
From: Perry G. <pe...@st...> - 2002-04-14 18:54:05
|
Hi Scott, Just to be to the point, I'm still missing what I've been asking for, to wit a concrete example that illustrates your point. I'll try to address a few of your points that appear to try to answer that and clarify what I mean by concrete example. > > Here's what I'm proposing, and it's only a suggestion. > > > *** I think the requirements for being a general purpose "NDArray" > can be specified with only the following attributes: > > __array_buffer__ - as buffer object > __array_shape__ - as tuple of long > __array_itemsize__ - as int > > Optionally > __array_stride__ - as tuple of long (get from shape if None) > __array_offset__ - as int (would default to 0 if not present) > > Then anyone who implemented these could work with the same C API for > getting the pointer to memory, shape array, stride array, and item size. > Then you are talking about standardizing a C-API. But I'm still confused. If you write a class that implements these attributes, is it your C-API that uses them, or do you mean our C-API uses them? If you have your own C-API, then the attributes are not relevant as an interface. If you intend to use our C-API to access your objects, then they are. But if you want to use our C-API, that still doesn't explain why the alternatives aren't acceptable (namely subclassing). > > Because truthfully arrays are little more than a pointer to memory. > > That's like asking "why in the world would we presume memcpy() or > qsort() would know what to do with your memory?" > Then you misunderstand Numarray. Numarrays are far more than just a pointer to memory. You can get a pointer to memory from them, but they entail much more than that. Numarray presumes that certain things are possible with NumArray objects (like standard math operations). If you want something that doesn't make such an assumption, you should be using NDArray instead. NDArray makes no presumptions about the contents of the memory other than they are arranged in memory in array fashion. > > > > > You haven't provided any example (let > > alone a compelling one) of why we should accept any object that > > provides those attributes. > > > > Well, the UFuncs certainly should reject any object that they don't > know how to handle. I'm currently only addressing what it takes to be > an NDArray/NumArray object. OTOH, if I can present something to the > UFuncs that looks like a known array type, why wouldn't UFuncs > want to work with it? > If you are presenting numarray with a type is already knows about, why aren't you subclassing it? If you present numarray an object with a type it doesn't know about, then that is pointless. Types and numarray are inextricably intertwined, and shall remain so. > > - Allows me personally to distribute a separate (and simpler) > implementation of NDArrays/NumArrays right now and have the same data > objects work with yours when you're all done. If I give the UFuncs a > pointer to memory, and the attributes above, why shouldn't it work > correctly? > > > Am I doing any better? I am trying. > Not really. More on that later. > > > Is there a way, today, without modifying numarray, for me to use > numarray as a holder for these esoteric data types? Is that way > difficult? > Could it be easier? > No to the first, it isn't intended to serve that purpose. If you just need something to blindly hold values without doing anything with them use NDArray (and you can add whatever customization you wish regarding what methods or operators are available). > I'm not asking numarray to know about my types in it's core baseline. I'm > wondering what it takes to implement new types at all. > It's possible to extend (but not in any way that makes it automaticaly usable with anyone elses extension. Currently that sort of extension would not be hard for someone that knows how things work. We haven't documented how to do so, and won't for a while. It's not a high priority for us now. ********************************************************** What I want to see is a specific example. I'm not going to pay much attention to generalities becasue I'm still unclear about how you intend to do what you say you will do. Perhaps I'm slow, but I still don't get it. On the one hand, you ask us to have numarray accept objects with the same 'interface'. Well, if they are not of an existing supported type, thats pointless since numarray won't work properly with them. If it is an existing type, you haven't explained why you can't use numarray directly (or alternatively, create a numarray object that uses the same buffer yours does). I still haven't seen a specific example that illustrates why you cannot use subclassing or an instance of a numarray object instead. If you need to add a new type that's possible but you'll have to spend some time figuring out how to do that for your own extended version. If you just want to use arrays to hold values (of new types), then use NDArray. It doesn't care about types. But please give a specific case. E.g., "I want complex ints and I will develop a class that will use this to do the following things [it doesn't have to be exhastive or complete, but include just enough to illustrate the point]. If the attributes were standardized then I would do this and that, and use it with your stuff like this showing you the code (and the behavior I expect)." Given this I can either show you an alternate solution or I can realize why you are right and we can discuss where to go from there. Otherwise you are wasting your time. Perry |
From: Scott G. <xs...@ya...> - 2002-04-15 04:09:24
|
--- Perry Greenfield <pe...@st...> wrote: *** Just skim through my first few responses. About half way through writing this letter, a few things hit me. I still want to propose some changes, but I don't think you'll find them as intrusive... > > > > > Then anyone who implemented these could work with the same C API for > > getting the pointer to memory, shape array, stride array, and item > > size. > > > Then you are talking about standardizing a C-API. But I'm still > confused. If you write a class that implements these attributes, > is it your C-API that uses them, or do you mean our C-API uses > them? > I'm not really talking about standardizing a C-API. I'm talking about standardizing what that C-API would have to do. You would have your C-API as part of numarray proper. And, for the short term, I would have my own C-API as part of what I need to get done. Both C-API's would use the same attributes. Why do I want my own C-API today? Because numarray isn't done yet, and I can't create arrays of the types I need. I'll need a C-API to get at my types. It would be great if the same C-API could get at yours too. > > If you have your own C-API, then the attributes are not > relevant as an interface. If you intend to use our C-API to access > your objects, then they are. > Either C-API could access anything that looks like an NDArray. > > > > > Because truthfully arrays are little more than a pointer to memory. > > > > That's like asking "why in the world would we presume memcpy() or > > qsort() would know what to do with your memory?" > > > > Then you misunderstand Numarray. Numarrays are far more than just > a pointer to memory. You can get a pointer to memory from them, > but they entail much more than that. Numarray presumes that certain > things are possible with NumArray objects (like standard math > operations). If you want something that doesn't make such an > assumption, you should be using NDArray instead. NDArray makes > no presumptions about the contents of the memory other than > they are arranged in memory in array fashion. > I think I understand where you're coming from now. (BTW, I think some of our confusion comes from when I'm talking about "Numarray" or "numarray" the package versus "NumArray" and "NDArray" the classes.) *** Ok, I think there is light at the end of this tunnel... I guess what I've been arguing for all along is something a lot like an NDArray where I can specify the typecode (and possibly other things like 'endian' etc...), and that only NDArrays have a minimal set of standardized attributes. With this I can create extensions that will work with anything that looks like an NDArray. Your NDArrays from the numarray package, and my NDArrays of crazy types. I'm still left in the position of having to upcast an NDArray to a full blown NumArray if I ever want to use my NDArrays in a routine meant solely for NumArrays. However this conversion isn't difficult, and I think can do that when needed. Important Question: If an NDArray had a typecode (and it was a known string), is it possible to promote it to one of the standard NumArray types? Lesser Question: If an NDArray had a known typecode, is it desirable for numarray routines to promote the NDArray to a NumArray in the same way that the routines promote a Python list or tuple to a NumArray on the fly? Ok, my new proposal (again, treat it like a suggestion): - Do you think it would be possible to standardize the set of attributes that it requires to be an NDArray? NDArrays are simple and unlikely to change. I think _those_ really are just pointers to memory with array accounting information. We could agree on what exactly constitutes an NDArray. - Could this standard set of attributes optionally include the names for the typecode, endian, (and maybe some other) attributes? That doesn't mean that your NDArrays would have to have the typecode, endian or whatever information. It just means that when any class does add a typecode, it adds it as a specially named attribute. I realize that a large part of what I want is interoperability between separate implementations of NDArrays. Anything that has (_data, _shape, _itemsize, _type) is something I could work with in an extension. Some other fields are optional (_strides, _byteoffset) because they have sensible defaults that can be calculated from above in the common case. So the only difference between what you currently have and most of what I'm proposing is that the names of NDArray attributes become standardized. > > If you are presenting numarray with a type it already knows about, > why aren't you subclassing it? > Since I know I'll have to create types that numarray doesn't know about, I know I'm going to have to write a new array class (it's already written). It would be silly of my new array class to not implement the standard types just because numarray _does_ know about them. I now realize that I don't have to give my class to numarray directly. That didn't hit me before. I could promote/upcast it when necessary. The upcast-in and downcast-out thing will add up to extra work and messier code, but it is a workaround. > > If you present numarray an object > with a type it doesn't know about, then that is pointless. > Types and numarray are inextricably intertwined, and shall > remain so. > Understood. I don't want to ruin your NumArrays. > > ********************************************************** > > What I want to see is a specific example. I'm not going to > pay much attention to generalities because I'm still unclear > about how you intend to do what you say you will do. Perhaps > I'm slow, but I still don't get it. > Nope, clearly it was me that was being slow. There is still that bit about NDArrays that I'm trying to justify, so my example is below. > > (or alternatively, > create a numarray object that uses the same buffer yours does). > You're right. This hadn't occurred to me until just a little bit ago. > > E.g., "I want > complex ints and I will develop a class that will use this to > do the following things [it doesn't have to be exhaustive or > complete, but include just enough to illustrate the point]. > If the attributes were standardized then I would do this and that, > and use it with your stuff like this showing you the code > (and the behavior I expect)." > Here goes (somewhat hypothetical, but close to the boat I'm currently in): Jon is our FPGA guy who makes screaming fast core files, but our FPGAs don't do floating point. So I have to provide his driver with ComplexInt16 data. Jon and I write an extension module that calls his driver and reads data. We also write a C routine (call it "munge") that takes both ComplexInt16 data, and ComplexFloat64 data. We try it out for testing, and pass in my arrays in both places. We could have used Numarray for the ComplexFloat64, but that meant we had to use two array packages, and use two C-APIs in our extension. All we needed was a pointer to an array of doubles, so we stuck with mine. Ok, that part of development is done. Now we present it to the application developers. Their happy and we're rolling. Successful application. Another group find out about this and they want to use it. They're using numarray for a large part of their application. In fact, their calculating the ComplexFloat64 half the data that they want to pass to my "munge" routine using numarray, and they still need to use my ComplexInt32 data to read the FPGA. They're going to be disappointed to find out my extension can't read numarray data, and that they have to convert back and forth between the two. And as the list of routines grow, they have to keep track of whether it is a numarray-routine, or a scottarray-routine. It's not so bad for one simple "munge" function, but there are going to be hundreds of functions... I don't expect you to have much sympathy for my having to convert data back and forth between my array types and yours, but it is an avoidable problem. For the most part, we both agree on what parts an NDArray should have. If we could only agree what to name them, and that we'd stick to those names, that would be a large part of it for me. > > Given this I can either show you an alternate solution or > I can realize why you are right and we can discuss where > to go from there. Otherwise you are wasting your time. > Cheers, -Scott __________________________________________________ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ |