From: Scott G. <xs...@ya...> - 2002-04-13 10:08:25
|
--- Perry Greenfield <pe...@st...> wrote: > Scott Gilbert writes: [...] > > > > very_cool = numarray.add(n, s) > > > But why not (I may have some details wrong, I'm doing this > from memory, and I haven't worked on it myself in a bit): > [...] > > maybe_not_quite_so_cool_but_just_as_functional = n + s > [...] > > From everything I've seen so far, I don't see why you can't > just create a NumArray object directly. You can subclass it > (and use multiple inheritance if you need to subclass a different > object as well) and add whatever customized behavior you want. > You can create new kinds of objects as buffers just so long > as you satisfy the buffer interface. > Your point about the optional buffer parameter to the NumArray is well taken. I had seen that when looking through the code, but it slipped my mind for that example. I could very well be wrong about some of these other reasons too... I have a number of reasons listed below for wanting the standard that Python adopts to specify only the interface and not the implementation. You may not find all of these pursuasive, and I apologize in advance if any looks like a criticism. (In my limited years as a professional software developer, I've found that the majority of people can be very defensive and protective of their code. I've been trying to tread lightly, but I don't know if I'm succeeding.) However if any of these reasons is persuasive, keep in mind that the actual changes I'm proposing are pretty minimal in scope. And that I'd be willing to submit patches so as to reduce any inconvenience to you. (Not that you have any reason to believe I can code my way out of a box... :-) Ok, here's my list: Philosophical You have a proposal in to the Python guys to make Numarray into the standard _implementation_. I think standards like this should specify an _interface_, not an implementation. Simplicity I can give my users a single XArray.py file, and they can be off and running with something that works right then and there, and it could in many ways be compatible with Numarray (with some slight modifications) when they decide they want the extra functionality of extension modules that you or anyone else who follows your standard provides. But they don't have to compile anything until they really need to. Your implementation leaves me with all or nothing. I'll have to build and use numarray, or I've got an in house only solution. Expediency I want to see a usable standard arise quickly. If you maintain the stance that we should all use the Numarray implementation, instead of just defining a good Numarray interface, everyone has to wait for you to finish things enough to get them accepted by the Python group. Your implementation is complicated, and I suspect they will have many things that they will want you to change before they accept it into their baseline. (If you think my list of suggestions is annoying, wait until you see theirs!) If a simple interface protocol is presented, and a simple pure Python module that implements it. The PEP acceptance process might move along quickly, but you could take your time with implementing your code. Pragmatic You guys aren't finished yet, and I need to give my users an array module ASAP. As such a new project, there are likely to be many bugs floating around in there. I think that when you are done, you will probably have a very good library. Moreover, I'm grateful that you are making it open source. That's very generous of you, and the fact that you are tolerating this discussion is definitely appreciated. Still, I can't put off my projects, and I can't task you to work faster. However, I do think we could agree in a very short term that your design for the interface is a good one. I also think that we (or just me if you like) could make a much smaller PEP that would be more readily accepted. Then everyone in this community could proceed at their own pace - knowing that if we followed the simple standard we would have inter operability with each other. Social Normally I wouldn't expect you to care about any of my special issues. You have your own problems to solve. As I said above, it's generous of you to even offer your source code. However, you are (or at least were) trying to push for this to become a standard. As such, considering how to be more general and apply to a wider class of problems should be on your agenda. If it's not, then you shouldn't be creating the standard. If you don't care about numarray becoming standard, I would like to try my hand at submitting the slightly modified version of your design. I won't be compatible with your stuff, but hopefully others will follow suit. Functionality Data Types I have needs for other types of data that you probably have little use for. If I can't coerce you to make a minor change in specification, I really don't think I could coerce you to support brand new data types (complex ints is the one I've beaten to death, because I could use that one in the short term). What happens when someone at my company wants quaternions? I suspect that you won't have direct support for those. I know that numarray is supposed to be extensible, but the following raises an exception: from numarray import * class QuaternionType(NumericType): def __init__(self): NumericType.__init__(self, "Quaternion", 4*8, 0) Quaternion = QuaternionType() # BOOM! q = array(shape=(10, 10), type=Quaternion) Maybe I'm just doing something wrong, but it looks like your code wants "Quaternion" to be in your (private?) typeConverters dictionary. Ok, try two: from numarray import * q = NDArray(shape=(10, 10), itemsize=4*8) if a[5][5] is None: print "No boom, but what can I do with it?" Maybe this is just a documentation problem. On the other hand, I can do the following pretty readily: import array class Quat2D: def __init__(self, *shape): assert len(shape) == 2 self._buffer = array.array('d', [0])*shape[0]*shape[1]*4 self._shape, self._stride = tuple(shape), (4*shape[0], 4) self._itemsize = 4*8 def __getitem__(self, sub): assert isinstance(sub, tuple) and len(sub) == 2 offset = sub[0]*self._stride[0] + sub[1]*self._stride[1] return tuple([self._buffer[offset + i] for i in range(4)]) def __setitem__(self, sub, val): assert isinstance(sub, tuple) and len(sub) == 2 offset = sub[0]*self._stride[0] + sub[1]*self._stride[1] for i in range(4): self._buffer[offset + i] = val[i] return val q = Quat2D(10, 10) q[5, 5] = (1, 2, 3, 4) print q[5, 5] This isn't very general, but it is short, and it makes a good example. If they get half of their data from calculations using Numarray, and half from whatever I provide them, and then try to mix the results in an extension module that has to know about separate implementations, life is more complicated than it should be. Operations I'm going to have to write my own C extension modules for some high performance operations. All I need to get this done is a void* pointer, the shape, stride, itemsize, itemtype, and maybe some other things to get off and running. You have a growing framework, and you have already indicated that you think of your hidden variables as private. I don't think I or my users should have to understand the whole UFunc framework and API just to create an extension that manipulates a pointer to an array of doubles. Arrays are simpler than UFuncs. I consider them to be pretty seperable parts of your design. If you keep it this way, and it becomes the standard, it seems that I and everyone else will have to understand both parts in order to create an extension module. Flexibility Numarray is going to make a choice of how to implement slicing. My guess is that it will be one of "copy contiguous", "copy on write", "copy by reference". I don't know what the correct choice is, but I know that someone else will need something different based on context. Things like UFuncs and other extension modules that do fast C level calculations typically don't need to concern themselves with slicing behaviour. Design Your implementation would be similar to having the 'pickle' module require you to derive from a 'Pickleable' base class - instead of simply providing __getstate__ and __setstate__ methods. It's an artificial constraint, and those are usually bad. > > All good in principle, but I haven't yet seen a reason to change > numarray. As far as I can tell, it provides all you need exactly > as it is. If you could give an example that demonstrated otherwise... > Maybe you're right. I suspect you as the author will come up with the quick example that shows how to implement my bizarre quaternion example above. I'm not sure if this makes either of us right or wrong, but if you're not buying any of this, then it's probably time for me to chock this off to a difference in opinion and move on. Truthfully this is taking me pretty far from my original tack. Originally I had simply hoped to hack a couple of things into arraymodule.c, and here I am now trying to get a simpler standard in place. I'll try one last time to convince you with the following two statements: - Changing such that you only require the interface is a subtle, but noticeable, improvement to your otherwise very good design. - It's not a difficult change. If that doesn't compel you, at least I can walk away knowing I tried. For the volumes I've written, this will probably be my last pesky message if you really don't want to budge on this issue. > > To tell you the truth, I'm not crazy about how the struct module > handles types or attributes. It's generally far too cryptic for > my tastes. Other than providing backward compatibility, we aren't > interested in it emulating struct. > I consider it a lot like regular expressions. I cringe when I see someone else's, but I don't have much difficulty putting them together. The alternative of coming up with a different specifier for records/structs is probably a mistake now that the struct module already has it's (terse) format specification. Once that is taken into consideration, following all the leads of the struct module makes sense to me. > > I could well misunderstand, but I thought that if you mmap a file > in unix in write mode, you do not use up the virtual memory as > limited by the physical memory and the paging file. Your only > limit becomes the virtual address space available to the processor. > Regarding efficiency, it depends on the implementations, which vary greatly, and there are other subtleties. I've already written a book above, so I won't tire you with details. I will say that closing a large memory mapped file on top of NFS can be dreadful. It probably takes the same amount of total time or less, but from an interactive analysys point of view it's pretty unpleasant on Tru64 at least. Also, just mmaping the whole file puts all of the memory use at the discretion of the OS. I might have a gig or two to work with, but if mmap takes them all, other threads will have to contend for memory. The system (application) as a whole might very well run better if I can retain some control over this. I'm not married to the windowing suggestion. I think it's something to consider, but it might not be a common enough case to try and make a standard mechanism for. If there isn't a way to do it without a kluge, then I'll drop it. Likewise if a simple strategy can't meet anyone's real needs. > > If the 32 bit address is your problem, you are far, far better off > using a 64-bit processor and operating system than trying to kludge up > a windowing memory mechanism. > We don't always get to specify what platform we want to run on. Our customer has other needs, and sometimes hardware support for exotic devices dictate what we'll be using. Frequently it is on 64 bit Alphas, but sometimes the requirement is x86 Linux, or 32 bit Solaris. Finally, our most frustrating piece of legacy software was written in Fortran assuming you could stuff a pointer into an INT*4 and now requires the -taso flag to the compiler for all new code (which turns a sexy 64 bit Alpha into a 32 bit kluge...). Also, much of our data comes on tapes. It's not easy to memory map those. > > I could see a way of doing it for > ufuncs, but the numeric world (and I would think the DSP world > as well) needs far more than element-by-element array functionality. > providing a usable C-api for that kind of memory model would be > a nightmare. But I'm not sure if this or the page file is your > limitation. > I would suggest that any extension module which is not interested in this feature simply raise a NotImplemented exception of some sort. UFuncs could fall into this camp without any criticism from me. All it would have to do is check if the 'window_get' attribute is a callable, and punt an exception. My proposal wasn't necessarily to map in a single element at a time. If the C extension was willing to work these beasts at all, it would check to see if the offset it wanted was between window_min and window_max. If it wasn't, then it would call ob.window_get(offset), and the Python object could update window_min and window_max however it sees fit. For instance by remapping 10 or 20 megabytes on both sides. This particular implementation would allow us to do correlations of a small (mega sample) chunk of data against a HUGE (giga sample) file. This might be the wrong interface, and I'm willing to listen to a better suggestion. It might also be too special of a need to detract from a simpler overall design. Also, there are other uses for things like this. It could possibly be used to implement sparse arrays. It's probably not the best implementation of that, but it could hide a dict of set data points, and present it to an extension module as a complete array. Cheers, -Scott Gilbert __________________________________________________ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ |