From: Travis O. <oli...@ie...> - 2006-09-20 10:48:14
|
Francesc Altet wrote: > Hi, > > I'm sending a message here because discussing about this in the bug tracker is > not very comfortable. This my last try before giving up, so don't be > afraid ;-) > > In bug #283 (http://projects.scipy.org/scipy/numpy/ticket/283) I complained > about the fact that a numpy.int32 is being mapped in NumPy to NPY_LONG > enumerated type and I think I failed to explain well why I think this is a > bad thing. Now, I'll try to expose an (real life) example, in the hope that > things will make clearer. > > Realize that you are coding a C extension that receives NumPy arrays for > saving them on-disk for a later retrieval. Realize also that an user is using > your extension on a 32-bit platform. If she pass to this extension an array > of type 'int32', and the extension tries to read the enumerated type (using > array.dtype.num), it will get NPY_LONG. > So, the extension use this code > (NPY_LONG) to save the type (together with data) on-disk. Now, she send this > data file to a teammate that works on a 64-bit machine, and tries to read the > data using the same extension. The extension would see that the data is > NPY_LONG type and would try to deserialize interpreting data elements as > being as 64-bit integer (this is the size of a NPY_LONG in 64-bit platforms), > and this is clearly wrong. > > In my view, this "real-life" example points to a flaw in the coding design that will not be fixed by altering what numpy.int32 maps to under the covers. It is wrong to use a code for the platform c data-type (NPY_LONG) as a key to understand data written to disk. This is and always has been a bad idea. No matter what we do with numpy.int32 this can cause problems. Just because a lot of platforms think an int is 32-bits does not mean all of them do. C gives you no such guarantee. Notice that pickling of NumPy arrays does not store the "enumerated type" as the code. Instead it stores the data-type object (which itself pickles using the kind and element size so that the correct data-type object can be reconstructed on the other end --- if it is available at all). Thus, you should not be storing the enumerated type but instead something like the kind and element-size. > Besides this, if for making your C extension you are using a C library that is > meant to save data in a platform-independent (say, HDF5), then, having a > NPY_LONG will not automatically say which C library datatype maps to, because > it only have datatypes that are of a definite size in all platforms. So, this > is a second problem. > > Making sure you get the correct data-type is why there are NPY_INT32 and NPY_INT64 enumerated types. You can't code using NPY_LONG and expect it will give you the same sizes when moving from 32-bit and 64-bit platforms. That's a problem that has been fixed with the bitwidth types. I don't understand why you are using the enumerated types at all in this circumstance. > Of course there are workarounds for this, but my impression is that they can > be avoided with a more sensible mapping between NumPy Python types and NumPy > enumerated types, like: > > numpy.int32 --> NPY_INT > numpy.int64 --> NPY_LONGLONG > numpy.int_ --> NPY_LONG > > in all platforms, avoiding the current situation of ambiguous mapping between > platforms. > The problem is that C gives us this ambiguous mapping. You are asking us to pretend it isn't there because it "simplifies" a hypothetical case so that poor coding practice can be allowed to work in a special case. I'm not convinced. This persists the myth that C data-types have a defined length. This is not guaranteed. The current system defines data-types with a guaranteed length. Yes, there is ambiguity as to which is "the" underlying c-type on certain platforms, but if you are running into trouble with the difference, then you need to change how you are coding because you would run into trouble on some combination of platforms even if we made the change. Basically, you are asking to make a major change, and at this point I'm very hesitant to make such a change without a clear and pressing need for it. Your hypothetical example does not rise to the level of "clear and pressing need." In fact, I see your proposal as a step backwards. Now, it is true that we could change the default type that gets first grab at int32 to be int (instead of the current long) --- I could see arguments for that. But, since the choice is ambiguous and the Python integer type is the c-type long, I let long get first dibs on everything as this seemed to work better for code I was wrapping in the past. I don't see any point in changing this choice now and risk code breakage, especially when your argument is that it would let users think that a c int is always 32-bits. Best regards, -Travis |