[Numpy-discussion] Introduction

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello All.

I'm interested in this project, and am curious to what level you are
willing to accept outside contribution.  I just tried to subscribe to
the developers list, but I didn't realize that required admin approval.
 Hopefully it doesn't look like I was shaking the door without knocking
first.

Is this list active?  Is this the correct place to talk about Numarray?

A little about me:

My name is Scott Gilbert, and I work as a software developer for a
company called Rincon Research in Tucson Arizona.  We do a lot digital
signal processing/analysis among other things. 

In the last year or so, we've started to use Python in various
capacities, and we're hoping to use it for more things.

We need a good array module for various things.  Some are similar to
what it looks like Numarray is targeted at (fft, convolutions, etc...),
and others are pretty different (providing buffers for reading data
from specialized hardware etc...)

About a week ago, I noticed that Guido over in Python developer land
was willing to accept patches to the standard array module.  As such, I
thought I would take that opportunity to try and wedge some desirements
and requirements I have into that baseline.  Bummer for me, but they
weren't exactly exited about bloating out arraymodule.c to meet my
needs, and in retrospect that does make good sense.  A number of people
suggested that this might be a better place to try and get what I need.

So here I am, poking around and wondering if I can play in your
sandbox.

If you're willing to let me contribute, my specific itches that I need
to scratch are below.  Otherwise - bummer, and I hope you all catch
crabs...  :-)

-----------------------------------

It's taken me a couple of days to understand what's going on in the
source.  I've read through the design docs, and the PEP, but it wasn't
until I tried to re-implement it that it really clicked.  My
re-implementation of the array portion of what you're doing is
attached.  There are still some holes to fill in, but it's fairly
complete and supports a whole bunch of things which yours does not
(Some of which you might even find useful: Pickling, Bit type).  I'm
pretty proud of it for only 400 lines of Python (Most of which is the
bazillion type declarations).  It's probably riddled with bugs as it's
less than a day old...

After initially thinking that you guys were getting too clever, I've
come to realize it's a pretty good design overall.  Still I have some
changes I would like to make if you'll let me.  (Both to the design and
the implementation)

-------------------------

Following your design for the Array stuff, I've been able to implement
a pretty usable array class that supports the bazillion array types I
need (Bit, Complex Integer, etc...).  This gets me past my core
requirements without polluting your world, but unfortunately my new
XArray type doesn't play so well with your UFuncs.  I think my users
will definitely want to use your UFuncs when the time comes, so I want
to remedy this situation.

The first change I would like to make is to rework your code that
verifies that an object is a "usable" array.  I think NumArray should
only check for the interface required, not the actual type hierarchy. 
By this I mean that the minimum required to be a supported array type
is that it support the correct attributes, not that it actually inherit
from NDArray:

   (quoting from your paper) something like:

       _data
       _shape
       _strides
       _byteoffset
       _aligned
       _contiguous
       _type
       _byteswap

Most of these are just integer fields, or tuples of integers.  Ignoring
_type for the moment, it appears that the interface required to be a
NumArray is much less strict than actually requiring it to derive from
NumArray.  If you allow me to change a few functions (inputarray() in
numarray.py is one small example), I could use my independant XArray
class almost as is, and moreover I can implement new array objects
(possibly as extension types) for crazy things like working with page
aligned memory, memory mapping etc...

Well, that's almost enough.  The _type field poses a small problem of
sorts.  It looks like you don't require a _type to be derived from
NumericType, and this is a good thing since it allows me (and others)
to implement NumArray compatible arrays without actually requiring
NumArray to be present.

However, it would be nice if you declared a more comprehensive list of
typenames - even if they aren't all implemented in NumArray proper. 
Who knows, maybe the SciPy guys have a use for complex integers or bit
arrays.  If you make a reasonable canonical list, our data could be
passed back and forth even if NumArray doesn't know what to do with it.

See my attached module for the types of things I'm thinking of.  I'm
not so concerned about the "Native Types" that are in there, but I
think committing a list of named standard types.  (I suspect there are
others that are interested in standard C types even if the size changes
between machines...)

If you were to specify a minimal interface like this in the short term,
I could begin propagating my array module to my users.  I could get my
work done now, knowing that I'll be compatible with NumArray proper
once it matures.  I'd be willing to participate in making these changes
if necessary.

Looking at the big picture, I think it's desirable that there really
only be one official standard for ND arrays in the Python world.  That
way, the various independent groups can all share their independent
work.  You guys are the heir-apparent, so to speak, from the Python
guys point of view.

I don't know if you're trying to get all of NumArray into the Python
distribution or not, but I suspect a good interim step would be to have
a PEP that specifies what it means to be a NumArray or NDArray in
minimal terms.  Perhaps supplying an Array only module in Python that
implements this interface.  Again, I'd be willing to help with all of
this.

-------------------------

Ok, other suggestions...

Here is the list of things that your design document indicates are
required to be a NumArray:

       _data
       _shape
       _strides
       _byteoffset
       _aligned
       _contiguous
       _type
       _byteswap

I believe that one could calculate the values for _aligned and
_contiguous from the other fields.  So they shouldn't really be part of
the interface required.  I suspect it is useful for the C
implementation of UFuncs to have this information in the NDINfo struct
though, so while I would drop them from attribute interface, I would
delegate the task of calculating these values to getNDInfo() and/or
getNumInfo().

I also notice that you chose _byteswap to indicate byteswapping is
needed.  I think a better choice would be to specify the endian-ness of
the data (with an _endian attr), and have getNDInfo() and getNumInfo()
calculte the _byteswap value for the NDInfo struct.

In my implementation, I came up with a slightly different list:

            self._endian
            self._offset
            self._shape
            self._stride
            self._itemtype
            self._itemsize
            self._itemformat
            self._buffer

The only minimal differences are that _itemsize allows me to work with
arrays of bytes without having any clue what the underlying type is (in
some cases, _itemtype is "Unknown".)  Secondly, I implemented a
"Struct" _itemtype, and _itemformat is useful for for this case.  (It's
the same format string that the struct module in Python uses.)

Also, I specified 0 for _itemsize when the actual items aren't byte
addressable.  In my module, this only occurred with the Bit type.  I
figured specifying 0 like this could keep a UFunc that isn't Bit aware
from stepping on memory that it isn't allowed to.

-------------------------

Next thought:  Memory Mapping

I really like the idea of having Python objects that map huge files a
piece at time without using all of available memory.  I've seen this in
NumArray's charter as part of the reason for breaking away from
Numeric, and I'm curious how you intend to address it.

Right now, the only requirement for _data seems to be that it implement
the PyBufferProcs.  For memory mapping something else is needed...

I haven't implemented this, so take it as just my rambling thoughts:

With the addition of 3 new, optional, attributes to the NumArray object
interface, I think this could be efficiently accomplished:

     _mapproc
     _mapmin
     _mapmax

If _mapproc is present and not None, then it points to a function who's
responsibility it is to set _mapmin and _mapmax appropriately. 
_mapproc takes one argument which is the desired byte offset into the
virtual array.  This is probably easier to describe with code:

     def _mapproc(self, offset):
         unmap_the_old_range()
         mmap_a_new_range_that_includes_byteoffset()
         self._mapmin = minimum_of_new_range()
         self._mapmax = maximum_of_new_range()

In this way, when the delta between _mapmin and _mapmax is large
enough, the UFuncs could act over a large contiguous portion of the
_data array at a time before another remapping is necessary.  If the
byteoffset that a UFunc needs to work with is outside of _mapmin and
_mapmax, it must call _mapproc to remedy the situation.

This puts a lot of work into UFuncs that choose to support this.  I
suppose that is tough to avoid though.

Also, there are threading issues to think about here.  I don't know if
UFuncs are going to release the Global Interpreter Lock, but if they do
it's possible that multiple threads could have the same PyObject and
try to _mapproc different offsets at different times.

It is possible to implement a mutex for the NumArray without requiring
anything special from the PyObject that implements it...

-----------------------------

Ok.  That's probably way too much content for an Introductory email.  I
do have more thoughts on this stuff though.  They'll just have to wait
for another time.

Nice to meet you all,
    -Scott Gilbert

__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

[Numpy-discussion] Introduction

A package for scientific computing with Python

[Numpy-discussion] Introduction