From: Paul F. D. <du...@ll...> - 2002-01-25 17:43:31
|
I have verified that this package seems to work on Windows. I says seems only because I didn't try enough to uncover anything subtle. Unless or until we are convinced as a community that this is (a) the right way to do this and (b) that the package is portable, it would not be wise to put it in the main distribution. I would like to hear from the community about this so that I will know whether or not to add this package as a separate SourceForge 'package' within the Numerical Python area. Meantime I will add a link to the web page. -----Original Message----- From: pyt...@py... [mailto:pyt...@py...] On Behalf Of Kragen Sitaker Sent: Wednesday, January 23, 2002 9:40 PM To: pyt...@py... Subject: memory-mapped Numeric arrays: arrayfrombuffer version 2 The 'arrayfrombuffer' package features support for Numerical Python arrays whose contents are stored in buffer objects, including memory-mapped files. This has the following advantages: - loading your array from a file is easy --- a module import and a single function call --- and doesn't use excessive amounts of memory. - loading your array is quick; it doesn't need to be copied from one part of memory to another in order to be loaded. - your array gets demand-loaded; parts you aren't using don't need to be in memory or in swap. - under memory-pressure conditions, your array doesn't use up swap, and parts of it you haven't modified can be evicted from RAM without the need for a disk write - your arrays can be bigger than your physical memory - when you modify your array, only the parts you modify get written back out to disk This is something that's been requested on the Numpy list a few times a year since 1999. arrayfrombuffer lives at http://pobox.com/~kragen/sw/arrayfrombuffer/ The current version is version 2; it is released under the X11 license (the BSD license without the advertising clause). <kr...@po...> <P><A HREF="http://pobox.com/~kragen/sw/arrayfrombuffer/">arrayfrombuffer 2</A> - creates Numeric arrays from memory-mapped files. (23-Jan-02) -- http://mail.python.org/mailman/listinfo/python-announce-list |
From: <kr...@po...> - 2002-01-30 07:38:37
|
Paul Dubois writes: > I have verified that this package seems to work on Windows. I says seems > only because I didn't try enough to uncover anything subtle. Thanks! If you run maparray.py from the command-line, it runs a basic regression test suite, which doesn't really try enough to uncover anything subtle either. > Unless or until we are convinced as a community that this is (a) the > right way to do this and (b) that the package is portable, it would not > be wise to put it in the main distribution. Well, I'm not the community, but I'll state my arguments. On (a): I don't know about the right way to do it; as I'm sure is obvious, I'm new to extending Numerical Python in C, but I doubt there's a simpler way to do it ("it" being seeing the contents of files as Numeric arrays), and I think it does work as well as it's possible to get it to work without major hacking of Numeric (to support read-only arrays) or the mmap module (to prevent closing an open object, to allow read-only mmapping on Windows). The other things on the wishlist are basically features to add --- "start" and "size" arguments to maparray(), a "create-a-file" argument to maparray(), etc., and don't change the basic structure. On (b): it depends on two-argument mmap.mmap(), open(), <file>.fileno(), and os.fstat(). The major portability hurdle there is probably mmap.mmap(), but that's OK. > I would like to hear from the community about this so that I will know > whether or not to add this package as a separate SourceForge 'package' > within the Numerical Python area. Meantime I will add a link to the web > page. I would too. There have been downloads from something like 50 IP addresses, but I've only heard from three people. |
From: <kr...@po...> - 2002-02-18 20:07:26
|
(I thought I had sent this mail on January 30, but I guess I was mistaken.) Eric Nodwell writes: > Since I have a 2.4GB data file handy, I thought I'd try this > package with it. (Normally I process this data file by reading > it in a chunk at a time, which is perfectly adequate.) Not > surprisingly, it chokes: Yep, that's pretty much what I expected. I think that adding code to support mapping some arbitrary part of a file should be fairly straightforward --- do you want to run the tests if I write the code? > File "/home/eric/lib/python2.2/site-packages/maparray.py", line 15, > in maparray > m = mmap.mmap(fn, os.fstat(fn)[stat.ST_SIZE]) > OverflowError: memory mapped size is too large (limited by C int) This error message's wording led me to something that was *not* what I expected. That's a sort of alarming message --- it suggests that it won't work on >2G files even on LP64 systems, where longs and pointers are 64 bits but ints are 32 bits. The comments in the mmap module say: The map size is restricted to [0, INT_MAX] because this is the current Python limitation on object sizes. Although the mmap object *could* handle a larger map size, there is no point because all the useful operations (len(), slicing(), sequence indexing) are limited by a C int. Horrifyingly, this is true. Even the buffer interface function arrayfrombuffer uses to get the size of the buffer return int sizes, not size_t sizes. This is a serious bug in the buffer interface, IMO, and I doubt it will be fixed --- the buffer interface is apparently due for a revamp soon at any rate, so little changes won't be welcomed, especially if they break binary backwards compatibility, as this one would on LP64 platforms. Fixing this, so that LP64 Pythons can mmap >2G files (their birthright!), is a bit of work --- probably a matter of writing a modified mmap() module that supports a saner version of the buffer interface (with named methods instead of a type object slot), and can't be close()d, to boot. Until then, this module only lets you memory-map files up to two gigs. > (details: Python 2.2, numpy 20.3, Pentium III, Debian Woody, Linux > kernel 2.4.13, gcc 2.95.4) My kernel is 2.4.13 too, but I don't have any large files, and I don't know whether any of my kernel, my libc, or my Python even support them. > I'm not a big C programmer, but I wonder if there is some way for > this package to overcome the 2GB limit on 32-bit systems. That > could be useful in some situations. I don't know, but I think it would probably require extensive code changes throughout Numpy. -- <kr...@po...> Kragen Sitaker <http://www.pobox.com/~kragen/> The sages do not believe that making no mistakes is a blessing. They believe, rather, that the great virtue of man lies in his ability to correct his mistakes and continually make a new man of himself. -- Wang Yang-Ming |
From: Eric N. <no...@ph...> - 2002-01-25 18:40:16
|
Since I have a 2.4GB data file handy, I thought I'd try this package with it. (Normally I process this data file by reading it in a chunk at a time, which is perfectly adequate.) Not surprisinly, it chokes: File "/home/eric/lib/python2.2/site-packages/maparray.py", line 15, in maparray m = mmap.mmap(fn, os.fstat(fn)[stat.ST_SIZE]) OverflowError: memory mapped size is too large (limited by C int) (details: Python 2.2, numpy 20.3, Pentium III, Debian Woody, Linux kernel 2.4.13, gcc 2.95.4) I'm not a big C programmer, but I wonder if there is some way for this package to overcome the 2GB limit on 32-bit systems. That could be useful in some situations. Eric On Fri, Jan 25, 2002 at 09:40:21AM -0800, Paul F. Dubois wrote: > > I have verified that this package seems to work on Windows. I says seems > only because I didn't try enough to uncover anything subtle. > > Unless or until we are convinced as a community that this is (a) the > right way to do this and (b) that the package is portable, it would not > be wise to put it in the main distribution. > > I would like to hear from the community about this so that I will know > whether or not to add this package as a separate SourceForge 'package' > within the Numerical Python area. Meantime I will add a link to the web > page. > > From: pyt...@py... > [mailto:pyt...@py...] On Behalf Of Kragen > Sitaker > Sent: Wednesday, January 23, 2002 9:40 PM > To: pyt...@py... > Subject: memory-mapped Numeric arrays: arrayfrombuffer version 2 > > > The 'arrayfrombuffer' package features support for Numerical Python > arrays whose contents are stored in buffer objects, including > memory-mapped files. This has the following advantages: > > - loading your array from a file is easy --- a module import and a > single function call --- and doesn't use excessive amounts of > memory. > - loading your array is quick; it doesn't need to be copied from one > part of memory to another in order to be loaded. > - your array gets demand-loaded; parts you aren't using don't need to > be in memory or in swap. > - under memory-pressure conditions, your array doesn't use up swap, > and parts of it you haven't modified can be evicted from RAM without > the need for a disk write > - your arrays can be bigger than your physical memory > - when you modify your array, only the parts you modify get written > back out to disk > > This is something that's been requested on the Numpy list a few times a > year since 1999. > > arrayfrombuffer lives at http://pobox.com/~kragen/sw/arrayfrombuffer/ > The current version is version 2; it is released under the X11 license > (the BSD license without the advertising clause). > > <kr...@po...> > > <P><A > HREF="http://pobox.com/~kragen/sw/arrayfrombuffer/">arrayfrombuffer > 2</A> - creates Numeric arrays from memory-mapped files. (23-Jan-02) > -- ******************************** Eric Nodwell Ph.D. candidate Department of Physics University of British Columbia tel: 604-822-5425 fax: 604-822-4750 no...@ph... |