You can subscribe to this list here.
2000 |
Jan
(8) |
Feb
(49) |
Mar
(48) |
Apr
(28) |
May
(37) |
Jun
(28) |
Jul
(16) |
Aug
(16) |
Sep
(44) |
Oct
(61) |
Nov
(31) |
Dec
(24) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2001 |
Jan
(56) |
Feb
(54) |
Mar
(41) |
Apr
(71) |
May
(48) |
Jun
(32) |
Jul
(53) |
Aug
(91) |
Sep
(56) |
Oct
(33) |
Nov
(81) |
Dec
(54) |
2002 |
Jan
(72) |
Feb
(37) |
Mar
(126) |
Apr
(62) |
May
(34) |
Jun
(124) |
Jul
(36) |
Aug
(34) |
Sep
(60) |
Oct
(37) |
Nov
(23) |
Dec
(104) |
2003 |
Jan
(110) |
Feb
(73) |
Mar
(42) |
Apr
(8) |
May
(76) |
Jun
(14) |
Jul
(52) |
Aug
(26) |
Sep
(108) |
Oct
(82) |
Nov
(89) |
Dec
(94) |
2004 |
Jan
(117) |
Feb
(86) |
Mar
(75) |
Apr
(55) |
May
(75) |
Jun
(160) |
Jul
(152) |
Aug
(86) |
Sep
(75) |
Oct
(134) |
Nov
(62) |
Dec
(60) |
2005 |
Jan
(187) |
Feb
(318) |
Mar
(296) |
Apr
(205) |
May
(84) |
Jun
(63) |
Jul
(122) |
Aug
(59) |
Sep
(66) |
Oct
(148) |
Nov
(120) |
Dec
(70) |
2006 |
Jan
(460) |
Feb
(683) |
Mar
(589) |
Apr
(559) |
May
(445) |
Jun
(712) |
Jul
(815) |
Aug
(663) |
Sep
(559) |
Oct
(930) |
Nov
(373) |
Dec
|
From: Jonathan W. <jon...@gm...> - 2006-11-16 19:28:33
|
Hi all, I've gotten to the point where Numpy recognizes the objects (represented as doubles), but I haven't figured out how to register ufunc loops on the custom type. It seems like Numpy should be able to check that the scalarkind variable in the numpy type descriptor is set to float and use the float ufuncs on the custom object. Barring that, does anyone know if the symbols for the ufuncs are publicly accessible (and where they are) so that I can register them with Numpy on the custom type? As for sharing code, I've been working on this for a project at work. There is a possibility that it will be released to the Numpy community, but that's not clear yet. Thanks, Jonathan On 11/16/06, Matt Knox <mat...@ho...> wrote: > > > On Thursday 16 November 2006 11:44, David Douard wrote: > > > Hi, just to ask you: how is the work going on encapsulatinsg > mx.DateTime > > > as a native numpy type? > > > And most important: is the code available somewhere? I am also > > > interested in using DateTime objects in numpy arrays. For now, I've > > > always used arrays of floats (using gmticks values of dates). > > > And I, as arrays of objects (well, I wrote a subclass to deal with > dates, > > where each element is a datetime object, with methods to translate to > floats > > or strings , but it's far from optimal...). I'd also be quite interested > in > > checking what has been done. > > I'm also very interested in the results of this. I need to do something > very similar and am currently relying on an ugly hack to achieve the desired > result. > > - Matt Knox > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > Numpy-discussion mailing list > Num...@li... > https://lists.sourceforge.net/lists/listinfo/numpy-discussion > > > |
From: Matt K. <mat...@ho...> - 2006-11-16 18:37:41
|
> On Thursday 16 November 2006 11:44, David Douard wrote:> > Hi, just to as= k you: how is the work going on encapsulatinsg mx.DateTime> > as a native n= umpy type?> > And most important: is the code available somewhere? I am als= o> > interested in using DateTime objects in numpy arrays. For now, I've> >= always used arrays of floats (using gmticks values of dates).> And I, as a= rrays of objects (well, I wrote a subclass to deal with dates, > where each= element is a datetime object, with methods to translate to floats > or str= ings , but it's far from optimal...). I'd also be quite interested in > che= cking what has been done. =20 I'm also very interested in the results of this. I need to do something ver= y similar and am currently relying on an ugly hack to achieve the desired r= esult. =20 - Matt Knox= |
From: Tim H. <tim...@ie...> - 2006-11-16 18:05:56
|
Francesc Altet wrote: > A Dimarts 14 Novembre 2006 23:08, Erin Sheldon escrigué: > >> On 11/14/06, John Hunter <jdh...@ac...> wrote: >> >>> Has anyone written any code to facilitate dumping mysql query results >>> (mainly arrays of floats) into numpy arrays directly at the extension >>> code layer. The query results->list->array conversion can be slow. >>> >>> Ideally, one could do this semi-automagically with record arrays and >>> table introspection.... >>> >> I've been considering this as well. I use both postgres and Oracle >> in my work, and I have been using the python interfaces (cx_Oracle >> and pgdb) to get result lists and convert to numpy arrays. >> >> The question I have been asking myself is "what is the advantage >> of such an approach?". It would be faster, but by how >> much? Presumably the bottleneck for most applications will >> be data retrieval rather than data copying in memory. >> > > Well, that largely depends on your pattern to access the data in your > database. If you are accessing to regions of your database that have a > high degree of spatial locality (i.e. they are located in equal or > very similar places), the data is most probably already in memory (in > your filesystem cache or maybe in your database cache) and the > bottleneck will become the memory access. Of course, if you don't have > such a spatial locality in the access pattern, then the bottleneck > will be the disk. > > Just to see how DB 2.0 could benefit from adopting record arrays as > input buffers, I've done a comparison between SQLite3 and PyTables. > PyTables doesn't suport DB 2.0 as such, but it does use record arrays > as buffers internally so as to read data in an efficient way (there > should be other databases that features this, but I know PyTables best > ;) > > For this, I've used a modified version of a small benchmarking program > posted by Tim Hochberg in this same thread (it is listed at the end > of the message). Here are the results: > > setup SQLite took 23.5661110878 seconds > retrieve SQLite took 3.26717996597 seconds > setup PyTables took 0.139157056808 seconds > retrieve PyTables took 0.13444685936 seconds > > [SQLite results were obtained using an in-memory database, while > PyTables used an on-disk one. See the code.] > > So, yes, if your access pattern exhibits a high degree of locality, > you can expect a huge difference on the reading speed (more than 20x > for this example, but as this depends on the dataset size, it can be > even higher for larger datasets). > One weakness of this benchmark is that it doesn't break out how much of the sqlite3 overhead is inherent to the sqlite3 engine, which I expect is somewhat more complicated internally than PyTables, and how much is due to all the extra layers we go through to get the data into an array (native[in database]->Python Objects->Native[In record array]). To try to get at least a little handle on this, I add this test: def querySQLite(conn): c = conn.cursor() c.execute('select * from demo where x = 0.0') y = np.fromiter(c, dtype=dtype) return y This returns very little data (in the cases I ran it actually returned no data). However is still needs to loop over all the records and examine them. Here's what the timings looked like: setup SQLite took 9.71799993515 seconds retrieve SQLite took 0.921999931335 seconds query SQLite took 0.313000202179 seconds I'm reluctant to conclude to conclude that 1/3 of the time is spent in traversing the database and 2/3 of the time in creating the data solely because databases are big voodoo to me. Still, we can probably conclude that traversing the data itself is pretty expensive and we would be unlikely to approach PyTables speed even if we didn't have the extra overhead. On the other hand, there's a factor of three or so improvement that could be realized by reducing overhead. Or maybe not. I think that the database has to return it's data a row at a time, so there's intrinsically a lot of copying that's going to happen. So, I think it's unclear whether getting the data directly in native format would be significantly cheaper. I suppose that the way to definitively test it would be to rewrite one of these tests in C. Any volunteers? I think it's probably safe to say that either way PyTables will cream sqllite3 in those fields where it's applicable. One of these days I really need to dig into PyTables. I'm sure I could use it for something. [snip] -tim |
From: Colin J. W. <cj...@sy...> - 2006-11-16 17:40:29
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> <title></title> </head> <body bgcolor="#ffffff" text="#000000"> <br> <br> David Douard wrote: <blockquote cite="mid...@cr..." type="cite"> <pre wrap="">On Thu, Oct 26, 2006 at 05:26:47PM -0500, Jonathan Wang wrote: </pre> <blockquote type="cite"> <pre wrap="">I'm trying to write a Numpy extension that will encapsulate mxDateTime as a native Numpy type. I've decided to use a type inherited from Numpy's scalar double. However, I'm running into all sorts of problems. I'm using numpy 1.0b5; I realize this is somewhat out of date. </pre> </blockquote> <pre wrap=""><!----> Hi, just to ask you: how is the work going on encapsulatinsg mx.DateTime as a native numpy type? And most important: is the code available somewhere? I am also interested in using DateTime objects in numpy arrays. For now, I've always used arrays of floats (using gmticks values of dates). Thanks you, David </pre> </blockquote> It would be nice if dtype were subclassable to handle this sort of thing.<br> <br> Colin W. </body> </html> |
From: Pierre GM <pgm...@gm...> - 2006-11-16 17:01:13
|
On Thursday 16 November 2006 11:44, David Douard wrote: > Hi, just to ask you: how is the work going on encapsulatinsg mx.DateTime > as a native numpy type? > And most important: is the code available somewhere? I am also > interested in using DateTime objects in numpy arrays. For now, I've > always used arrays of floats (using gmticks values of dates). And I, as arrays of objects (well, I wrote a subclass to deal with dates, where each element is a datetime object, with methods to translate to floats or strings , but it's far from optimal...). I'd also be quite interested in checking what has been done. |
From: David D. <dav...@lo...> - 2006-11-16 16:44:51
|
On Thu, Oct 26, 2006 at 05:26:47PM -0500, Jonathan Wang wrote: > I'm trying to write a Numpy extension that will encapsulate mxDateTime as= a > native Numpy type. I've decided to use a type inherited from Numpy's scal= ar > double. However, I'm running into all sorts of problems. I'm using numpy > 1.0b5; I realize this is somewhat out of date. >=20 Hi, just to ask you: how is the work going on encapsulatinsg mx.DateTime as a native numpy type?=20 And most important: is the code available somewhere? I am also interested in using DateTime objects in numpy arrays. For now, I've always used arrays of floats (using gmticks values of dates). Thanks you, David --=20 David Douard LOGILAB, Paris (France) Formations Python, Zope, Plone, Debian : http://www.logilab.fr/formations D=E9veloppement logiciel sur mesure : http://www.logilab.fr/services Informatique scientifique : http://www.logilab.fr/science |
From: Francesc A. <fa...@ca...> - 2006-11-16 13:13:47
|
A Dimarts 14 Novembre 2006 23:08, Erin Sheldon escrigu=E9: > On 11/14/06, John Hunter <jdh...@ac...> wrote: > > Has anyone written any code to facilitate dumping mysql query results > > (mainly arrays of floats) into numpy arrays directly at the extension > > code layer. The query results->list->array conversion can be slow. > > > > Ideally, one could do this semi-automagically with record arrays and > > table introspection.... > > I've been considering this as well. I use both postgres and Oracle > in my work, and I have been using the python interfaces (cx_Oracle > and pgdb) to get result lists and convert to numpy arrays. > > The question I have been asking myself is "what is the advantage > of such an approach?". It would be faster, but by how > much? Presumably the bottleneck for most applications will > be data retrieval rather than data copying in memory. Well, that largely depends on your pattern to access the data in your database. If you are accessing to regions of your database that have a high degree of spatial locality (i.e. they are located in equal or very similar places), the data is most probably already in memory (in your filesystem cache or maybe in your database cache) and the bottleneck will become the memory access. Of course, if you don't have such a spatial locality in the access pattern, then the bottleneck will be the disk. Just to see how DB 2.0 could benefit from adopting record arrays as input buffers, I've done a comparison between SQLite3 and PyTables. PyTables doesn't suport DB 2.0 as such, but it does use record arrays as buffers internally so as to read data in an efficient way (there should be other databases that features this, but I know PyTables best ;) =46or this, I've used a modified version of a small benchmarking program posted by Tim Hochberg in this same thread (it is listed at the end of the message). Here are the results: setup SQLite took 23.5661110878 seconds retrieve SQLite took 3.26717996597 seconds setup PyTables took 0.139157056808 seconds retrieve PyTables took 0.13444685936 seconds [SQLite results were obtained using an in-memory database, while PyTables used an on-disk one. See the code.] So, yes, if your access pattern exhibits a high degree of locality, you can expect a huge difference on the reading speed (more than 20x for this example, but as this depends on the dataset size, it can be even higher for larger datasets). > On the other hand, the database access modules for all major > databases, with DB 2.0 semicomplience, have already been written. > This is not an insignificant amount of work. Writing our own > interfaces for each of our favorite databases would require an > equivalent amount of work. That's true, but still, feasible. However, before people would start doing this on a general way, it should help implementing first in Python something like the numpy.ndarray object: this would standarize a full-fledged heterogeneous buffer for doing intensive I/O tasks. > I think a set of timing tests would be useful. I will try some > using Oracle or postgres over the next few days. Perhaps > you could do the same with mysql. Well, here it is my own benchmark (admittedly trivial). Hope it helps in your comparisons. =2D--------------------------------------------------------------------- import sqlite3, numpy as np, time, tables as pt, os, os.path N =3D 500000 rndata =3D np.random.rand(2, N) dtype =3D np.dtype([('x',float), ('y', float)]) data =3D np.empty(shape=3DN, dtype=3Ddtype) data['x'] =3D rndata[0] data['y'] =3D rndata[1] def setupSQLite(conn): c =3D conn.cursor() c.execute('''create table demo (x real, y real)''') c.executemany("""insert into demo values (?, ?)""", data) def retrieveSQLite(conn): c =3D conn.cursor() c.execute('select * from demo') y =3D np.fromiter(c, dtype=3Ddtype) return y def setupPT(fileh): fileh.createTable('/', 'table', data) def retrievePT(fileh): y =3D fileh.root.table[:] return y # if os.path.exists('test.sql3'): # os.remove('test.sql3') #conn =3D sqlite3.connect('test.sql3') conn =3D sqlite3.connect(':memory:') t0 =3D time.time() setupSQLite(conn) t1 =3D time.time() print "setup SQLite took", t1-t0, "seconds" t0 =3D time.time() y1 =3D retrieveSQLite(conn) t1 =3D time.time() print "retrieve SQLite took", t1-t0, "seconds" conn.close() fileh =3D pt.openFile("test.h5", "w") t0 =3D time.time() setupPT(fileh) t1 =3D time.time() print "setup PyTables took", t1-t0, "seconds" t0 =3D time.time() y2 =3D retrievePT(fileh) t1 =3D time.time() print "retrieve PyTables took", t1-t0, "seconds" fileh.close() assert y1.shape =3D=3D y2.shape assert np.alltrue(y1 =3D=3D y2) =2D-=20 >0,0< Francesc Altet =A0 =A0 http://www.carabos.com/ V V C=E1rabos Coop. V. =A0=A0Enjoy Data "-" |
From: David C. <da...@ar...> - 2006-11-16 03:55:17
|
David Cournapeau wrote: > Robert Kern wrote: > >> David Cournapeau wrote: >> >> >>> Hi, >>> >>> This is a bit OT, but I wasted quite some time on this time, when >>> using 64 bits integers and ctypes on ubuntu edgy. As I know other people >>> use ubuntu with numpy, this may save some headache to others. I found >>> this behaviour which looks like a bug in ctypes for python2.5 on edgy >>> ubuntu: >>> >>> python2.5 -c "from ctypes import sizeof, c_longlong; print >>> sizeof(c_longlong)" >>> >>> prints 4 instead of 8, which in my case is problematic for >>> structures alignement. This affects only python2.5, and does not affect >>> a python installed from sources. Can anybody else reproduce this ? >>> >>> >> Can you try a similar program in C compiled with the same C compiler that you >> used to build ctypes? sizeof(long long) does not have to be 8 bytes; it just has >> to be at least as large as sizeof(long). >> >> >> > I thought that ISO C99 required long long to be at least 64 bits, and > that gcc followed this by default: > Ok, I found the problem: python 2.5 is configured with the option --with-system-ffi. If I compile python2.5 original sources with this option, I have the same problem, so it looks like a ffi-related problem. I will investigate this, because this is really annoying, but this has nothing to do with python nor numpy anymore, cheers, David |
From: David C. <da...@ar...> - 2006-11-16 03:46:24
|
Robert Kern wrote: > David Cournapeau wrote: > >> Hi, >> >> This is a bit OT, but I wasted quite some time on this time, when >> using 64 bits integers and ctypes on ubuntu edgy. As I know other people >> use ubuntu with numpy, this may save some headache to others. I found >> this behaviour which looks like a bug in ctypes for python2.5 on edgy >> ubuntu: >> >> python2.5 -c "from ctypes import sizeof, c_longlong; print >> sizeof(c_longlong)" >> >> prints 4 instead of 8, which in my case is problematic for >> structures alignement. This affects only python2.5, and does not affect >> a python installed from sources. Can anybody else reproduce this ? >> > > Can you try a similar program in C compiled with the same C compiler that you > used to build ctypes? sizeof(long long) does not have to be 8 bytes; it just has > to be at least as large as sizeof(long). > > I thought that ISO C99 required long long to be at least 64 bits, and that gcc followed this by default: #include <stdio.h> int main(void) { printf("size of long long is %d\n", sizeof(long long)); return 0; } compiled by gcc with -W -Wall, returns 8. Gcc is edgy ubuntu, that is 4.1.2. I tried also with gcc 3.3 and 4.0, same result. Also, something I didn't say is that c_int64 is not available from ctypes; with python2.4, this is fine (I changed my code to use c_int64 instead: an import error is much easier to find than a structure alignement problem when using ctypes :) ). Also, I insist on this point, installing python from sources returns 8, as does python2.4 packaged by ubuntu (which is compiled by the exact same compiler according to python prompts). I am now trying to see if this is coming from configuration options (now, I know what a dual cpu is useful for: compiling python with make -j5 is really fast:) ), cheers, David |
From: Robert K. <rob...@gm...> - 2006-11-16 03:22:55
|
David Cournapeau wrote: > Hi, > > This is a bit OT, but I wasted quite some time on this time, when > using 64 bits integers and ctypes on ubuntu edgy. As I know other people > use ubuntu with numpy, this may save some headache to others. I found > this behaviour which looks like a bug in ctypes for python2.5 on edgy > ubuntu: > > python2.5 -c "from ctypes import sizeof, c_longlong; print > sizeof(c_longlong)" > > prints 4 instead of 8, which in my case is problematic for > structures alignement. This affects only python2.5, and does not affect > a python installed from sources. Can anybody else reproduce this ? Can you try a similar program in C compiled with the same C compiler that you used to build ctypes? sizeof(long long) does not have to be 8 bytes; it just has to be at least as large as sizeof(long). -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco |
From: David C. <da...@ar...> - 2006-11-16 02:40:14
|
Hi, This is a bit OT, but I wasted quite some time on this time, when using 64 bits integers and ctypes on ubuntu edgy. As I know other people use ubuntu with numpy, this may save some headache to others. I found this behaviour which looks like a bug in ctypes for python2.5 on edgy ubuntu: python2.5 -c "from ctypes import sizeof, c_longlong; print sizeof(c_longlong)" prints 4 instead of 8, which in my case is problematic for structures alignement. This affects only python2.5, and does not affect a python installed from sources. Can anybody else reproduce this ? cheers, David |
From: Pierre GM <pgm...@gm...> - 2006-11-16 02:01:21
|
On Wednesday 15 November 2006 12:55, Keith Goodman wrote: > I didn't know you could use masked arrays with matrices. I guess I > took the name literally. :) Please check the developer zone: http://projects.scipy.org/scipy/numpy/wiki/MaskedArray for an alternative implementation of masked arrays that support subclasses of ndarray. > I think an easier way to use masked arrays would be to introduce a new > thing called mis. > > I could make a regular matrix > > x = M.rand(3,3) > > and assign a missing value > > x[0,0] = M.mis > > x would then behave as a missing array matrix. .... > I think that would make missing arrays accessible to everyone. Well, there's already something like that, sort of: MA.masked, or MA.masked_singleton. The emphasis here is on "sort of". That works well if x is already a masked array. Else, a "ValueError: setting an array element with a sequence" is raised. I haven't tried to find where the problem comes from (ndarray.__setitem__ ? The masked_singleton larger than it seems ?), but I wonder whether it's an issue worth solving. If you want to get a masked_matrix from x, just type x=masked_array(x). You won't be able to access some specific matrix attributes (A, T), at least directly, but you can fill your masked_matrix and get a matrix back. And multiplication of two masked_matrices work as expected ! The main advantage of this approach is that we don't overload ndarray or matrices, the work is solely on the masked_array side. |
From: Mathew Y. <my...@jp...> - 2006-11-15 22:18:47
|
Robert Kern wrote: > Mathew Yeates wrote: > > >> def delta2day1(delta): >> return delta.days/365.0 >> deltas2days=numpy.frompyfunc(delta2day1,1,1) >> > > If I had to guess where the problem is, it's here. frompyfunc() and vectorize() > have always been tricky beasts to get right. > > It appears the problem is, in fact, with frompyfunc. I'm still running but I'm not seeing the immediate loss of memory as I was before. I replaced the frompyfunc with a simple loop. Thanks for all the help. Mathew |
From: Tim H. <tim...@ie...> - 2006-11-15 22:15:48
|
Robert Kern wrote: > Mathew Yeates wrote: > > >> def delta2day1(delta): >> return delta.days/365.0 >> deltas2days=numpy.frompyfunc(delta2day1,1,1) >> > > If I had to guess where the problem is, it's here. frompyfunc() and vectorize() > have always been tricky beasts to get right. > <curnudgen mode> IMO, frompyfunc is an attractive nuisance. It doesn't magically make scalar Python functions fast, as people seem to assume, and it prevents people from figuring out how to write vectorized functions in the many cases where that's practicable. And it sounds like it may be buggy to boot. In this case, I don't know that you can easily vectorize this by hand, but there are many ways that it could be rewritten to avoid frompyfunc. For example: def deltas2days(seq): return numpy.fromter((x.days for x in seq), dtype=float, count=len(seq)) One line shorter, about equally opaque and less likely to have mysterious bugs. </curmudgen mode> -tim |
From: Robert K. <rob...@gm...> - 2006-11-15 21:42:41
|
Mathew Yeates wrote: > def delta2day1(delta): > return delta.days/365.0 > deltas2days=numpy.frompyfunc(delta2day1,1,1) If I had to guess where the problem is, it's here. frompyfunc() and vectorize() have always been tricky beasts to get right. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco |
From: Mathew Y. <my...@jp...> - 2006-11-15 21:35:34
|
Stefan van der Walt wrote: > On Wed, Nov 15, 2006 at 02:33:52PM -0600, Robert Kern wrote: > =20 >> Mathew Yeates wrote: >> =20 >>> Hi >>> I'm running a 64 bit Python 2.5 on an x86 with Solaris. I have a=20 >>> function I call over 2^32 times and eventually I run out of memory. >>> >>> The function is >>> def make_B(deltadates): >>> numcols=3Ddeltadates.shape[0] >>> B=3Dnumpy.zeros((numcols,numcols)) >>> for ind in range(0,numcols): #comment out this loop and all is go= od >>> B[ind,0:numcols] =3D deltadates[0:numcols] >>> return B >>> >>> >>> If I comment out the loop lines, my memory is okay. I'm guessing that= a=20 >>> reference is being added to "deltadates" and that the reference count= is=20 >>> going above 2^32 and reseting. Anybody have any ideas about how I can= =20 >>> cure this? Is Numpy increasing the reference count here? >>> =20 >> Can you give us a small but complete and self-contained script that de= monstrates >> the problem? >> =20 > > I think this might be related to ticket #378: > > http://projects.scipy.org/scipy/numpy/ticket/378 > > Cheers > St=E9fan > =20 okay. attached is the smallest program I could make. Before running you=20 will need to create a file named biggie with 669009000 non zero floats. Mathew |
From: Stefan v. d. W. <st...@su...> - 2006-11-15 20:52:20
|
On Wed, Nov 15, 2006 at 02:33:52PM -0600, Robert Kern wrote: > Mathew Yeates wrote: > > Hi > > I'm running a 64 bit Python 2.5 on an x86 with Solaris. I have a=20 > > function I call over 2^32 times and eventually I run out of memory. > >=20 > > The function is > > def make_B(deltadates): > > numcols=3Ddeltadates.shape[0] > > B=3Dnumpy.zeros((numcols,numcols)) > > for ind in range(0,numcols): #comment out this loop and all is go= od > > B[ind,0:numcols] =3D deltadates[0:numcols] > > return B > >=20 > >=20 > > If I comment out the loop lines, my memory is okay. I'm guessing that= a=20 > > reference is being added to "deltadates" and that the reference count= is=20 > > going above 2^32 and reseting. Anybody have any ideas about how I can= =20 > > cure this? Is Numpy increasing the reference count here? >=20 > Can you give us a small but complete and self-contained script that dem= onstrates > the problem? I think this might be related to ticket #378: http://projects.scipy.org/scipy/numpy/ticket/378 Cheers St=E9fan |
From: Mathew Y. <my...@jp...> - 2006-11-15 20:44:15
|
Robert Kern wrote: > Mathew Yeates wrote: > >> Hi >> I'm running a 64 bit Python 2.5 on an x86 with Solaris. I have a >> function I call over 2^32 times and eventually I run out of memory. >> >> The function is >> def make_B(deltadates): >> numcols=deltadates.shape[0] >> B=numpy.zeros((numcols,numcols)) >> for ind in range(0,numcols): #comment out this loop and all is good >> B[ind,0:numcols] = deltadates[0:numcols] >> return B >> >> >> If I comment out the loop lines, my memory is okay. I'm guessing that a >> reference is being added to "deltadates" and that the reference count is >> going above 2^32 and reseting. Anybody have any ideas about how I can >> cure this? Is Numpy increasing the reference count here? >> > > Can you give us a small but complete and self-contained script that demonstrates > the problem? > > I'll try. But its in a complex program. BTW - I tried B[ind,0:numcols] = deltadates[0:numcols].copy() but that didn't work either. Mathew |
From: Robert K. <rob...@gm...> - 2006-11-15 20:34:53
|
Mathew Yeates wrote: > Hi > I'm running a 64 bit Python 2.5 on an x86 with Solaris. I have a > function I call over 2^32 times and eventually I run out of memory. > > The function is > def make_B(deltadates): > numcols=deltadates.shape[0] > B=numpy.zeros((numcols,numcols)) > for ind in range(0,numcols): #comment out this loop and all is good > B[ind,0:numcols] = deltadates[0:numcols] > return B > > > If I comment out the loop lines, my memory is okay. I'm guessing that a > reference is being added to "deltadates" and that the reference count is > going above 2^32 and reseting. Anybody have any ideas about how I can > cure this? Is Numpy increasing the reference count here? Can you give us a small but complete and self-contained script that demonstrates the problem? -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco |
From: Mathew Y. <my...@jp...> - 2006-11-15 20:28:43
|
Hi I'm running a 64 bit Python 2.5 on an x86 with Solaris. I have a function I call over 2^32 times and eventually I run out of memory. The function is def make_B(deltadates): numcols=deltadates.shape[0] B=numpy.zeros((numcols,numcols)) for ind in range(0,numcols): #comment out this loop and all is good B[ind,0:numcols] = deltadates[0:numcols] return B If I comment out the loop lines, my memory is okay. I'm guessing that a reference is being added to "deltadates" and that the reference count is going above 2^32 and reseting. Anybody have any ideas about how I can cure this? Is Numpy increasing the reference count here? Mathew |
From: Keith G. <kwg...@gm...> - 2006-11-15 17:55:42
|
On 11/12/06, Pierre GM <pgm...@gm...> wrote: > On Sunday 12 November 2006 17:08, A. M. Archibald wrote: > > On 12/11/06, Keith Goodman <kwg...@gm...> wrote: > > > Is anybody interested in making x.max() and nanmax() behave the same > > > for matrices, except for the NaN part? That is, make > > > numpy.matlib.nanmax return a matrix instead of an array. > > Or, you could use masked arrays... In the new implementation, you can add a > mask to a subclassed array (such as matrix) to get a regular masked array. If > you fill this masked array, you get an array of the same subclass. > > >>> import numpy as N > >>> import numpy.matlib as M > >>> import maskedarray as MA > >>> x=M.rand(3,3) > >>> assert isinstance(x.max(0), M.matrix) > >>> assert isinstance(N.max(x,0), M.matrix) > >>> assert isinstance(MA.max(x,0).filled(0), M.matrix) > >>> assert isinstance(MA.max(x,0)._data, M.matrix) > > >>> x[-1,-1] = N.nan > >>> tmp = MA.max(MA.array(x,mask=N.isnan(x)), 0) > >>> assert (tmp == N.nanmax(x,0)).all() > >>> assert isinstance(tmp.filled(0), M.matrix) I didn't know you could use masked arrays with matrices. I guess I took the name literally. I think an easier way to use masked arrays would be to introduce a new thing called mis. I could make a regular matrix x = M.rand(3,3) and assign a missing value x[0,0] = M.mis x would then behave as a missing array matrix. I could also do x[M.isnan(x)] = M.mis or x[mask] = M.mis To get the mask from x: x.mask or M.ismis(x) I think that would make missing arrays accessible to everyone. |
From: Tim H. <tim...@ie...> - 2006-11-15 04:56:20
|
Tim Hochberg wrote: > [CHOP] > > The timings of these are pretty consistent with each other with the > previous runs except that the difference between retrieve1 and retrieve2 > has disappeared. In fact, all of the runs that produce lists have gotten > faster by about the same amount.. Odd! A little digging reveals that > timeit turns off garbage collection to make things more repeatable. > Turning gc back on yields the following numbers for repeat(3,1): > > retrieve1 [0.92517736192728406, 0.92109667569481601, > 0.92390960303614023] > retrieve2 [1.3018456256311914, 1.2277141368525903, 1.2929785768861706] > retrieve3 [1.5309831277438946, 1.4998853206203577, 1.5601200711263488] > retrieve4 [8.6400394463542227, 8.7022300320292061, 8.6807761880350682] > > So there we are, back to our original numbers. This also reveals that > the majority of the time difference between retrieve1 and retrieve2 *is* > memory related. However, it's the deallocation (or more precisely > garbage collection) of all those floats that is the killer. I just realized that this sounds sort of misleading. In both cases a million floats are allocated and deallocated. However, in retrieve1 only two of those million are alive at any one time, so Python will just keep reusing the same two chunks of memory for all 500,000 pairs (ditto for the 500,000 tuples that are created). In the other cases, all million floats will be alive at once, requiring much more memory and possibly swapping to disk. Unsurprisingly, the second case is slower, but the details aren't clear. In particular why is it the deallocation that is slow? Another mystery is why gc matters at all. None of the obvious actors are involved in cycles so they would normally go away due to reference counting even with gc turned off. My rather uninformed guess is that the cursor or the connection holds onto the list (caching it for later perhaps) and that cursor/connection is involved in some sort of cycle. This would keep the list alive until the gc ran. -tim [CHOP] |
From: Erin S. <eri...@gm...> - 2006-11-15 04:47:09
|
On 11/14/06, Tim Hochberg <tim...@ie...> wrote: SNIP > > Interesting results Tim. From Pierre's results > > we saw that fromiter is the fastest way to get data > > into arrays. With your results we see there is a > > difference between iterating over the cursor and > > doing a fetchall() as well. Surprisingly, running > > the cursor is faster. > > > > This must come not from the data retrieval rate but > > from creating the copies in memory. > I imagine that is correct. In particular, skipping the making of the > list avoids the creation of 1e6 Python floats, which is going to result > in a lot of memory allocation. > > > But just in case > > I think there is one more thing to check. > > I haven't used sqlite, but with other databases I have > > used there is often a large variance in times from > > one select to the next. Can you > > repeat these tests with a timeit().repeat and give the > > minimum? > > > Sure. Here's two sets of numbers. The first is for repeat(3,1) and the > second for repeat (3,3). > > retrieve1 [0.91198546183942375, 0.9042411814909439, 0.90411518782415001] > retrieve2 [0.98355349632425515, 0.95424502276127754, > 0.94714328217692412] > retrieve3 [1.2227562441595268, 1.2195848913758596, 1.2206193803961156] > retrieve4 [8.4344040932576547, 8.3556245276983532, 8.3568341786456131] > > retrieve1 [2.7317457945074026, 2.7274656415829384, 2.7250913174719109] > retrieve2 [2.8857103346933783, 2.8379299603720582, 2.8386803350705136] > retrieve3 [3.6870535221655203, 3.8980253076857565, 3.7002303365371887] > retrieve4 [25.138646950939304, 25.06737169109482, 25.052789390830412] > > The timings of these are pretty consistent with each other with the > previous runs except that the difference between retrieve1 and retrieve2 > has disappeared. In fact, all of the runs that produce lists have gotten > faster by about the same amount.. Odd! A little digging reveals that > timeit turns off garbage collection to make things more repeatable. > Turning gc back on yields the following numbers for repeat(3,1): > > retrieve1 [0.92517736192728406, 0.92109667569481601, > 0.92390960303614023] > retrieve2 [1.3018456256311914, 1.2277141368525903, 1.2929785768861706] > retrieve3 [1.5309831277438946, 1.4998853206203577, 1.5601200711263488] > retrieve4 [8.6400394463542227, 8.7022300320292061, 8.6807761880350682] > > So there we are, back to our original numbers. This also reveals that > the majority of the time difference between retrieve1 and retrieve2 *is* > memory related. However, it's the deallocation (or more precisely > garbage collection) of all those floats that is the killer. Here's what > the timeit routines looked like: > > if __name__ == "__main__": > for name in ['retrieve1', 'retrieve2', 'retrieve3', 'retrieve4']: > print name, timeit.Timer("%s(conn)" % name, "gc.enable(); > from scratch import sqlite3, %s, setup; conn = > sqlite3.connect(':memory:'); setup(conn)" % name).repeat(3, 1) > > > As an aside, your database is running on a local disk, right, so > > the overehead of retrieving data is minimized here? > > For my tests I think I am data retrieval limited because I > > get exactly the same time for the equivalent of retrieve1 > > and retrieve2. > > > As Keith pointed out, I'm keeping the database in memory (although > there's a very good chance some of it is actually swapped to disk) so > it's probably relatively fast. On the other hand, if you are using > timeit to make your measurements you could be running into the (lack of) > garbage collection issue I mention above. > I checked and for my real situation I am totally limited by the time to retrieve the data. From these tests I think this will probably be true even if the data is on a local disk. I think these experiments show that iterating over the cursor is the best approach. It is better from a memory point of view and is probably also the fastest. We should still resolve the slowness for the array() function however when converting lists of tuples. I will file a ticket if no one else has. Erin |
From: Tim H. <tim...@ie...> - 2006-11-15 04:14:33
|
Erin Sheldon wrote: > On 11/14/06, Tim Hochberg <tim...@ie...> wrote: > >> Tim Hochberg wrote: >> >>> [SNIP] >>> >>> I'm no database user, but a glance at the at the docs seems to indicate >>> that you can get your data via an iterator (by iterating over the cursor >>> or some such db mumbo jumbo) rather than slurping up the whole list up >>> at once. If so, then you'll save a lot of memory by passing the iterator >>> straight to fromiter. It may even be faster, who knows. >>> >>> Accessing the db via the iterator could be a performance killer, but >>> it's almost certainly worth trying as it could a few megabytes of >>> storage and that in turn might speed things up. >>> >> Assuming that I didn't mess this up too badly, it appears that using the >> iterator directly with fromiter is significantly faster than the next >> best solution (about 45%). The fromiter wrapping a list solution come in >> second, followed by numarray.array and finally way in the back, >> numpy.array. Here's the numbers: >> >> retrieve1 took 0.902922857514 seconds >> retrieve2 took 1.31245870634 seconds >> retrieve3 took 1.51207569677 seconds >> retrieve4 took 8.71539930354 seconds >> > > Interesting results Tim. From Pierre's results > we saw that fromiter is the fastest way to get data > into arrays. With your results we see there is a > difference between iterating over the cursor and > doing a fetchall() as well. Surprisingly, running > the cursor is faster. > > This must come not from the data retrieval rate but > from creating the copies in memory. I imagine that is correct. In particular, skipping the making of the list avoids the creation of 1e6 Python floats, which is going to result in a lot of memory allocation. > But just in case > I think there is one more thing to check. > I haven't used sqlite, but with other databases I have > used there is often a large variance in times from > one select to the next. Can you > repeat these tests with a timeit().repeat and give the > minimum? > Sure. Here's two sets of numbers. The first is for repeat(3,1) and the second for repeat (3,3). retrieve1 [0.91198546183942375, 0.9042411814909439, 0.90411518782415001] retrieve2 [0.98355349632425515, 0.95424502276127754, 0.94714328217692412] retrieve3 [1.2227562441595268, 1.2195848913758596, 1.2206193803961156] retrieve4 [8.4344040932576547, 8.3556245276983532, 8.3568341786456131] retrieve1 [2.7317457945074026, 2.7274656415829384, 2.7250913174719109] retrieve2 [2.8857103346933783, 2.8379299603720582, 2.8386803350705136] retrieve3 [3.6870535221655203, 3.8980253076857565, 3.7002303365371887] retrieve4 [25.138646950939304, 25.06737169109482, 25.052789390830412] The timings of these are pretty consistent with each other with the previous runs except that the difference between retrieve1 and retrieve2 has disappeared. In fact, all of the runs that produce lists have gotten faster by about the same amount.. Odd! A little digging reveals that timeit turns off garbage collection to make things more repeatable. Turning gc back on yields the following numbers for repeat(3,1): retrieve1 [0.92517736192728406, 0.92109667569481601, 0.92390960303614023] retrieve2 [1.3018456256311914, 1.2277141368525903, 1.2929785768861706] retrieve3 [1.5309831277438946, 1.4998853206203577, 1.5601200711263488] retrieve4 [8.6400394463542227, 8.7022300320292061, 8.6807761880350682] So there we are, back to our original numbers. This also reveals that the majority of the time difference between retrieve1 and retrieve2 *is* memory related. However, it's the deallocation (or more precisely garbage collection) of all those floats that is the killer. Here's what the timeit routines looked like: if __name__ == "__main__": for name in ['retrieve1', 'retrieve2', 'retrieve3', 'retrieve4']: print name, timeit.Timer("%s(conn)" % name, "gc.enable(); from scratch import sqlite3, %s, setup; conn = sqlite3.connect(':memory:'); setup(conn)" % name).repeat(3, 1) > As an aside, your database is running on a local disk, right, so > the overehead of retrieving data is minimized here? > For my tests I think I am data retrieval limited because I > get exactly the same time for the equivalent of retrieve1 > and retrieve2. > As Keith pointed out, I'm keeping the database in memory (although there's a very good chance some of it is actually swapped to disk) so it's probably relatively fast. On the other hand, if you are using timeit to make your measurements you could be running into the (lack of) garbage collection issue I mention above. -tim |
From: David C. <da...@ar...> - 2006-11-15 02:51:36
|
Kenny Ortmann wrote: > I am sorry if this has come up before, I've found some stuff on it but just > want to be clear. > What are the latest versions of numpy matplotlib and scipy that work > together. > I have seen that numpy 1.0rc2 works with the latest scipy until 0.5.2 comes > out. > now with numpy and matplotlib? I saw that matplotlib 87.5 works with i > think the 1.0rc2 also, but i read that mpl was going to release right after > numpy 1.0 was released so that they would be compatible. > I am just trying to upgrade these packages before I create an executable of > the program I am working on, and I am running into problems with > "from matplotlib._ns_nxutils import * > ImportError: numpy.core.multiarray failed to imort" It looks like you have a problem with numpy. First, I would try import numpy from a python prompt, to see what's wrong. I would also check that numpy is set as the array package in matplolibrc file. Second, I would check that I install everything "cleanly", that is: - first, remove all packages (numpy, scipy and mpl) in site-packages directory - remove the build directory in each package - then build from scratch + test each package: numpy first, then mpl, then scipy. Concerning versions, I regularly rebuild scipy and numpy from SVN (but always test them with import package; package.test(100) to test everything), but uses the latest release of mpl (0.87.7 as we speak), on linux, without any problem cheers, David > > -Kenny > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Numpy-discussion mailing list > Num...@li... > https://lists.sourceforge.net/lists/listinfo/numpy-discussion > > |