You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(7) |
Jul
(18) |
Aug
(5) |
Sep
(15) |
Oct
(4) |
Nov
(1) |
Dec
(4) |
2004 |
Jan
(5) |
Feb
(2) |
Mar
(5) |
Apr
(8) |
May
(8) |
Jun
(10) |
Jul
(4) |
Aug
(4) |
Sep
(20) |
Oct
(11) |
Nov
(31) |
Dec
(41) |
2005 |
Jan
(79) |
Feb
(22) |
Mar
(14) |
Apr
(17) |
May
(35) |
Jun
(24) |
Jul
(26) |
Aug
(9) |
Sep
(57) |
Oct
(64) |
Nov
(25) |
Dec
(37) |
2006 |
Jan
(76) |
Feb
(24) |
Mar
(79) |
Apr
(44) |
May
(33) |
Jun
(12) |
Jul
(15) |
Aug
(40) |
Sep
(17) |
Oct
(21) |
Nov
(46) |
Dec
(23) |
2007 |
Jan
(18) |
Feb
(25) |
Mar
(41) |
Apr
(66) |
May
(18) |
Jun
(29) |
Jul
(40) |
Aug
(32) |
Sep
(34) |
Oct
(17) |
Nov
(46) |
Dec
(17) |
2008 |
Jan
(17) |
Feb
(42) |
Mar
(23) |
Apr
(11) |
May
(65) |
Jun
(28) |
Jul
(28) |
Aug
(16) |
Sep
(24) |
Oct
(33) |
Nov
(16) |
Dec
(5) |
2009 |
Jan
(19) |
Feb
(25) |
Mar
(11) |
Apr
(32) |
May
(62) |
Jun
(28) |
Jul
(61) |
Aug
(20) |
Sep
(61) |
Oct
(11) |
Nov
(14) |
Dec
(53) |
2010 |
Jan
(17) |
Feb
(31) |
Mar
(39) |
Apr
(43) |
May
(49) |
Jun
(47) |
Jul
(35) |
Aug
(58) |
Sep
(55) |
Oct
(91) |
Nov
(77) |
Dec
(63) |
2011 |
Jan
(50) |
Feb
(30) |
Mar
(67) |
Apr
(31) |
May
(17) |
Jun
(83) |
Jul
(17) |
Aug
(33) |
Sep
(35) |
Oct
(19) |
Nov
(29) |
Dec
(26) |
2012 |
Jan
(53) |
Feb
(22) |
Mar
(118) |
Apr
(45) |
May
(28) |
Jun
(71) |
Jul
(87) |
Aug
(55) |
Sep
(30) |
Oct
(73) |
Nov
(41) |
Dec
(28) |
2013 |
Jan
(19) |
Feb
(30) |
Mar
(14) |
Apr
(63) |
May
(20) |
Jun
(59) |
Jul
(40) |
Aug
(33) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Antonio V. <ant...@ti...> - 2013-04-24 19:25:48
|
Hi Matt, Il 24/04/2013 21:09, Matt Terry ha scritto: > Hello, > > The source tarball for pytables 2.4 on sourceforge appears to be broken. > The file size is suspiciously small (800 kB vs 8.5MB on PyPI), the tarball > doesn't untar, and the md5 doesn't match. > > -matt Thanks for reporting. It should be fixed now. ciao -- Antonio Valentino |
From: Anthony S. <sc...@gm...> - 2013-04-24 19:23:30
|
Hey Matt, is this related? https://github.com/PyTables/PyTables/issues/223 Be Well Anthony On Wed, Apr 24, 2013 at 3:09 PM, Matt Terry <mat...@gm...> wrote: > Hello, > > The source tarball for pytables 2.4 on sourceforge appears to be broken. > The file size is suspiciously small (800 kB vs 8.5MB on PyPI), the tarball > doesn't untar, and the md5 doesn't match. > > -matt > > > ------------------------------------------------------------------------------ > Try New Relic Now & We'll Send You this Cool Shirt > New Relic is the only SaaS-based application performance monitoring service > that delivers powerful full stack analytics. Optimize and monitor your > browser, app, & servers with just a few lines of code. Try New Relic > and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_apr > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Matt T. <mat...@gm...> - 2013-04-24 19:09:37
|
Hello, The source tarball for pytables 2.4 on sourceforge appears to be broken. The file size is suspiciously small (800 kB vs 8.5MB on PyPI), the tarball doesn't untar, and the md5 doesn't match. -matt |
From: Anthony S. <sc...@gm...> - 2013-04-23 00:20:31
|
Hello Thadeus, Thanks for posting this PR! Once it is fixed for Python 3, we'd love to see it merged in. Be Well Anthony On Mon, Apr 22, 2013 at 2:52 PM, Thadeus Burgess <tha...@th...>wrote: > Hopefully this pull request can be included in the next version? It is > keeping us from using the CSI functionality of PyTables. > > https://github.com/PyTables/PyTables/pull/238 > > -- > Thadeus > > > > On Tue, Apr 16, 2013 at 4:10 PM, Anthony Scopatz <sc...@gm...>wrote: > >> Hello PyTables Users, >> >> To let you know, we are hoping to do a PyTables 3.0-beta release here in >> the next week or two. This will include the long awaited Python 3 support >> thanks to the heroic efforts of Antonio Valentino, who did the lion's share >> of the porting work for both PyTables AND one of our dependencies, numexpr. >> >> However, to really make this release the best possible, we are asking for >> your help in cleaning up and closing some of the remaining issues. You can >> see our list of open issues for this release here [1]. You can also see >> out todo list for this release here [2]. >> >> *If you have a feature that you'd really love to see make it into the >> code base, now is the time to implement it.* If you have always wanted >> to contribute, but weren't sure how to get going, please fork the repo on >> github and then issue a pull request. If you have any questions about this >> process feel free to ask in this thread. >> >> Here is to a great next release! >> The PyTables Developers >> >> 1. https://github.com/PyTables/PyTables/issues?milestone=4&state=open >> 2. https://github.com/PyTables/PyTables/wiki/NextReleaseTodo >> >> >> ------------------------------------------------------------------------------ >> Precog is a next-generation analytics platform capable of advanced >> analytics on semi-structured data. The platform includes APIs for building >> apps and a phenomenal toolset for data science. Developers can use >> our toolset for easy data analysis & visualization. Get a free account! >> http://www2.precog.com/precogplatform/slashdotnewsletter >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Thadeus B. <tha...@th...> - 2013-04-22 19:53:20
|
Hopefully this pull request can be included in the next version? It is keeping us from using the CSI functionality of PyTables. https://github.com/PyTables/PyTables/pull/238 -- Thadeus On Tue, Apr 16, 2013 at 4:10 PM, Anthony Scopatz <sc...@gm...> wrote: > Hello PyTables Users, > > To let you know, we are hoping to do a PyTables 3.0-beta release here in > the next week or two. This will include the long awaited Python 3 support > thanks to the heroic efforts of Antonio Valentino, who did the lion's share > of the porting work for both PyTables AND one of our dependencies, numexpr. > > However, to really make this release the best possible, we are asking for > your help in cleaning up and closing some of the remaining issues. You can > see our list of open issues for this release here [1]. You can also see > out todo list for this release here [2]. > > *If you have a feature that you'd really love to see make it into the > code base, now is the time to implement it.* If you have always wanted > to contribute, but weren't sure how to get going, please fork the repo on > github and then issue a pull request. If you have any questions about this > process feel free to ask in this thread. > > Here is to a great next release! > The PyTables Developers > > 1. https://github.com/PyTables/PyTables/issues?milestone=4&state=open > 2. https://github.com/PyTables/PyTables/wiki/NextReleaseTodo > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Antonio V. <ant...@ti...> - 2013-04-22 18:01:57
|
Hi Gaëtan, Il 22/04/2013 13:09, Gaëtan de Menten ha scritto: > Hello all, > > TL;DR: It would be nice to have online documentation for stable versions > and have pytables.github.io point to the doc for the latest stable release > by default. > ==== > > I just tried to use the new out= argument to table.read, only to find out > it did not work in my version (2.3.1). please file a bug report for this. The out= parameter should be marked as ".. versionadded:: 3.0" > Then I tried to update my version to > 2.4 since I thought it was implemented in that version because of the > "2.4.0+1.dev" name at the top of the page which I thought meant "dev > version leading to 2.4", or maybe to "2.4.1", but certainly not the next > major release. I got even more confused because, after the initial failure > with my 2.3.1 release, I checked the release notes... which I thought were > for 2.4 because the title of the "release notes" page is "Release notes for > PyTables 2.4 series" when it is in fact for the next major version... > Gaëtan, we follow guidelines described in [1] for managing the development cycle. We choose a version number only when it is actually time to make a release. VERSION+1.dev just means the "development version (.dev) of the next stable release". > Here are a couple suggestions: > * doc for stable releases (default to latest stable), bonus points to be > able to switch easily from one version to another, a-la Python stdlib. I agree, in the future we will pay more attention to publish doc for the latest stable release on the main site > * change 2.4.0+1.dev to 3.0-dev or 3.0-pre, and all mentions of 2.4.x > * have new arguments to functions documented in the docstring for the > functions (like in Python stdlib): "new in pytables 3.0" in the docstring > for table.read() would have made wonders. please see comments above best regards [1] http://nvie.com/posts/a-successful-git-branching-model -- Antonio Valentino |
From: Anthony S. <sc...@gm...> - 2013-04-22 14:37:12
|
Hello Gaëtan, Thanks for bringing this up and I think that older versions of the docs are a fairly important thing to have. I have opened an issue for this on github [1]. However, I doubt that I will have an opportunity to take care of this in the short term. So if you want to take care of this issue for the benefit of yourself and all, I would love to see a pull request ;). Be Well Anthony 1. https://github.com/PyTables/PyTables/issues/236 On Mon, Apr 22, 2013 at 6:09 AM, Gaëtan de Menten <gde...@gm...>wrote: > Hello all, > > TL;DR: It would be nice to have online documentation for stable versions > and have pytables.github.io point to the doc for the latest stable > release by default. > ==== > > I just tried to use the new out= argument to table.read, only to find out > it did not work in my version (2.3.1). Then I tried to update my version to > 2.4 since I thought it was implemented in that version because of the > "2.4.0+1.dev" name at the top of the page which I thought meant "dev > version leading to 2.4", or maybe to "2.4.1", but certainly not the next > major release. I got even more confused because, after the initial failure > with my 2.3.1 release, I checked the release notes... which I thought were > for 2.4 because the title of the "release notes" page is "Release notes for > PyTables 2.4 series" when it is in fact for the next major version... > > Here are a couple suggestions: > * doc for stable releases (default to latest stable), bonus points to be > able to switch easily from one version to another, a-la Python stdlib. > * change 2.4.0+1.dev to 3.0-dev or 3.0-pre, and all mentions of 2.4.x > * have new arguments to functions documented in the docstring for the > functions (like in Python stdlib): "new in pytables 3.0" in the docstring > for table.read() would have made wonders. > > Thanks in adance, > -- > Gaëtan de Menten > > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Gaëtan de M. <gde...@gm...> - 2013-04-22 11:09:32
|
Hello all, TL;DR: It would be nice to have online documentation for stable versions and have pytables.github.io point to the doc for the latest stable release by default. ==== I just tried to use the new out= argument to table.read, only to find out it did not work in my version (2.3.1). Then I tried to update my version to 2.4 since I thought it was implemented in that version because of the "2.4.0+1.dev" name at the top of the page which I thought meant "dev version leading to 2.4", or maybe to "2.4.1", but certainly not the next major release. I got even more confused because, after the initial failure with my 2.3.1 release, I checked the release notes... which I thought were for 2.4 because the title of the "release notes" page is "Release notes for PyTables 2.4 series" when it is in fact for the next major version... Here are a couple suggestions: * doc for stable releases (default to latest stable), bonus points to be able to switch easily from one version to another, a-la Python stdlib. * change 2.4.0+1.dev to 3.0-dev or 3.0-pre, and all mentions of 2.4.x * have new arguments to functions documented in the docstring for the functions (like in Python stdlib): "new in pytables 3.0" in the docstring for table.read() would have made wonders. Thanks in adance, -- Gaëtan de Menten |
From: Gaëtan de M. <gde...@gm...> - 2013-04-22 10:39:31
|
I just managed to compile PyTables "dev" on Windows. I first tried with the pre-built binaries of HDF5 1.8.10 and it failed to build because of a missing "stdint". I only realized later that the pre-built binaries are meant for VS10. Given we are stuck with Visual Studio 9 (2008) for python 2.7 extensions, it might not be the only one stumbling on the issue, and it might be worth directing users to either use 1.8.9 or build hdf5 1.8.10+ from source (I did not try myself but VS9 seems to be still supported through cmake). I hope this helps someone, -- Gaëtan de Menten |
From: Francesc A. <fa...@gm...> - 2013-04-22 10:12:18
|
On 4/22/13 8:11 AM, Antonio Valentino wrote: > Hi Francesc, > > Il 21/04/2013 21:46, Francesc Alted ha scritto: >> Hi, >> >> I'm happy to announce the availability of Blosc 1.2.0 RC1. It exists >> currently just as a tag in the github repo >> (https://github.com/FrancescAlted/blosc), so you can fetch it as: >> >> https://github.com/FrancescAlted/blosc/archive/v1.2.0-rc1.tar.gz >> >> If everything goes well, I plan to do an official release in a week or so. >> >> Enjoy data! >> >> Francesc >> > thank you Francesc. > I have just updated blosc sources included in PyTables. Excellent. I am having problems with compiling PyTables on Windows on my new laptop, and I am also getting some problems running the test included in Blosc 1.2 RC1 on Windows. Including Blosc on PyTables will allow Jenkins try it on Windows. Let me know how it goes please. Ciao, -- Francesc Alted |
From: Antonio V. <ant...@ti...> - 2013-04-22 06:37:59
|
Hi Francesc, Il 21/04/2013 21:46, Francesc Alted ha scritto: > Hi, > > I'm happy to announce the availability of Blosc 1.2.0 RC1. It exists > currently just as a tag in the github repo > (https://github.com/FrancescAlted/blosc), so you can fetch it as: > > https://github.com/FrancescAlted/blosc/archive/v1.2.0-rc1.tar.gz > > If everything goes well, I plan to do an official release in a week or so. > > Enjoy data! > > Francesc > thank you Francesc. I have just updated blosc sources included in PyTables. ciao -- Antonio Valentino |
From: Francesc A. <fa...@gm...> - 2013-04-21 19:46:30
|
Hi, I'm happy to announce the availability of Blosc 1.2.0 RC1. It exists currently just as a tag in the github repo (https://github.com/FrancescAlted/blosc), so you can fetch it as: https://github.com/FrancescAlted/blosc/archive/v1.2.0-rc1.tar.gz If everything goes well, I plan to do an official release in a week or so. Enjoy data! Francesc =============================================================== Announcing Blosc 1.2.0 RC1 A blocking, shuffling and lossless compression library =============================================================== What is new? ============ The most important features for this release are support for cmake (tested on Linux, Mac OSX and Windows) and thread safety calls of Blosc functions from threaded apps. Many thanks for those who contributed to this release: Thibault North, Antonio Valentino, Mark Wiebe and Valentin Haenel. For more info, please see the release notes in: https://github.com/FrancescAlted/blosc/wiki/Release-notes What is it? =========== Blosc (http://www.blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor (that I'm aware of) that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate object manipulations that are memory-bound. There is also a handy command line for Blosc called Bloscpack (https://github.com/esc/bloscpack) that allows you to compress large binary datafiles on-disk. Although the format for Bloscpack has not stabilized yet, it allows you to effectively use Blosc from you favorite shell. Download sources ================ For more details on what it is, please go to main web site: http://www.blosc.org/ The github repository is over here: https://github.com/FrancescAlted/blosc Blosc is distributed using the MIT license, see LICENSES/BLOSC.txt for details. Mailing list ============ There is an official Blosc blosc mailing list at: bl...@go... http://groups.google.es/group/blosc -- Francesc Alted |
From: Anthony S. <sc...@gm...> - 2013-04-16 21:23:55
|
Hello PyTables Users, To let you know, we are hoping to do a PyTables 3.0-beta release here in the next week or two. This will include the long awaited Python 3 support thanks to the heroic efforts of Antonio Valentino, who did the lion's share of the porting work for both PyTables AND one of our dependencies, numexpr. However, to really make this release the best possible, we are asking for your help in cleaning up and closing some of the remaining issues. You can see our list of open issues for this release here [1]. You can also see out todo list for this release here [2]. *If you have a feature that you'd really love to see make it into the code base, now is the time to implement it.* If you have always wanted to contribute, but weren't sure how to get going, please fork the repo on github and then issue a pull request. If you have any questions about this process feel free to ask in this thread. Here is to a great next release! The PyTables Developers 1. https://github.com/PyTables/PyTables/issues?milestone=4&state=open 2. https://github.com/PyTables/PyTables/wiki/NextReleaseTodo |
From: Anthony S. <sc...@gm...> - 2013-04-16 03:58:11
|
Hello Shyam, Can you please post the full traceback? In any event, I am fairly certain that this error is coming from the np.fromiter step. The problem here is that you are trying to read yur entire SQL query into a single numpy array in memory. This is impossible because you don't have enough RAM. Therefore, you are going to need to read and write in chunks. Something like the following: def getDataAndWriteHDF5(table): databaseConn= pyodbc.connect(<connection string>, <password>) cursor= databaseConn.cursor() cursor.execute("SQL Query") dt = np.dtype([('name', numpy.str_, 180), ('address', numpy.str_, 4200), ('email', numpy.str_, 180), ('phone', numpy.str_, 256)]) citer = iter(cursor) chunksize = 4096 # This is just a guess, other values might work better crange = range(chunksize) while True: resultSet = np.fromiter((tuple(row) for i, row in zip(crange, citer)), dtype=dt) table.append(resultSet) if len(resultSet) < chunksize: break You may want to tweak some things, but that is the basic strategy. Be Well Anthony On Mon, Apr 15, 2013 at 10:16 PM, Shyam Parimal Katti <sp...@ny...>wrote: > Hello Anthony, > > > Thank you for your suggestions. When I mentioned that I am reading the data from database, I meant a DB2 database, not a HDF5 database/file. > > > I followed your suggestions, so the code looks as follows: > > > def createHDF5File(): > > h5File= tables.openFile(<file name>, mode="a") > > table.createTable(h5File.root, "Contact", Contact, "Contact", expectedrows=7000000) > > ..... > > > def getDataAndWriteHDF5(table): > > databaseConn= pyodbc.connect(<connection string>, <password>) > > cursor= databaseConn.cursor() > > cursor.execute("SQL Query") > > resultSet= np.fromiter(( tuple(row) for row in cursor), dtype=[('name', numpy.str_, 180), ('address', numpy.str_, 4200), ('email', numpy.str_, 180), ('phone', numpy.str_, 256)]) > > table.append(resultSet) > > > > Error message: MemoryError: cannot allocate array memory. > > > > I am setting the `expectedrows` parameter when creating the table in HDF5 file, and yet encounter the error above. Looking forward to suggestions. > > > > > > > Hello Anthony, > > > > Thank you for replying back with suggestions. > > > > In response to your suggestions, I am *not reading the data from a file > > in the first step, but instead a database*. > > > > Hello Shyam, > > To put too fine a point on it, hdf5 databases are files. And reading from > any kind of file incurs the same disk read overhead. > > > > I did try out your 1st suggestion of doing a table.append(list of > > tuples), which took a little more than the executed time I got with the > > original code. Can you please guide me in how to chunk the data (that I got > > from database and stored as a list of tuples in Python) ? > > > > Ahh, so you should not be using list of tuples. These are Pythonic types > and conversion between HDF5 types and Python types is what is slowing you > down. You should be passing a numpy structured array into append(). Numpy > types are very similar (and often exactly the same as) HDF5 types. For > large, continuous, structured data you want to avoid the Python interpreter > as much as possible. Use Python here as the glue code to compose a series > of fast operations using the APIs exposed by numpy, pytables, etc. > > Be Well > Anthony > > > > On Thu, Apr 11, 2013 at 6:16 PM, Shyam Parimal Katti <sp...@ny...>wrote: > >> Hello Anthony, >> >> Thank you for replying back with suggestions. >> >> In response to your suggestions, I am *not reading the data from a file >> in the first step, but instead a database*. I did try out your 1st >> suggestion of doing a table.append(list of tuples), which took a little >> more than the executed time I got with the original code. Can you please >> guide me in how to chunk the data (that I got from database and stored as a >> list of tuples in Python) ? >> >> >> Thanks, >> Shyam >> >> >> Hi Shyam, >> >> The pattern that you are using to write to a table is basically one for >> writing Python data to HDF5. However, your data is already in a machine / >> HDF5 native format. Thus what you are doing here is an excessive amount of >> work: read data from file -> convert to Python data structures -> covert >> back to HDF5 data structures -> write to file. >> >> When reading from a table you get back a numpy structured array (look them >> up on the numpy website). >> >> Then instead of using rows to write back the data, just use Table.append() >> [1] which lets you pass in a bunch of rows simultaneously. (Note: that you >> data in this case is too large to fit into memory, so you may have to spit >> it up into chunks or use the new iterators which are in the development >> branch.) >> >> Additionally, if all you are doing is copying a table wholesale, you should >> use the Table.copy(). [2] Or if you only want to copy some subset based on >> a conditional you provide, use whereAppend() [3]. >> >> Finally, if you want to do math or evaluate expressions on one table to >> create a new table, use the Expr class [4]. >> >> All of these will be waaaaay faster than what you are doing right now. >> >> Be Well >> Anthony >> >> 1.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.append >> 2.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.copy >> 3.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.whereAppend >> 4. http://pytables.github.io/usersguide/libref/expr_class.html >> >> >> >> >> On Thu, Apr 11, 2013 at 12:23 PM, Shyam Parimal Katti <sp...@ny...>wrote: >> >>> Hello, >>> >>> I am writing a lot of data(close to 122GB ) to a hdf5 file using >>> PyTables. The execution time for writing the query result to the file is >>> close to 10 hours, which includes querying the database and then writing to >>> the file. When I timed the entire execution, I found that it takes as much >>> time to get the data from the database as it takes to write to the hdf5 >>> file. Here is the small snippet(P.S: the execution time noted below is not >>> for 122GB data, but a small subset close to 10GB): >>> >>> class ContactClass(table.IsDescription): >>> name= tb.StringCol(4200) >>> address= tb.StringCol(4200) >>> emailAddr= tb.StringCol(180) >>> phone= tb.StringCol(256) >>> >>> h5File= table.openFile(<file name>, mode="a", title= "Contacts") >>> t= h5File.createTable(h5File.root, 'ContactClass', ContactClass, >>> filters=table.Filters(5, 'blosc'), expectedrows=77806938) >>> >>> resultSet= get data from database >>> currRow= t.row >>> print("Before appending data: %s" % str(datetime.now())) >>> for (attributes ..) in resultSet: >>> currRow['name']= attribute[0] >>> currRow['address']= attribute[1] >>> currRow['emailAddr']= attribute[2] >>> currRow['phone']= attribute[3] >>> currRow.append() >>> print("After done appending: %s" % str(datetime.now())) >>> t.flush() >>> print("After done flushing: %s" % str(datetime.now())) >>> >>> .. gives me: >>> *Before appending data 2013-04-11 10:42:39.903713 * >>> *After done appending: 2013-04-11 11:04:10.002712* >>> *After done flushing: 2013-04-11 11:05:50.059893* >>> * >>> * >>> it seems like append() takes a lot of time. Any suggestions on how to >>> improve this? >>> >>> Thanks, >>> Shyam >>> >>> >> > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Shyam P. K. <sp...@ny...> - 2013-04-16 03:16:34
|
Hello Anthony, Thank you for your suggestions. When I mentioned that I am reading the data from database, I meant a DB2 database, not a HDF5 database/file. I followed your suggestions, so the code looks as follows: def createHDF5File(): h5File= tables.openFile(<file name>, mode="a") table.createTable(h5File.root, "Contact", Contact, "Contact", expectedrows=7000000) ..... def getDataAndWriteHDF5(table): databaseConn= pyodbc.connect(<connection string>, <password>) cursor= databaseConn.cursor() cursor.execute("SQL Query") resultSet= np.fromiter(( tuple(row) for row in cursor), dtype=[('name', numpy.str_, 180), ('address', numpy.str_, 4200), ('email', numpy.str_, 180), ('phone', numpy.str_, 256)]) table.append(resultSet) Error message: MemoryError: cannot allocate array memory. I am setting the `expectedrows` parameter when creating the table in HDF5 file, and yet encounter the error above. Looking forward to suggestions. > Hello Anthony, > > Thank you for replying back with suggestions. > > In response to your suggestions, I am *not reading the data from a file > in the first step, but instead a database*. > Hello Shyam, To put too fine a point on it, hdf5 databases are files. And reading from any kind of file incurs the same disk read overhead. > I did try out your 1st suggestion of doing a table.append(list of > tuples), which took a little more than the executed time I got with the > original code. Can you please guide me in how to chunk the data (that I got > from database and stored as a list of tuples in Python) ? > Ahh, so you should not be using list of tuples. These are Pythonic types and conversion between HDF5 types and Python types is what is slowing you down. You should be passing a numpy structured array into append(). Numpy types are very similar (and often exactly the same as) HDF5 types. For large, continuous, structured data you want to avoid the Python interpreter as much as possible. Use Python here as the glue code to compose a series of fast operations using the APIs exposed by numpy, pytables, etc. Be Well Anthony On Thu, Apr 11, 2013 at 6:16 PM, Shyam Parimal Katti <sp...@ny...> wrote: > Hello Anthony, > > Thank you for replying back with suggestions. > > In response to your suggestions, I am *not reading the data from a file > in the first step, but instead a database*. I did try out your 1st > suggestion of doing a table.append(list of tuples), which took a little > more than the executed time I got with the original code. Can you please > guide me in how to chunk the data (that I got from database and stored as a > list of tuples in Python) ? > > > Thanks, > Shyam > > Hi Shyam, > > The pattern that you are using to write to a table is basically one for > writing Python data to HDF5. However, your data is already in a machine / > HDF5 native format. Thus what you are doing here is an excessive amount of > work: read data from file -> convert to Python data structures -> covert > back to HDF5 data structures -> write to file. > > When reading from a table you get back a numpy structured array (look them > up on the numpy website). > > Then instead of using rows to write back the data, just use Table.append() > [1] which lets you pass in a bunch of rows simultaneously. (Note: that you > data in this case is too large to fit into memory, so you may have to spit > it up into chunks or use the new iterators which are in the development > branch.) > > Additionally, if all you are doing is copying a table wholesale, you should > use the Table.copy(). [2] Or if you only want to copy some subset based on > a conditional you provide, use whereAppend() [3]. > > Finally, if you want to do math or evaluate expressions on one table to > create a new table, use the Expr class [4]. > > All of these will be waaaaay faster than what you are doing right now. > > Be Well > Anthony > > 1.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.append > 2.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.copy > 3.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.whereAppend > 4. http://pytables.github.io/usersguide/libref/expr_class.html > > > > > On Thu, Apr 11, 2013 at 12:23 PM, Shyam Parimal Katti <sp...@ny...>wrote: > >> Hello, >> >> I am writing a lot of data(close to 122GB ) to a hdf5 file using >> PyTables. The execution time for writing the query result to the file is >> close to 10 hours, which includes querying the database and then writing to >> the file. When I timed the entire execution, I found that it takes as much >> time to get the data from the database as it takes to write to the hdf5 >> file. Here is the small snippet(P.S: the execution time noted below is not >> for 122GB data, but a small subset close to 10GB): >> >> class ContactClass(table.IsDescription): >> name= tb.StringCol(4200) >> address= tb.StringCol(4200) >> emailAddr= tb.StringCol(180) >> phone= tb.StringCol(256) >> >> h5File= table.openFile(<file name>, mode="a", title= "Contacts") >> t= h5File.createTable(h5File.root, 'ContactClass', ContactClass, >> filters=table.Filters(5, 'blosc'), expectedrows=77806938) >> >> resultSet= get data from database >> currRow= t.row >> print("Before appending data: %s" % str(datetime.now())) >> for (attributes ..) in resultSet: >> currRow['name']= attribute[0] >> currRow['address']= attribute[1] >> currRow['emailAddr']= attribute[2] >> currRow['phone']= attribute[3] >> currRow.append() >> print("After done appending: %s" % str(datetime.now())) >> t.flush() >> print("After done flushing: %s" % str(datetime.now())) >> >> .. gives me: >> *Before appending data 2013-04-11 10:42:39.903713 * >> *After done appending: 2013-04-11 11:04:10.002712* >> *After done flushing: 2013-04-11 11:05:50.059893* >> * >> * >> it seems like append() takes a lot of time. Any suggestions on how to >> improve this? >> >> Thanks, >> Shyam >> >> > |
From: Anthony S. <sc...@gm...> - 2013-04-16 03:03:50
|
On Mon, Apr 15, 2013 at 9:40 AM, Julio Trevisan <jul...@gm...>wrote: > Hi Anthony > > Thanks for adding this issue. > > Is there a way to use CS indexes to get the row coordinates satisfying a > simple condition sorted by the column in the condition? I would like to > avoid using numpy.sort() since the sorting order is probably already > available within the index information. > > My condition is simply (timestamp >= %d) & (timestamp <= %d)" % (ts1, ts2) > > If you could please give me some guidelines so that I could put together > such a method, that would be great. > > Julio > > > (like using getWhereList()), but using an index to get the coordinate list > ordered by column indexes to get the coordinates ? I nhas a *sort*parameter that uses numpy.sort() to do the work, and that readSorted() uses > a full index to get the sorted sequence. I couldn't make complete sense yet > of "chunkmaps" being passed to numexpr evaluator inside _where(). > Hi Julio, Thanks for taking this on! You probably what to read [1] to figure out how numexpr works and what chunkmaps is doing, if you haven't already. However, probably the simplest implementation of this method would be basically part of the read_where() method followed by the read_sorted() body. It would look like this, but you'll have to try it: def read_where_sorted(self, ...): self._g_check_open() condcoords = set([p.nrow for p in self._where(condition, condvars, start, stop, step)]) self._where_condition = None # reset the conditions index = self._check_sortby_csi(sortby, checkCSI) sortcoords = index[start:stop:step] coords = [c for c in sortcoords if c in condcoords] return self.read_coordinates(coords, field) There may be faster, more elegant solutions, but I think that something like this would work. Be Well Anthony 1. http://code.google.com/p/numexpr/ > > > On Thu, Apr 11, 2013 at 1:14 PM, Anthony Scopatz <sc...@gm...>wrote: > >> Thanks for bringing this up, Julio. >> >> Hmm I don't think that this exists currently, but since there are >> readWhere() and readSorted() it shouldn't be too hard to implement. I have >> opened issue #225 to this effect. Pull requests welcome! >> >> https://github.com/PyTables/PyTables/issues/225 >> >> Be Well >> Anthony >> >> >> On Wed, Apr 10, 2013 at 1:02 PM, Dr. Louis Wicker <lou...@no...>wrote: >> >>> I am also interested in the this capability, if it exists in some way... >>> >>> Lou >>> >>> On Apr 10, 2013, at 12:35 PM, Julio Trevisan <jul...@gm...> >>> wrote: >>> >>> > Hi, >>> > >>> > Is there a way that I could have the ability of readWhere (i.e., >>> specify condition, and fast result) but also using a CSIndex so that the >>> rows come sorted in a particular order? >>> > >>> > I checked readSorted() but it is iterative and does not allow to >>> specify a condition. >>> > >>> > Julio >>> > >>> ------------------------------------------------------------------------------ >>> > Precog is a next-generation analytics platform capable of advanced >>> > analytics on semi-structured data. The platform includes APIs for >>> building >>> > apps and a phenomenal toolset for data science. Developers can use >>> > our toolset for easy data analysis & visualization. Get a free account! >>> > >>> http://www2.precog.com/precogplatform/slashdotnewsletter_______________________________________________ >>> > Pytables-users mailing list >>> > Pyt...@li... >>> > https://lists.sourceforge.net/lists/listinfo/pytables-users >>> >>> >>> ---------------------------------------------------------------------------- >>> | Dr. Louis J. Wicker >>> | NSSL/WRDD Rm 4366 >>> | National Weather Center >>> | 120 David L. Boren Boulevard, Norman, OK 73072 >>> | >>> | E-mail: Lou...@no... >>> | HTTP: http://www.nssl.noaa.gov/~lwicker >>> | Phone: (405) 325-6340 >>> | Fax: (405) 325-6780 >>> | >>> | >>> I For every complex problem, there is a solution that is simple, >>> | neat, and wrong. >>> | >>> | -- H. L. Mencken >>> | >>> >>> ---------------------------------------------------------------------------- >>> | "The contents of this message are mine personally and >>> | do not reflect any position of the Government or NOAA." >>> >>> ---------------------------------------------------------------------------- >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Precog is a next-generation analytics platform capable of advanced >>> analytics on semi-structured data. The platform includes APIs for >>> building >>> apps and a phenomenal toolset for data science. Developers can use >>> our toolset for easy data analysis & visualization. Get a free account! >>> http://www2.precog.com/precogplatform/slashdotnewsletter >>> _______________________________________________ >>> Pytables-users mailing list >>> Pyt...@li... >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >>> >> >> >> >> ------------------------------------------------------------------------------ >> Precog is a next-generation analytics platform capable of advanced >> analytics on semi-structured data. The platform includes APIs for building >> apps and a phenomenal toolset for data science. Developers can use >> our toolset for easy data analysis & visualization. Get a free account! >> http://www2.precog.com/precogplatform/slashdotnewsletter >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2013-04-15 23:15:45
|
And here is the issue: https://github.com/PyTables/PyTables/issues/230 On Mon, Apr 15, 2013 at 6:07 PM, Anthony Scopatz <sc...@gm...> wrote: > Hi Charles, > > This is very likely a bug with respect to querying based off of Time64Cols > not being converted to Float64s for the query itself. Under the covers, > HDF5 and PyTables represent Time64 as a posix times, which are structs of > two 4 byte ints [1]. These obviously have a very different memory layout > than your standard float64. This is why this comparison is failing. > > numexpr doesn't support the time64 datatype, nor does it support bit shift > operators. This makes it difficult to impossible to use time64 columns > properly from within a query right now. > > I'll make open a ticket for this, but if you want something working right > now using Float64Col is probably your best bet. This is what I have always > done, and it works just fine. I think that the Time64 stuff is in there > largely for C/HDF5 compliance. Sorry about the confusion. > > Be Well > Anthony > > 1. http://pubs.opengroup.org/onlinepubs/000095399/basedefs/sys/time.h.html > > > On Mon, Apr 15, 2013 at 2:20 PM, Charles de Villiers <ch...@ya...>wrote: > >> Hi Anthony, >> >> Thanks for your response. >> >> I had come across that discussion, but I don't think the floating-point >> precision thing really explains my results, because I'm querying for >> intervals, not instants. >> if I have a table containing, say, one-second samples between 500.0 and >> 1500.0, and I use a where clause like this: >> '(update_seconds >= 1000.0) & (update_seconds <= 1060.0)' >> then I expect to get at least 58 samples, even with floating-point >> 'fuzziness' - but in fact I get none. >> However, I have now tried the approach of storing my epoch seconds in >> Float64Cols and that seems to be working just fine. >> The question I'm left with is - just what does a Time64Col represent? >> Since there's no standard Python Time class with a float representation, I >> just guessed I could assign it float seconds a la time.time(), but >> Float64 works just as well for that (and as it turns out, better). How >> could you use a Time64Col in practice? >> >> Thanks again, >> >> Charles de Villiers >> >> "They have computers, and they may have other weapons of mass >> destruction." >> (Janet Reno) >> >> ------------------------------ >> *From:* Anthony Scopatz <sc...@gm...> >> *To:* Charles de Villiers <ch...@ya...>; Discussion list for >> PyTables <pyt...@li...> >> *Sent:* Monday, April 15, 2013 5:13 PM >> *Subject:* Re: [Pytables-users] PyTables in-kernel query using Time64Col >> returns wrong results >> >> Hi Charles, >> >> We just discussed this last week and I am too lazy to retype it all so >> here is a link to the archive post [1]. >> >> Be Well >> Anthony >> >> 1. http://sourceforge.net/mailarchive/message.php?msg_id=30708089 >> >> >> On Mon, Apr 15, 2013 at 9:20 AM, Charles de Villiers <ch...@ya...>wrote: >> >> >> 0down votefavorite<http://stackoverflow.com/questions/16013711/pytables-in-kernel-search-on-time64col#> >> ** >> I'm using PyTables 2.4.0 and Python 2.7 I've got a database that >> contains the following typical table: >> >> /anc/asc_wind_speed (Table(87591,), shuffle, blosc(3)) 'Wind speed' >> description := { >> "value_seconds": Time64Col(shape=(), dflt=0.0, pos=0), >> "update_seconds": Time64Col(shape=(), >> dflt=0.0, pos=1), >> "status": UInt8Col(shape=(), dflt=0, pos=2), >> "value": Float64Col(shape=(), dflt=0.0, pos=3)} >> byteorder := 'little' >> chunkshape := (2621,) >> autoIndex := True >> colindexes := { >> "update_seconds": Index(9, >> full, shuffle, zlib(1)).is_CSI=True, >> "value": Index(9, >> full, shuffle, zlib(1)).is_CSI=True} >> >> I populate the timestamp columns using float seconds. >> The data looks OK in my IPython session: >> >> array([(1343779432.2160001, 1343779431.8529999, 0, 5.2975000000000003), >> (1343779433.2190001, 1343779432.9430001, 0, 5.7474999999999996), >> (1343779434.217, 1343779433.9809999, 0, 5.8600000000000003), ..., >> (1343866301.934, 1343866301.5139999, 0, 3.8424999999999998), >> (1343866302.934, 1343866302.5799999, 0, 4.0599999999999996), >> (1343866303.934, 1343866303.642, 0, 3.7825000000000002)], >> >> dtype=[('value_seconds', '<f8'), ('update_seconds', '<f8'), ('status', '|u1'), ('value', '<f8')]) >> >> .. but when I try to do an in-kernel search using the indexed column >> 'update_seconds', everything goes pear-shaped: >> >> len(wstable.readWhere('(update_seconds <= 1343866303.642)'))0 >> >> ie I get 0 rows returned when I was expecting all 87591 of them. >> Occasionally I do manage to get some rows with a '>=' query, but the >> timestamp columns are then returned as huge floats (~10^79). It seems that >> there is some implicit type-conversion going on that causes the Time64Col >> values to be misinterpreted. Can someone spot my mistake, or should I >> forget about Time64Cols and convert them all to Float64 (and how do I do >> this?) >> >> >> >> ------------------------------------------------------------------------------ >> Precog is a next-generation analytics platform capable of advanced >> analytics on semi-structured data. The platform includes APIs for building >> apps and a phenomenal toolset for data science. Developers can use >> our toolset for easy data analysis & visualization. Get a free account! >> http://www2.precog.com/precogplatform/slashdotnewsletter >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> >> >> >> > |
From: Anthony S. <sc...@gm...> - 2013-04-15 23:08:05
|
Hi Charles, This is very likely a bug with respect to querying based off of Time64Cols not being converted to Float64s for the query itself. Under the covers, HDF5 and PyTables represent Time64 as a posix times, which are structs of two 4 byte ints [1]. These obviously have a very different memory layout than your standard float64. This is why this comparison is failing. numexpr doesn't support the time64 datatype, nor does it support bit shift operators. This makes it difficult to impossible to use time64 columns properly from within a query right now. I'll make open a ticket for this, but if you want something working right now using Float64Col is probably your best bet. This is what I have always done, and it works just fine. I think that the Time64 stuff is in there largely for C/HDF5 compliance. Sorry about the confusion. Be Well Anthony 1. http://pubs.opengroup.org/onlinepubs/000095399/basedefs/sys/time.h.html On Mon, Apr 15, 2013 at 2:20 PM, Charles de Villiers <ch...@ya...>wrote: > Hi Anthony, > > Thanks for your response. > > I had come across that discussion, but I don't think the floating-point > precision thing really explains my results, because I'm querying for > intervals, not instants. > if I have a table containing, say, one-second samples between 500.0 and > 1500.0, and I use a where clause like this: > '(update_seconds >= 1000.0) & (update_seconds <= 1060.0)' > then I expect to get at least 58 samples, even with floating-point > 'fuzziness' - but in fact I get none. > However, I have now tried the approach of storing my epoch seconds in > Float64Cols and that seems to be working just fine. > The question I'm left with is - just what does a Time64Col represent? > Since there's no standard Python Time class with a float representation, I > just guessed I could assign it float seconds a la time.time(), but > Float64 works just as well for that (and as it turns out, better). How > could you use a Time64Col in practice? > > Thanks again, > > Charles de Villiers > > "They have computers, and they may have other weapons of mass destruction." > (Janet Reno) > > ------------------------------ > *From:* Anthony Scopatz <sc...@gm...> > *To:* Charles de Villiers <ch...@ya...>; Discussion list for > PyTables <pyt...@li...> > *Sent:* Monday, April 15, 2013 5:13 PM > *Subject:* Re: [Pytables-users] PyTables in-kernel query using Time64Col > returns wrong results > > Hi Charles, > > We just discussed this last week and I am too lazy to retype it all so > here is a link to the archive post [1]. > > Be Well > Anthony > > 1. http://sourceforge.net/mailarchive/message.php?msg_id=30708089 > > > On Mon, Apr 15, 2013 at 9:20 AM, Charles de Villiers <ch...@ya...>wrote: > > > 0down votefavorite<http://stackoverflow.com/questions/16013711/pytables-in-kernel-search-on-time64col#> > ** > I'm using PyTables 2.4.0 and Python 2.7 I've got a database that contains > the following typical table: > > /anc/asc_wind_speed (Table(87591,), shuffle, blosc(3)) 'Wind speed' > description := { > "value_seconds": Time64Col(shape=(), dflt=0.0, pos=0), > "update_seconds": Time64Col(shape=(), > dflt=0.0, pos=1), > "status": UInt8Col(shape=(), dflt=0, pos=2), > "value": Float64Col(shape=(), dflt=0.0, pos=3)} > byteorder := 'little' > chunkshape := (2621,) > autoIndex := True > colindexes := { > "update_seconds": Index(9, > full, shuffle, zlib(1)).is_CSI=True, > "value": Index(9, > full, shuffle, zlib(1)).is_CSI=True} > > I populate the timestamp columns using float seconds. > The data looks OK in my IPython session: > > array([(1343779432.2160001, 1343779431.8529999, 0, 5.2975000000000003), > (1343779433.2190001, 1343779432.9430001, 0, 5.7474999999999996), > (1343779434.217, 1343779433.9809999, 0, 5.8600000000000003), ..., > (1343866301.934, 1343866301.5139999, 0, 3.8424999999999998), > (1343866302.934, 1343866302.5799999, 0, 4.0599999999999996), > (1343866303.934, 1343866303.642, 0, 3.7825000000000002)], > > dtype=[('value_seconds', '<f8'), ('update_seconds', '<f8'), ('status', '|u1'), ('value', '<f8')]) > > .. but when I try to do an in-kernel search using the indexed column > 'update_seconds', everything goes pear-shaped: > > len(wstable.readWhere('(update_seconds <= 1343866303.642)'))0 > > ie I get 0 rows returned when I was expecting all 87591 of them. > Occasionally I do manage to get some rows with a '>=' query, but the > timestamp columns are then returned as huge floats (~10^79). It seems that > there is some implicit type-conversion going on that causes the Time64Col > values to be misinterpreted. Can someone spot my mistake, or should I > forget about Time64Cols and convert them all to Float64 (and how do I do > this?) > > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > |
From: Charles de V. <ch...@ya...> - 2013-04-15 19:20:45
|
Hi Anthony, Thanks for your response. I had come across that discussion, but I don't think the floating-point precision thing really explains my results, because I'm querying for intervals, not instants. if I have a table containing, say, one-second samples between 500.0 and 1500.0, and I use a where clause like this: '(update_seconds >= 1000.0) & (update_seconds <= 1060.0)' then I expect to get at least 58 samples, even with floating-point 'fuzziness' - but in fact I get none. However, I have now tried the approach of storing my epoch seconds in Float64Cols and that seems to be working just fine. The question I'm left with is - just what does a Time64Col represent? Since there's no standard Python Time class with a float representation, I just guessed I could assign it float seconds a la time.time(), but Float64 works just as well for that (and as it turns out, better). How could you use a Time64Col in practice? Thanks again, Charles de Villiers "They have computers, and they may have other weapons of mass destruction." (Janet Reno) ________________________________ From: Anthony Scopatz <sc...@gm...> To: Charles de Villiers <ch...@ya...>; Discussion list for PyTables <pyt...@li...> Sent: Monday, April 15, 2013 5:13 PM Subject: Re: [Pytables-users] PyTables in-kernel query using Time64Col returns wrong results Hi Charles, We just discussed this last week and I am too lazy to retype it all so here is a link to the archive post [1]. Be Well Anthony 1. http://sourceforge.net/mailarchive/message.php?msg_id=30708089 On Mon, Apr 15, 2013 at 9:20 AM, Charles de Villiers <ch...@ya...> wrote: >0 >down vote >favorite I'm using PyTables 2.4.0 and Python 2.7 I've got a database that contains the following typical table: >/anc/asc_wind_speed (Table(87591,),shuffle,blosc(3))'Wind speed'description :={"value_seconds":Time64Col(shape=(),dflt=0.0,pos=0),"update_seconds":Time64Col(shape=(),dflt=0.0,pos=1),"status":UInt8Col(shape=(),dflt=0,pos=2),"value":Float64Col(shape=(),dflt=0.0,pos=3)}byteorder :='little'chunkshape :=(2621,)autoIndex :=Truecolindexes :={"update_seconds":Index(9,full,shuffle,zlib(1)).is_CSI=True,"value":Index(9,full,shuffle,zlib(1)).is_CSI=True} >I populate the timestamp columns using float seconds. >The data looks OK in my IPython session: >array([(1343779432.2160001,1343779431.8529999,0,5.2975000000000003),(1343779433.2190001,1343779432.9430001,0,5.7474999999999996),(1343779434.217,1343779433.9809999,0,5.8600000000000003),...,(1343866301.934,1343866301.5139999,0,3.8424999999999998),(1343866302.934,1343866302.5799999,0,4.0599999999999996),(1343866303.934,1343866303.642,0,3.7825000000000002)],dtype=[('value_seconds','<f8'),('update_seconds','<f8'),('status','|u1'),('value','<f8')]) >.. but when I try to do an in-kernel search using the indexed column 'update_seconds', everything goes pear-shaped: >len(wstable.readWhere('(update_seconds <= 1343866303.642)'))0 >ie I get 0 rows returned when I was expecting all 87591 of them. Occasionally I do manage to get some rows with a '>=' query, but the timestamp columns are then returned as huge floats (~10^79). It seems that there is some implicit type-conversion going on that causes the Time64Col values to be misinterpreted. Can someone spot my mistake, or should I forget about Time64Cols and convert them all to Float64 (and how do I do this?) > > >------------------------------------------------------------------------------ >Precog is a next-generation analytics platform capable of advanced >analytics on semi-structured data. The platform includes APIs for building >apps and a phenomenal toolset for data science. Developers can use >our toolset for easy data analysis & visualization. Get a free account! >http://www2.precog.com/precogplatform/slashdotnewsletter >_______________________________________________ >Pytables-users mailing list >Pyt...@li... >https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2013-04-15 15:13:57
|
Hi Charles, We just discussed this last week and I am too lazy to retype it all so here is a link to the archive post [1]. Be Well Anthony 1. http://sourceforge.net/mailarchive/message.php?msg_id=30708089 On Mon, Apr 15, 2013 at 9:20 AM, Charles de Villiers <ch...@ya...>wrote: > > 0down votefavorite<http://stackoverflow.com/questions/16013711/pytables-in-kernel-search-on-time64col#> > ** > I'm using PyTables 2.4.0 and Python 2.7 I've got a database that contains > the following typical table: > > /anc/asc_wind_speed (Table(87591,), shuffle, blosc(3)) 'Wind speed' > description := { > "value_seconds": Time64Col(shape=(), dflt=0.0, pos=0), > "update_seconds": Time64Col(shape=(), dflt=0.0, pos=1), > "status": UInt8Col(shape=(), dflt=0, pos=2), > "value": Float64Col(shape=(), dflt=0.0, pos=3)} > byteorder := 'little' > chunkshape := (2621,) > autoIndex := True > colindexes := { > "update_seconds": Index(9, full, shuffle, zlib(1)).is_CSI=True, > "value": Index(9, full, shuffle, zlib(1)).is_CSI=True} > > I populate the timestamp columns using float seconds. > The data looks OK in my IPython session: > > array([(1343779432.2160001, 1343779431.8529999, 0, 5.2975000000000003), > (1343779433.2190001, 1343779432.9430001, 0, 5.7474999999999996), > (1343779434.217, 1343779433.9809999, 0, 5.8600000000000003), ..., > (1343866301.934, 1343866301.5139999, 0, 3.8424999999999998), > (1343866302.934, 1343866302.5799999, 0, 4.0599999999999996), > (1343866303.934, 1343866303.642, 0, 3.7825000000000002)], > > dtype=[('value_seconds', '<f8'), ('update_seconds', '<f8'), ('status', '|u1'), ('value', '<f8')]) > > .. but when I try to do an in-kernel search using the indexed column > 'update_seconds', everything goes pear-shaped: > > len(wstable.readWhere('(update_seconds <= 1343866303.642)'))0 > > ie I get 0 rows returned when I was expecting all 87591 of them. > Occasionally I do manage to get some rows with a '>=' query, but the > timestamp columns are then returned as huge floats (~10^79). It seems that > there is some implicit type-conversion going on that causes the Time64Col > values to be misinterpreted. Can someone spot my mistake, or should I > forget about Time64Cols and convert them all to Float64 (and how do I do > this?) > > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Julio T. <jul...@gm...> - 2013-04-15 14:40:13
|
Hi Anthony Thanks for adding this issue. Is there a way to use CS indexes to get the row coordinates satisfying a simple condition sorted by the column in the condition? I would like to avoid using numpy.sort() since the sorting order is probably already available within the index information. My condition is simply (timestamp >= %d) & (timestamp <= %d)" % (ts1, ts2) If you could please give me some guidelines so that I could put together such a method, that would be great. Julio (like using getWhereList()), but using an index to get the coordinate list ordered by column indexes to get the coordinates ? I nhas a *sort*parameter that uses numpy.sort() to do the work, and that readSorted() uses a full index to get the sorted sequence. I couldn't make complete sense yet of "chunkmaps" being passed to numexpr evaluator inside _where(). On Thu, Apr 11, 2013 at 1:14 PM, Anthony Scopatz <sc...@gm...> wrote: > Thanks for bringing this up, Julio. > > Hmm I don't think that this exists currently, but since there are > readWhere() and readSorted() it shouldn't be too hard to implement. I have > opened issue #225 to this effect. Pull requests welcome! > > https://github.com/PyTables/PyTables/issues/225 > > Be Well > Anthony > > > On Wed, Apr 10, 2013 at 1:02 PM, Dr. Louis Wicker <lou...@no...>wrote: > >> I am also interested in the this capability, if it exists in some way... >> >> Lou >> >> On Apr 10, 2013, at 12:35 PM, Julio Trevisan <jul...@gm...> >> wrote: >> >> > Hi, >> > >> > Is there a way that I could have the ability of readWhere (i.e., >> specify condition, and fast result) but also using a CSIndex so that the >> rows come sorted in a particular order? >> > >> > I checked readSorted() but it is iterative and does not allow to >> specify a condition. >> > >> > Julio >> > >> ------------------------------------------------------------------------------ >> > Precog is a next-generation analytics platform capable of advanced >> > analytics on semi-structured data. The platform includes APIs for >> building >> > apps and a phenomenal toolset for data science. Developers can use >> > our toolset for easy data analysis & visualization. Get a free account! >> > >> http://www2.precog.com/precogplatform/slashdotnewsletter_______________________________________________ >> > Pytables-users mailing list >> > Pyt...@li... >> > https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> >> ---------------------------------------------------------------------------- >> | Dr. Louis J. Wicker >> | NSSL/WRDD Rm 4366 >> | National Weather Center >> | 120 David L. Boren Boulevard, Norman, OK 73072 >> | >> | E-mail: Lou...@no... >> | HTTP: http://www.nssl.noaa.gov/~lwicker >> | Phone: (405) 325-6340 >> | Fax: (405) 325-6780 >> | >> | >> I For every complex problem, there is a solution that is simple, >> | neat, and wrong. >> | >> | -- H. L. Mencken >> | >> >> ---------------------------------------------------------------------------- >> | "The contents of this message are mine personally and >> | do not reflect any position of the Government or NOAA." >> >> ---------------------------------------------------------------------------- >> >> >> >> ------------------------------------------------------------------------------ >> Precog is a next-generation analytics platform capable of advanced >> analytics on semi-structured data. The platform includes APIs for building >> apps and a phenomenal toolset for data science. Developers can use >> our toolset for easy data analysis & visualization. Get a free account! >> http://www2.precog.com/precogplatform/slashdotnewsletter >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Charles de V. <ch...@ya...> - 2013-04-15 14:20:31
|
0 down vote favorite I'm using PyTables 2.4.0 and Python 2.7 I've got a database that contains the following typical table: /anc/asc_wind_speed (Table(87591,),shuffle,blosc(3))'Wind speed'description :={"value_seconds":Time64Col(shape=(),dflt=0.0,pos=0),"update_seconds":Time64Col(shape=(),dflt=0.0,pos=1),"status":UInt8Col(shape=(),dflt=0,pos=2),"value":Float64Col(shape=(),dflt=0.0,pos=3)}byteorder :='little'chunkshape :=(2621,)autoIndex :=Truecolindexes :={"update_seconds":Index(9,full,shuffle,zlib(1)).is_CSI=True,"value":Index(9,full,shuffle,zlib(1)).is_CSI=True} I populate the timestamp columns using float seconds. The data looks OK in my IPython session: array([(1343779432.2160001,1343779431.8529999,0,5.2975000000000003),(1343779433.2190001,1343779432.9430001,0,5.7474999999999996),(1343779434.217,1343779433.9809999,0,5.8600000000000003),...,(1343866301.934,1343866301.5139999,0,3.8424999999999998),(1343866302.934,1343866302.5799999,0,4.0599999999999996),(1343866303.934,1343866303.642,0,3.7825000000000002)],dtype=[('value_seconds','<f8'),('update_seconds','<f8'),('status','|u1'),('value','<f8')]) .. but when I try to do an in-kernel search using the indexed column 'update_seconds', everything goes pear-shaped: len(wstable.readWhere('(update_seconds <= 1343866303.642)'))0 ie I get 0 rows returned when I was expecting all 87591 of them. Occasionally I do manage to get some rows with a '>=' query, but the timestamp columns are then returned as huge floats (~10^79). It seems that there is some implicit type-conversion going on that causes the Time64Col values to be misinterpreted. Can someone spot my mistake, or should I forget about Time64Cols and convert them all to Float64 (and how do I do this?) |
From: Francesc A. <fa...@gm...> - 2013-04-14 21:57:07
|
Uploaded numexpr 2.1 RC2 with your suggestions. Thanks! Francesc Al 14/04/13 23:12, En/na Christoph Gohlke ha escrit: > On 4/14/2013 1:19 PM, Francesc Alted wrote: >> ============================ >> Announcing Numexpr 2.1RC1 >> ============================ >> >> Numexpr is a fast numerical expression evaluator for NumPy. With it, >> expressions that operate on arrays (like "3*a+4*b") are accelerated >> and use less memory than doing the same calculation in Python. >> >> It wears multi-threaded capabilities, as well as support for Intel's >> VML library, which allows for squeezing the last drop of performance >> out of your multi-core processors. >> >> What's new >> ========== >> >> This version adds compatibility for Python 3. A bunch of thanks to >> Antonio Valentino for his excelent work on this.I apologize for taking >> so long in releasing his contributions. >> >> In case you want to know more in detail what has changed in this >> version, see: >> >> http://code.google.com/p/numexpr/wiki/ReleaseNotes >> >> or have a look at RELEASE_NOTES.txt in the tarball. >> >> Where I can find Numexpr? >> ========================= >> >> The project is hosted at Google code in: >> >> http://code.google.com/p/numexpr/ >> >> This is a release candidate 1, so it will not be available on the PyPi >> repository. I'll post it there when the final version will released. >> >> Share your experience >> ===================== >> >> Let us know of any bugs, suggestions, gripes, kudos, etc. you may >> have. >> >> >> Enjoy! >> >> -- >> Francesc Alted >> >> > Hello, > > Looks good. All tests pass here on Python 2.6-3.3, 32&64 bit, numpy > 1.7.1, VML/MKL 11.0.3, Windows 8. PyTables 2.4 also tests OK against the rc. > > Two small issues: > > 1) numexpr-2.1-rc1.tar.gz is missing the file missing_posix_functions.hpp > > 2) The latest version of MKL requires the following change (see > <http://software.intel.com/en-us/articles/some-service-functions-have-become-obsolete-and-will-be-removed-in-subsequent-releases>): > > diff -r 97ab97673591 numexpr/module.cpp > --- a/numexpr/module.cpp Sun Apr 14 22:11:47 2013 +0200 > +++ b/numexpr/module.cpp Sun Apr 14 14:01:09 2013 -0700 > @@ -277,7 +277,7 @@ > { > int len=198; > char buf[198]; > - MKLGetVersionString(buf, len); > + MKL_Get_Version_String(buf, len); > return Py_BuildValue("s", buf); > } > > > Thank you, > > Christoph > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Francesc A. <fa...@gm...> - 2013-04-14 21:44:28
|
Al 14/04/13 23:12, En/na Christoph Gohlke ha escrit: > Hello, > > Looks good. All tests pass here on Python 2.6-3.3, 32&64 bit, numpy > 1.7.1, VML/MKL 11.0.3, Windows 8. PyTables 2.4 also tests OK against the rc. > > Two small issues: > > 1) numexpr-2.1-rc1.tar.gz is missing the file missing_posix_functions.hpp Oops. Will fix. > > 2) The latest version of MKL requires the following change (see > <http://software.intel.com/en-us/articles/some-service-functions-have-become-obsolete-and-will-be-removed-in-subsequent-releases>): > > diff -r 97ab97673591 numexpr/module.cpp > --- a/numexpr/module.cpp Sun Apr 14 22:11:47 2013 +0200 > +++ b/numexpr/module.cpp Sun Apr 14 14:01:09 2013 -0700 > @@ -277,7 +277,7 @@ > { > int len=198; > char buf[198]; > - MKLGetVersionString(buf, len); > + MKL_Get_Version_String(buf, len); > return Py_BuildValue("s", buf); > } Okay, and that is backwards compatible with older versions of MKL too? Thanks, Francesc |
From: Christoph G. <cg...@uc...> - 2013-04-14 21:12:15
|
On 4/14/2013 1:19 PM, Francesc Alted wrote: > ============================ > Announcing Numexpr 2.1RC1 > ============================ > > Numexpr is a fast numerical expression evaluator for NumPy. With it, > expressions that operate on arrays (like "3*a+4*b") are accelerated > and use less memory than doing the same calculation in Python. > > It wears multi-threaded capabilities, as well as support for Intel's > VML library, which allows for squeezing the last drop of performance > out of your multi-core processors. > > What's new > ========== > > This version adds compatibility for Python 3. A bunch of thanks to > Antonio Valentino for his excelent work on this.I apologize for taking > so long in releasing his contributions. > > In case you want to know more in detail what has changed in this > version, see: > > http://code.google.com/p/numexpr/wiki/ReleaseNotes > > or have a look at RELEASE_NOTES.txt in the tarball. > > Where I can find Numexpr? > ========================= > > The project is hosted at Google code in: > > http://code.google.com/p/numexpr/ > > This is a release candidate 1, so it will not be available on the PyPi > repository. I'll post it there when the final version will released. > > Share your experience > ===================== > > Let us know of any bugs, suggestions, gripes, kudos, etc. you may > have. > > > Enjoy! > > -- > Francesc Alted > > Hello, Looks good. All tests pass here on Python 2.6-3.3, 32&64 bit, numpy 1.7.1, VML/MKL 11.0.3, Windows 8. PyTables 2.4 also tests OK against the rc. Two small issues: 1) numexpr-2.1-rc1.tar.gz is missing the file missing_posix_functions.hpp 2) The latest version of MKL requires the following change (see <http://software.intel.com/en-us/articles/some-service-functions-have-become-obsolete-and-will-be-removed-in-subsequent-releases>): diff -r 97ab97673591 numexpr/module.cpp --- a/numexpr/module.cpp Sun Apr 14 22:11:47 2013 +0200 +++ b/numexpr/module.cpp Sun Apr 14 14:01:09 2013 -0700 @@ -277,7 +277,7 @@ { int len=198; char buf[198]; - MKLGetVersionString(buf, len); + MKL_Get_Version_String(buf, len); return Py_BuildValue("s", buf); } Thank you, Christoph |