You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(7) |
Jul
(18) |
Aug
(5) |
Sep
(15) |
Oct
(4) |
Nov
(1) |
Dec
(4) |
2004 |
Jan
(5) |
Feb
(2) |
Mar
(5) |
Apr
(8) |
May
(8) |
Jun
(10) |
Jul
(4) |
Aug
(4) |
Sep
(20) |
Oct
(11) |
Nov
(31) |
Dec
(41) |
2005 |
Jan
(79) |
Feb
(22) |
Mar
(14) |
Apr
(17) |
May
(35) |
Jun
(24) |
Jul
(26) |
Aug
(9) |
Sep
(57) |
Oct
(64) |
Nov
(25) |
Dec
(37) |
2006 |
Jan
(76) |
Feb
(24) |
Mar
(79) |
Apr
(44) |
May
(33) |
Jun
(12) |
Jul
(15) |
Aug
(40) |
Sep
(17) |
Oct
(21) |
Nov
(46) |
Dec
(23) |
2007 |
Jan
(18) |
Feb
(25) |
Mar
(41) |
Apr
(66) |
May
(18) |
Jun
(29) |
Jul
(40) |
Aug
(32) |
Sep
(34) |
Oct
(17) |
Nov
(46) |
Dec
(17) |
2008 |
Jan
(17) |
Feb
(42) |
Mar
(23) |
Apr
(11) |
May
(65) |
Jun
(28) |
Jul
(28) |
Aug
(16) |
Sep
(24) |
Oct
(33) |
Nov
(16) |
Dec
(5) |
2009 |
Jan
(19) |
Feb
(25) |
Mar
(11) |
Apr
(32) |
May
(62) |
Jun
(28) |
Jul
(61) |
Aug
(20) |
Sep
(61) |
Oct
(11) |
Nov
(14) |
Dec
(53) |
2010 |
Jan
(17) |
Feb
(31) |
Mar
(39) |
Apr
(43) |
May
(49) |
Jun
(47) |
Jul
(35) |
Aug
(58) |
Sep
(55) |
Oct
(91) |
Nov
(77) |
Dec
(63) |
2011 |
Jan
(50) |
Feb
(30) |
Mar
(67) |
Apr
(31) |
May
(17) |
Jun
(83) |
Jul
(17) |
Aug
(33) |
Sep
(35) |
Oct
(19) |
Nov
(29) |
Dec
(26) |
2012 |
Jan
(53) |
Feb
(22) |
Mar
(118) |
Apr
(45) |
May
(28) |
Jun
(71) |
Jul
(87) |
Aug
(55) |
Sep
(30) |
Oct
(73) |
Nov
(41) |
Dec
(28) |
2013 |
Jan
(19) |
Feb
(30) |
Mar
(14) |
Apr
(63) |
May
(20) |
Jun
(59) |
Jul
(40) |
Aug
(33) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Francesc A. <fa...@gm...> - 2013-04-14 20:19:42
|
============================ Announcing Numexpr 2.1RC1 ============================ Numexpr is a fast numerical expression evaluator for NumPy. With it, expressions that operate on arrays (like "3*a+4*b") are accelerated and use less memory than doing the same calculation in Python. It wears multi-threaded capabilities, as well as support for Intel's VML library, which allows for squeezing the last drop of performance out of your multi-core processors. What's new ========== This version adds compatibility for Python 3. A bunch of thanks to Antonio Valentino for his excelent work on this.I apologize for taking so long in releasing his contributions. In case you want to know more in detail what has changed in this version, see: http://code.google.com/p/numexpr/wiki/ReleaseNotes or have a look at RELEASE_NOTES.txt in the tarball. Where I can find Numexpr? ========================= The project is hosted at Google code in: http://code.google.com/p/numexpr/ This is a release candidate 1, so it will not be available on the PyPi repository. I'll post it there when the final version will released. Share your experience ===================== Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. Enjoy! -- Francesc Alted |
From: Anthony S. <sc...@gm...> - 2013-04-11 23:02:48
|
On Thu, Apr 11, 2013 at 5:16 PM, Shyam Parimal Katti <sp...@ny...> wrote: > Hello Anthony, > > Thank you for replying back with suggestions. > > In response to your suggestions, I am *not reading the data from a file > in the first step, but instead a database*. > Hello Shyam, To put too fine a point on it, hdf5 databases are files. And reading from any kind of file incurs the same disk read overhead. > I did try out your 1st suggestion of doing a table.append(list of > tuples), which took a little more than the executed time I got with the > original code. Can you please guide me in how to chunk the data (that I got > from database and stored as a list of tuples in Python) ? > Ahh, so you should not be using list of tuples. These are Pythonic types and conversion between HDF5 types and Python types is what is slowing you down. You should be passing a numpy structured array into append(). Numpy types are very similar (and often exactly the same as) HDF5 types. For large, continuous, structured data you want to avoid the Python interpreter as much as possible. Use Python here as the glue code to compose a series of fast operations using the APIs exposed by numpy, pytables, etc. Be Well Anthony > > > Thanks, > Shyam > > > Hi Shyam, > > The pattern that you are using to write to a table is basically one for > writing Python data to HDF5. However, your data is already in a machine / > HDF5 native format. Thus what you are doing here is an excessive amount of > work: read data from file -> convert to Python data structures -> covert > back to HDF5 data structures -> write to file. > > When reading from a table you get back a numpy structured array (look them > up on the numpy website). > > Then instead of using rows to write back the data, just use Table.append() > [1] which lets you pass in a bunch of rows simultaneously. (Note: that you > data in this case is too large to fit into memory, so you may have to spit > it up into chunks or use the new iterators which are in the development > branch.) > > Additionally, if all you are doing is copying a table wholesale, you should > use the Table.copy(). [2] Or if you only want to copy some subset based on > a conditional you provide, use whereAppend() [3]. > > Finally, if you want to do math or evaluate expressions on one table to > create a new table, use the Expr class [4]. > > All of these will be waaaaay faster than what you are doing right now. > > Be Well > Anthony > > 1.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.append > 2.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.copy > 3.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.whereAppend > 4. http://pytables.github.io/usersguide/libref/expr_class.html > > > > > On Thu, Apr 11, 2013 at 12:23 PM, Shyam Parimal Katti <sp...@ny...>wrote: > >> Hello, >> >> I am writing a lot of data(close to 122GB ) to a hdf5 file using >> PyTables. The execution time for writing the query result to the file is >> close to 10 hours, which includes querying the database and then writing to >> the file. When I timed the entire execution, I found that it takes as much >> time to get the data from the database as it takes to write to the hdf5 >> file. Here is the small snippet(P.S: the execution time noted below is not >> for 122GB data, but a small subset close to 10GB): >> >> class ContactClass(table.IsDescription): >> name= tb.StringCol(4200) >> address= tb.StringCol(4200) >> emailAddr= tb.StringCol(180) >> phone= tb.StringCol(256) >> >> h5File= table.openFile(<file name>, mode="a", title= "Contacts") >> t= h5File.createTable(h5File.root, 'ContactClass', ContactClass, >> filters=table.Filters(5, 'blosc'), expectedrows=77806938) >> >> resultSet= get data from database >> currRow= t.row >> print("Before appending data: %s" % str(datetime.now())) >> for (attributes ..) in resultSet: >> currRow['name']= attribute[0] >> currRow['address']= attribute[1] >> currRow['emailAddr']= attribute[2] >> currRow['phone']= attribute[3] >> currRow.append() >> print("After done appending: %s" % str(datetime.now())) >> t.flush() >> print("After done flushing: %s" % str(datetime.now())) >> >> .. gives me: >> *Before appending data 2013-04-11 10:42:39.903713 * >> *After done appending: 2013-04-11 11:04:10.002712* >> *After done flushing: 2013-04-11 11:05:50.059893* >> * >> * >> it seems like append() takes a lot of time. Any suggestions on how to >> improve this? >> >> Thanks, >> Shyam >> >> > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Shyam P. K. <sp...@ny...> - 2013-04-11 22:16:45
|
Hello Anthony, Thank you for replying back with suggestions. In response to your suggestions, I am *not reading the data from a file in the first step, but instead a database*. I did try out your 1st suggestion of doing a table.append(list of tuples), which took a little more than the executed time I got with the original code. Can you please guide me in how to chunk the data (that I got from database and stored as a list of tuples in Python) ? Thanks, Shyam Hi Shyam, The pattern that you are using to write to a table is basically one for writing Python data to HDF5. However, your data is already in a machine / HDF5 native format. Thus what you are doing here is an excessive amount of work: read data from file -> convert to Python data structures -> covert back to HDF5 data structures -> write to file. When reading from a table you get back a numpy structured array (look them up on the numpy website). Then instead of using rows to write back the data, just use Table.append() [1] which lets you pass in a bunch of rows simultaneously. (Note: that you data in this case is too large to fit into memory, so you may have to spit it up into chunks or use the new iterators which are in the development branch.) Additionally, if all you are doing is copying a table wholesale, you should use the Table.copy(). [2] Or if you only want to copy some subset based on a conditional you provide, use whereAppend() [3]. Finally, if you want to do math or evaluate expressions on one table to create a new table, use the Expr class [4]. All of these will be waaaaay faster than what you are doing right now. Be Well Anthony 1.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.append 2.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.copy 3.http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.whereAppend 4. http://pytables.github.io/usersguide/libref/expr_class.html On Thu, Apr 11, 2013 at 12:23 PM, Shyam Parimal Katti <sp...@ny...>wrote: > Hello, > > I am writing a lot of data(close to 122GB ) to a hdf5 file using PyTables. > The execution time for writing the query result to the file is close to 10 > hours, which includes querying the database and then writing to the file. > When I timed the entire execution, I found that it takes as much time to > get the data from the database as it takes to write to the hdf5 file. Here > is the small snippet(P.S: the execution time noted below is not for 122GB > data, but a small subset close to 10GB): > > class ContactClass(table.IsDescription): > name= tb.StringCol(4200) > address= tb.StringCol(4200) > emailAddr= tb.StringCol(180) > phone= tb.StringCol(256) > > h5File= table.openFile(<file name>, mode="a", title= "Contacts") > t= h5File.createTable(h5File.root, 'ContactClass', ContactClass, > filters=table.Filters(5, 'blosc'), expectedrows=77806938) > > resultSet= get data from database > currRow= t.row > print("Before appending data: %s" % str(datetime.now())) > for (attributes ..) in resultSet: > currRow['name']= attribute[0] > currRow['address']= attribute[1] > currRow['emailAddr']= attribute[2] > currRow['phone']= attribute[3] > currRow.append() > print("After done appending: %s" % str(datetime.now())) > t.flush() > print("After done flushing: %s" % str(datetime.now())) > > .. gives me: > *Before appending data 2013-04-11 10:42:39.903713 * > *After done appending: 2013-04-11 11:04:10.002712* > *After done flushing: 2013-04-11 11:05:50.059893* > * > * > it seems like append() takes a lot of time. Any suggestions on how to > improve this? > > Thanks, > Shyam > > |
From: Anthony S. <sc...@gm...> - 2013-04-11 18:14:41
|
Hi Shyam, The pattern that you are using to write to a table is basically one for writing Python data to HDF5. However, your data is already in a machine / HDF5 native format. Thus what you are doing here is an excessive amount of work: read data from file -> convert to Python data structures -> covert back to HDF5 data structures -> write to file. When reading from a table you get back a numpy structured array (look them up on the numpy website). Then instead of using rows to write back the data, just use Table.append() [1] which lets you pass in a bunch of rows simultaneously. (Note: that you data in this case is too large to fit into memory, so you may have to spit it up into chunks or use the new iterators which are in the development branch.) Additionally, if all you are doing is copying a table wholesale, you should use the Table.copy(). [2] Or if you only want to copy some subset based on a conditional you provide, use whereAppend() [3]. Finally, if you want to do math or evaluate expressions on one table to create a new table, use the Expr class [4]. All of these will be waaaaay faster than what you are doing right now. Be Well Anthony 1. http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.append 2. http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.copy 3. http://pytables.github.io/usersguide/libref/structured_storage.html#tables.Table.whereAppend 4. http://pytables.github.io/usersguide/libref/expr_class.html On Thu, Apr 11, 2013 at 11:23 AM, Shyam Parimal Katti <sp...@ny...>wrote: > Hello, > > I am writing a lot of data(close to 122GB ) to a hdf5 file using PyTables. > The execution time for writing the query result to the file is close to 10 > hours, which includes querying the database and then writing to the file. > When I timed the entire execution, I found that it takes as much time to > get the data from the database as it takes to write to the hdf5 file. Here > is the small snippet(P.S: the execution time noted below is not for 122GB > data, but a small subset close to 10GB): > > class ContactClass(table.IsDescription): > name= tb.StringCol(4200) > address= tb.StringCol(4200) > emailAddr= tb.StringCol(180) > phone= tb.StringCol(256) > > h5File= table.openFile(<file name>, mode="a", title= "Contacts") > t= h5File.createTable(h5File.root, 'ContactClass', ContactClass, > filters=table.Filters(5, 'blosc'), expectedrows=77806938) > > resultSet= get data from database > currRow= t.row > print("Before appending data: %s" % str(datetime.now())) > for (attributes ..) in resultSet: > currRow['name']= attribute[0] > currRow['address']= attribute[1] > currRow['emailAddr']= attribute[2] > currRow['phone']= attribute[3] > currRow.append() > print("After done appending: %s" % str(datetime.now())) > t.flush() > print("After done flushing: %s" % str(datetime.now())) > > .. gives me: > *Before appending data 2013-04-11 10:42:39.903713 * > *After done appending: 2013-04-11 11:04:10.002712* > *After done flushing: 2013-04-11 11:05:50.059893* > * > * > it seems like append() takes a lot of time. Any suggestions on how to > improve this? > > Thanks, > Shyam > > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Shyam P. K. <sp...@ny...> - 2013-04-11 17:18:05
|
Hello, I am writing a lot of data(close to 122GB ) to a hdf5 file using PyTables. The execution time for writing the query result to the file is close to 10 hours, which includes querying the database and then writing to the file. When I timed the entire execution, I found that it takes as much time to get the data from the database as it takes to write to the hdf5 file. Here is the small snippet(P.S: the execution time noted below is not for 122GB data, but a small subset close to 10GB): class ContactClass(table.IsDescription): name= tb.StringCol(4200) address= tb.StringCol(4200) emailAddr= tb.StringCol(180) phone= tb.StringCol(256) h5File= table.openFile(<file name>, mode="a", title= "Contacts") t= h5File.createTable(h5File.root, 'ContactClass', ContactClass, filters=table.Filters(5, 'blosc'), expectedrows=77806938) resultSet= get data from database currRow= t.row print("Before appending data: %s" % str(datetime.now())) for (attributes ..) in resultSet: currRow['name']= attribute[0] currRow['address']= attribute[1] currRow['emailAddr']= attribute[2] currRow['phone']= attribute[3] currRow.append() print("After done appending: %s" % str(datetime.now())) t.flush() print("After done flushing: %s" % str(datetime.now())) .. gives me: *Before appending data 2013-04-11 10:42:39.903713 * *After done appending: 2013-04-11 11:04:10.002712* *After done flushing: 2013-04-11 11:05:50.059893* * * it seems like append() takes a lot of time. Any suggestions on how to improve this? Thanks, Shyam |
From: Anthony S. <sc...@gm...> - 2013-04-11 16:15:12
|
Thanks for bringing this up, Julio. Hmm I don't think that this exists currently, but since there are readWhere() and readSorted() it shouldn't be too hard to implement. I have opened issue #225 to this effect. Pull requests welcome! https://github.com/PyTables/PyTables/issues/225 Be Well Anthony On Wed, Apr 10, 2013 at 1:02 PM, Dr. Louis Wicker <lou...@no...>wrote: > I am also interested in the this capability, if it exists in some way... > > Lou > > On Apr 10, 2013, at 12:35 PM, Julio Trevisan <jul...@gm...> > wrote: > > > Hi, > > > > Is there a way that I could have the ability of readWhere (i.e., specify > condition, and fast result) but also using a CSIndex so that the rows come > sorted in a particular order? > > > > I checked readSorted() but it is iterative and does not allow to specify > a condition. > > > > Julio > > > ------------------------------------------------------------------------------ > > Precog is a next-generation analytics platform capable of advanced > > analytics on semi-structured data. The platform includes APIs for > building > > apps and a phenomenal toolset for data science. Developers can use > > our toolset for easy data analysis & visualization. Get a free account! > > > http://www2.precog.com/precogplatform/slashdotnewsletter_______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > ---------------------------------------------------------------------------- > | Dr. Louis J. Wicker > | NSSL/WRDD Rm 4366 > | National Weather Center > | 120 David L. Boren Boulevard, Norman, OK 73072 > | > | E-mail: Lou...@no... > | HTTP: http://www.nssl.noaa.gov/~lwicker > | Phone: (405) 325-6340 > | Fax: (405) 325-6780 > | > | > I For every complex problem, there is a solution that is simple, > | neat, and wrong. > | > | -- H. L. Mencken > | > > ---------------------------------------------------------------------------- > | "The contents of this message are mine personally and > | do not reflect any position of the Government or NOAA." > > ---------------------------------------------------------------------------- > > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Dr. L. W. <lou...@no...> - 2013-04-10 18:03:02
|
I am also interested in the this capability, if it exists in some way... Lou On Apr 10, 2013, at 12:35 PM, Julio Trevisan <jul...@gm...> wrote: > Hi, > > Is there a way that I could have the ability of readWhere (i.e., specify condition, and fast result) but also using a CSIndex so that the rows come sorted in a particular order? > > I checked readSorted() but it is iterative and does not allow to specify a condition. > > Julio > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter_______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users ---------------------------------------------------------------------------- | Dr. Louis J. Wicker | NSSL/WRDD Rm 4366 | National Weather Center | 120 David L. Boren Boulevard, Norman, OK 73072 | | E-mail: Lou...@no... | HTTP: http://www.nssl.noaa.gov/~lwicker | Phone: (405) 325-6340 | Fax: (405) 325-6780 | | I For every complex problem, there is a solution that is simple, | neat, and wrong. | | -- H. L. Mencken | ---------------------------------------------------------------------------- | "The contents of this message are mine personally and | do not reflect any position of the Government or NOAA." ---------------------------------------------------------------------------- |
From: Julio T. <jul...@gm...> - 2013-04-10 17:35:59
|
Hi, Is there a way that I could have the ability of readWhere (i.e., specify condition, and fast result) but also using a CSIndex so that the rows come sorted in a particular order? I checked readSorted() but it is iterative and does not allow to specify a condition. Julio |
From: Julio T. <jul...@gm...> - 2013-04-10 17:12:01
|
Thanks again :) On Wed, Apr 10, 2013 at 1:53 PM, Anthony Scopatz <sc...@gm...> wrote: > On Wed, Apr 10, 2013 at 11:40 AM, Julio Trevisan <jul...@gm...>wrote: > >> Hi Anthony >> >> Thanks again.* *If it is a problem related to floating-point precision, >> I might use an Int64Col instead, since I don't need the timestamp >> miliseconds. >> > > Another good plan since integers are exact ;) > > >> >> >> Julio >> >> >> >> >> On Wed, Apr 10, 2013 at 1:17 PM, Anthony Scopatz <sc...@gm...>wrote: >> >>> On Wed, Apr 10, 2013 at 7:44 AM, Julio Trevisan <jul...@gm... >>> > wrote: >>> >>>> Hi, >>>> >>>> I am using a Time64Col called "timestamp" in a condition, and I noticed >>>> that the condition does not work (i.e., no rows are selected) if I write >>>> something as: >>>> >>>> for row in node.where("timestamp == %f" % t): >>>> ... >>>> >>>> However, I had this idea of dividing the values by, say 1000, and it >>>> does work: >>>> >>>> for row in node.where("timestamp/1000 == %f" % t/1000): >>>> ... >>>> >>>> However, this doesn't seem to be an elegant solution. Please could >>>> someone point out a better solution to this? >>>> >>> >>> Hello Julio, >>> >>> While this may not be the most elegant solution it is probably one of >>> the most appropriate. The problem here likely stems from the fact that >>> floating point numbers (which are how Time64Cols are stored) are not exact >>> representations of the desired value. For example: >>> >>> In [1]: 1.1 + 2.2 >>> Out[1]: 3.3000000000000003 >>> >>> So when you divide my some constant order of magnitude, you are chopping >>> off the error associated with floating point precision. You are creating >>> a bin of this constant's size around the target value that is "close >>> enough" to count as equivalent. There are other mechanisms for alleviating >>> this issue: dividing and multiplying back (x/10)*10 == y, right shifting >>> (platform dependent), taking the difference and have it be less than some >>> tolerance x - y <= t. You get the idea. You have to mitigate this effect >>> some how. >>> >>> For more information please refer to: >>> http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html >>> >>> >>>> Could this be related to the fact that my column name is "timestamp"? I >>>> ask this because I use a program called HDFView to brose the HDF5 file. >>>> This program refuses to show the first column when it is called >>>> "timestamp", but shows it when it is called "id". I don't know if the facts >>>> are related or not. >>>> >>> >>> This is probably unrelated. >>> >>> Be Well >>> Anthony >>> >>> >>>> >>>> I don't know if this is useful information, but the conversion of a >>>> typical "t" to string gives something like this: >>>> >>>> >> print "%f" % t >>>> 1365597435.000000 >>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Precog is a next-generation analytics platform capable of advanced >>>> analytics on semi-structured data. The platform includes APIs for >>>> building >>>> apps and a phenomenal toolset for data science. Developers can use >>>> our toolset for easy data analysis & visualization. Get a free account! >>>> http://www2.precog.com/precogplatform/slashdotnewsletter >>>> _______________________________________________ >>>> Pytables-users mailing list >>>> Pyt...@li... >>>> https://lists.sourceforge.net/lists/listinfo/pytables-users >>>> >>>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Precog is a next-generation analytics platform capable of advanced >>> analytics on semi-structured data. The platform includes APIs for >>> building >>> apps and a phenomenal toolset for data science. Developers can use >>> our toolset for easy data analysis & visualization. Get a free account! >>> http://www2.precog.com/precogplatform/slashdotnewsletter >>> _______________________________________________ >>> Pytables-users mailing list >>> Pyt...@li... >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Precog is a next-generation analytics platform capable of advanced >> analytics on semi-structured data. The platform includes APIs for building >> apps and a phenomenal toolset for data science. Developers can use >> our toolset for easy data analysis & visualization. Get a free account! >> http://www2.precog.com/precogplatform/slashdotnewsletter >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2013-04-10 16:54:22
|
On Wed, Apr 10, 2013 at 11:40 AM, Julio Trevisan <jul...@gm...>wrote: > Hi Anthony > > Thanks again.* *If it is a problem related to floating-point precision, I > might use an Int64Col instead, since I don't need the timestamp miliseconds. > Another good plan since integers are exact ;) > > > Julio > > > > > On Wed, Apr 10, 2013 at 1:17 PM, Anthony Scopatz <sc...@gm...>wrote: > >> On Wed, Apr 10, 2013 at 7:44 AM, Julio Trevisan <jul...@gm...>wrote: >> >>> Hi, >>> >>> I am using a Time64Col called "timestamp" in a condition, and I noticed >>> that the condition does not work (i.e., no rows are selected) if I write >>> something as: >>> >>> for row in node.where("timestamp == %f" % t): >>> ... >>> >>> However, I had this idea of dividing the values by, say 1000, and it >>> does work: >>> >>> for row in node.where("timestamp/1000 == %f" % t/1000): >>> ... >>> >>> However, this doesn't seem to be an elegant solution. Please could >>> someone point out a better solution to this? >>> >> >> Hello Julio, >> >> While this may not be the most elegant solution it is probably one of the >> most appropriate. The problem here likely stems from the fact that >> floating point numbers (which are how Time64Cols are stored) are not exact >> representations of the desired value. For example: >> >> In [1]: 1.1 + 2.2 >> Out[1]: 3.3000000000000003 >> >> So when you divide my some constant order of magnitude, you are chopping >> off the error associated with floating point precision. You are creating >> a bin of this constant's size around the target value that is "close >> enough" to count as equivalent. There are other mechanisms for alleviating >> this issue: dividing and multiplying back (x/10)*10 == y, right shifting >> (platform dependent), taking the difference and have it be less than some >> tolerance x - y <= t. You get the idea. You have to mitigate this effect >> some how. >> >> For more information please refer to: >> http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html >> >> >>> Could this be related to the fact that my column name is "timestamp"? I >>> ask this because I use a program called HDFView to brose the HDF5 file. >>> This program refuses to show the first column when it is called >>> "timestamp", but shows it when it is called "id". I don't know if the facts >>> are related or not. >>> >> >> This is probably unrelated. >> >> Be Well >> Anthony >> >> >>> >>> I don't know if this is useful information, but the conversion of a >>> typical "t" to string gives something like this: >>> >>> >> print "%f" % t >>> 1365597435.000000 >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Precog is a next-generation analytics platform capable of advanced >>> analytics on semi-structured data. The platform includes APIs for >>> building >>> apps and a phenomenal toolset for data science. Developers can use >>> our toolset for easy data analysis & visualization. Get a free account! >>> http://www2.precog.com/precogplatform/slashdotnewsletter >>> _______________________________________________ >>> Pytables-users mailing list >>> Pyt...@li... >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Precog is a next-generation analytics platform capable of advanced >> analytics on semi-structured data. The platform includes APIs for building >> apps and a phenomenal toolset for data science. Developers can use >> our toolset for easy data analysis & visualization. Get a free account! >> http://www2.precog.com/precogplatform/slashdotnewsletter >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Julio T. <jul...@gm...> - 2013-04-10 16:40:47
|
Hi Anthony Thanks again.* *If it is a problem related to floating-point precision, I might use an Int64Col instead, since I don't need the timestamp miliseconds. Julio On Wed, Apr 10, 2013 at 1:17 PM, Anthony Scopatz <sc...@gm...> wrote: > On Wed, Apr 10, 2013 at 7:44 AM, Julio Trevisan <jul...@gm...>wrote: > >> Hi, >> >> I am using a Time64Col called "timestamp" in a condition, and I noticed >> that the condition does not work (i.e., no rows are selected) if I write >> something as: >> >> for row in node.where("timestamp == %f" % t): >> ... >> >> However, I had this idea of dividing the values by, say 1000, and it does >> work: >> >> for row in node.where("timestamp/1000 == %f" % t/1000): >> ... >> >> However, this doesn't seem to be an elegant solution. Please could >> someone point out a better solution to this? >> > > Hello Julio, > > While this may not be the most elegant solution it is probably one of the > most appropriate. The problem here likely stems from the fact that > floating point numbers (which are how Time64Cols are stored) are not exact > representations of the desired value. For example: > > In [1]: 1.1 + 2.2 > Out[1]: 3.3000000000000003 > > So when you divide my some constant order of magnitude, you are chopping > off the error associated with floating point precision. You are creating > a bin of this constant's size around the target value that is "close > enough" to count as equivalent. There are other mechanisms for alleviating > this issue: dividing and multiplying back (x/10)*10 == y, right shifting > (platform dependent), taking the difference and have it be less than some > tolerance x - y <= t. You get the idea. You have to mitigate this effect > some how. > > For more information please refer to: > http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html > > >> Could this be related to the fact that my column name is "timestamp"? I >> ask this because I use a program called HDFView to brose the HDF5 file. >> This program refuses to show the first column when it is called >> "timestamp", but shows it when it is called "id". I don't know if the facts >> are related or not. >> > > This is probably unrelated. > > Be Well > Anthony > > >> >> I don't know if this is useful information, but the conversion of a >> typical "t" to string gives something like this: >> >> >> print "%f" % t >> 1365597435.000000 >> >> >> >> >> ------------------------------------------------------------------------------ >> Precog is a next-generation analytics platform capable of advanced >> analytics on semi-structured data. The platform includes APIs for building >> apps and a phenomenal toolset for data science. Developers can use >> our toolset for easy data analysis & visualization. Get a free account! >> http://www2.precog.com/precogplatform/slashdotnewsletter >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2013-04-10 16:17:40
|
On Wed, Apr 10, 2013 at 7:44 AM, Julio Trevisan <jul...@gm...>wrote: > Hi, > > I am using a Time64Col called "timestamp" in a condition, and I noticed > that the condition does not work (i.e., no rows are selected) if I write > something as: > > for row in node.where("timestamp == %f" % t): > ... > > However, I had this idea of dividing the values by, say 1000, and it does > work: > > for row in node.where("timestamp/1000 == %f" % t/1000): > ... > > However, this doesn't seem to be an elegant solution. Please could someone > point out a better solution to this? > Hello Julio, While this may not be the most elegant solution it is probably one of the most appropriate. The problem here likely stems from the fact that floating point numbers (which are how Time64Cols are stored) are not exact representations of the desired value. For example: In [1]: 1.1 + 2.2 Out[1]: 3.3000000000000003 So when you divide my some constant order of magnitude, you are chopping off the error associated with floating point precision. You are creating a bin of this constant's size around the target value that is "close enough" to count as equivalent. There are other mechanisms for alleviating this issue: dividing and multiplying back (x/10)*10 == y, right shifting (platform dependent), taking the difference and have it be less than some tolerance x - y <= t. You get the idea. You have to mitigate this effect some how. For more information please refer to: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html > Could this be related to the fact that my column name is "timestamp"? I > ask this because I use a program called HDFView to brose the HDF5 file. > This program refuses to show the first column when it is called > "timestamp", but shows it when it is called "id". I don't know if the facts > are related or not. > This is probably unrelated. Be Well Anthony > > I don't know if this is useful information, but the conversion of a > typical "t" to string gives something like this: > > >> print "%f" % t > 1365597435.000000 > > > > > ------------------------------------------------------------------------------ > Precog is a next-generation analytics platform capable of advanced > analytics on semi-structured data. The platform includes APIs for building > apps and a phenomenal toolset for data science. Developers can use > our toolset for easy data analysis & visualization. Get a free account! > http://www2.precog.com/precogplatform/slashdotnewsletter > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Julio T. <jul...@gm...> - 2013-04-10 12:44:20
|
Hi, I am using a Time64Col called "timestamp" in a condition, and I noticed that the condition does not work (i.e., no rows are selected) if I write something as: for row in node.where("timestamp == %f" % t): ... However, I had this idea of dividing the values by, say 1000, and it does work: for row in node.where("timestamp/1000 == %f" % t/1000): ... However, this doesn't seem to be an elegant solution. Please could someone point out a better solution to this? Could this be related to the fact that my column name is "timestamp"? I ask this because I use a program called HDFView to brose the HDF5 file. This program refuses to show the first column when it is called "timestamp", but shows it when it is called "id". I don't know if the facts are related or not. I don't know if this is useful information, but the conversion of a typical "t" to string gives something like this: >> print "%f" % t 1365597435.000000 |
From: Anthony S. <sc...@gm...> - 2013-04-08 18:57:03
|
I am glad =) On Apr 8, 2013 12:44 PM, "Julio Trevisan" <jul...@gm...> wrote: > Hey Anthony > > Thanks a lot for this. Your method with map() works around 30000 times > faster! > > > BEFORE: > (database)DEBUG:BOVESPA.VISTA.PETR4: took 0.096931 seconds to do > everything else > (database)DEBUG:BOVESPA.VISTA.PETR4: took 0.780372 seconds to ZIP > > > AFTER: > (database)DEBUG:BOVESPA.VISTA.PETR4: took 0.073058 seconds to do > everything else > (database)DEBUG:BOVESPA.VISTA.PETR4: took 0.000024 seconds to ZIP > > > > > > On Fri, Mar 22, 2013 at 12:35 PM, Anthony Scopatz <sc...@gm...>wrote: > >> On Fri, Mar 22, 2013 at 7:11 AM, Julio Trevisan <jul...@gm...>wrote: >> >>> Hi, >>> >>> I just joined this list, I am using PyTables for my project and it works >>> great and fast. >>> >>> I am just trying to optimize some parts of the program and I noticed >>> that zipping the tuples to get one tuple per column takes much longer than >>> reading the data itself. The thing is that readWhere() returns one tuple >>> per row, whereas I I need one tuple per column, so I have to use the zip() >>> function to achieve this. Is there a way to skip this zip() operation? >>> Please see below: >>> >>> >>> def quote_GetData(self, period, name, dt1, dt2): >>> """Returns timedata.Quotes object. >>> >>> Arguments: >>> period -- value from within infogetter.QuotePeriod >>> name -- quote symbol >>> dt1, dt2 -- datetime.datetime or timestamp values >>> >>> """ >>> t = time.time() >>> node = self.quote_GetNode(period, name) >>> ts1 = misc.datetime2timestamp(dt1) >>> ts2 = misc.datetime2timestamp(dt2) >>> >>> L = node.readWhere( \ >>> "(timestamp/1000 >= %f) & (timestamp/1000 <= %f)" % \ >>> (ts1/1000, ts2/1000)) >>> rowNum = len(L) >>> Q = timedata.Quotes() >>> print "%s: took %f seconds to do everything else" % (name, >>> time.time()-t) >>> >>> t = time.time() >>> if rowNum > 0: >>> (Q.timestamp, Q.open, Q.close, Q.high, Q.low, Q.volume, \ >>> Q.numTrades) = zip(*L) >>> print "%s: took %f seconds to ZIP" % (name, time.time()-t) >>> return Q >>> >>> *And the printout:* >>> BOVESPA.VISTA.PETR4: took 0.068788 seconds to do everything else >>> BOVESPA.VISTA.PETR4: took 0.379910 seconds to ZIP >>> >> >> Hi Julio, >> >> The problem here isn't zip (packing and un-packing are generally >> fast operations -- they happen *all* the time in Python). Nor is the >> problem specifically with PyTables. Rather this is an issue with how you >> are using numpy structured arrays (look them up). Basically, this is slow >> because you are creating a list of column tuples where every element is a >> Python object of the corresponding type. For example upcasting every >> 32-bit integer to a Python int is very expensive! >> >> What you *should* be doing is keeping the columns as numpy arrays, which >> keeps the memory layout small, continuous, fast, and if done right does not >> require a copy (which you are doing now). >> >> The value of L here is a structured array. So say I have some >> other structured array with 4 fields, the right way to do this is to pull >> out each field individually by indexing >> >> a, b, c, d = x['a'], x['b'], x['c'], x['d'] >> >> or more generally (for all fields): >> >> a, b, c, d = map(lambda x: i[x], i.dtype.names) >> >> or for some list of fields: >> >> a, c, b = map(lambda x: i[x], ['a', 'c', 'b']) >> >> Timing both your original method and the new one gives: >> >> In [47]: timeit a, b, c, d = zip(*i) >> 1000 loops, best of 3: 1.3 ms per loop >> >> In [48]: timeit a, b, c, d = map(lambda x: i[x], i.dtype.names) >> 100000 loops, best of 3: 2.3 µs per loop >> >> So the method I propose is 500x-1000x times faster. Using numpy >> idiomatically is very important! >> >> Be Well >> Anthony >> >> >>> >>> >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Everyone hates slow websites. So do we. >>> Make your web apps faster with AppDynamics >>> Download AppDynamics Lite for free today: >>> http://p.sf.net/sfu/appdyn_d2d_mar >>> _______________________________________________ >>> Pytables-users mailing list >>> Pyt...@li... >>> https://lists.sourceforge.net/lists/listinfo/pytables-users >>> >>> >> >> >> ------------------------------------------------------------------------------ >> Everyone hates slow websites. So do we. >> Make your web apps faster with AppDynamics >> Download AppDynamics Lite for free today: >> http://p.sf.net/sfu/appdyn_d2d_mar >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > Minimize network downtime and maximize team effectiveness. > Reduce network management and security costs.Learn how to hire > the most talented Cisco Certified professionals. Visit the > Employer Resources Portal > http://www.cisco.com/web/learning/employer_resources/index.html > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Julio T. <jul...@gm...> - 2013-04-08 17:43:52
|
Hey Anthony Thanks a lot for this. Your method with map() works around 30000 times faster! BEFORE: (database)DEBUG:BOVESPA.VISTA.PETR4: took 0.096931 seconds to do everything else (database)DEBUG:BOVESPA.VISTA.PETR4: took 0.780372 seconds to ZIP AFTER: (database)DEBUG:BOVESPA.VISTA.PETR4: took 0.073058 seconds to do everything else (database)DEBUG:BOVESPA.VISTA.PETR4: took 0.000024 seconds to ZIP On Fri, Mar 22, 2013 at 12:35 PM, Anthony Scopatz <sc...@gm...> wrote: > On Fri, Mar 22, 2013 at 7:11 AM, Julio Trevisan <jul...@gm...>wrote: > >> Hi, >> >> I just joined this list, I am using PyTables for my project and it works >> great and fast. >> >> I am just trying to optimize some parts of the program and I noticed that >> zipping the tuples to get one tuple per column takes much longer than >> reading the data itself. The thing is that readWhere() returns one tuple >> per row, whereas I I need one tuple per column, so I have to use the zip() >> function to achieve this. Is there a way to skip this zip() operation? >> Please see below: >> >> >> def quote_GetData(self, period, name, dt1, dt2): >> """Returns timedata.Quotes object. >> >> Arguments: >> period -- value from within infogetter.QuotePeriod >> name -- quote symbol >> dt1, dt2 -- datetime.datetime or timestamp values >> >> """ >> t = time.time() >> node = self.quote_GetNode(period, name) >> ts1 = misc.datetime2timestamp(dt1) >> ts2 = misc.datetime2timestamp(dt2) >> >> L = node.readWhere( \ >> "(timestamp/1000 >= %f) & (timestamp/1000 <= %f)" % \ >> (ts1/1000, ts2/1000)) >> rowNum = len(L) >> Q = timedata.Quotes() >> print "%s: took %f seconds to do everything else" % (name, >> time.time()-t) >> >> t = time.time() >> if rowNum > 0: >> (Q.timestamp, Q.open, Q.close, Q.high, Q.low, Q.volume, \ >> Q.numTrades) = zip(*L) >> print "%s: took %f seconds to ZIP" % (name, time.time()-t) >> return Q >> >> *And the printout:* >> BOVESPA.VISTA.PETR4: took 0.068788 seconds to do everything else >> BOVESPA.VISTA.PETR4: took 0.379910 seconds to ZIP >> > > Hi Julio, > > The problem here isn't zip (packing and un-packing are generally > fast operations -- they happen *all* the time in Python). Nor is the > problem specifically with PyTables. Rather this is an issue with how you > are using numpy structured arrays (look them up). Basically, this is slow > because you are creating a list of column tuples where every element is a > Python object of the corresponding type. For example upcasting every > 32-bit integer to a Python int is very expensive! > > What you *should* be doing is keeping the columns as numpy arrays, which > keeps the memory layout small, continuous, fast, and if done right does not > require a copy (which you are doing now). > > The value of L here is a structured array. So say I have some > other structured array with 4 fields, the right way to do this is to pull > out each field individually by indexing > > a, b, c, d = x['a'], x['b'], x['c'], x['d'] > > or more generally (for all fields): > > a, b, c, d = map(lambda x: i[x], i.dtype.names) > > or for some list of fields: > > a, c, b = map(lambda x: i[x], ['a', 'c', 'b']) > > Timing both your original method and the new one gives: > > In [47]: timeit a, b, c, d = zip(*i) > 1000 loops, best of 3: 1.3 ms per loop > > In [48]: timeit a, b, c, d = map(lambda x: i[x], i.dtype.names) > 100000 loops, best of 3: 2.3 µs per loop > > So the method I propose is 500x-1000x times faster. Using numpy > idiomatically is very important! > > Be Well > Anthony > > >> >> >> >> >> >> ------------------------------------------------------------------------------ >> Everyone hates slow websites. So do we. >> Make your web apps faster with AppDynamics >> Download AppDynamics Lite for free today: >> http://p.sf.net/sfu/appdyn_d2d_mar >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_mar > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2013-03-22 15:35:36
|
On Fri, Mar 22, 2013 at 7:11 AM, Julio Trevisan <jul...@gm...>wrote: > Hi, > > I just joined this list, I am using PyTables for my project and it works > great and fast. > > I am just trying to optimize some parts of the program and I noticed that > zipping the tuples to get one tuple per column takes much longer than > reading the data itself. The thing is that readWhere() returns one tuple > per row, whereas I I need one tuple per column, so I have to use the zip() > function to achieve this. Is there a way to skip this zip() operation? > Please see below: > > > def quote_GetData(self, period, name, dt1, dt2): > """Returns timedata.Quotes object. > > Arguments: > period -- value from within infogetter.QuotePeriod > name -- quote symbol > dt1, dt2 -- datetime.datetime or timestamp values > > """ > t = time.time() > node = self.quote_GetNode(period, name) > ts1 = misc.datetime2timestamp(dt1) > ts2 = misc.datetime2timestamp(dt2) > > L = node.readWhere( \ > "(timestamp/1000 >= %f) & (timestamp/1000 <= %f)" % \ > (ts1/1000, ts2/1000)) > rowNum = len(L) > Q = timedata.Quotes() > print "%s: took %f seconds to do everything else" % (name, > time.time()-t) > > t = time.time() > if rowNum > 0: > (Q.timestamp, Q.open, Q.close, Q.high, Q.low, Q.volume, \ > Q.numTrades) = zip(*L) > print "%s: took %f seconds to ZIP" % (name, time.time()-t) > return Q > > *And the printout:* > BOVESPA.VISTA.PETR4: took 0.068788 seconds to do everything else > BOVESPA.VISTA.PETR4: took 0.379910 seconds to ZIP > Hi Julio, The problem here isn't zip (packing and un-packing are generally fast operations -- they happen *all* the time in Python). Nor is the problem specifically with PyTables. Rather this is an issue with how you are using numpy structured arrays (look them up). Basically, this is slow because you are creating a list of column tuples where every element is a Python object of the corresponding type. For example upcasting every 32-bit integer to a Python int is very expensive! What you *should* be doing is keeping the columns as numpy arrays, which keeps the memory layout small, continuous, fast, and if done right does not require a copy (which you are doing now). The value of L here is a structured array. So say I have some other structured array with 4 fields, the right way to do this is to pull out each field individually by indexing a, b, c, d = x['a'], x['b'], x['c'], x['d'] or more generally (for all fields): a, b, c, d = map(lambda x: i[x], i.dtype.names) or for some list of fields: a, c, b = map(lambda x: i[x], ['a', 'c', 'b']) Timing both your original method and the new one gives: In [47]: timeit a, b, c, d = zip(*i) 1000 loops, best of 3: 1.3 ms per loop In [48]: timeit a, b, c, d = map(lambda x: i[x], i.dtype.names) 100000 loops, best of 3: 2.3 µs per loop So the method I propose is 500x-1000x times faster. Using numpy idiomatically is very important! Be Well Anthony > > > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_mar > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Julio T. <jul...@gm...> - 2013-03-22 12:11:17
|
Hi, I just joined this list, I am using PyTables for my project and it works great and fast. I am just trying to optimize some parts of the program and I noticed that zipping the tuples to get one tuple per column takes much longer than reading the data itself. The thing is that readWhere() returns one tuple per row, whereas I I need one tuple per column, so I have to use the zip() function to achieve this. Is there a way to skip this zip() operation? Please see below: def quote_GetData(self, period, name, dt1, dt2): """Returns timedata.Quotes object. Arguments: period -- value from within infogetter.QuotePeriod name -- quote symbol dt1, dt2 -- datetime.datetime or timestamp values """ t = time.time() node = self.quote_GetNode(period, name) ts1 = misc.datetime2timestamp(dt1) ts2 = misc.datetime2timestamp(dt2) L = node.readWhere( \ "(timestamp/1000 >= %f) & (timestamp/1000 <= %f)" % \ (ts1/1000, ts2/1000)) rowNum = len(L) Q = timedata.Quotes() print "%s: took %f seconds to do everything else" % (name, time.time()-t) t = time.time() if rowNum > 0: (Q.timestamp, Q.open, Q.close, Q.high, Q.low, Q.volume, \ Q.numTrades) = zip(*L) print "%s: took %f seconds to ZIP" % (name, time.time()-t) return Q *And the printout:* BOVESPA.VISTA.PETR4: took 0.068788 seconds to do everything else BOVESPA.VISTA.PETR4: took 0.379910 seconds to ZIP |
From: Anthony S. <sc...@gm...> - 2013-03-18 20:49:42
|
On Fri, Mar 15, 2013 at 5:26 PM, Thadeus Burgess <tha...@th...>wrote: > Looks similar to https://github.com/PyTables/PyTables/issues/206 Yes this is the same problem. I proposed a test on the issue if someone wants to try working on it, that would be great! > > -- > Thadeus > > > > On Fri, Mar 15, 2013 at 6:58 PM, Dmitry Fedorov <fe...@ec...>wrote: > >> Hi, >> >> I'm trying to index a simple uint32 column and search if a value is >> there. Although during the in-kernel query i get an error like this: >> >> Traceback (most recent call last): >> File "G:\_2\PytablesTestCode\pytableindex.py", line 9, in <module> >> r = [ row.nrow for row in it ] >> File "tableExtension.pyx", line 858, in >> tables.tableExtension.Row.__next__ (tables\tableExtension.c:7788) >> File "tableExtension.pyx", line 879, in >> tables.tableExtension.Row.__next__indexed >> (tables\tableExtension.c:7922) >> AssertionError >> >> The second time I perform the same query I get no errors but also I >> got no results... >> >> I get this on python 2.7, pytables 2.4.0 and 2.3.1 under windows 64 (7 >> and 8). I'm using pytables distro from >> <http://www.lfd.uci.edu/~gohlke/pythonlibs/#pytables> >> >> I'm sure I've had code like this running just fine on older pytables >> 2.2.X (with pro) and older python... >> >> <<<<<<<<<<<<<<<<<<<<<<<<<<<< >> Here's table creation code: >> >> import tables >> import numpy as np >> import random >> >> #initalizes column types >> class Columns(tables.IsDescription): >> imageid = tables.UInt32Col(pos=1) >> feature = tables.Float64Col(pos=2, shape=(30,)) >> >> #creates Table >> h5file=tables.openFile('features.h5','a', title='features') >> table = h5file.createTable('/', 'values', Columns, >> expectedrows=1000000000) >> table.flush() >> >> Columns = table.row >> >> #appends values to table >> for i in xrange(100): >> Columns['imageid'] = i >> Columns['feature'] = [np.random.random_sample((30,))] >> Columns.append() >> >> table.flush() >> >> table.cols.imageid.createCSIndex() >> h5file.close() >> >> <<<<<<<<<<<<<<<<<<<<<<<<<<<< >> Here's table query code: >> >> import time >> import tables >> >> #opens tables >> h5file=tables.openFile('features.h5','a') >> table=h5file.root.values >> >> it = table.where('imageid == 8') >> r = [ row.nrow for row in it ] >> >> print r >> >> ----- >> Thank you, >> Dmitry >> >> >> ------------------------------------------------------------------------------ >> Everyone hates slow websites. So do we. >> Make your web apps faster with AppDynamics >> Download AppDynamics Lite for free today: >> http://p.sf.net/sfu/appdyn_d2d_mar >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_mar > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Thadeus B. <tha...@th...> - 2013-03-16 00:27:34
|
Looks similar to https://github.com/PyTables/PyTables/issues/206 -- Thadeus On Fri, Mar 15, 2013 at 6:58 PM, Dmitry Fedorov <fe...@ec...>wrote: > Hi, > > I'm trying to index a simple uint32 column and search if a value is > there. Although during the in-kernel query i get an error like this: > > Traceback (most recent call last): > File "G:\_2\PytablesTestCode\pytableindex.py", line 9, in <module> > r = [ row.nrow for row in it ] > File "tableExtension.pyx", line 858, in > tables.tableExtension.Row.__next__ (tables\tableExtension.c:7788) > File "tableExtension.pyx", line 879, in > tables.tableExtension.Row.__next__indexed > (tables\tableExtension.c:7922) > AssertionError > > The second time I perform the same query I get no errors but also I > got no results... > > I get this on python 2.7, pytables 2.4.0 and 2.3.1 under windows 64 (7 > and 8). I'm using pytables distro from > <http://www.lfd.uci.edu/~gohlke/pythonlibs/#pytables> > > I'm sure I've had code like this running just fine on older pytables > 2.2.X (with pro) and older python... > > <<<<<<<<<<<<<<<<<<<<<<<<<<<< > Here's table creation code: > > import tables > import numpy as np > import random > > #initalizes column types > class Columns(tables.IsDescription): > imageid = tables.UInt32Col(pos=1) > feature = tables.Float64Col(pos=2, shape=(30,)) > > #creates Table > h5file=tables.openFile('features.h5','a', title='features') > table = h5file.createTable('/', 'values', Columns, expectedrows=1000000000) > table.flush() > > Columns = table.row > > #appends values to table > for i in xrange(100): > Columns['imageid'] = i > Columns['feature'] = [np.random.random_sample((30,))] > Columns.append() > > table.flush() > > table.cols.imageid.createCSIndex() > h5file.close() > > <<<<<<<<<<<<<<<<<<<<<<<<<<<< > Here's table query code: > > import time > import tables > > #opens tables > h5file=tables.openFile('features.h5','a') > table=h5file.root.values > > it = table.where('imageid == 8') > r = [ row.nrow for row in it ] > > print r > > ----- > Thank you, > Dmitry > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_mar > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Dmitry F. <fe...@ec...> - 2013-03-15 23:59:17
|
Hi, I'm trying to index a simple uint32 column and search if a value is there. Although during the in-kernel query i get an error like this: Traceback (most recent call last): File "G:\_2\PytablesTestCode\pytableindex.py", line 9, in <module> r = [ row.nrow for row in it ] File "tableExtension.pyx", line 858, in tables.tableExtension.Row.__next__ (tables\tableExtension.c:7788) File "tableExtension.pyx", line 879, in tables.tableExtension.Row.__next__indexed (tables\tableExtension.c:7922) AssertionError The second time I perform the same query I get no errors but also I got no results... I get this on python 2.7, pytables 2.4.0 and 2.3.1 under windows 64 (7 and 8). I'm using pytables distro from <http://www.lfd.uci.edu/~gohlke/pythonlibs/#pytables> I'm sure I've had code like this running just fine on older pytables 2.2.X (with pro) and older python... <<<<<<<<<<<<<<<<<<<<<<<<<<<< Here's table creation code: import tables import numpy as np import random #initalizes column types class Columns(tables.IsDescription): imageid = tables.UInt32Col(pos=1) feature = tables.Float64Col(pos=2, shape=(30,)) #creates Table h5file=tables.openFile('features.h5','a', title='features') table = h5file.createTable('/', 'values', Columns, expectedrows=1000000000) table.flush() Columns = table.row #appends values to table for i in xrange(100): Columns['imageid'] = i Columns['feature'] = [np.random.random_sample((30,))] Columns.append() table.flush() table.cols.imageid.createCSIndex() h5file.close() <<<<<<<<<<<<<<<<<<<<<<<<<<<< Here's table query code: import time import tables #opens tables h5file=tables.openFile('features.h5','a') table=h5file.root.values it = table.where('imageid == 8') r = [ row.nrow for row in it ] print r ----- Thank you, Dmitry |
From: Tim B. <tim...@ma...> - 2013-03-11 22:24:14
|
The netCDF library gives me a masked array so I have to explicitly transform that into a regular numpy array. Ahh interesting. Depending on the netCDF version the file was made with, you should be able to read the file directly from PyTables. You could thus directly get a normal numpy array. This *should* be possible, but I have never tried it ;) I think the netCDF3 functionality has been taken out or at least deprecated (https://github.com/PyTables/PyTables/issues/68). Using the python-netCDF4 module allows me to pull from pretty much any netcdf file and the inherent masking is sometimes very useful where the dataset is smaller and I can live with the lower performance of masks. I've looked under the covers and have seen that the ma masked implementation is all pure Python and so there is a performance drawback. I'm not up to speed yet on where the numpy.na masking implementation is (started a new job here). I tried to do an implementation in memory (except for the final write) and found that I have about 2GB of indices when I extract the quality indices. Simply using those indexes, memory usage grows to over 64GB and I eventually run out of memory and start churning away in swap. For the moment, I have pulled down the latest git master and am using the new in-memory HDF feature. This seems to give be better performance and is code-wise pretty simple so for the moment, it's good enough. Awesome! I am glad that this is working for you. Yes - appears to work great! |
From: Anthony S. <sc...@gm...> - 2013-03-11 19:16:14
|
On Sun, Mar 10, 2013 at 8:47 PM, Tim Burgess <tim...@ma...> wrote: > > On 08/03/2013, at 2:51 AM, Anthony Scopatz wrote: > > > Hey Tim, > > > > Awesome dataset! And neat image! > > > > As per your request, a couple of minor things I noticed were that you > probably don't need to do the sanity check each time (great for debugging, > but not needed always), you are using masked arrays which while sometimes > convenient are generally slower than creating an array, a mask and applying > the mask to the array, and you seem to be downcasting from float64 to > float32 for some reason that I am not entirely clear on (size, speed?). > > > > To the more major question of write performance, one thing that you > could try is compression. You might want to do some timing studies to find > the best compressor and level. Performance here can vary a lot based on how > similar your data is (and how close similar data is to each other). If you > have got a bunch of zeros and only a few real data points, even zlib 1 is > going to be blazing fast compared to writing all those zeros out explicitly. > > > > Another thing you could try doing is switching to EArray and using the > append() method. This might save PyTables, numpy, hdf5, etc from having to > check that the shape of "sst_node[qual_indices]" is actually the same as > the data you are giving it. Additionally dumping a block of memory to the > file directly (via append()) is generally faster than having to resolve > fancy indexes (which are notoriously the slow part of even numpy). > > > > Lastly, as a general comment, you seem to be doing a lot of stuff in the > inner most loop -- including writing to disk. I would look at how you > could restructure this to move as much as possible out of this loop. Your > data seems to be about 12 GB for a year, so this is probably too big to > build up the full sst array completely in memory prior to writing. That > is, unless you have a computer much bigger than my laptop ;). But issuing > one fat write command is probably going to be faster than making 365 of > them. > > > > Happy hacking! > > Be Well > > Anthony > > > > > Thanks Anthony for being so responsive and touching on a number of points. > > The netCDF library gives me a masked array so I have to explicitly > transform that into a regular numpy array. Ahh interesting. Depending on the netCDF version the file was made with, you should be able to read the file directly from PyTables. You could thus directly get a normal numpy array. This *should* be possible, but I have never tried it ;) > I've looked under the covers and have seen that the ma masked > implementation is all pure Python and so there is a performance drawback. > I'm not up to speed yet on where the numpy.na masking implementation is > (started a new job here). > > I tried to do an implementation in memory (except for the final write) and > found that I have about 2GB of indices when I extract the quality indices. > Simply using those indexes, memory usage grows to over 64GB and I > eventually run out of memory and start churning away in swap. > > For the moment, I have pulled down the latest git master and am using the > new in-memory HDF feature. This seems to give be better performance and is > code-wise pretty simple so for the moment, it's good enough. > Awesome! I am glad that this is working for you. > Cheers and thanks again, Tim > > BTW I viewed your SciPy tutorial. Good stuff! > Thanks! > > > > > ------------------------------------------------------------------------------ > Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester > Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the > endpoint security space. For insight on selecting the right partner to > tackle endpoint security challenges, access the full report. > http://p.sf.net/sfu/symantec-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Tim B. <tim...@ma...> - 2013-03-11 01:48:19
|
On 08/03/2013, at 2:51 AM, Anthony Scopatz wrote: > Hey Tim, > > Awesome dataset! And neat image! > > As per your request, a couple of minor things I noticed were that you probably don't need to do the sanity check each time (great for debugging, but not needed always), you are using masked arrays which while sometimes convenient are generally slower than creating an array, a mask and applying the mask to the array, and you seem to be downcasting from float64 to float32 for some reason that I am not entirely clear on (size, speed?). > > To the more major question of write performance, one thing that you could try is compression. You might want to do some timing studies to find the best compressor and level. Performance here can vary a lot based on how similar your data is (and how close similar data is to each other). If you have got a bunch of zeros and only a few real data points, even zlib 1 is going to be blazing fast compared to writing all those zeros out explicitly. > > Another thing you could try doing is switching to EArray and using the append() method. This might save PyTables, numpy, hdf5, etc from having to check that the shape of "sst_node[qual_indices]" is actually the same as the data you are giving it. Additionally dumping a block of memory to the file directly (via append()) is generally faster than having to resolve fancy indexes (which are notoriously the slow part of even numpy). > > Lastly, as a general comment, you seem to be doing a lot of stuff in the inner most loop -- including writing to disk. I would look at how you could restructure this to move as much as possible out of this loop. Your data seems to be about 12 GB for a year, so this is probably too big to build up the full sst array completely in memory prior to writing. That is, unless you have a computer much bigger than my laptop ;). But issuing one fat write command is probably going to be faster than making 365 of them. > > Happy hacking! > Be Well > Anthony > Thanks Anthony for being so responsive and touching on a number of points. The netCDF library gives me a masked array so I have to explicitly transform that into a regular numpy array. I've looked under the covers and have seen that the ma masked implementation is all pure Python and so there is a performance drawback. I'm not up to speed yet on where the numpy.na masking implementation is (started a new job here). I tried to do an implementation in memory (except for the final write) and found that I have about 2GB of indices when I extract the quality indices. Simply using those indexes, memory usage grows to over 64GB and I eventually run out of memory and start churning away in swap. For the moment, I have pulled down the latest git master and am using the new in-memory HDF feature. This seems to give be better performance and is code-wise pretty simple so for the moment, it's good enough. Cheers and thanks again, Tim BTW I viewed your SciPy tutorial. Good stuff! |
From: Thadeus B. <tha...@th...> - 2013-03-08 01:14:28
|
Thank you for the information. I will run a few more tests over the next couple of days, one day with no compression, and one day with a chunksize similar to what will be appended each cycle, hopefully I will get a chance to report back. A ptrepack into a file with no compression is half the size of its append/compress/lots of unused space counterpart. The reason for using compression is to reduce the IO required from the network backed storage, not necessarily reduce disk space, although that is a plus. -- Thadeus On Thu, Mar 7, 2013 at 5:40 PM, Anthony Scopatz <sc...@gm...> wrote: > Hi Thadeus, > > HDF5 does not guarantee that the data is contiguous on disk between > blocks. hat is, there may be empty space in your file. Furthermore, > compression really messes with HDF5's ability to predict how large blocks > will end up being. To avoid accidental data loss, HDF5 tends to over > predict the empty buffer space needed. > > Thus my guess is that by having this tight loop around open/append/close, > you keep accidentally triggering extraneous buffer space. You basically > have two options: > > 1. turn off compression. size prediction is exact without it. > 2. periodically run ptrepack. (every 10, 100, 1000 cycles? end of the > day?) > > Hope this helps > Be Well > Anthony > > > On Thu, Mar 7, 2013 at 5:26 PM, Thadeus Burgess <tha...@th...>wrote: > >> I have a PyTables file that receives many appends to a Table throughout >> the day, the file is opened, a small bit of data is appended, and the file >> is closed. The open/append/close can happen many times in a minute. >> Anywhere from 1-500 rows are appended at any given time. By the end of the >> day, this file is expected to have roughly 66000 rows. Chunkshape is set to >> 1500 for no particular reason (doesn't seem to make a difference, and some >> other files can be 5 million/day). BLOSC with lvl 9 compression is used on >> the table. Data is never deleted from the table. There are roughly 12 >> columns on the Table. >> >> The problem is that at the end of the day this file is 1GB in size. I >> don't understand why the file is growing so big. The tbl.size_on_disk shows >> a meager 20MB. >> >> I have used ptrepack with --keep-source-filters and --chunkshape=keep. >> The new file is only 30MB in size which is reasonable. >> I have also used ptrepack with --chunkshape=auto and although it set the >> chunkshape to around 388, there was no significant change in filesize from >> chunkshape of 1500. >> >> Is pytables not re-using chunks on new appends. When 50 rows are >> appended, is it still writing a chunk sized for 1500 rows. When the next >> append comes along, it writes a brand new chunk instead of opening the old >> chunk and appending the data? >> >> Should my chunksize really be "expected rows to append each time" instead >> of "expected total rows"? >> >> -- >> Thadeus >> >> >> >> ------------------------------------------------------------------------------ >> Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester >> Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the >> endpoint security space. For insight on selecting the right partner to >> tackle endpoint security challenges, access the full report. >> http://p.sf.net/sfu/symantec-dev2dev >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> >> > > > ------------------------------------------------------------------------------ > Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester > Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the > endpoint security space. For insight on selecting the right partner to > tackle endpoint security challenges, access the full report. > http://p.sf.net/sfu/symantec-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2013-03-07 23:40:48
|
Hi Thadeus, HDF5 does not guarantee that the data is contiguous on disk between blocks. hat is, there may be empty space in your file. Furthermore, compression really messes with HDF5's ability to predict how large blocks will end up being. To avoid accidental data loss, HDF5 tends to over predict the empty buffer space needed. Thus my guess is that by having this tight loop around open/append/close, you keep accidentally triggering extraneous buffer space. You basically have two options: 1. turn off compression. size prediction is exact without it. 2. periodically run ptrepack. (every 10, 100, 1000 cycles? end of the day?) Hope this helps Be Well Anthony On Thu, Mar 7, 2013 at 5:26 PM, Thadeus Burgess <tha...@th...>wrote: > I have a PyTables file that receives many appends to a Table throughout > the day, the file is opened, a small bit of data is appended, and the file > is closed. The open/append/close can happen many times in a minute. > Anywhere from 1-500 rows are appended at any given time. By the end of the > day, this file is expected to have roughly 66000 rows. Chunkshape is set to > 1500 for no particular reason (doesn't seem to make a difference, and some > other files can be 5 million/day). BLOSC with lvl 9 compression is used on > the table. Data is never deleted from the table. There are roughly 12 > columns on the Table. > > The problem is that at the end of the day this file is 1GB in size. I > don't understand why the file is growing so big. The tbl.size_on_disk shows > a meager 20MB. > > I have used ptrepack with --keep-source-filters and --chunkshape=keep. The > new file is only 30MB in size which is reasonable. > I have also used ptrepack with --chunkshape=auto and although it set the > chunkshape to around 388, there was no significant change in filesize from > chunkshape of 1500. > > Is pytables not re-using chunks on new appends. When 50 rows are appended, > is it still writing a chunk sized for 1500 rows. When the next append comes > along, it writes a brand new chunk instead of opening the old chunk and > appending the data? > > Should my chunksize really be "expected rows to append each time" instead > of "expected total rows"? > > -- > Thadeus > > > > ------------------------------------------------------------------------------ > Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester > Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the > endpoint security space. For insight on selecting the right partner to > tackle endpoint security challenges, access the full report. > http://p.sf.net/sfu/symantec-dev2dev > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |