You can subscribe to this list here.
2002 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2003 |
Jan
|
Feb
(2) |
Mar
|
Apr
(5) |
May
(11) |
Jun
(7) |
Jul
(18) |
Aug
(5) |
Sep
(15) |
Oct
(4) |
Nov
(1) |
Dec
(4) |
2004 |
Jan
(5) |
Feb
(2) |
Mar
(5) |
Apr
(8) |
May
(8) |
Jun
(10) |
Jul
(4) |
Aug
(4) |
Sep
(20) |
Oct
(11) |
Nov
(31) |
Dec
(41) |
2005 |
Jan
(79) |
Feb
(22) |
Mar
(14) |
Apr
(17) |
May
(35) |
Jun
(24) |
Jul
(26) |
Aug
(9) |
Sep
(57) |
Oct
(64) |
Nov
(25) |
Dec
(37) |
2006 |
Jan
(76) |
Feb
(24) |
Mar
(79) |
Apr
(44) |
May
(33) |
Jun
(12) |
Jul
(15) |
Aug
(40) |
Sep
(17) |
Oct
(21) |
Nov
(46) |
Dec
(23) |
2007 |
Jan
(18) |
Feb
(25) |
Mar
(41) |
Apr
(66) |
May
(18) |
Jun
(29) |
Jul
(40) |
Aug
(32) |
Sep
(34) |
Oct
(17) |
Nov
(46) |
Dec
(17) |
2008 |
Jan
(17) |
Feb
(42) |
Mar
(23) |
Apr
(11) |
May
(65) |
Jun
(28) |
Jul
(28) |
Aug
(16) |
Sep
(24) |
Oct
(33) |
Nov
(16) |
Dec
(5) |
2009 |
Jan
(19) |
Feb
(25) |
Mar
(11) |
Apr
(32) |
May
(62) |
Jun
(28) |
Jul
(61) |
Aug
(20) |
Sep
(61) |
Oct
(11) |
Nov
(14) |
Dec
(53) |
2010 |
Jan
(17) |
Feb
(31) |
Mar
(39) |
Apr
(43) |
May
(49) |
Jun
(47) |
Jul
(35) |
Aug
(58) |
Sep
(55) |
Oct
(91) |
Nov
(77) |
Dec
(63) |
2011 |
Jan
(50) |
Feb
(30) |
Mar
(67) |
Apr
(31) |
May
(17) |
Jun
(83) |
Jul
(17) |
Aug
(33) |
Sep
(35) |
Oct
(19) |
Nov
(29) |
Dec
(26) |
2012 |
Jan
(53) |
Feb
(22) |
Mar
(118) |
Apr
(45) |
May
(28) |
Jun
(71) |
Jul
(87) |
Aug
(55) |
Sep
(30) |
Oct
(73) |
Nov
(41) |
Dec
(28) |
2013 |
Jan
(19) |
Feb
(30) |
Mar
(14) |
Apr
(63) |
May
(20) |
Jun
(59) |
Jul
(40) |
Aug
(33) |
Sep
(1) |
Oct
|
Nov
|
Dec
|
From: Anthony S. <sc...@gm...> - 2012-11-17 01:11:08
|
On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson <js...@fn...> wrote: > Hi all, > I am trying to find the best way to make histograms from large data > sets. Up to now, I've been just loading entire columns into in-memory > numpy arrays and making histograms from those. However, I'm currently > working on a handful of datasets where this is prohibitively memory > intensive (causing an out-of-memory kernel panic on a shared machine > that you have to open a ticket to have rebooted makes you a little > gun-shy), so I am now exploring other options. > > I know that the Column object is rather nicely set up to act, in some > circumstances, like a numpy ndarray. So my first thought is to try just > creating the histogram out of the Column object directly. This is, > however, 1000x slower than loading it into memory and creating the > histogram from the in-memory array. Please see my test notebook at: > http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html > > For such a small table, loading into memory is not an issue. For larger > tables, though, it is a problem, and I had hoped that pytables was > optimized so that histogramming directly from disk would proceed no > slower than loading into memory and histogramming. Is there some other > way of accessing the column (or Array or CArray) data that will make > faster histograms? > Hi Jon, This is not surprising since the column object itself is going to be iterated over per row. As you found, reading in each row individually will be prohibitively expensive as compared to reading in all the data at one. To do this in the right way for data that is larger than system memory, you need to read it in in chunks. Luckily there are tools to help you automate this process already in PyTables. I would recommend that you use expressions [1] or queries [2] to do your historgramming more efficiently. Be Well Anthony 1. http://pytables.github.com/usersguide/libref/expr_class.html 2. http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying > Regards, > Jon > > > ------------------------------------------------------------------------------ > Monitor your physical, virtual and cloud infrastructure from a single > web console. Get in-depth insight into apps, servers, databases, vmware, > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > Pricing starts from $795 for 25 servers or applications! > http://p.sf.net/sfu/zoho_dev2dev_nov > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Jon W. <js...@fn...> - 2012-11-16 17:20:37
|
Hi all, I am trying to find the best way to make histograms from large data sets. Up to now, I've been just loading entire columns into in-memory numpy arrays and making histograms from those. However, I'm currently working on a handful of datasets where this is prohibitively memory intensive (causing an out-of-memory kernel panic on a shared machine that you have to open a ticket to have rebooted makes you a little gun-shy), so I am now exploring other options. I know that the Column object is rather nicely set up to act, in some circumstances, like a numpy ndarray. So my first thought is to try just creating the histogram out of the Column object directly. This is, however, 1000x slower than loading it into memory and creating the histogram from the in-memory array. Please see my test notebook at: http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html For such a small table, loading into memory is not an issue. For larger tables, though, it is a problem, and I had hoped that pytables was optimized so that histogramming directly from disk would proceed no slower than loading into memory and histogramming. Is there some other way of accessing the column (or Array or CArray) data that will make faster histograms? Regards, Jon |
From: Juan M. V. T. <jmv...@gm...> - 2012-11-11 00:39:40
|
Hello, I have to deal in pytables with a very large dataset. The file already compressed with blosc5 is about 5GB. Is it possible to store objects within the same file, each of them containing a reference to a certain search over the dataset? It is like having a large numpy array and a mask of it in the same pytables file. Thank you, Juanma |
From: Jim K. <jim...@sp...> - 2012-11-09 21:26:46
|
Thanks for taking the time. Most of our tables are very wide lots of col.... and simple conditions are common.... so that is why in-kernel makes almost no impact for me. -----Original Message----- From: Francesc Alted [mailto:fa...@gm...] Sent: Friday, November 09, 2012 11:27 AM To: pyt...@li... Subject: Re: [Pytables-users] pyTable index from c++ Well, expected performance of in-kernel (numexpr powered) queries wrt regular (python) queries largely depends on where the bottleneck is. If your table has a lot of columns, then the bottleneck is going to be more on the I/O side, so you cannot expect a large difference in performance. However, if your table has a small number of columns, then there is more likelihood that bottleneck is CPU, and your chances to experiment a difference are higher. Of course, having complex queries (i.e. queries that take conditions over several columns, or just combinations of conditions in the same column) makes the query more CPU intensive, and in-kernel normally wins by a comfortable margin. Finally, what indexing is doing is to reduce the number of rows where the conditions have to be evaluated, so depending on the cardinality of the query and the associated index, you can get more or less speedup. Francesc On 11/9/12 5:12 PM, Jim Knoll wrote: > > Thanks for the reply. I will put some investigation of C++ access on > my list for items to look at over the slow holiday season. > > For the short term we will store a C++ ready index as a different > table object in the same h5 file. It will work... just a bit of a waste > on disk space. > > One follow up question > > Why would my performance of > > for row in node.where('stringField == "SomeString"'): > > *not*be noticeably faster than > > for row in node: > > if row.stringField == "SomeString" : > > Specifically when there is no index. I understand and see the speed > improvement only when I have a index. I expected to see some benefit > from numexpr even with no index. I expected node.where() to be much > faster. What I see is identical performance. Is numexpr benefit only > seen for complex math like (floatField ** intField > otherFloatField) > I did not see that to be the case on my first attempt.... Seems that I > only benefit from a index. > > *From:*Anthony Scopatz [mailto:sc...@gm...] > *Sent:* Friday, November 09, 2012 12:24 AM > *To:* Discussion list for PyTables > *Subject:* Re: [Pytables-users] pyTable index from c++ > > On Thu, Nov 8, 2012 at 10:19 PM, Jim Knoll > <jim...@sp... <mailto:jim...@sp...>> > wrote: > > I love the index function and promote the internal use of PyTables at > my company. The availability of a indexed method to speed the search > is the main reason why. > > We are a mixed shop using c++ to create H5 (just for the raw speed ... > need to keep up with streaming data) End users start with python > pyTables to consume the data. (Often after we have created indexes > from python pytables.col.col1.createIndex()) > > Sometimes the users come up with something we want to do thousands of > times and performance is critical. But then we are falling back to c++ > We can use our own index method but would like to make dbl use of the > PyTables index. > > I know the python table.where( is implemented in C. > > Hi Jim, > > This is only kind of true. Querying (ie all of the where*() methods) > are actually mostly written in Python in the tables.py and > expressions.py files. However, they make use of numexpr [1]. > > Is there a way to access that from c or c++? Don't mind if I need > to do work to get the result I think in my case the work may be > worth it. > > *PLAN 1:* One possibility is that the parts of PyTables are written in > Cython. We could maybe try (without making any edits to these files) > to convert them to Cython. This has the advantage that for Cython > files, if you write the appropriate C++ header file and link against > the shared library correctly, it is possible to access certain > functions from C/C++. BUT, I am not sure how much of speed boost you > would get out of this since you would still be calling out to the > Python interpreter to get these result. You are just calling Python's > virtual machine from C++ rather than calling it from Python (like > normal). This has the advantage that you would basically get access to > these functions acting on tables from C++. > > *PLAN 2:* Alternatively, numexpr itself is mostly written in C++ > already. You should be able to call core numexpr functions directly. > However, you would have to feed it data that you read from the tables > yourself. These could even be table indexes. On a personal note, if > you get code working that does this, I would be interested in seeing > your implementation. (I have another project where I have tables that > I want to query from C++) > > Let us know what route you ultimately end up taking or if you have any > further questions! > > Be Well > > Anthony > > 1. http://code.google.com/p/numexpr/source/browse/#hg%2Fnumexpr > > ------------------------------------------------------------------------ > > *Jim Knoll** > *Data Developer** > > Spot Trading L.L.C > 440 South LaSalle St., Suite 2800 > Chicago, IL 60605 > Office: 312.362.4550 <tel:312.362.4550> > Direct: 312-362-4798 <tel:312-362-4798> > Fax: 312.362.4551 <tel:312.362.4551> > jim...@sp... <mailto:jim...@sp...> > www.spottradingllc.com <http://www.spottradingllc.com/> > > ------------------------------------------------------------------------ > > The information contained in this message may be privileged and > confidential and protected from disclosure. If the reader of this > message is not the intended recipient, or an employee or agent > responsible for delivering this message to the intended recipient, > you are hereby notified that any dissemination, distribution or > copying of this communication is strictly prohibited. If you have > received this communication in error, please notify us immediately > by replying to the message and deleting it from your computer. > Thank you. Spot Trading, LLC > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_nov > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_nov > > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_nov _______________________________________________ Pytables-users mailing list Pyt...@li... https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Francesc A. <fa...@gm...> - 2012-11-09 17:27:14
|
Well, expected performance of in-kernel (numexpr powered) queries wrt regular (python) queries largely depends on where the bottleneck is. If your table has a lot of columns, then the bottleneck is going to be more on the I/O side, so you cannot expect a large difference in performance. However, if your table has a small number of columns, then there is more likelihood that bottleneck is CPU, and your chances to experiment a difference are higher. Of course, having complex queries (i.e. queries that take conditions over several columns, or just combinations of conditions in the same column) makes the query more CPU intensive, and in-kernel normally wins by a comfortable margin. Finally, what indexing is doing is to reduce the number of rows where the conditions have to be evaluated, so depending on the cardinality of the query and the associated index, you can get more or less speedup. Francesc On 11/9/12 5:12 PM, Jim Knoll wrote: > > Thanks for the reply. I will put some investigation of C++ access on > my list for items to look at over the slow holiday season. > > For the short term we will store a C++ ready index as a different > table object in the same h5 file. It will work… just a bit of a waste > on disk space. > > One follow up question > > Why would my performance of > > for row in node.where('stringField == "SomeString"'): > > *not*be noticeably faster than > > for row in node: > > if row.stringField == "SomeString" : > > Specifically when there is no index. I understand and see the speed > improvement only when I have a index. I expected to see some benefit > from numexpr even with no index. I expected node.where() to be much > faster. What I see is identical performance. Is numexpr benefit only > seen for complex math like (floatField ** intField > otherFloatField) > I did not see that to be the case on my first attempt…. Seems that I > only benefit from a index. > > *From:*Anthony Scopatz [mailto:sc...@gm...] > *Sent:* Friday, November 09, 2012 12:24 AM > *To:* Discussion list for PyTables > *Subject:* Re: [Pytables-users] pyTable index from c++ > > On Thu, Nov 8, 2012 at 10:19 PM, Jim Knoll > <jim...@sp... <mailto:jim...@sp...>> > wrote: > > I love the index function and promote the internal use of PyTables at > my company. The availability of a indexed method to speed the search > is the main reason why. > > We are a mixed shop using c++ to create H5 (just for the raw speed … > need to keep up with streaming data) End users start with python > pyTables to consume the data. (Often after we have created indexes > from python pytables.col.col1.createIndex()) > > Sometimes the users come up with something we want to do thousands of > times and performance is critical. But then we are falling back to c++ > We can use our own index method but would like to make dbl use of the > PyTables index. > > I know the python table.where( is implemented in C. > > Hi Jim, > > This is only kind of true. Querying (ie all of the where*() methods) > are actually mostly written in Python in the tables.py and > expressions.py files. However, they make use of numexpr [1]. > > Is there a way to access that from c or c++? Don’t mind if I need > to do work to get the result I think in my case the work may be > worth it. > > *PLAN 1:* One possibility is that the parts of PyTables are written in > Cython. We could maybe try (without making any edits to these files) > to convert them to Cython. This has the advantage that for Cython > files, if you write the appropriate C++ header file and link against > the shared library correctly, it is possible to access certain > functions from C/C++. BUT, I am not sure how much of speed boost you > would get out of this since you would still be calling out to the > Python interpreter to get these result. You are just calling Python's > virtual machine from C++ rather than calling it from Python (like > normal). This has the advantage that you would basically get access to > these functions acting on tables from C++. > > *PLAN 2:* Alternatively, numexpr itself is mostly written in C++ > already. You should be able to call core numexpr functions directly. > However, you would have to feed it data that you read from the tables > yourself. These could even be table indexes. On a personal note, if > you get code working that does this, I would be interested in seeing > your implementation. (I have another project where I have tables that > I want to query from C++) > > Let us know what route you ultimately end up taking or if you have any > further questions! > > Be Well > > Anthony > > 1. http://code.google.com/p/numexpr/source/browse/#hg%2Fnumexpr > > ------------------------------------------------------------------------ > > *Jim Knoll** > *Data Developer** > > Spot Trading L.L.C > 440 South LaSalle St., Suite 2800 > Chicago, IL 60605 > Office: 312.362.4550 <tel:312.362.4550> > Direct: 312-362-4798 <tel:312-362-4798> > Fax: 312.362.4551 <tel:312.362.4551> > jim...@sp... <mailto:jim...@sp...> > www.spottradingllc.com <http://www.spottradingllc.com/> > > ------------------------------------------------------------------------ > > The information contained in this message may be privileged and > confidential and protected from disclosure. If the reader of this > message is not the intended recipient, or an employee or agent > responsible for delivering this message to the intended recipient, > you are hereby notified that any dissemination, distribution or > copying of this communication is strictly prohibited. If you have > received this communication in error, please notify us immediately > by replying to the message and deleting it from your computer. > Thank you. Spot Trading, LLC > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_nov > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_nov > > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users -- Francesc Alted |
From: Jim K. <jim...@sp...> - 2012-11-09 16:12:56
|
Thanks for the reply. I will put some investigation of C++ access on my list for items to look at over the slow holiday season. For the short term we will store a C++ ready index as a different table object in the same h5 file. It will work... just a bit of a waste on disk space. One follow up question Why would my performance of for row in node.where('stringField == "SomeString"'): not be noticeably faster than for row in node: if row.stringField == "SomeString" : Specifically when there is no index. I understand and see the speed improvement only when I have a index. I expected to see some benefit from numexpr even with no index. I expected node.where() to be much faster. What I see is identical performance. Is numexpr benefit only seen for complex math like (floatField ** intField > otherFloatField) I did not see that to be the case on my first attempt.... Seems that I only benefit from a index. From: Anthony Scopatz [mailto:sc...@gm...] Sent: Friday, November 09, 2012 12:24 AM To: Discussion list for PyTables Subject: Re: [Pytables-users] pyTable index from c++ On Thu, Nov 8, 2012 at 10:19 PM, Jim Knoll <jim...@sp...<mailto:jim...@sp...>> wrote: I love the index function and promote the internal use of PyTables at my company. The availability of a indexed method to speed the search is the main reason why. We are a mixed shop using c++ to create H5 (just for the raw speed ... need to keep up with streaming data) End users start with python pyTables to consume the data. (Often after we have created indexes from python pytables.col.col1.createIndex()) Sometimes the users come up with something we want to do thousands of times and performance is critical. But then we are falling back to c++ We can use our own index method but would like to make dbl use of the PyTables index. I know the python table.where( is implemented in C. Hi Jim, This is only kind of true. Querying (ie all of the where*() methods) are actually mostly written in Python in the tables.py and expressions.py files. However, they make use of numexpr [1]. Is there a way to access that from c or c++? Don't mind if I need to do work to get the result I think in my case the work may be worth it. PLAN 1: One possibility is that the parts of PyTables are written in Cython. We could maybe try (without making any edits to these files) to convert them to Cython. This has the advantage that for Cython files, if you write the appropriate C++ header file and link against the shared library correctly, it is possible to access certain functions from C/C++. BUT, I am not sure how much of speed boost you would get out of this since you would still be calling out to the Python interpreter to get these result. You are just calling Python's virtual machine from C++ rather than calling it from Python (like normal). This has the advantage that you would basically get access to these functions acting on tables from C++. PLAN 2: Alternatively, numexpr itself is mostly written in C++ already. You should be able to call core numexpr functions directly. However, you would have to feed it data that you read from the tables yourself. These could even be table indexes. On a personal note, if you get code working that does this, I would be interested in seeing your implementation. (I have another project where I have tables that I want to query from C++) Let us know what route you ultimately end up taking or if you have any further questions! Be Well Anthony 1. http://code.google.com/p/numexpr/source/browse/#hg%2Fnumexpr ________________________________ Jim Knoll Data Developer Spot Trading L.L.C 440 South LaSalle St., Suite 2800 Chicago, IL 60605 Office: 312.362.4550<tel:312.362.4550> Direct: 312-362-4798<tel:312-362-4798> Fax: 312.362.4551<tel:312.362.4551> jim...@sp...<mailto:jim...@sp...> www.spottradingllc.com<http://www.spottradingllc.com/> ________________________________ The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Spot Trading, LLC ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_nov _______________________________________________ Pytables-users mailing list Pyt...@li...<mailto:Pyt...@li...> https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Anthony S. <sc...@gm...> - 2012-11-09 06:24:36
|
On Thu, Nov 8, 2012 at 10:19 PM, Jim Knoll <jim...@sp...>wrote: > I love the index function and promote the internal use of PyTables at > my company. The availability of a indexed method to speed the search is > the main reason why.**** > > ** ** > > We are a mixed shop using c++ to create H5 (just for the raw speed … need > to keep up with streaming data) End users start with python pyTables to > consume the data. (Often after we have created indexes from python > pytables.col.col1.createIndex()) **** > > ** ** > > Sometimes the users come up with something we want to do thousands of > times and performance is critical. But then we are falling back to c++ We > can use our own index method but would like to make dbl use of the PyTables > index. **** > > ** ** > > I know the python table.where( is implemented in C. > Hi Jim, This is only kind of true. Querying (ie all of the where*() methods) are actually mostly written in Python in the tables.py and expressions.py files. However, they make use of numexpr [1]. > **** > > ** Is there a way to access that from c or c++? Don’t mind if I need > to do work to get the result I think in my case the work may be worth it. > *PLAN 1:* One possibility is that the parts of PyTables are written in Cython. We could maybe try (without making any edits to these files) to convert them to Cython. This has the advantage that for Cython files, if you write the appropriate C++ header file and link against the shared library correctly, it is possible to access certain functions from C/C++. BUT, I am not sure how much of speed boost you would get out of this since you would still be calling out to the Python interpreter to get these result. You are just calling Python's virtual machine from C++ rather than calling it from Python (like normal). This has the advantage that you would basically get access to these functions acting on tables from C++. *PLAN 2:* Alternatively, numexpr itself is mostly written in C++ already. You should be able to call core numexpr functions directly. However, you would have to feed it data that you read from the tables yourself. These could even be table indexes. On a personal note, if you get code working that does this, I would be interested in seeing your implementation. (I have another project where I have tables that I want to query from C++) Let us know what route you ultimately end up taking or if you have any further questions! Be Well Anthony 1. http://code.google.com/p/numexpr/source/browse/#hg%2Fnumexpr > > > ------------------------------ > > * Jim Knoll* * > **Data Developer* > > Spot Trading L.L.C > 440 South LaSalle St., Suite 2800 > Chicago, IL 60605 > Office: 312.362.4550 > Direct: 312-362-4798 > Fax: 312.362.4551 > jim...@sp... > www.spottradingllc.com > ------------------------------ > > The information contained in this message may be privileged and > confidential and protected from disclosure. If the reader of this message > is not the intended recipient, or an employee or agent responsible for > delivering this message to the intended recipient, you are hereby notified > that any dissemination, distribution or copying of this communication is > strictly prohibited. If you have received this communication in error, > please notify us immediately by replying to the message and deleting it > from your computer. Thank you. Spot Trading, LLC > > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_nov > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2012-11-09 05:57:24
|
Hello Jim, The major hurdle here is exposing 7Zip to HDF5. Luckily it appears as if this may have been taken care of for you by the HDF-group already [1]. You should google around to see what has already been done and how hard it is to install. The next step is to expose this as a compression option for filters [2]. I am fairly certain that this is just a matter of adding a simple flag and making sure 7Zip works if available. This should not be too difficult at all and we would happily consider/review any pull request that implemented this. Barring any major concerns, I feel that it would likely be accepted. Be Well Anthony 1. http://www.hdfgroup.org/ftp/HDF5/releases/hdf5-1.6/hdf5-1.6.7/src/unpacked/release_docs/INSTALL_Windows_From_Command_Line.txt 2. http://pytables.github.com/usersguide/libref/helper_classes.html#the-filters-class On Thu, Nov 8, 2012 at 9:52 PM, Jim Knoll <jim...@sp...>wrote: > I would like to squeeze out as much compression as I can get. I do not > mind spending time on the front end as long as I do not kill my read > performance. Seems like 7Zip is well suited to my data. Is it possible to > have 7Zip used as the native internal compression for a pytable?**** > > ** ** > > If not now hard would it be to add this option?**** > > > ------------------------------ > > * Jim Knoll* * > **Data Developer* > > Spot Trading L.L.C > 440 South LaSalle St., Suite 2800 > Chicago, IL 60605 > Office: 312.362.4550 > Direct: 312-362-4798 > Fax: 312.362.4551 > jim...@sp... > www.spottradingllc.com > ------------------------------ > > The information contained in this message may be privileged and > confidential and protected from disclosure. If the reader of this message > is not the intended recipient, or an employee or agent responsible for > delivering this message to the intended recipient, you are hereby notified > that any dissemination, distribution or copying of this communication is > strictly prohibited. If you have received this communication in error, > please notify us immediately by replying to the message and deleting it > from your computer. Thank you. Spot Trading, LLC > > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_nov > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Jim K. <jim...@sp...> - 2012-11-09 04:19:40
|
I love the index function and promote the internal use of PyTables at my company. The availability of a indexed method to speed the search is the main reason why. We are a mixed shop using c++ to create H5 (just for the raw speed … need to keep up with streaming data) End users start with python pyTables to consume the data. (Often after we have created indexes from python pytables.col.col1.createIndex()) Sometimes the users come up with something we want to do thousands of times and performance is critical. But then we are falling back to c++ We can use our own index method but would like to make dbl use of the PyTables index. I know the python table.where( is implemented in C. Is there a way to access that from c or c++? Don’t mind if I need to do work to get the result I think in my case the work may be worth it. ________________________________ Jim Knoll Data Developer Spot Trading L.L.C 440 South LaSalle St., Suite 2800 Chicago, IL 60605 Office: 312.362.4550 Direct: 312-362-4798 Fax: 312.362.4551 jim...@sp... www.spottradingllc.com<http://www.spottradingllc.com/> ________________________________ The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Spot Trading, LLC |
From: Jim K. <jim...@sp...> - 2012-11-09 04:04:37
|
I would like to squeeze out as much compression as I can get. I do not mind spending time on the front end as long as I do not kill my read performance. Seems like 7Zip is well suited to my data. Is it possible to have 7Zip used as the native internal compression for a pytable? If not now hard would it be to add this option? ________________________________ Jim Knoll Data Developer Spot Trading L.L.C 440 South LaSalle St., Suite 2800 Chicago, IL 60605 Office: 312.362.4550 Direct: 312-362-4798 Fax: 312.362.4551 jim...@sp... www.spottradingllc.com<http://www.spottradingllc.com/> ________________________________ The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Spot Trading, LLC |
From: Aquil H. A. <aqu...@gm...> - 2012-11-08 16:58:21
|
Thanks Anthony, This also did the trick: import tables h5f_in = tables.open('CO.h5) tbl_in = h5f_in.root.CO.DATA h5f_out = tables.openFile('test.h5', 'w') g = h5f_out.createGroup('/','CO') ot = tbl.copy(newparent=g) -- Aquil H. Abdullah "I never think of the future. It comes soon enough" - Albert Einstein On Thursday, November 8, 2012 at 11:19 AM, Anthony Scopatz wrote: > Hey Aquil, > > I think File.copyNode() [1] with the newparent argument as group on another file will do what you want. > > Be Well > Anthony > > 1. http://pytables.github.com/usersguide/libref/file_class.html?highlight=copy#tables.File.copyNode > > > On Thu, Nov 8, 2012 at 10:02 AM, Aquil H. Abdullah <aqu...@gm... (mailto:aqu...@gm...)> wrote: > > I create the tables in an HDF5 file from three different python processes. I needed to modify one of the processes, but not the others. Is there an easy way to copy the two tables that did not change to the new file? > > > > -- > > Aquil H. Abdullah > > "I never think of the future. It comes soon enough" - Albert Einstein > > > > > > ------------------------------------------------------------------------------ > > Everyone hates slow websites. So do we. > > Make your web apps faster with AppDynamics > > Download AppDynamics Lite for free today: > > http://p.sf.net/sfu/appdyn_d2d_nov > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... (mailto:Pyt...@li...) > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_nov > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... (mailto:Pyt...@li...) > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Anthony S. <sc...@gm...> - 2012-11-08 16:19:58
|
Hey Aquil, I think File.copyNode() [1] with the newparent argument as group on another file will do what you want. Be Well Anthony 1. http://pytables.github.com/usersguide/libref/file_class.html?highlight=copy#tables.File.copyNode On Thu, Nov 8, 2012 at 10:02 AM, Aquil H. Abdullah <aqu...@gm... > wrote: > I create the tables in an HDF5 file from three different python processes. > I needed to modify one of the processes, but not the others. Is there an > easy way to copy the two tables that did not change to the new file? > > -- > Aquil H. Abdullah > "I never think of the future. It comes soon enough" - Albert Einstein > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_nov > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Aquil H. A. <aqu...@gm...> - 2012-11-08 16:02:36
|
I create the tables in an HDF5 file from three different python processes. I needed to modify one of the processes, but not the others. Is there an easy way to copy the two tables that did not change to the new file? -- Aquil H. Abdullah "I never think of the future. It comes soon enough" - Albert Einstein |
From: Owen M. <owe...@bc...> - 2012-11-03 16:04:25
|
If you're reading the data out of the file from inside a generator (ie - if load_items returns a generator that accesses the HDF5 file) then as the worker processes consume the work items the file is actually being opened and read from a worker thread in the master process. Regards, Owen On 2 November 2012 21:49, Ben Elliston <bj...@ai...> wrote: > Hi Francesc > > > Hmm, now that I think, Blosc is not thread safe, and that can bring > > these sort of problems if you use it from several threads (but it > > should be safe when using several *processes*). > > I am using multiprocessing.Pool, like so: > > if __name__ == '__main__': > pool = Pool(processes=2) # start 2 worker processes > items = load_items () > pool.map (process_items, items) > > Ben > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.10 (GNU/Linux) > > iQIVAwUBUJQx0+ZqTimv57y9AQigFQ//XWpDoIve2PDag4SG/JBu6Y4D5X8pZcfA > froFDmju5dGtlCaRKUc/puioFRukyD2s5QCV9hvYIGQ2EYkts1eKLQnYRvcV/37D > J6QVEGUcuqfdj6lnGZSiDHr24rCeT3oGozbYO0/6casyV4iIuRjOzghWgKjV7ko1 > N/dy2UGK0S1S2Ws/OnkzDlbXiZShHfjLw3au2TtCdXaPcA4X1aMs1qLzEzvd+gJb > MLHy3MVtGCxjtnB3Vzi/2UmgMfB6hFuGugD2Yp2i0SxFIlTS7cIXYB2beL+x+y4Q > MtcZZO7QOvTvoExRje0BWR0e0BAOumKACKiU7Uq8L0+rwT6NWPfOVfe6KhieHvi/ > bh8oiNl2tekB+UE5JQ6Yi13YwfReyA1M8RFRsrQ3fXCWaQ6+Hx3m8+t/q4bfDhy4 > wFrC4N3hIkqMNI589aju8vVWerSdKDrzqFjcLBp8zfY718qm0ulGz0bBsKNPzr8y > 7AN+/5StSos29oBTv3p3s7YDLjXoDi4XwZPR0aMt9pO2JrPxVUda+oi8/1AT46RS > QoyFGmkqotucR22rWXbv9wZdHm9SsxGZakNpeQYxN9uf2aaqP0eodbgmSPx1ST8P > 8vU+uGqrCb564KW5eJmQCgNZMXo1uCAqCdF5LZT+h1ncWy9//bv9gW2mityDNs6j > YOqKMSOj9e0= > =ZvxS > -----END PGP SIGNATURE----- > > > ------------------------------------------------------------------------------ > LogMeIn Central: Instant, anywhere, Remote PC access and management. > Stay in control, update software, and manage PCs from one command center > Diagnose problems and improve visibility into emerging IT issues > Automate, monitor and manage. Do more in less time with Central > http://p.sf.net/sfu/logmein12331_d2d > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Francesc A. <fa...@gm...> - 2012-11-02 21:30:10
|
On 11/2/12 5:19 PM, Ben Elliston wrote: > On Fri, Nov 02, 2012 at 04:56:55PM -0400, Francesc Alted wrote: > >> Hmm, that's strange. Using lzo or zlib works for you? > Well, it seems that switching compression algorithms could be a > nightmare (or can I do this with ptrepack?). Yes, ptrepack can do that very easily. > However, I may have a > workaround: I now open the HDF5 file with tables.openFile at the start > of each process rather than inherit the file descriptor from the > parent. That works, since it's just concurrent file I/O on the same > read-only file, and the start-up overhead is acceptable in this case. Mmh, I think that makes sense. I think the problem before was that you was sharing the same file description with different processes, and hence you ended with sync problems. Having different descriptors for different processes is definitely the way to go. > > Happy to try lzo or zlib, though, if you like. Provided the above, I don't think you need to (I mean, I'd say that lzo and zlib would have exactly the same problem). -- Francesc Alted |
From: Ben E. <bj...@ai...> - 2012-11-02 21:19:50
|
On Fri, Nov 02, 2012 at 04:56:55PM -0400, Francesc Alted wrote: > Hmm, that's strange. Using lzo or zlib works for you? Well, it seems that switching compression algorithms could be a nightmare (or can I do this with ptrepack?). However, I may have a workaround: I now open the HDF5 file with tables.openFile at the start of each process rather than inherit the file descriptor from the parent. That works, since it's just concurrent file I/O on the same read-only file, and the start-up overhead is acceptable in this case. Happy to try lzo or zlib, though, if you like. Cheers, Ben |
From: Francesc A. <fa...@gm...> - 2012-11-02 20:57:03
|
On 11/2/12 4:49 PM, Ben Elliston wrote: > Hi Francesc > >> Hmm, now that I think, Blosc is not thread safe, and that can bring >> these sort of problems if you use it from several threads (but it >> should be safe when using several *processes*). > I am using multiprocessing.Pool, like so: > > if __name__ == '__main__': > pool = Pool(processes=2) # start 2 worker processes > items = load_items () > pool.map (process_items, items) > Hmm, that's strange. Using lzo or zlib works for you? -- Francesc Alted |
From: Ben E. <bj...@ai...> - 2012-11-02 20:49:35
|
Hi Francesc > Hmm, now that I think, Blosc is not thread safe, and that can bring > these sort of problems if you use it from several threads (but it > should be safe when using several *processes*). I am using multiprocessing.Pool, like so: if __name__ == '__main__': pool = Pool(processes=2) # start 2 worker processes items = load_items () pool.map (process_items, items) Ben |
From: Francesc A. <fa...@gm...> - 2012-11-02 20:41:26
|
On 11/2/12 4:22 PM, Ben Elliston wrote: > My reading of the PyTables FAQ is that concurrent read access should > be safe with PyTables. However, when using a pool of worker processes > to read different parts of a large blosc-compressed CArray, I see: > > HDF5-DIAG: Error detected in HDF5 (1.8.8) thread 140476163647232: > #000: ../../../src/H5Dio.c line 174 in H5Dread(): can't read data > major: Dataset > minor: Read failed > #001: ../../../src/H5Dio.c line 448 in H5D_read(): can't read data > major: Dataset > minor: Read failed > etc. Hmm, now that I think, Blosc is not thread safe, and that can bring these sort of problems if you use it from several threads (but it should be safe when using several *processes*). In case your worker processes are threads, then it might help to deactivate threading in Blosc by setting the MAX_BLOSC_THREADS parameter: http://pytables.github.com/usersguide/parameter_files.html?#tables.parameters.MAX_BLOSC_THREADS to 1. HTH, -- Francesc Alted |
From: Ben E. <bj...@ai...> - 2012-11-02 20:22:42
|
My reading of the PyTables FAQ is that concurrent read access should be safe with PyTables. However, when using a pool of worker processes to read different parts of a large blosc-compressed CArray, I see: HDF5-DIAG: Error detected in HDF5 (1.8.8) thread 140476163647232: #000: ../../../src/H5Dio.c line 174 in H5Dread(): can't read data major: Dataset minor: Read failed #001: ../../../src/H5Dio.c line 448 in H5D_read(): can't read data major: Dataset minor: Read failed etc. Am I misunderstanding something? Thanks, Ben |
From: Francesc A. <fa...@gm...> - 2012-11-02 01:25:28
|
On 11/1/12 9:02 PM, Ben Elliston wrote: > Hi all. > > I have a very large CArray that I need to extend (of course, you can't > extend a CArray). I want to do this in the least operation intensive > way I can. > > What's the easiest way? Create a new CArray of the right size, copy > the data into it and delete the old one? Yes, this is the best approach. > I understand that deleting > the old one will not reclaim any space in the database -- is that > right? Yes, that's correct. For claiming the old space you should 'repack' the file either using ptrepack PyTable's own tool or the HDF5 native tool called h5repack. HTH, -- Francesc Alted |
From: Ben E. <bj...@ai...> - 2012-11-02 01:20:34
|
Hi all. I have a very large CArray that I need to extend (of course, you can't extend a CArray). I want to do this in the least operation intensive way I can. What's the easiest way? Create a new CArray of the right size, copy the data into it and delete the old one? I understand that deleting the old one will not reclaim any space in the database -- is that right? Thanks, Ben |
From: Andrea G. <and...@gm...> - 2012-10-31 20:59:31
|
Hi Francesc and All, On 31 October 2012 21:02, Francesc Alted wrote: > On 10/31/12 10:12 AM, Andrea Gavana wrote: >> Hi Francesc & All, >> >> On 31 October 2012 14:13, Francesc Alted wrote: >>> On 10/31/12 4:30 AM, Andrea Gavana wrote: >>>> Thank you for all your suggestions. I managed to slightly modify the >>>> script you attached and I am also experimenting with compression. >>>> However, in the newly attached script the underlying table is not >>>> modified, i.e., this assignment: >>>> >>>> for p in table: >>>> p['results'][:NUM_SIM, :, :] = numpy.random.random(size=(NUM_SIM, >>>> len(ALL_DATES), 7)) >>>> table.flush() >>> For modifying row values you need to assign a complete row object. >>> Something like: >>> >>> for i in range(len(table)): >>> myrow = table[i] >>> myrow['results'][:NUM_SIM, :, :] = >>> numpy.random.random(size=(NUM_SIM, len(ALL_DATES), 7)) >>> table[i] = myrow >>> >>> You may also use Table.modifyColumn() for better efficiency. Look at >>> the different modification methods here: >>> >>> http://pytables.github.com/usersguide/libref/structured_storage.html#table-methods-writing >>> >>> and experiment with them. >> Thank you, I have tried different approaches and they all seem to run >> more or less at the same speed (see below). I had to slightly modify >> your code from: >> >> table[i] = myrow >> >> to >> >> table[i] = [myrow] >> >> To avoid exceptions. >> >> In the newly attached file, I switched to blosc for compression (but >> with compression level 1) and run a few sensitivities. By calling the >> attached script as: >> >> python pytables_test.py NUM_SIM >> >> where "NUM_SIM" is an integer, I get the following timings and file sizes: >> >> C:\MyProjects\Phaser\tests>python pytables_test.py 10 >> Number of simulations : 10 >> H5 file creation time : 0.879s >> Saving results for table: 6.413s >> H5 file size (MB) : 193 >> >> >> C:\MyProjects\Phaser\tests>python pytables_test.py 100 >> Number of simulations : 100 >> H5 file creation time : 4.155s >> Saving results for table: 86.326s >> H5 file size (MB) : 1935 >> >> >> I dont think I will try the 1,000 simulations case :-) . I believe I >> still don't understand what the best strategy would be for my problem. >> I basically need to save all the simulation results for all the 1,200 >> "objects", each of which has a timeseries matrix of 600x7 size. In the >> GUI I have, these 1,200 "objects" are grouped into multiple >> categories, and multiple categories can reference the same "object", >> i.e.: >> >> Category_1: object_1, object_23, object_543, etc... >> Category_2: object_23, object_100, object_543, etc... >> >> So my idea was to save all the "objects" results to disk and, upon the >> user's choice, build the categories results "on the fly", i.e. by >> seeking the H5 file on disk for the "objects" belonging to that >> specific category and summing up all their results (over time, i.e. >> the 600 time-steps). Maybe I would be better off with a 4D array >> (NUM_OBJECTS, NUM_SIM, TSTEPS, 7) as a table, but then I will lose the >> ability to reference the "objects" by their names... > > You should keep trying experimenting with different approaches and > discover the one that works for you the best. Regarding using the 4D > array as a table, I might be misunderstanding your problem, but you can > still reference objects by name by using: > > row = table.where("name == %s" % my_name) > table[row.nrow] = ... > > You may want to index the 'name' column for better performance. I did spend quite some time experimenting today (actually almost the whole day), but even the task of writing a 4D array (created via createCArray) to disk is somehow overwhelming from a GUI point of view. My 4D array is a 1,200x100x600x7, and on my PC at work (16 cores, 96 GB RAM, 3.0 GHz Windows Vista 64bit with PyTables pro) it takes close to 80 seconds to populate it with random arrays and save it to disk. This is almost a lifetime in the GUI world, and 100 simulation is possibly the simplest case I have. As I said before, I am probably completely missing the point; the fact that my script seems to be "un-improvable" in terms of speed is quite a demonstration of it. But given the constraints (in terms of time and GUI responsiveness) we have I will probably go back to my previous approach of saving the simulation results for the higher level categories only, discarding the ones for the 1,200 "objects". I love the idea of being able to seek results in real-time via a HDF5 file on my drive and having all the simulation results readily available, but the actual "saving" time is somehow a showstopper. Andrea. "Imagination Is The Only Weapon In The War Against Reality." http://www.infinity77.net # ------------------------------------------------------------- # def ask_mailing_list_support(email): if mention_platform_and_version() and include_sample_app(): send_message(email) else: install_malware() erase_hard_drives() # ------------------------------------------------------------- # |
From: Francesc A. <fa...@gm...> - 2012-10-31 20:10:58
|
On 10/31/12 4:05 PM, Francesc Alted wrote: > On 10/31/12 4:02 PM, Francesc Alted wrote: >> On 10/31/12 10:12 AM, Andrea Gavana wrote: >>> Hi Francesc & All, >>> >>> On 31 October 2012 14:13, Francesc Alted wrote: >>>> On 10/31/12 4:30 AM, Andrea Gavana wrote: >>>>> Thank you for all your suggestions. I managed to slightly modify the >>>>> script you attached and I am also experimenting with compression. >>>>> However, in the newly attached script the underlying table is not >>>>> modified, i.e., this assignment: >>>>> >>>>> for p in table: >>>>> p['results'][:NUM_SIM, :, :] = >>>>> numpy.random.random(size=(NUM_SIM, >>>>> len(ALL_DATES), 7)) >>>>> table.flush() >>>> For modifying row values you need to assign a complete row object. >>>> Something like: >>>> >>>> for i in range(len(table)): >>>> myrow = table[i] >>>> myrow['results'][:NUM_SIM, :, :] = >>>> numpy.random.random(size=(NUM_SIM, len(ALL_DATES), 7)) >>>> table[i] = myrow >>>> >>>> You may also use Table.modifyColumn() for better efficiency. Look at >>>> the different modification methods here: >>>> >>>> http://pytables.github.com/usersguide/libref/structured_storage.html#table-methods-writing >>>> >>>> >>>> and experiment with them. >>> Thank you, I have tried different approaches and they all seem to run >>> more or less at the same speed (see below). I had to slightly modify >>> your code from: >>> >>> table[i] = myrow >>> >>> to >>> >>> table[i] = [myrow] >>> >>> To avoid exceptions. >>> >>> In the newly attached file, I switched to blosc for compression (but >>> with compression level 1) and run a few sensitivities. By calling the >>> attached script as: >>> >>> python pytables_test.py NUM_SIM >>> >>> where "NUM_SIM" is an integer, I get the following timings and file >>> sizes: >>> >>> C:\MyProjects\Phaser\tests>python pytables_test.py 10 >>> Number of simulations : 10 >>> H5 file creation time : 0.879s >>> Saving results for table: 6.413s >>> H5 file size (MB) : 193 >>> >>> >>> C:\MyProjects\Phaser\tests>python pytables_test.py 100 >>> Number of simulations : 100 >>> H5 file creation time : 4.155s >>> Saving results for table: 86.326s >>> H5 file size (MB) : 1935 >>> >>> >>> I dont think I will try the 1,000 simulations case :-) . I believe I >>> still don't understand what the best strategy would be for my problem. >>> I basically need to save all the simulation results for all the 1,200 >>> "objects", each of which has a timeseries matrix of 600x7 size. In the >>> GUI I have, these 1,200 "objects" are grouped into multiple >>> categories, and multiple categories can reference the same "object", >>> i.e.: >>> >>> Category_1: object_1, object_23, object_543, etc... >>> Category_2: object_23, object_100, object_543, etc... >>> >>> So my idea was to save all the "objects" results to disk and, upon the >>> user's choice, build the categories results "on the fly", i.e. by >>> seeking the H5 file on disk for the "objects" belonging to that >>> specific category and summing up all their results (over time, i.e. >>> the 600 time-steps). Maybe I would be better off with a 4D array >>> (NUM_OBJECTS, NUM_SIM, TSTEPS, 7) as a table, but then I will lose the >>> ability to reference the "objects" by their names... >> >> You should keep trying experimenting with different approaches and >> discover the one that works for you the best. Regarding using the 4D >> array as a table, I might be misunderstanding your problem, but you >> can still reference objects by name by using: >> >> row = table.where("name == %s" % my_name) >> table[row.nrow] = ... > > Uh, I rather meant: > > row = table.readWhere("name == %s" % my_name) > table[row.nrow] = ... > > but you probably got the idea already. > Ups, that does not work either. It is probably something more like: rowid = table.readWhereList("name == %s" % my_name)[0] myrow = table[rowid] table[rowid] = ... (assuming that 'name' is a primary key here, i.e. values are not repeated). -- Francesc Alted |
From: Francesc A. <fa...@gm...> - 2012-10-31 20:05:38
|
On 10/31/12 4:02 PM, Francesc Alted wrote: > On 10/31/12 10:12 AM, Andrea Gavana wrote: >> Hi Francesc & All, >> >> On 31 October 2012 14:13, Francesc Alted wrote: >>> On 10/31/12 4:30 AM, Andrea Gavana wrote: >>>> Thank you for all your suggestions. I managed to slightly modify the >>>> script you attached and I am also experimenting with compression. >>>> However, in the newly attached script the underlying table is not >>>> modified, i.e., this assignment: >>>> >>>> for p in table: >>>> p['results'][:NUM_SIM, :, :] = >>>> numpy.random.random(size=(NUM_SIM, >>>> len(ALL_DATES), 7)) >>>> table.flush() >>> For modifying row values you need to assign a complete row object. >>> Something like: >>> >>> for i in range(len(table)): >>> myrow = table[i] >>> myrow['results'][:NUM_SIM, :, :] = >>> numpy.random.random(size=(NUM_SIM, len(ALL_DATES), 7)) >>> table[i] = myrow >>> >>> You may also use Table.modifyColumn() for better efficiency. Look at >>> the different modification methods here: >>> >>> http://pytables.github.com/usersguide/libref/structured_storage.html#table-methods-writing >>> >>> >>> and experiment with them. >> Thank you, I have tried different approaches and they all seem to run >> more or less at the same speed (see below). I had to slightly modify >> your code from: >> >> table[i] = myrow >> >> to >> >> table[i] = [myrow] >> >> To avoid exceptions. >> >> In the newly attached file, I switched to blosc for compression (but >> with compression level 1) and run a few sensitivities. By calling the >> attached script as: >> >> python pytables_test.py NUM_SIM >> >> where "NUM_SIM" is an integer, I get the following timings and file >> sizes: >> >> C:\MyProjects\Phaser\tests>python pytables_test.py 10 >> Number of simulations : 10 >> H5 file creation time : 0.879s >> Saving results for table: 6.413s >> H5 file size (MB) : 193 >> >> >> C:\MyProjects\Phaser\tests>python pytables_test.py 100 >> Number of simulations : 100 >> H5 file creation time : 4.155s >> Saving results for table: 86.326s >> H5 file size (MB) : 1935 >> >> >> I dont think I will try the 1,000 simulations case :-) . I believe I >> still don't understand what the best strategy would be for my problem. >> I basically need to save all the simulation results for all the 1,200 >> "objects", each of which has a timeseries matrix of 600x7 size. In the >> GUI I have, these 1,200 "objects" are grouped into multiple >> categories, and multiple categories can reference the same "object", >> i.e.: >> >> Category_1: object_1, object_23, object_543, etc... >> Category_2: object_23, object_100, object_543, etc... >> >> So my idea was to save all the "objects" results to disk and, upon the >> user's choice, build the categories results "on the fly", i.e. by >> seeking the H5 file on disk for the "objects" belonging to that >> specific category and summing up all their results (over time, i.e. >> the 600 time-steps). Maybe I would be better off with a 4D array >> (NUM_OBJECTS, NUM_SIM, TSTEPS, 7) as a table, but then I will lose the >> ability to reference the "objects" by their names... > > You should keep trying experimenting with different approaches and > discover the one that works for you the best. Regarding using the 4D > array as a table, I might be misunderstanding your problem, but you > can still reference objects by name by using: > > row = table.where("name == %s" % my_name) > table[row.nrow] = ... Uh, I rather meant: row = table.readWhere("name == %s" % my_name) table[row.nrow] = ... but you probably got the idea already. -- Francesc Alted |