From: Lukas S. <luk...@gm...> - 2012-11-18 08:11:00
|
2012. 11. 17. 오후 12:46에 <pyt...@li...>님이 작성: > Send Pytables-users mailing list submissions to > pyt...@li... > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/pytables-users > or, via email, send a message with subject or body 'help' to > pyt...@li... > > You can reach the person managing the list at > pyt...@li... > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Pytables-users digest..." > > > Today's Topics: > > 1. Re: pyTable index from c++ (Jim Knoll) > 2. Store a reference to a dataset (Juan Manuel V?zquez Tovar) > 3. Histogramming 1000x too slow (Jon Wilson) > 4. Re: Histogramming 1000x too slow (Anthony Scopatz) > 5. Re: Histogramming 1000x too slow (Jon Wilson) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 9 Nov 2012 15:26:38 -0600 > From: Jim Knoll <jim...@sp...> > Subject: Re: [Pytables-users] pyTable index from c++ > To: 'Discussion list for PyTables' > <pyt...@li...> > Message-ID: > < > 142...@SP...> > Content-Type: text/plain; charset="us-ascii" > > Thanks for taking the time. > > Most of our tables are very wide lots of col.... and simple conditions > are common.... so that is why in-kernel makes almost no impact for me. > > -----Original Message----- > From: Francesc Alted [mailto:fa...@gm...] > Sent: Friday, November 09, 2012 11:27 AM > To: pyt...@li... > Subject: Re: [Pytables-users] pyTable index from c++ > > Well, expected performance of in-kernel (numexpr powered) queries wrt > regular (python) queries largely depends on where the bottleneck is. If > your table has a lot of columns, then the bottleneck is going to be more > on the I/O side, so you cannot expect a large difference in performance. > However, if your table has a small number of columns, then there is more > likelihood that bottleneck is CPU, and your chances to experiment a > difference are higher. > > Of course, having complex queries (i.e. queries that take conditions > over several columns, or just combinations of conditions in the same > column) makes the query more CPU intensive, and in-kernel normally wins > by a comfortable margin. > > Finally, what indexing is doing is to reduce the number of rows where > the conditions have to be evaluated, so depending on the cardinality of > the query and the associated index, you can get more or less speedup. > > Francesc > > On 11/9/12 5:12 PM, Jim Knoll wrote: > > > > Thanks for the reply. I will put some investigation of C++ access on > > my list for items to look at over the slow holiday season. > > > > For the short term we will store a C++ ready index as a different > > table object in the same h5 file. It will work... just a bit of a waste > > on disk space. > > > > One follow up question > > > > Why would my performance of > > > > for row in node.where('stringField == "SomeString"'): > > > > *not*be noticeably faster than > > > > for row in node: > > > > if row.stringField == "SomeString" : > > > > Specifically when there is no index. I understand and see the speed > > improvement only when I have a index. I expected to see some benefit > > from numexpr even with no index. I expected node.where() to be much > > faster. What I see is identical performance. Is numexpr benefit only > > seen for complex math like (floatField ** intField > otherFloatField) > > I did not see that to be the case on my first attempt.... Seems that I > > only benefit from a index. > > > > *From:*Anthony Scopatz [mailto:sc...@gm...] > > *Sent:* Friday, November 09, 2012 12:24 AM > > *To:* Discussion list for PyTables > > *Subject:* Re: [Pytables-users] pyTable index from c++ > > > > On Thu, Nov 8, 2012 at 10:19 PM, Jim Knoll > > <jim...@sp... <mailto:jim...@sp...>> > > wrote: > > > > I love the index function and promote the internal use of PyTables at > > my company. The availability of a indexed method to speed the search > > is the main reason why. > > > > We are a mixed shop using c++ to create H5 (just for the raw speed ... > > need to keep up with streaming data) End users start with python > > pyTables to consume the data. (Often after we have created indexes > > from python pytables.col.col1.createIndex()) > > > > Sometimes the users come up with something we want to do thousands of > > times and performance is critical. But then we are falling back to c++ > > We can use our own index method but would like to make dbl use of the > > PyTables index. > > > > I know the python table.where( is implemented in C. > > > > Hi Jim, > > > > This is only kind of true. Querying (ie all of the where*() methods) > > are actually mostly written in Python in the tables.py and > > expressions.py files. However, they make use of numexpr [1]. > > > > Is there a way to access that from c or c++? Don't mind if I need > > to do work to get the result I think in my case the work may be > > worth it. > > > > *PLAN 1:* One possibility is that the parts of PyTables are written in > > Cython. We could maybe try (without making any edits to these files) > > to convert them to Cython. This has the advantage that for Cython > > files, if you write the appropriate C++ header file and link against > > the shared library correctly, it is possible to access certain > > functions from C/C++. BUT, I am not sure how much of speed boost you > > would get out of this since you would still be calling out to the > > Python interpreter to get these result. You are just calling Python's > > virtual machine from C++ rather than calling it from Python (like > > normal). This has the advantage that you would basically get access to > > these functions acting on tables from C++. > > > > *PLAN 2:* Alternatively, numexpr itself is mostly written in C++ > > already. You should be able to call core numexpr functions directly. > > However, you would have to feed it data that you read from the tables > > yourself. These could even be table indexes. On a personal note, if > > you get code working that does this, I would be interested in seeing > > your implementation. (I have another project where I have tables that > > I want to query from C++) > > > > Let us know what route you ultimately end up taking or if you have any > > further questions! > > > > Be Well > > > > Anthony > > > > 1. http://code.google.com/p/numexpr/source/browse/#hg%2Fnumexpr > > > > > ------------------------------------------------------------------------ > > > > *Jim Knoll** > > *Data Developer** > > > > Spot Trading L.L.C > > 440 South LaSalle St., Suite 2800 > > Chicago, IL 60605 > > Office: 312.362.4550 <tel:312.362.4550> > > Direct: 312-362-4798 <tel:312-362-4798> > > Fax: 312.362.4551 <tel:312.362.4551> > > jim...@sp... <mailto:jim...@sp...> > > www.spottradingllc.com <http://www.spottradingllc.com/> > > > > > ------------------------------------------------------------------------ > > > > The information contained in this message may be privileged and > > confidential and protected from disclosure. If the reader of this > > message is not the intended recipient, or an employee or agent > > responsible for delivering this message to the intended recipient, > > you are hereby notified that any dissemination, distribution or > > copying of this communication is strictly prohibited. If you have > > received this communication in error, please notify us immediately > > by replying to the message and deleting it from your computer. > > Thank you. Spot Trading, LLC > > > > > > > ------------------------------------------------------------------------------ > > Everyone hates slow websites. So do we. > > Make your web apps faster with AppDynamics > > Download AppDynamics Lite for free today: > > http://p.sf.net/sfu/appdyn_d2d_nov > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > <mailto:Pyt...@li...> > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > ------------------------------------------------------------------------------ > > Everyone hates slow websites. So do we. > > Make your web apps faster with AppDynamics > > Download AppDynamics Lite for free today: > > http://p.sf.net/sfu/appdyn_d2d_nov > > > > > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -- > Francesc Alted > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_nov > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------ > > Message: 2 > Date: Sun, 11 Nov 2012 01:39:33 +0100 > From: Juan Manuel V?zquez Tovar <jmv...@gm...> > Subject: [Pytables-users] Store a reference to a dataset > To: Discussion list for PyTables > <pyt...@li...> > Message-ID: > < > CAD...@ma...> > Content-Type: text/plain; charset="iso-8859-1" > > Hello, > > I have to deal in pytables with a very large dataset. The file already > compressed with blosc5 is about 5GB. Is it possible to store objects within > the same file, each of them containing a reference to a certain search over > the dataset? > It is like having a large numpy array and a mask of it in the same pytables > file. > > Thank you, > > Juanma > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > Message: 3 > Date: Fri, 16 Nov 2012 11:02:25 -0600 > From: Jon Wilson <js...@fn...> > Subject: [Pytables-users] Histogramming 1000x too slow > To: pyt...@li... > Message-ID: <50A...@fn...> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Hi all, > I am trying to find the best way to make histograms from large data > sets. Up to now, I've been just loading entire columns into in-memory > numpy arrays and making histograms from those. However, I'm currently > working on a handful of datasets where this is prohibitively memory > intensive (causing an out-of-memory kernel panic on a shared machine > that you have to open a ticket to have rebooted makes you a little > gun-shy), so I am now exploring other options. > > I know that the Column object is rather nicely set up to act, in some > circumstances, like a numpy ndarray. So my first thought is to try just > creating the histogram out of the Column object directly. This is, > however, 1000x slower than loading it into memory and creating the > histogram from the in-memory array. Please see my test notebook at: > http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html > > For such a small table, loading into memory is not an issue. For larger > tables, though, it is a problem, and I had hoped that pytables was > optimized so that histogramming directly from disk would proceed no > slower than loading into memory and histogramming. Is there some other > way of accessing the column (or Array or CArray) data that will make > faster histograms? > Regards, > Jon > > > > ------------------------------ > > Message: 4 > Date: Fri, 16 Nov 2012 17:10:41 -0800 > From: Anthony Scopatz <sc...@gm...> > Subject: Re: [Pytables-users] Histogramming 1000x too slow > To: Discussion list for PyTables > <pyt...@li...> > Message-ID: > < > CAP...@ma...> > Content-Type: text/plain; charset="iso-8859-1" > > On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson <js...@fn...> wrote: > > > Hi all, > > I am trying to find the best way to make histograms from large data > > sets. Up to now, I've been just loading entire columns into in-memory > > numpy arrays and making histograms from those. However, I'm currently > > working on a handful of datasets where this is prohibitively memory > > intensive (causing an out-of-memory kernel panic on a shared machine > > that you have to open a ticket to have rebooted makes you a little > > gun-shy), so I am now exploring other options. > > > > I know that the Column object is rather nicely set up to act, in some > > circumstances, like a numpy ndarray. So my first thought is to try just > > creating the histogram out of the Column object directly. This is, > > however, 1000x slower than loading it into memory and creating the > > histogram from the in-memory array. Please see my test notebook at: > > http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html > > > > For such a small table, loading into memory is not an issue. For larger > > tables, though, it is a problem, and I had hoped that pytables was > > optimized so that histogramming directly from disk would proceed no > > slower than loading into memory and histogramming. Is there some other > > way of accessing the column (or Array or CArray) data that will make > > faster histograms? > > > > Hi Jon, > > This is not surprising since the column object itself is going to be > iterated > over per row. As you found, reading in each row individually will be > prohibitively expensive as compared to reading in all the data at one. > > To do this in the right way for data that is larger than system memory, you > need to read it in in chunks. Luckily there are tools to help you automate > this process already in PyTables. I would recommend that you use > expressions [1] or queries [2] to do your historgramming more efficiently. > > Be Well > Anthony > > 1. http://pytables.github.com/usersguide/libref/expr_class.html > 2. > > http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying > > > > > Regards, > > Jon > > > > > > > ------------------------------------------------------------------------------ > > Monitor your physical, virtual and cloud infrastructure from a single > > web console. Get in-depth insight into apps, servers, databases, vmware, > > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > > Pricing starts from $795 for 25 servers or applications! > > http://p.sf.net/sfu/zoho_dev2dev_nov > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > Message: 5 > Date: Fri, 16 Nov 2012 21:33:46 -0600 > From: Jon Wilson <js...@fn...> > Subject: Re: [Pytables-users] Histogramming 1000x too slow > To: Discussion list for PyTables > <pyt...@li...>, Anthony Scopatz > <sc...@gm...> > Message-ID: <4c2...@em...> > Content-Type: text/plain; charset="utf-8" > > Hi Anthony, > I don't think that either of these help me here (unless I've misunderstood > something). I need to fill the histogram with every row in the table, so > querying doesn't gain me anything. (especially since the query just returns > an iterator over rows) I also don't need (at the moment) to compute any > function of the column data, just count (weighted) entries into various > bins. I suppose I could write one Expr for each bin of my histogram, but > that seems dreadfully inefficient and probably difficult to maintain. > > It is a reduction operation, and would greatly benefit from chunking, I > expect. Not unlike sum(), which is implemented as a specially supported > reduction operation inside numexpr (buggily, last I checked). I suspect > that a substantial improvement in histogramming requires direct support > from either pytables or from numexpr. I don't suppose that there might be > a chunked-reduction interface exposed somewhere that I could hook into? > Jon > > Anthony Scopatz <sc...@gm...> wrote: > > >On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson <js...@fn...> wrote: > > > >> Hi all, > >> I am trying to find the best way to make histograms from large data > >> sets. Up to now, I've been just loading entire columns into > >in-memory > >> numpy arrays and making histograms from those. However, I'm > >currently > >> working on a handful of datasets where this is prohibitively memory > >> intensive (causing an out-of-memory kernel panic on a shared machine > >> that you have to open a ticket to have rebooted makes you a little > >> gun-shy), so I am now exploring other options. > >> > >> I know that the Column object is rather nicely set up to act, in some > >> circumstances, like a numpy ndarray. So my first thought is to try > >just > >> creating the histogram out of the Column object directly. This is, > >> however, 1000x slower than loading it into memory and creating the > >> histogram from the in-memory array. Please see my test notebook at: > >> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html > >> > >> For such a small table, loading into memory is not an issue. For > >larger > >> tables, though, it is a problem, and I had hoped that pytables was > >> optimized so that histogramming directly from disk would proceed no > >> slower than loading into memory and histogramming. Is there some > >other > >> way of accessing the column (or Array or CArray) data that will make > >> faster histograms? > >> > > > >Hi Jon, > > > >This is not surprising since the column object itself is going to be > >iterated > >over per row. As you found, reading in each row individually will be > >prohibitively expensive as compared to reading in all the data at one. > > > >To do this in the right way for data that is larger than system memory, > >you > >need to read it in in chunks. Luckily there are tools to help you > >automate > >this process already in PyTables. I would recommend that you use > >expressions [1] or queries [2] to do your historgramming more > >efficiently. > > > >Be Well > >Anthony > > > >1. http://pytables.github.com/usersguide/libref/expr_class.html > >2. > > > http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying > > > > > > > >> Regards, > >> Jon > >> > >> > >> > > >------------------------------------------------------------------------------ > >> Monitor your physical, virtual and cloud infrastructure from a single > >> web console. Get in-depth insight into apps, servers, databases, > >vmware, > >> SAP, cloud infrastructure, etc. Download 30-day Free Trial. > >> Pricing starts from $795 for 25 servers or applications! > >> http://p.sf.net/sfu/zoho_dev2dev_nov > >> _______________________________________________ > >> Pytables-users mailing list > >> Pyt...@li... > >> https://lists.sourceforge.net/lists/listinfo/pytables-users > >> > > > > > >------------------------------------------------------------------------ > > > > >------------------------------------------------------------------------------ > >Monitor your physical, virtual and cloud infrastructure from a single > >web console. Get in-depth insight into apps, servers, databases, > >vmware, > >SAP, cloud infrastructure, etc. Download 30-day Free Trial. > >Pricing starts from $795 for 25 servers or applications! > >http://p.sf.net/sfu/zoho_dev2dev_nov > > > >------------------------------------------------------------------------ > > > >_______________________________________________ > >Pytables-users mailing list > >Pyt...@li... > >https://lists.sourceforge.net/lists/listinfo/pytables-users > > -- > Sent from my Android phone with K-9 Mail. Please excuse my brevity. > -------------- next part -------------- > An HTML attachment was scrubbed... > > ------------------------------ > > > ------------------------------------------------------------------------------ > Monitor your physical, virtual and cloud infrastructure from a single > web console. Get in-depth insight into apps, servers, databases, vmware, > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > Pricing starts from $795 for 25 servers or applications! > http://p.sf.net/sfu/zoho_dev2dev_nov > > ------------------------------ > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > End of Pytables-users Digest, Vol 78, Issue 6 > ********************************************* > |