Re: [Pytables-users] Pytables-users Digest, Vol 78, Issue 6

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

2012. 11. 17. 오후 12:46에 <pyt...@li...>님이 작성:

> Send Pytables-users mailing list submissions to
>         pyt...@li...
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://lists.sourceforge.net/lists/listinfo/pytables-users
> or, via email, send a message with subject or body 'help' to
>         pyt...@li...
>
> You can reach the person managing the list at
>         pyt...@li...
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Pytables-users digest..."
>
>
> Today's Topics:
>
>    1. Re: pyTable index from c++ (Jim Knoll)
>    2. Store a reference to a dataset (Juan Manuel V?zquez Tovar)
>    3. Histogramming 1000x too slow (Jon Wilson)
>    4. Re: Histogramming 1000x too slow (Anthony Scopatz)
>    5. Re: Histogramming 1000x too slow (Jon Wilson)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 9 Nov 2012 15:26:38 -0600
> From: Jim Knoll <jim...@sp...>
> Subject: Re: [Pytables-users] pyTable index from c++
> To: 'Discussion list for PyTables'
>         <pyt...@li...>
> Message-ID:
>         <
> 142...@SP...>
> Content-Type: text/plain; charset="us-ascii"
>
> Thanks for taking the time.
>
> Most of our tables are very wide  lots of col....  and simple conditions
> are common.... so that is why in-kernel makes almost no impact for me.
>
> -----Original Message-----
> From: Francesc Alted [mailto:fa...@gm...]
> Sent: Friday, November 09, 2012 11:27 AM
> To: pyt...@li...
> Subject: Re: [Pytables-users] pyTable index from c++
>
> Well, expected performance of in-kernel (numexpr powered) queries wrt
> regular (python) queries largely depends on where the bottleneck is. If
> your table has a lot of columns, then the bottleneck is going to be more
> on the I/O side, so you cannot expect a large difference in performance.
> However, if your table has a small number of columns, then there is more
> likelihood that bottleneck is CPU, and your chances to experiment a
> difference are higher.
>
> Of course, having complex queries (i.e. queries that take conditions
> over several columns, or just combinations of conditions in the same
> column) makes the query more CPU intensive, and in-kernel normally wins
> by a comfortable margin.
>
> Finally, what indexing is doing is to reduce the number of rows where
> the conditions have to be evaluated, so depending on the cardinality of
> the query and the associated index, you can get more or less speedup.
>
> Francesc
>
> On 11/9/12 5:12 PM, Jim Knoll wrote:
> >
> > Thanks for the reply. I will put some investigation of C++ access on
> > my list for items to look at over the slow holiday season.
> >
> > For the short term we will store a C++ ready index as a different
> > table object in the same h5 file. It will work... just a bit of a waste
> > on disk space.
> >
> > One follow up question
> >
> > Why would my performance of
> >
> > for row in node.where('stringField == "SomeString"'):
> >
> > *not*be noticeably faster than
> >
> > for row in node:
> >
> > if row.stringField == "SomeString" :
> >
> > Specifically when there is no index. I understand and see the speed
> > improvement only when I have a index. I expected to see some benefit
> > from numexpr even with no index. I expected node.where() to be much
> > faster. What I see is identical performance. Is numexpr benefit only
> > seen for complex math like (floatField ** intField > otherFloatField)
> > I did not see that to be the case on my first attempt.... Seems that I
> > only benefit from a index.
> >
> > *From:*Anthony Scopatz [mailto:sc...@gm...]
> > *Sent:* Friday, November 09, 2012 12:24 AM
> > *To:* Discussion list for PyTables
> > *Subject:* Re: [Pytables-users] pyTable index from c++
> >
> > On Thu, Nov 8, 2012 at 10:19 PM, Jim Knoll
> > <jim...@sp... <mailto:jim...@sp...>>
> > wrote:
> >
> > I love the index function and promote the internal use of PyTables at
> > my company. The availability of a indexed method to speed the search
> > is the main reason why.
> >
> > We are a mixed shop using c++ to create H5 (just for the raw speed ...
> > need to keep up with streaming data) End users start with python
> > pyTables to consume the data. (Often after we have created indexes
> > from python pytables.col.col1.createIndex())
> >
> > Sometimes the users come up with something we want to do thousands of
> > times and performance is critical. But then we are falling back to c++
> > We can use our own index method but would like to make dbl use of the
> > PyTables index.
> >
> > I know the python table.where( is implemented in C.
> >
> > Hi Jim,
> >
> > This is only kind of true. Querying (ie all of the where*() methods)
> > are actually mostly written in Python in the tables.py and
> > expressions.py files. However, they make use of numexpr [1].
> >
> >     Is there a way to access that from c or c++? Don't mind if I need
> >     to do work to get the result I think in my case the work may be
> >     worth it.
> >
> > *PLAN 1:* One possibility is that the parts of PyTables are written in
> > Cython. We could maybe try (without making any edits to these files)
> > to convert them to Cython. This has the advantage that for Cython
> > files, if you write the appropriate C++ header file and link against
> > the shared library correctly, it is possible to access certain
> > functions from C/C++. BUT, I am not sure how much of speed boost you
> > would get out of this since you would still be calling out to the
> > Python interpreter to get these result. You are just calling Python's
> > virtual machine from C++ rather than calling it from Python (like
> > normal). This has the advantage that you would basically get access to
> > these functions acting on tables from C++.
> >
> > *PLAN 2:* Alternatively, numexpr itself is mostly written in C++
> > already. You should be able to call core numexpr functions directly.
> > However, you would have to feed it data that you read from the tables
> > yourself. These could even be table indexes. On a personal note, if
> > you get code working that does this, I would be interested in seeing
> > your implementation. (I have another project where I have tables that
> > I want to query from C++)
> >
> > Let us know what route you ultimately end up taking or if you have any
> > further questions!
> >
> > Be Well
> >
> > Anthony
> >
> > 1. http://code.google.com/p/numexpr/source/browse/#hg%2Fnumexpr
> >
> >
> ------------------------------------------------------------------------
> >
> >     *Jim Knoll**
> >     *Data Developer**
> >
> >     Spot Trading L.L.C
> >     440 South LaSalle St., Suite 2800
> >     Chicago, IL 60605
> >     Office: 312.362.4550 <tel:312.362.4550>
> >     Direct: 312-362-4798 <tel:312-362-4798>
> >     Fax: 312.362.4551 <tel:312.362.4551>
> >     jim...@sp... <mailto:jim...@sp...>
> >     www.spottradingllc.com <http://www.spottradingllc.com/>
> >
> >
> ------------------------------------------------------------------------
> >
> >     The information contained in this message may be privileged and
> >     confidential and protected from disclosure. If the reader of this
> >     message is not the intended recipient, or an employee or agent
> >     responsible for delivering this message to the intended recipient,
> >     you are hereby notified that any dissemination, distribution or
> >     copying of this communication is strictly prohibited. If you have
> >     received this communication in error, please notify us immediately
> >     by replying to the message and deleting it from your computer.
> >     Thank you. Spot Trading, LLC
> >
> >
> >
> ------------------------------------------------------------------------------
> >     Everyone hates slow websites. So do we.
> >     Make your web apps faster with AppDynamics
> >     Download AppDynamics Lite for free today:
> >     http://p.sf.net/sfu/appdyn_d2d_nov
> >     _______________________________________________
> >     Pytables-users mailing list
> >     Pyt...@li...
> >     <mailto:Pyt...@li...>
> >     https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > Everyone hates slow websites. So do we.
> > Make your web apps faster with AppDynamics
> > Download AppDynamics Lite for free today:
> > http://p.sf.net/sfu/appdyn_d2d_nov
> >
> >
> > _______________________________________________
> > Pytables-users mailing list
> > Pyt...@li...
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
> --
> Francesc Alted
>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_nov
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
>
> ------------------------------
>
> Message: 2
> Date: Sun, 11 Nov 2012 01:39:33 +0100
> From: Juan Manuel V?zquez Tovar <jmv...@gm...>
> Subject: [Pytables-users] Store a reference to a dataset
> To: Discussion list for PyTables
>         <pyt...@li...>
> Message-ID:
>         <
> CAD...@ma...>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hello,
>
> I have to deal in pytables with a very large dataset. The file already
> compressed with blosc5 is about 5GB. Is it possible to store objects within
> the same file, each of them containing a reference to a certain search over
> the dataset?
> It is like having a large numpy array and a mask of it in the same pytables
> file.
>
> Thank you,
>
> Juanma
> -------------- next part --------------
> An HTML attachment was scrubbed...
>
> ------------------------------
>
> Message: 3
> Date: Fri, 16 Nov 2012 11:02:25 -0600
> From: Jon Wilson <js...@fn...>
> Subject: [Pytables-users] Histogramming 1000x too slow
> To: pyt...@li...
> Message-ID: <50A...@fn...>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hi all,
> I am trying to find the best way to make histograms from large data
> sets.  Up to now, I've been just loading entire columns into in-memory
> numpy arrays and making histograms from those.  However, I'm currently
> working on a handful of datasets where this is prohibitively memory
> intensive (causing an out-of-memory kernel panic on a shared machine
> that you have to open a ticket to have rebooted makes you a little
> gun-shy), so I am now exploring other options.
>
> I know that the Column object is rather nicely set up to act, in some
> circumstances, like a numpy ndarray.  So my first thought is to try just
> creating the histogram out of the Column object directly. This is,
> however, 1000x slower than loading it into memory and creating the
> histogram from the in-memory array.  Please see my test notebook at:
> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>
> For such a small table, loading into memory is not an issue.  For larger
> tables, though, it is a problem, and I had hoped that pytables was
> optimized so that histogramming directly from disk would proceed no
> slower than loading into memory and histogramming. Is there some other
> way of accessing the column (or Array or CArray) data that will make
> faster histograms?
> Regards,
> Jon
>
>
>
> ------------------------------
>
> Message: 4
> Date: Fri, 16 Nov 2012 17:10:41 -0800
> From: Anthony Scopatz <sc...@gm...>
> Subject: Re: [Pytables-users] Histogramming 1000x too slow
> To: Discussion list for PyTables
>         <pyt...@li...>
> Message-ID:
>         <
> CAP...@ma...>
> Content-Type: text/plain; charset="iso-8859-1"
>
> On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson <js...@fn...> wrote:
>
> > Hi all,
> > I am trying to find the best way to make histograms from large data
> > sets.  Up to now, I've been just loading entire columns into in-memory
> > numpy arrays and making histograms from those.  However, I'm currently
> > working on a handful of datasets where this is prohibitively memory
> > intensive (causing an out-of-memory kernel panic on a shared machine
> > that you have to open a ticket to have rebooted makes you a little
> > gun-shy), so I am now exploring other options.
> >
> > I know that the Column object is rather nicely set up to act, in some
> > circumstances, like a numpy ndarray.  So my first thought is to try just
> > creating the histogram out of the Column object directly. This is,
> > however, 1000x slower than loading it into memory and creating the
> > histogram from the in-memory array.  Please see my test notebook at:
> > http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
> >
> > For such a small table, loading into memory is not an issue.  For larger
> > tables, though, it is a problem, and I had hoped that pytables was
> > optimized so that histogramming directly from disk would proceed no
> > slower than loading into memory and histogramming. Is there some other
> > way of accessing the column (or Array or CArray) data that will make
> > faster histograms?
> >
>
> Hi Jon,
>
> This is not surprising since the column object itself is going to be
> iterated
> over per row.  As you found, reading in each row individually will be
> prohibitively expensive as compared to reading in all the data at one.
>
> To do this in the right way for data that is larger than system memory, you
> need to read it in in chunks.  Luckily there are tools to help you automate
> this process already in PyTables.  I would recommend that you use
> expressions [1] or queries [2] to do your historgramming more efficiently.
>
> Be Well
> Anthony
>
> 1. http://pytables.github.com/usersguide/libref/expr_class.html
> 2.
>
> http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying
>
>
>
> > Regards,
> > Jon
> >
> >
> >
> ------------------------------------------------------------------------------
> > Monitor your physical, virtual and cloud infrastructure from a single
> > web console. Get in-depth insight into apps, servers, databases, vmware,
> > SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> > Pricing starts from $795 for 25 servers or applications!
> > http://p.sf.net/sfu/zoho_dev2dev_nov
> > _______________________________________________
> > Pytables-users mailing list
> > Pyt...@li...
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
>
> ------------------------------
>
> Message: 5
> Date: Fri, 16 Nov 2012 21:33:46 -0600
> From: Jon Wilson <js...@fn...>
> Subject: Re: [Pytables-users] Histogramming 1000x too slow
> To: Discussion list for PyTables
>         <pyt...@li...>, Anthony Scopatz
>         <sc...@gm...>
> Message-ID: <4c2...@em...>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Anthony,
> I don't think that either of these help me here (unless I've misunderstood
> something). I need to fill the histogram with every row in the table, so
> querying doesn't gain me anything. (especially since the query just returns
> an iterator over rows)  I also don't need (at the moment) to compute any
> function of the column data, just count (weighted) entries into various
> bins. I suppose I could write one Expr for each bin of my histogram, but
> that seems dreadfully inefficient and probably difficult to maintain.
>
> It is a reduction operation, and would greatly benefit from chunking, I
> expect. Not unlike sum(), which is implemented as a specially supported
> reduction operation inside numexpr (buggily, last I checked). I suspect
> that a substantial improvement in histogramming requires direct support
> from either pytables or from numexpr.  I don't suppose that there might be
> a chunked-reduction interface exposed somewhere that I could hook into?
> Jon
>
> Anthony Scopatz <sc...@gm...> wrote:
>
> >On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson <js...@fn...> wrote:
> >
> >> Hi all,
> >> I am trying to find the best way to make histograms from large data
> >> sets.  Up to now, I've been just loading entire columns into
> >in-memory
> >> numpy arrays and making histograms from those.  However, I'm
> >currently
> >> working on a handful of datasets where this is prohibitively memory
> >> intensive (causing an out-of-memory kernel panic on a shared machine
> >> that you have to open a ticket to have rebooted makes you a little
> >> gun-shy), so I am now exploring other options.
> >>
> >> I know that the Column object is rather nicely set up to act, in some
> >> circumstances, like a numpy ndarray.  So my first thought is to try
> >just
> >> creating the histogram out of the Column object directly. This is,
> >> however, 1000x slower than loading it into memory and creating the
> >> histogram from the in-memory array.  Please see my test notebook at:
> >> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
> >>
> >> For such a small table, loading into memory is not an issue.  For
> >larger
> >> tables, though, it is a problem, and I had hoped that pytables was
> >> optimized so that histogramming directly from disk would proceed no
> >> slower than loading into memory and histogramming. Is there some
> >other
> >> way of accessing the column (or Array or CArray) data that will make
> >> faster histograms?
> >>
> >
> >Hi Jon,
> >
> >This is not surprising since the column object itself is going to be
> >iterated
> >over per row.  As you found, reading in each row individually will be
> >prohibitively expensive as compared to reading in all the data at one.
> >
> >To do this in the right way for data that is larger than system memory,
> >you
> >need to read it in in chunks.  Luckily there are tools to help you
> >automate
> >this process already in PyTables.  I would recommend that you use
> >expressions [1] or queries [2] to do your historgramming more
> >efficiently.
> >
> >Be Well
> >Anthony
> >
> >1. http://pytables.github.com/usersguide/libref/expr_class.html
> >2.
> >
> http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying
> >
> >
> >
> >> Regards,
> >> Jon
> >>
> >>
> >>
>
> >------------------------------------------------------------------------------
> >> Monitor your physical, virtual and cloud infrastructure from a single
> >> web console. Get in-depth insight into apps, servers, databases,
> >vmware,
> >> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> >> Pricing starts from $795 for 25 servers or applications!
> >> http://p.sf.net/sfu/zoho_dev2dev_nov
> >> _______________________________________________
> >> Pytables-users mailing list
> >> Pyt...@li...
> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >>
> >
> >
> >------------------------------------------------------------------------
> >
>
> >------------------------------------------------------------------------------
> >Monitor your physical, virtual and cloud infrastructure from a single
> >web console. Get in-depth insight into apps, servers, databases,
> >vmware,
> >SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> >Pricing starts from $795 for 25 servers or applications!
> >http://p.sf.net/sfu/zoho_dev2dev_nov
> >
> >------------------------------------------------------------------------
> >
> >_______________________________________________
> >Pytables-users mailing list
> >Pyt...@li...
> >https://lists.sourceforge.net/lists/listinfo/pytables-users
>
> --
> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
> -------------- next part --------------
> An HTML attachment was scrubbed...
>
> ------------------------------
>
>
> ------------------------------------------------------------------------------
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov
>
> ------------------------------
>
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
> End of Pytables-users Digest, Vol 78, Issue 6
> *********************************************
>