From: Jon W. <js...@fn...> - 2012-11-17 03:46:38
|
Hi Anthony, I don't think that either of these help me here (unless I've misunderstood something). I need to fill the histogram with every row in the table, so querying doesn't gain me anything. (especially since the query just returns an iterator over rows) I also don't need (at the moment) to compute any function of the column data, just count (weighted) entries into various bins. I suppose I could write one Expr for each bin of my histogram, but that seems dreadfully inefficient and probably difficult to maintain. It is a reduction operation, and would greatly benefit from chunking, I expect. Not unlike sum(), which is implemented as a specially supported reduction operation inside numexpr (buggily, last I checked). I suspect that a substantial improvement in histogramming requires direct support from either pytables or from numexpr. I don't suppose that there might be a chunked-reduction interface exposed somewhere that I could hook into? Jon Anthony Scopatz <sc...@gm...> wrote: >On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson <js...@fn...> wrote: > >> Hi all, >> I am trying to find the best way to make histograms from large data >> sets. Up to now, I've been just loading entire columns into >in-memory >> numpy arrays and making histograms from those. However, I'm >currently >> working on a handful of datasets where this is prohibitively memory >> intensive (causing an out-of-memory kernel panic on a shared machine >> that you have to open a ticket to have rebooted makes you a little >> gun-shy), so I am now exploring other options. >> >> I know that the Column object is rather nicely set up to act, in some >> circumstances, like a numpy ndarray. So my first thought is to try >just >> creating the histogram out of the Column object directly. This is, >> however, 1000x slower than loading it into memory and creating the >> histogram from the in-memory array. Please see my test notebook at: >> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html >> >> For such a small table, loading into memory is not an issue. For >larger >> tables, though, it is a problem, and I had hoped that pytables was >> optimized so that histogramming directly from disk would proceed no >> slower than loading into memory and histogramming. Is there some >other >> way of accessing the column (or Array or CArray) data that will make >> faster histograms? >> > >Hi Jon, > >This is not surprising since the column object itself is going to be >iterated >over per row. As you found, reading in each row individually will be >prohibitively expensive as compared to reading in all the data at one. > >To do this in the right way for data that is larger than system memory, >you >need to read it in in chunks. Luckily there are tools to help you >automate >this process already in PyTables. I would recommend that you use >expressions [1] or queries [2] to do your historgramming more >efficiently. > >Be Well >Anthony > >1. http://pytables.github.com/usersguide/libref/expr_class.html >2. >http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying > > > >> Regards, >> Jon >> >> >> >------------------------------------------------------------------------------ >> Monitor your physical, virtual and cloud infrastructure from a single >> web console. Get in-depth insight into apps, servers, databases, >vmware, >> SAP, cloud infrastructure, etc. Download 30-day Free Trial. >> Pricing starts from $795 for 25 servers or applications! >> http://p.sf.net/sfu/zoho_dev2dev_nov >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > > >------------------------------------------------------------------------ > >------------------------------------------------------------------------------ >Monitor your physical, virtual and cloud infrastructure from a single >web console. Get in-depth insight into apps, servers, databases, >vmware, >SAP, cloud infrastructure, etc. Download 30-day Free Trial. >Pricing starts from $795 for 25 servers or applications! >http://p.sf.net/sfu/zoho_dev2dev_nov > >------------------------------------------------------------------------ > >_______________________________________________ >Pytables-users mailing list >Pyt...@li... >https://lists.sourceforge.net/lists/listinfo/pytables-users -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. |