Re: [Pytables-users] Histogramming 1000x too slow

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Anthony,
I don't think that either of these help me here (unless I've misunderstood something). I need to fill the histogram with every row in the table, so querying doesn't gain me anything. (especially since the query just returns an iterator over rows)  I also don't need (at the moment) to compute any function of the column data, just count (weighted) entries into various bins. I suppose I could write one Expr for each bin of my histogram, but that seems dreadfully inefficient and probably difficult to maintain.

It is a reduction operation, and would greatly benefit from chunking, I expect. Not unlike sum(), which is implemented as a specially supported reduction operation inside numexpr (buggily, last I checked). I suspect that a substantial improvement in histogramming requires direct support from either pytables or from numexpr.  I don't suppose that there might be a chunked-reduction interface exposed somewhere that I could hook into?
Jon

Anthony Scopatz <sc...@gm...> wrote:

>On Fri, Nov 16, 2012 at 9:02 AM, Jon Wilson <js...@fn...> wrote:
>
>> Hi all,
>> I am trying to find the best way to make histograms from large data
>> sets.  Up to now, I've been just loading entire columns into
>in-memory
>> numpy arrays and making histograms from those.  However, I'm
>currently
>> working on a handful of datasets where this is prohibitively memory
>> intensive (causing an out-of-memory kernel panic on a shared machine
>> that you have to open a ticket to have rebooted makes you a little
>> gun-shy), so I am now exploring other options.
>>
>> I know that the Column object is rather nicely set up to act, in some
>> circumstances, like a numpy ndarray.  So my first thought is to try
>just
>> creating the histogram out of the Column object directly. This is,
>> however, 1000x slower than loading it into memory and creating the
>> histogram from the in-memory array.  Please see my test notebook at:
>> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>>
>> For such a small table, loading into memory is not an issue.  For
>larger
>> tables, though, it is a problem, and I had hoped that pytables was
>> optimized so that histogramming directly from disk would proceed no
>> slower than loading into memory and histogramming. Is there some
>other
>> way of accessing the column (or Array or CArray) data that will make
>> faster histograms?
>>
>
>Hi Jon,
>
>This is not surprising since the column object itself is going to be
>iterated
>over per row.  As you found, reading in each row individually will be
>prohibitively expensive as compared to reading in all the data at one.
>
>To do this in the right way for data that is larger than system memory,
>you
>need to read it in in chunks.  Luckily there are tools to help you
>automate
>this process already in PyTables.  I would recommend that you use
>expressions [1] or queries [2] to do your historgramming more
>efficiently.
>
>Be Well
>Anthony
>
>1. http://pytables.github.com/usersguide/libref/expr_class.html
>2.
>http://pytables.github.com/usersguide/libref/structured_storage.html?#table-methods-querying
>
>
>
>> Regards,
>> Jon
>>
>>
>>
>------------------------------------------------------------------------------
>> Monitor your physical, virtual and cloud infrastructure from a single
>> web console. Get in-depth insight into apps, servers, databases,
>vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>
>
>------------------------------------------------------------------------
>
>------------------------------------------------------------------------------
>Monitor your physical, virtual and cloud infrastructure from a single
>web console. Get in-depth insight into apps, servers, databases,
>vmware,
>SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>Pricing starts from $795 for 25 servers or applications!
>http://p.sf.net/sfu/zoho_dev2dev_nov
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Pytables-users mailing list
>Pyt...@li...
>https://lists.sourceforge.net/lists/listinfo/pytables-users

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.