Re: [Pytables-users] Histogramming 1000x too slow

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Stephen,
This sounds fantastic, and exactly what i'm looking for. I'll take a closer look tomorrow.
Jon

Stephen Simmons <ma...@st...> wrote:

>Back in 2006/07 I wrote an optimized histogram function for pytables + 
>numpy. The main steps were: - Read in chunksize-sections of the
>pytables 
>array so the HDF5 library just needs to decompress full blocks of data 
>from disk into memory; eliminates subsequent copying/merging of partial
>
>data blocks - Modify numpy's bincount function to be more suitable for 
>high-speed histograms by avoiding data type conversions, eliminate 
>initial pass to determine bounds, etc. - Also I modified the numpy 
>histogram function to update existing histogram counts. This meant huge
>
>pytables datasets could be histogrammed by reading in successive
>chunks. 
>- I also wrote numpy function in C to do weighted averages and simple 
>joins. Net result of optimising both the pytables data storage and the 
>numpy histogramming was probably a 50x increase in speed. Certainly I 
>was getting >1m rows/sec for weighted average histograms, using a 2005 
>Dell laptop. I had plans to submit it as a patch to numpy, but work 
>priorities at the time took me in another direction. One email about it
>
>with some C code is here: 
>http://mail.scipy.org/pipermail/numpy-discussion/2007-March/026472.html
>
>I can send a proper Python source package for it if anyone is 
>interested. Regards Stephen ------------------------------ Message: 3 
>Date: Sat, 17 Nov 2012 23:54:39 +0100 From: Francesc Alted 
><fa...@gm...> Subject: Re: [Pytables-users] Histogramming 1000x
>too 
>slow To: Discussion list for PyTables 
><pyt...@li...> Message-ID: 
><50A...@gm...> Content-Type: text/plain;
>charset=ISO-8859-1; 
>format=flowed On 11/16/12 6:02 PM, Jon Wilson wrote:
>
>> Hi all,
>> I am trying to find the best way to make histograms from large data
>> sets.  Up to now, I've been just loading entire columns into
>in-memory
>> numpy arrays and making histograms from those.  However, I'm
>currently
>> working on a handful of datasets where this is prohibitively memory
>> intensive (causing an out-of-memory kernel panic on a shared machine
>> that you have to open a ticket to have rebooted makes you a little
>> gun-shy), so I am now exploring other options.
>>
>> I know that the Column object is rather nicely set up to act, in some
>> circumstances, like a numpy ndarray.  So my first thought is to try
>just
>> creating the histogram out of the Column object directly. This is,
>> however, 1000x slower than loading it into memory and creating the
>> histogram from the in-memory array.  Please see my test notebook at:
>> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html
>>
>> For such a small table, loading into memory is not an issue.  For
>larger
>> tables, though, it is a problem, and I had hoped that pytables was
>> optimized so that histogramming directly from disk would proceed no
>> slower than loading into memory and histogramming. Is there some
>other
>> way of accessing the column (or Array or CArray) data that will make
>> faster histograms?
>
>Indeed a 1000x slowness is quite a lot, but it is important to stress
>that you are doing an disk operation whenever you are accessing a data
>element, and that takes time.  Perhaps using Array or CArray would make
>times a bit better, but frankly, I don't think this is going to buy you
>too much speed.
>
>The problem here is that you have too many layers, and this makes
>access
>slower.  You may have better luck with carray
>(https://github.com/FrancescAlted/carray), that supports this sort of
>operations, but using a much simpler persistence machinery.  At any
>rate, the results are far better than PyTables:
>
>In [6]: import numpy as np
>
>In [7]: import carray as ca
>
>In [8]: N = 1e7
>
>In [9]: a = np.random.rand(N)
>
>In [10]: %time h = np.histogram(a)
>CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s
>Wall time: 0.55 s
>
>In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray')
>
>In [12]: %time h = np.histogram(ad)
>CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s
>Wall time: 5.81 s
>
>So, the overhead for using a disk-based array is just 10x (not 1000x as
>in PyTables).  I don't know if a 10x slowdown is acceptable to you, but
>in case you need more speed, you could probably implement the histogram
>as a method of the carray class in Cython:
>
>https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651
>
>It should not be too difficult to come up with an optimal
>implementation
>using a chunk-based approach.
>
>-- Francesc Alted ------------------------------
>
>
>------------------------------------------------------------------------------
>Monitor your physical, virtual and cloud infrastructure from a single
>web console. Get in-depth insight into apps, servers, databases,
>vmware,
>SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>Pricing starts from $795 for 25 servers or applications!
>http://p.sf.net/sfu/zoho_dev2dev_nov
>_______________________________________________
>Pytables-users mailing list
>Pyt...@li...
>https://lists.sourceforge.net/lists/listinfo/pytables-users

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.