From: Jon W. <js...@fn...> - 2012-11-18 21:12:47
|
Hi Stephen, This sounds fantastic, and exactly what i'm looking for. I'll take a closer look tomorrow. Jon Stephen Simmons <ma...@st...> wrote: >Back in 2006/07 I wrote an optimized histogram function for pytables + >numpy. The main steps were: - Read in chunksize-sections of the >pytables >array so the HDF5 library just needs to decompress full blocks of data >from disk into memory; eliminates subsequent copying/merging of partial > >data blocks - Modify numpy's bincount function to be more suitable for >high-speed histograms by avoiding data type conversions, eliminate >initial pass to determine bounds, etc. - Also I modified the numpy >histogram function to update existing histogram counts. This meant huge > >pytables datasets could be histogrammed by reading in successive >chunks. >- I also wrote numpy function in C to do weighted averages and simple >joins. Net result of optimising both the pytables data storage and the >numpy histogramming was probably a 50x increase in speed. Certainly I >was getting >1m rows/sec for weighted average histograms, using a 2005 >Dell laptop. I had plans to submit it as a patch to numpy, but work >priorities at the time took me in another direction. One email about it > >with some C code is here: >http://mail.scipy.org/pipermail/numpy-discussion/2007-March/026472.html > >I can send a proper Python source package for it if anyone is >interested. Regards Stephen ------------------------------ Message: 3 >Date: Sat, 17 Nov 2012 23:54:39 +0100 From: Francesc Alted ><fa...@gm...> Subject: Re: [Pytables-users] Histogramming 1000x >too >slow To: Discussion list for PyTables ><pyt...@li...> Message-ID: ><50A...@gm...> Content-Type: text/plain; >charset=ISO-8859-1; >format=flowed On 11/16/12 6:02 PM, Jon Wilson wrote: > >> Hi all, >> I am trying to find the best way to make histograms from large data >> sets. Up to now, I've been just loading entire columns into >in-memory >> numpy arrays and making histograms from those. However, I'm >currently >> working on a handful of datasets where this is prohibitively memory >> intensive (causing an out-of-memory kernel panic on a shared machine >> that you have to open a ticket to have rebooted makes you a little >> gun-shy), so I am now exploring other options. >> >> I know that the Column object is rather nicely set up to act, in some >> circumstances, like a numpy ndarray. So my first thought is to try >just >> creating the histogram out of the Column object directly. This is, >> however, 1000x slower than loading it into memory and creating the >> histogram from the in-memory array. Please see my test notebook at: >> http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html >> >> For such a small table, loading into memory is not an issue. For >larger >> tables, though, it is a problem, and I had hoped that pytables was >> optimized so that histogramming directly from disk would proceed no >> slower than loading into memory and histogramming. Is there some >other >> way of accessing the column (or Array or CArray) data that will make >> faster histograms? > >Indeed a 1000x slowness is quite a lot, but it is important to stress >that you are doing an disk operation whenever you are accessing a data >element, and that takes time. Perhaps using Array or CArray would make >times a bit better, but frankly, I don't think this is going to buy you >too much speed. > >The problem here is that you have too many layers, and this makes >access >slower. You may have better luck with carray >(https://github.com/FrancescAlted/carray), that supports this sort of >operations, but using a much simpler persistence machinery. At any >rate, the results are far better than PyTables: > >In [6]: import numpy as np > >In [7]: import carray as ca > >In [8]: N = 1e7 > >In [9]: a = np.random.rand(N) > >In [10]: %time h = np.histogram(a) >CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s >Wall time: 0.55 s > >In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray') > >In [12]: %time h = np.histogram(ad) >CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s >Wall time: 5.81 s > >So, the overhead for using a disk-based array is just 10x (not 1000x as >in PyTables). I don't know if a 10x slowdown is acceptable to you, but >in case you need more speed, you could probably implement the histogram >as a method of the carray class in Cython: > >https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651 > >It should not be too difficult to come up with an optimal >implementation >using a chunk-based approach. > >-- Francesc Alted ------------------------------ > > >------------------------------------------------------------------------------ >Monitor your physical, virtual and cloud infrastructure from a single >web console. Get in-depth insight into apps, servers, databases, >vmware, >SAP, cloud infrastructure, etc. Download 30-day Free Trial. >Pricing starts from $795 for 25 servers or applications! >http://p.sf.net/sfu/zoho_dev2dev_nov >_______________________________________________ >Pytables-users mailing list >Pyt...@li... >https://lists.sourceforge.net/lists/listinfo/pytables-users -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. |