|
From: David W. <so...@av...> - 2012-11-18 23:29:48
|
Yes _please_ Stephen. It would be much appreciated. On 19/11/2012, at 8:12 AM, Jon Wilson wrote: > Hi Stephen, > This sounds fantastic, and exactly what i'm looking for. I'll take a closer look tomorrow. > Jon > > Stephen Simmons <ma...@st...> wrote: > Back in 2006/07 I wrote an optimized histogram function for pytables + > numpy. The main steps were: - Read in chunksize-sections of the pytables > array so the HDF5 library just needs to decompress full blocks of data > from disk into memory; eliminates subsequent copying/merging of partial > data blocks - Modify numpy's bincount function to be more suitable for > high-speed histograms by avoiding data type conversions, eliminate > initial pass to determine bounds, etc. - Also I modified the numpy > histogram function to update existing histogram counts. This meant huge > pytables datasets could be histogrammed by reading in successive chunks. > - I also wrote numpy function in C to do weighted averages and simple > joins. Net result of optimising both the pytables data storage and the > numpy histogramming was probably a 50x increase in > speed. Certainly I > was getting >1m rows/sec for weighted average histograms, using a 2005 > Dell laptop. I had plans to submit it as a patch to numpy, but work > priorities at the time took me in another direction. One email about it > with some C code is here: > http://mail.scipy.org/pipermail/numpy-discussion/2007-March/026472.html > I can send a proper Python source package for it if anyone is > interested. Regards Stephen Yes _please_ Stephen. It would be much appreciated. > Message: 3 > Date: Sat, 17 Nov 2012 23:54:39 +0100 From: Francesc Alted > <fa...@gm...> Subject: Re: [Pytables-users] Histogramming 1000x too > slow To: Discussion list for PyTables > <pyt...@li...> Message-ID: > <50A...@gm...> Content-Type: text/plain; charset=ISO-8859-1; > format=flowed On 11/16/12 6:02 PM, Jon Wilson wrote: > > Hi all, > I am trying to find the best way to make histograms from large data > sets. Up to now, I've been just loading entire columns into in-memory > numpy arrays and making histograms from those. However, I'm currently > working on a handful of datasets where this is prohibitively memory > intensive (causing an out-of-memory kernel panic on a shared machine > that you have to open a ticket to have rebooted makes you a little > gun-shy), so I am now exploring other options. > > I know that the Column object is rather nicely set up to act, in some > circumstances, like a numpy ndarray. So my first thought is to try just > creating the histogram out of the Column object directly. This is, > however, 1000x slower than loading it into memory and creating the > histogram from the in-memory array. Please see my test notebook at: > http://www-cdf.fnal.gov/~jsw/pytables%20test%20stuff.html > > For such a small table, loading into memory is not an issue. For larger > tables, though, it is a problem, and I had hoped that pytables was > optimized so that histogramming directly from disk would proceed no > slower than loading into memory and histogramming. Is there some other > way of accessing the column (or Array or CArray) data that will make > faster histograms? > > Indeed a 1000x slowness is quite a lot, but it is important to stress > that you are doing an disk operation whenever you are accessing a data > element, and that takes time. Perhaps using Array or CArray would make > times a bit better, but frankly, I don't think this is going to buy you > too much speed. > > The problem here is that you have too many layers, and this makes access > slower. You may have better luck with > carray > (https://github.com/FrancescAlted/carray), that supports this sort of > operations, but using a much simpler persistence machinery. At any > rate, the results are far better than PyTables: > > In [6]: import numpy as np > > In [7]: import carray as ca > > In [8]: N = 1e7 > > In [9]: a = np.random.rand(N) > > In [10]: %time h = np.histogram(a) > CPU times: user 0.55 s, sys: 0.00 s, total: 0.55 s > Wall time: 0.55 s > > In [11]: ad = ca.carray(a, rootdir='/tmp/a.carray') > > In [12]: %time h = np.histogram(ad) > CPU times: user 5.72 s, sys: 0.07 s, total: 5.79 s > Wall time: 5.81 s > > So, the overhead for using a disk-based array is just 10x (not 1000x as > in PyTables). I don't know if a 10x slowdown is acceptable to you, but > in case you need more speed, you could probably implement the histogram > as a method of the carray class in > Cython: > > https://github.com/FrancescAlted/carray/blob/master/carray/carrayExtension.pyx#L651 > > It should not be too difficult to come up with an optimal implementation > using a chunk-based approach. > > -- Francesc Alted > > > > > Monitor your physical, virtual and cloud infrastructure from a single > web console. Get in-depth insight into apps, servers, databases, vmware, > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > Pricing starts from $795 for 25 servers or applications! > http://p.sf.net/sfu/zoho_dev2dev_nov > > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > -- > Sent from my Android phone with K-9 Mail. Please excuse my brevity. > ------------------------------------------------------------------------------ > Monitor your physical, virtual and cloud infrastructure from a single > web console. Get in-depth insight into apps, servers, databases, vmware, > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > Pricing starts from $795 for 25 servers or applications! > http://p.sf.net/sfu/zoho_dev2dev_nov_______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users _________________________________________________ experimental polymedia: www.avatar.com.au Sonic Communications Research Group, University of Canberra: creative.canberra.edu.au/scrg |