From: Jon W. <js...@fn...> - 2012-11-19 20:59:51
|
Hi Anthony, On 11/17/2012 11:49 AM, Anthony Scopatz wrote: > Hi Jon, > > Barring changes to numexpr itself, this is exactly what I am > suggesting. Well,, either writing one query expr per bin or (more > cleverly) writing one expr which when evaluated for a row returns the > integer bin number (1, 2, 3,...) this row falls in. Then you can > simply count() for each bin number. > > For example, if you wanted to histogram data which ran from [0,100] > into 10 bins, then the expr "r/10" into a dtype=int would do the > trick. This has the advantage of only running over the data once. > (Also, I am not convinced that running over the data multiple times > is less efficient than doing row-based iteration. You would have to > test it on your data to find out.) > > It is a reduction operation, and would greatly benefit from > chunking, I expect. Not unlike sum(), which is implemented as a > specially supported reduction operation inside numexpr (buggily, > last I checked). I suspect that a substantial improvement in > histogramming requires direct support from either pytables or from > numexpr. I don't suppose that there might be a chunked-reduction > interface exposed somewhere that I could hook into? > > > This is definitively as feature to request from numexpr. I've been fiddling around with Stephen's code a bit, and it looks like the best way to do things is to read chunks (whether exactly of table.chunksize or not is a matter for optimization) of the data in at a time, and create histograms of those chunks. Then combining the histograms is a trivial sum operation. This type of approach can be generically applied in many cases, I suspect, where row-by-row iteration is prohibitively slow, but the dataset is too large to fit into memory. As I understand, this idea is the primary win of PyTables in the first place! So, I think it would be extraordinarily helpful to provide a chunked-iteration interface for this sort of use case. It can be as simple as a wrapper around Table.read(): class Table: def chunkiter(self, field=None): while n*self.chunksize < self.nrows: yield self.read(n*self.chunksize, (n+1)*self.chunksize, field=field) Then I can write something like bins = linspace(-1,1, 101) hist = sum(histogram(chunk, bins=bins) for chunk in mytable.chunkiter(myfield)) Preliminary tests seem to indicate that, for a table with 1 column and 10M rows, reading in "chunks" of 10x chunksize gives the best read-time-per-row. This is perhaps naive as regards chunksize black magic, though... And of course, if implemented by numexpr, it could benefit from the nice automatic multithreading there. Also, I might dig in a bit and see about extending the "field" argument to read so it can read multiple fields at once (to do N-dimensional histograms), as you suggested in a previous mail some months ago. Best Regards, Jon |