Hi Anthony,



On 11/17/2012 11:49 AM, Anthony Scopatz wrote:
Hi Jon, 

Barring changes to numexpr itself, this is exactly what I am suggesting.  Well,, either writing one query expr per bin or (more cleverly) writing one expr which when evaluated for a row returns the integer bin number (1, 2, 3,...) this row falls in.  Then you can simply count() for each bin number.  

For example, if you wanted to histogram data which ran from [0,100] into 10 bins, then the expr "r/10" into a dtype=int would do the trick.  This has the advantage of only running over the data once.  (Also, I am not convinced that running over the data multiple times is less efficient than doing row-based iteration.  You would have to test it on your data to find out.)
 
It is a reduction operation, and would greatly benefit from chunking, I expect. Not unlike sum(), which is implemented as a specially supported reduction operation inside numexpr (buggily, last I checked). I suspect that a substantial improvement in histogramming requires direct support from either pytables or from numexpr. I don't suppose that there might be a chunked-reduction interface exposed somewhere that I could hook into?

This is definitively as feature to request from numexpr.
I've been fiddling around with Stephen's code a bit, and it looks like the best way to do things is to read chunks (whether exactly of table.chunksize or not is a matter for optimization) of the data in at a time, and create histograms of those chunks.  Then combining the histograms is a trivial sum operation.  This type of approach can be generically applied in many cases, I suspect, where row-by-row iteration is prohibitively slow, but the dataset is too large to fit into memory.  As I understand, this idea is the primary win of PyTables in the first place!

So, I think it would be extraordinarily helpful to provide a chunked-iteration interface for this sort of use case.  It can be as simple as a wrapper around Table.read():

class Table:
    def chunkiter(self, field=None):
        while n*self.chunksize < self.nrows:
            yield self.read(n*self.chunksize, (n+1)*self.chunksize, field=field)

Then I can write something like
bins = linspace(-1,1, 101)
hist = sum(histogram(chunk, bins=bins) for chunk in mytable.chunkiter(myfield))

Preliminary tests seem to indicate that, for a table with 1 column and 10M rows, reading in "chunks" of 10x chunksize gives the best read-time-per-row.  This is perhaps naive as regards chunksize black magic, though...

And of course, if implemented by numexpr, it could benefit from the nice automatic multithreading there.

Also, I might dig in a bit and see about extending the "field" argument to read so it can read multiple fields at once (to do N-dimensional histograms), as you suggested in a previous mail some months ago.
Best Regards,
Jon