Re: [Pytables-users] Histogramming 1000x too slow

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Anthony,

On 11/17/2012 11:49 AM, Anthony Scopatz wrote:
> Hi Jon,
>
> Barring changes to numexpr itself, this is exactly what I am 
> suggesting.  Well,, either writing one query expr per bin or (more 
> cleverly) writing one expr which when evaluated for a row returns the 
> integer bin number (1, 2, 3,...) this row falls in.  Then you can 
> simply count() for each bin number.
>
> For example, if you wanted to histogram data which ran from [0,100] 
> into 10 bins, then the expr "r/10" into a dtype=int would do the 
> trick.  This has the advantage of only running over the data once. 
>  (Also, I am not convinced that running over the data multiple times 
> is less efficient than doing row-based iteration.  You would have to 
> test it on your data to find out.)
>
>     It is a reduction operation, and would greatly benefit from
>     chunking, I expect. Not unlike sum(), which is implemented as a
>     specially supported reduction operation inside numexpr (buggily,
>     last I checked). I suspect that a substantial improvement in
>     histogramming requires direct support from either pytables or from
>     numexpr. I don't suppose that there might be a chunked-reduction
>     interface exposed somewhere that I could hook into?
>
>
> This is definitively as feature to request from numexpr.
I've been fiddling around with Stephen's code a bit, and it looks like 
the best way to do things is to read chunks (whether exactly of 
table.chunksize or not is a matter for optimization) of the data in at a 
time, and create histograms of those chunks.  Then combining the 
histograms is a trivial sum operation.  This type of approach can be 
generically applied in many cases, I suspect, where row-by-row iteration 
is prohibitively slow, but the dataset is too large to fit into memory.  
As I understand, this idea is the primary win of PyTables in the first 
place!

So, I think it would be extraordinarily helpful to provide a 
chunked-iteration interface for this sort of use case.  It can be as 
simple as a wrapper around Table.read():

class Table:
     def chunkiter(self, field=None):
         while n*self.chunksize < self.nrows:
             yield self.read(n*self.chunksize, (n+1)*self.chunksize, 
field=field)

Then I can write something like
bins = linspace(-1,1, 101)
hist = sum(histogram(chunk, bins=bins) for chunk in 
mytable.chunkiter(myfield))

Preliminary tests seem to indicate that, for a table with 1 column and 
10M rows, reading in "chunks" of 10x chunksize gives the best 
read-time-per-row.  This is perhaps naive as regards chunksize black 
magic, though...

And of course, if implemented by numexpr, it could benefit from the nice 
automatic multithreading there.

Also, I might dig in a bit and see about extending the "field" argument 
to read so it can read multiple fields at once (to do N-dimensional 
histograms), as you suggested in a previous mail some months ago.
Best Regards,
Jon