|
From: Jon W. <js...@fn...> - 2012-11-19 20:59:51
|
Hi Anthony,
On 11/17/2012 11:49 AM, Anthony Scopatz wrote:
> Hi Jon,
>
> Barring changes to numexpr itself, this is exactly what I am
> suggesting. Well,, either writing one query expr per bin or (more
> cleverly) writing one expr which when evaluated for a row returns the
> integer bin number (1, 2, 3,...) this row falls in. Then you can
> simply count() for each bin number.
>
> For example, if you wanted to histogram data which ran from [0,100]
> into 10 bins, then the expr "r/10" into a dtype=int would do the
> trick. This has the advantage of only running over the data once.
> (Also, I am not convinced that running over the data multiple times
> is less efficient than doing row-based iteration. You would have to
> test it on your data to find out.)
>
> It is a reduction operation, and would greatly benefit from
> chunking, I expect. Not unlike sum(), which is implemented as a
> specially supported reduction operation inside numexpr (buggily,
> last I checked). I suspect that a substantial improvement in
> histogramming requires direct support from either pytables or from
> numexpr. I don't suppose that there might be a chunked-reduction
> interface exposed somewhere that I could hook into?
>
>
> This is definitively as feature to request from numexpr.
I've been fiddling around with Stephen's code a bit, and it looks like
the best way to do things is to read chunks (whether exactly of
table.chunksize or not is a matter for optimization) of the data in at a
time, and create histograms of those chunks. Then combining the
histograms is a trivial sum operation. This type of approach can be
generically applied in many cases, I suspect, where row-by-row iteration
is prohibitively slow, but the dataset is too large to fit into memory.
As I understand, this idea is the primary win of PyTables in the first
place!
So, I think it would be extraordinarily helpful to provide a
chunked-iteration interface for this sort of use case. It can be as
simple as a wrapper around Table.read():
class Table:
def chunkiter(self, field=None):
while n*self.chunksize < self.nrows:
yield self.read(n*self.chunksize, (n+1)*self.chunksize,
field=field)
Then I can write something like
bins = linspace(-1,1, 101)
hist = sum(histogram(chunk, bins=bins) for chunk in
mytable.chunkiter(myfield))
Preliminary tests seem to indicate that, for a table with 1 column and
10M rows, reading in "chunks" of 10x chunksize gives the best
read-time-per-row. This is perhaps naive as regards chunksize black
magic, though...
And of course, if implemented by numexpr, it could benefit from the nice
automatic multithreading there.
Also, I might dig in a bit and see about extending the "field" argument
to read so it can read multiple fields at once (to do N-dimensional
histograms), as you suggested in a previous mail some months ago.
Best Regards,
Jon
|