Re: [Pytables-users] Combining in-kernel queries with out-of-core computations

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Anthony,

On 06/06/2012 12:45 AM, Anthony Scopatz wrote:
>
>     I think something like
>     histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 &
>     abs(col3) < 5')).eval())
>     would be ideal, but since where() returns a row iterator, and not
>     something that I can extract Column objects from, I don't see any
>     way to make it work.
>
>
> You are probably looking for the readWhere() method 
> <http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere> which 
> normally returns a numpy structured array.  The line you are looking 
> for is thus:
>
> histogram(tables.Expr('col0 + col1**2', mytable.readWhere('col2 > 15 & 
> abs(col3) < 5')).eval())
>
> This will likely be fast in both cases.  I hope this helps.

Oddly, it doesn't work with tables.Expr, but does work with 
numexpr.evaluate.  In the case I talked about before with 7M rows, when 
selecting very few rows, it does just fine (between the other two 
solutions), but when selecting all rows, it is still about 2.75x slower 
than the technique of using tables.Expr for both the histogram var and 
the condition.

I think that this is because .readWhere() pulls all the table rows 
satisfying the where condition into memory first, and it furthermore 
does so for all columns of all selected rows, so, for a table with many 
columns, it has to read many times as much data into memory.  I can use 
the field parameter, but it only accepts one single field, so I would 
have to perform the query once per variable used in the histogram 
variable expression to do that.

Using .readWhere() gives a medium-fast performance in both cases, but I 
still feel like it is not quite the right thing because it reads the 
data completely into memory instead of allowing the computation to be 
performed out-of-core.  Perhaps it is not really feasible, but I think 
the ideal would be to have a .where type query operator that returns 
Column objects or a Table object, with a "view" imposed in either case.
Regards,
Jon