Re: [Pytables-users] Combining in-kernel queries with out-of-core computations

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Wed, Jun 6, 2012 at 10:24 AM, Jon Wilson <js...@fn...> wrote:

>  Hi Anthony,
>
>
> On 06/06/2012 12:45 AM, Anthony Scopatz wrote:
>
>
>  I think something like
>> histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 &
>> abs(col3) < 5')).eval())
>> would be ideal, but since where() returns a row iterator, and not
>> something that I can extract Column objects from, I don't see any way to
>> make it work.
>>
>
>  You are probably looking for the readWhere() method<http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere> which
> normally returns a numpy structured array.  The line you are looking for is
> thus:
>
>  histogram(tables.Expr('col0 + col1**2', mytable.readWhere('col2 > 15 &
> abs(col3) < 5')).eval())
>
>  This will likely be fast in both cases.  I hope this helps.
>
>
> Oddly, it doesn't work with tables.Expr, but does work with
> numexpr.evaluate.  In the case I talked about before with 7M rows, when
> selecting very few rows, it does just fine (between the other two
> solutions), but when selecting all rows, it is still about 2.75x slower
> than the technique of using tables.Expr for both the histogram var and the
> condition.
>
> I think that this is because .readWhere() pulls all the table rows
> satisfying the where condition into memory first, and it furthermore does
> so for all columns of all selected rows, so, for a table with many columns,
> it has to read many times as much data into memory.
>

Yes that is correct, it does have to read the data into memory.

>  I can use the field parameter, but it only accepts one single field, so I
> would have to perform the query once per variable used in the histogram
> variable expression to do that.
>
> Using .readWhere() gives a medium-fast performance in both cases, but I
> still feel like it is not quite the right thing because it reads the data
> completely into memory instead of allowing the computation to be performed
> out-of-core.  Perhaps it is not really feasible,
>

Well I think the issue at hand is that you are trying to support two
disparate cases with one expression: sparse and dense selection.  We have
tools for dealing with these cases individually and performing out-of-core
calculations.  And if you know a priori which case you are going to fall
into, you can do the right thing.  So without doing anything special, I
think medium-fast is probably the best and easiest thing that you can
expect right now.  (Though I would be delighted to be proved wrong on this
point.)

> but I think the ideal would be to have a .where type query operator that
> returns Column objects or a Table object, with a "view" imposed in either
> case.
>

We are very open to pull requests if you come up with an implementation of
this that you like more ;).

Be Well
Anthony

> Regards,
> Jon
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>