Thread: [Pytables-users] Combining in-kernel queries with out-of-core computations

Brought to you by: a_valentino, falted, ivilata, joshmoore

pytables-users

[Pytables-users] Combining in-kernel queries with out-of-core computations

From: Jon W. <js...@fn...> - 2012-06-05 23:31:17

Hi all,
In looking through the docs, I see two very nice features: the .where() 
query method, and the tables.Expr computation mechanism.  But, it 
doesn't appear to be possible to combine the two.  It appears that, if I 
want to compute some function of my columns, but only for certain rows, 
I have two options.
  - I can use tables.Expr to compute the function, and then filter the 
results in python
  - I can use mytable.where() to select the rows I'm interested in, and 
then compute the function in python

Am I missing anything?  Is it possible to perform fast out-of-core 
computations with numexpr, but only on a subset of the existing rows?
Regards,
Jon Wilson

Re: [Pytables-users] Combining in-kernel queries with out-of-core computations

From: Anthony S. <sc...@gm...> - 2012-06-06 02:46:17

Hello Jon,

I believe that the where() method just uses the Expr / numexpr
functionality under the covers.  Anything that you can do in Expr you
should be able to do in where().  Can you provide a short example where
this is not the case?

Be Well
Anthony

On Tue, Jun 5, 2012 at 6:17 PM, Jon Wilson <js...@fn...> wrote:

> Hi all,
> In looking through the docs, I see two very nice features: the .where()
> query method, and the tables.Expr computation mechanism.  But, it
> doesn't appear to be possible to combine the two.  It appears that, if I
> want to compute some function of my columns, but only for certain rows,
> I have two options.
>  - I can use tables.Expr to compute the function, and then filter the
> results in python
>  - I can use mytable.where() to select the rows I'm interested in, and
> then compute the function in python
>
> Am I missing anything?  Is it possible to perform fast out-of-core
> computations with numexpr, but only on a subset of the existing rows?
> Regards,
> Jon Wilson
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

Re: [Pytables-users] Combining in-kernel queries with out-of-core computations

From: Jon W. <js...@fn...> - 2012-06-06 03:32:07

Hi Anthony,
Allow me to clarify.  I wish to perform a reduction (histogramming,
specifically) over a function of some values, but only including certain
rows.  For instance, say I have a table with three columns, col0 --
col3.  I would like to create a histogram of col0 + col1**2, but only
where col2 > 15 and abs(col3) < 5.  As far as I understand, I can do the
following:
histogram(array([row['col0'] + row['col1']**2 for row in
mytable.where('col2 > 15 & abs(col3) < 5')]))

And this does produce the desired histogram.  However, mytable.where()
returns an iterator over rows, and then the list comprehension computes
col0 + col1**2 for each row in python space, which lacks the
optimization and multithreading of the numexpr kernel.  It seems as
though it should be possible to have both the condition and the
histogramming variable (col0 + col1**2) computed in the parallelized and
optimized numexpr kernel, but I do not see a way to do this using where().

The alternative that I can see would be to do something like:
histvar = tables.Expr('col0 + col1**2', vars(mytable.cols)).eval()
selection = tables.Expr('col2 > 15 & abs(col3) < 5',
vars(mytable.cols)).eval()
histogram(histvar, weights = selection)

This should produce the same histogram as above, and it does compute
both the histogram variable and the query condition in the numexpr
kernel, but it requires the computation of the histogram variable even
for rows I do not wish to include in the histogram.  If the table is
very large and relatively few rows are selected, or if computing the
histogram variable is expensive, this is quite undesirable.

So, it seems that I can either a) use the fast query operator where();
or, b) perform all computation in numexpr.  But not both.

FWIW, a quick timeit test shows that, on a table with ~1M rows, for a
very simple condition and a very simple histogram variable, the first
method is faster than the second method even when all rows are selected.
For a table with ~7M rows, for a more complex histogram variable and
still a very simple condition, the first method is faster than the
second method when only a few rows are selected, but when all rows are
selected, the second method is more than 10x faster.  (2.16s vs 3.27s
for few rows, 43.1s vs 3.19s for all 7M rows)  So it is clear that in
some cases, method 2 could be sped up substantially, and in other cases,
method 1 could be sped up enormously.

I think something like
histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 &
abs(col3) < 5')).eval())
would be ideal, but since where() returns a row iterator, and not
something that I can extract Column objects from, I don't see any way to
make it work.

So, am I missing some way to compute the histogram variable in the
numexpr kernel, but only for rows I'm interested in?
Regards,
Jon

On 06/05/2012 09:45 PM, Anthony Scopatz wrote:
> Hello Jon, 
>
> I believe that the where() method just uses the Expr / numexpr
> functionality under the covers.  Anything that you can do in Expr you
> should be able to do in where().  Can you provide a short example
> where this is not the case?
>
> Be Well
> Anthony
>
> On Tue, Jun 5, 2012 at 6:17 PM, Jon Wilson <js...@fn...
> <mailto:js...@fn...>> wrote:
>
>     Hi all,
>     In looking through the docs, I see two very nice features: the
>     .where()
>     query method, and the tables.Expr computation mechanism.  But, it
>     doesn't appear to be possible to combine the two.  It appears
>     that, if I
>     want to compute some function of my columns, but only for certain
>     rows,
>     I have two options.
>      - I can use tables.Expr to compute the function, and then filter the
>     results in python
>      - I can use mytable.where() to select the rows I'm interested in, and
>     then compute the function in python
>
>     Am I missing anything?  Is it possible to perform fast out-of-core
>     computations with numexpr, but only on a subset of the existing rows?
>     Regards,
>     Jon Wilson
>
>     ------------------------------------------------------------------------------
>     Live Security Virtual Conference
>     Exclusive live event will cover all the ways today's security and
>     threat landscape has changed and how IT managers can respond.
>     Discussions
>     will include endpoint security, mobile security and the latest in
>     malware
>     threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>     _______________________________________________
>     Pytables-users mailing list
>     Pyt...@li...
>     <mailto:Pyt...@li...>
>     https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and 
> threat landscape has changed and how IT managers can respond. Discussions 
> will include endpoint security, mobile security and the latest in malware 
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>
>
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Combining in-kernel queries with out-of-core computations

From: Anthony S. <sc...@gm...> - 2012-06-06 05:45:40

On Tue, Jun 5, 2012 at 10:32 PM, Jon Wilson <js...@fn...> wrote:

[snip]


>  I think something like
> histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 &
> abs(col3) < 5')).eval())
> would be ideal, but since where() returns a row iterator, and not
> something that I can extract Column objects from, I don't see any way to
> make it work.
>

You are probably looking for the readWhere()
method<http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere>
which
normally returns a numpy structured array.  The line you are looking for is
thus:

histogram(tables.Expr('col0 + col1**2', mytable.readWhere('col2 > 15 &
abs(col3) < 5')).eval())

This will likely be fast in both cases.  I hope this helps.

Be Well
Anthony


>
> So, am I missing some way to compute the histogram variable in the numexpr
> kernel, but only for rows I'm interested in?
> Regards,
> Jon
>
>
> On 06/05/2012 09:45 PM, Anthony Scopatz wrote:
>
> Hello Jon,
>
>  I believe that the where() method just uses the Expr / numexpr
> functionality under the covers.  Anything that you can do in Expr you
> should be able to do in where().  Can you provide a short example where
> this is not the case?
>
>  Be Well
> Anthony
>
> On Tue, Jun 5, 2012 at 6:17 PM, Jon Wilson <js...@fn...> wrote:
>
>> Hi all,
>> In looking through the docs, I see two very nice features: the .where()
>> query method, and the tables.Expr computation mechanism.  But, it
>> doesn't appear to be possible to combine the two.  It appears that, if I
>> want to compute some function of my columns, but only for certain rows,
>> I have two options.
>>  - I can use tables.Expr to compute the function, and then filter the
>> results in python
>>  - I can use mytable.where() to select the rows I'm interested in, and
>> then compute the function in python
>>
>> Am I missing anything?  Is it possible to perform fast out-of-core
>> computations with numexpr, but only on a subset of the existing rows?
>> Regards,
>> Jon Wilson
>>
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Pytables-users mailing list
>> Pyt...@li...
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>
>
>
> _______________________________________________
> Pytables-users mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] Combining in-kernel queries with out-of-core computations

From: Jon W. <js...@fn...> - 2012-06-06 15:24:44

Hi Anthony,

On 06/06/2012 12:45 AM, Anthony Scopatz wrote:
>
>     I think something like
>     histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 &
>     abs(col3) < 5')).eval())
>     would be ideal, but since where() returns a row iterator, and not
>     something that I can extract Column objects from, I don't see any
>     way to make it work.
>
>
> You are probably looking for the readWhere() method 
> <http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere> which 
> normally returns a numpy structured array.  The line you are looking 
> for is thus:
>
> histogram(tables.Expr('col0 + col1**2', mytable.readWhere('col2 > 15 & 
> abs(col3) < 5')).eval())
>
> This will likely be fast in both cases.  I hope this helps.

Oddly, it doesn't work with tables.Expr, but does work with 
numexpr.evaluate.  In the case I talked about before with 7M rows, when 
selecting very few rows, it does just fine (between the other two 
solutions), but when selecting all rows, it is still about 2.75x slower 
than the technique of using tables.Expr for both the histogram var and 
the condition.

I think that this is because .readWhere() pulls all the table rows 
satisfying the where condition into memory first, and it furthermore 
does so for all columns of all selected rows, so, for a table with many 
columns, it has to read many times as much data into memory.  I can use 
the field parameter, but it only accepts one single field, so I would 
have to perform the query once per variable used in the histogram 
variable expression to do that.

Using .readWhere() gives a medium-fast performance in both cases, but I 
still feel like it is not quite the right thing because it reads the 
data completely into memory instead of allowing the computation to be 
performed out-of-core.  Perhaps it is not really feasible, but I think 
the ideal would be to have a .where type query operator that returns 
Column objects or a Table object, with a "view" imposed in either case.
Regards,
Jon

Re: [Pytables-users] Combining in-kernel queries with out-of-core computations

From: Anthony S. <sc...@gm...> - 2012-06-06 16:07:49

On Wed, Jun 6, 2012 at 10:24 AM, Jon Wilson <js...@fn...> wrote:

>  Hi Anthony,
>
>
> On 06/06/2012 12:45 AM, Anthony Scopatz wrote:
>
>
>  I think something like
>> histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 &
>> abs(col3) < 5')).eval())
>> would be ideal, but since where() returns a row iterator, and not
>> something that I can extract Column objects from, I don't see any way to
>> make it work.
>>
>
>  You are probably looking for the readWhere() method<http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere> which
> normally returns a numpy structured array.  The line you are looking for is
> thus:
>
>  histogram(tables.Expr('col0 + col1**2', mytable.readWhere('col2 > 15 &
> abs(col3) < 5')).eval())
>
>  This will likely be fast in both cases.  I hope this helps.
>
>
> Oddly, it doesn't work with tables.Expr, but does work with
> numexpr.evaluate.  In the case I talked about before with 7M rows, when
> selecting very few rows, it does just fine (between the other two
> solutions), but when selecting all rows, it is still about 2.75x slower
> than the technique of using tables.Expr for both the histogram var and the
> condition.
>
> I think that this is because .readWhere() pulls all the table rows
> satisfying the where condition into memory first, and it furthermore does
> so for all columns of all selected rows, so, for a table with many columns,
> it has to read many times as much data into memory.
>

Yes that is correct, it does have to read the data into memory.


>  I can use the field parameter, but it only accepts one single field, so I
> would have to perform the query once per variable used in the histogram
> variable expression to do that.
>
> Using .readWhere() gives a medium-fast performance in both cases, but I
> still feel like it is not quite the right thing because it reads the data
> completely into memory instead of allowing the computation to be performed
> out-of-core.  Perhaps it is not really feasible,
>

Well I think the issue at hand is that you are trying to support two
disparate cases with one expression: sparse and dense selection.  We have
tools for dealing with these cases individually and performing out-of-core
calculations.  And if you know a priori which case you are going to fall
into, you can do the right thing.  So without doing anything special, I
think medium-fast is probably the best and easiest thing that you can
expect right now.  (Though I would be delighted to be proved wrong on this
point.)


> but I think the ideal would be to have a .where type query operator that
> returns Column objects or a Table object, with a "view" imposed in either
> case.
>

We are very open to pull requests if you come up with an implementation of
this that you like more ;).

Be Well
Anthony


> Regards,
> Jon
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

Re: [Pytables-users] Combining in-kernel queries with out-of-core computations

From: Jon W. <js...@fn...> - 2012-06-06 16:18:33

Hi Anthony,
> Well I think the issue at hand is that you are trying to support two 
> disparate cases with one expression: sparse and dense selection.  We 
> have tools for dealing with these cases individually 
> and performing out-of-core calculations.  And if you know a priori 
> which case you are going to fall into, you can do the right thing.  So 
> without doing anything special, I think medium-fast is probably the 
> best and easiest thing that you can expect right now.  (Though I would 
> be delighted to be proved wrong on this point.)
True enough.  Sometimes I can't know anything a priori about the density 
of the selection, of course.  And I'm happy to worry about the 
internals, but some of my colleagues, not so much ;)
>
>     but I think the ideal would be to have a .where type query
>     operator that returns Column objects or a Table object, with a
>     "view" imposed in either case.
>
>
> We are very open to pull requests if you come up with an 
> implementation of this that you like more ;).
Very fair.  We'll see if I can get to it.  Is there any sort of 
guide-to-the-source to help me get started when and if that happens?  I 
guess just the reference guide?

I'll have a lot to learn before I can contribute usefully, I'm sure.  I 
don't know enough about the implementation details yet to know: would a 
selection make the out-of-core performance gains from chunking and other 
things moot because you'd have to skip around too much?
Regards,
Jon

Re: [Pytables-users] Combining in-kernel queries with out-of-core computations

From: Anthony S. <sc...@gm...> - 2012-06-06 18:53:30

On Wed, Jun 6, 2012 at 11:18 AM, Jon Wilson <js...@fn...> wrote:

>  Hi Anthony,
>
>  Well I think the issue at hand is that you are trying to support two
> disparate cases with one expression: sparse and dense selection.  We have
> tools for dealing with these cases individually and performing out-of-core
> calculations.  And if you know a priori which case you are going to fall
> into, you can do the right thing.  So without doing anything special, I
> think medium-fast is probably the best and easiest thing that you can
> expect right now.  (Though I would be delighted to be proved wrong on this
> point.)
>
> True enough.  Sometimes I can't know anything a priori about the density
> of the selection, of course.  And I'm happy to worry about the internals,
> but some of my colleagues, not so much ;)
>

I fully understand your position!


>  but I think the ideal would be to have a .where type query operator that
>> returns Column objects or a Table object, with a "view" imposed in either
>> case.
>>
>
>  We are very open to pull requests if you come up with an implementation
> of this that you like more ;).
>
> Very fair.  We'll see if I can get to it.  Is there any sort of
> guide-to-the-source to help me get started when and if that happens?  I
> guess just the reference guide?
>

Not really as such.  There is this list, and then there is
pyt...@go... for development specific questions and issues.


>
> I'll have a lot to learn before I can contribute usefully, I'm sure.  I
> don't know enough about the implementation details yet to know: would a
> selection make the out-of-core performance gains from chunking and other
> things moot because you'd have to skip around too much?
>

Basically, I would say that if you were to write a function that does what
you described, we could take care of / help with the PyTables integration
issues.  You could start by looking at the code for Expr [1] and seeing if
you could track down the one field issue.  Or maybe there is a way to add
this functionality easily...  Don't hesitate to ask if you run into
problems.

Be Well
Anthony


> Regards,
> Jon
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>