From: Jon W. <js...@fn...> - 2012-06-05 23:31:17
|
Hi all, In looking through the docs, I see two very nice features: the .where() query method, and the tables.Expr computation mechanism. But, it doesn't appear to be possible to combine the two. It appears that, if I want to compute some function of my columns, but only for certain rows, I have two options. - I can use tables.Expr to compute the function, and then filter the results in python - I can use mytable.where() to select the rows I'm interested in, and then compute the function in python Am I missing anything? Is it possible to perform fast out-of-core computations with numexpr, but only on a subset of the existing rows? Regards, Jon Wilson |
From: Anthony S. <sc...@gm...> - 2012-06-06 02:46:17
|
Hello Jon, I believe that the where() method just uses the Expr / numexpr functionality under the covers. Anything that you can do in Expr you should be able to do in where(). Can you provide a short example where this is not the case? Be Well Anthony On Tue, Jun 5, 2012 at 6:17 PM, Jon Wilson <js...@fn...> wrote: > Hi all, > In looking through the docs, I see two very nice features: the .where() > query method, and the tables.Expr computation mechanism. But, it > doesn't appear to be possible to combine the two. It appears that, if I > want to compute some function of my columns, but only for certain rows, > I have two options. > - I can use tables.Expr to compute the function, and then filter the > results in python > - I can use mytable.where() to select the rows I'm interested in, and > then compute the function in python > > Am I missing anything? Is it possible to perform fast out-of-core > computations with numexpr, but only on a subset of the existing rows? > Regards, > Jon Wilson > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |
From: Jon W. <js...@fn...> - 2012-06-06 03:32:07
|
Hi Anthony, Allow me to clarify. I wish to perform a reduction (histogramming, specifically) over a function of some values, but only including certain rows. For instance, say I have a table with three columns, col0 -- col3. I would like to create a histogram of col0 + col1**2, but only where col2 > 15 and abs(col3) < 5. As far as I understand, I can do the following: histogram(array([row['col0'] + row['col1']**2 for row in mytable.where('col2 > 15 & abs(col3) < 5')])) And this does produce the desired histogram. However, mytable.where() returns an iterator over rows, and then the list comprehension computes col0 + col1**2 for each row in python space, which lacks the optimization and multithreading of the numexpr kernel. It seems as though it should be possible to have both the condition and the histogramming variable (col0 + col1**2) computed in the parallelized and optimized numexpr kernel, but I do not see a way to do this using where(). The alternative that I can see would be to do something like: histvar = tables.Expr('col0 + col1**2', vars(mytable.cols)).eval() selection = tables.Expr('col2 > 15 & abs(col3) < 5', vars(mytable.cols)).eval() histogram(histvar, weights = selection) This should produce the same histogram as above, and it does compute both the histogram variable and the query condition in the numexpr kernel, but it requires the computation of the histogram variable even for rows I do not wish to include in the histogram. If the table is very large and relatively few rows are selected, or if computing the histogram variable is expensive, this is quite undesirable. So, it seems that I can either a) use the fast query operator where(); or, b) perform all computation in numexpr. But not both. FWIW, a quick timeit test shows that, on a table with ~1M rows, for a very simple condition and a very simple histogram variable, the first method is faster than the second method even when all rows are selected. For a table with ~7M rows, for a more complex histogram variable and still a very simple condition, the first method is faster than the second method when only a few rows are selected, but when all rows are selected, the second method is more than 10x faster. (2.16s vs 3.27s for few rows, 43.1s vs 3.19s for all 7M rows) So it is clear that in some cases, method 2 could be sped up substantially, and in other cases, method 1 could be sped up enormously. I think something like histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 & abs(col3) < 5')).eval()) would be ideal, but since where() returns a row iterator, and not something that I can extract Column objects from, I don't see any way to make it work. So, am I missing some way to compute the histogram variable in the numexpr kernel, but only for rows I'm interested in? Regards, Jon On 06/05/2012 09:45 PM, Anthony Scopatz wrote: > Hello Jon, > > I believe that the where() method just uses the Expr / numexpr > functionality under the covers. Anything that you can do in Expr you > should be able to do in where(). Can you provide a short example > where this is not the case? > > Be Well > Anthony > > On Tue, Jun 5, 2012 at 6:17 PM, Jon Wilson <js...@fn... > <mailto:js...@fn...>> wrote: > > Hi all, > In looking through the docs, I see two very nice features: the > .where() > query method, and the tables.Expr computation mechanism. But, it > doesn't appear to be possible to combine the two. It appears > that, if I > want to compute some function of my columns, but only for certain > rows, > I have two options. > - I can use tables.Expr to compute the function, and then filter the > results in python > - I can use mytable.where() to select the rows I'm interested in, and > then compute the function in python > > Am I missing anything? Is it possible to perform fast out-of-core > computations with numexpr, but only on a subset of the existing rows? > Regards, > Jon Wilson > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. > Discussions > will include endpoint security, mobile security and the latest in > malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |
From: Anthony S. <sc...@gm...> - 2012-06-06 05:45:40
|
On Tue, Jun 5, 2012 at 10:32 PM, Jon Wilson <js...@fn...> wrote: [snip] > I think something like > histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 & > abs(col3) < 5')).eval()) > would be ideal, but since where() returns a row iterator, and not > something that I can extract Column objects from, I don't see any way to > make it work. > You are probably looking for the readWhere() method<http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere> which normally returns a numpy structured array. The line you are looking for is thus: histogram(tables.Expr('col0 + col1**2', mytable.readWhere('col2 > 15 & abs(col3) < 5')).eval()) This will likely be fast in both cases. I hope this helps. Be Well Anthony > > So, am I missing some way to compute the histogram variable in the numexpr > kernel, but only for rows I'm interested in? > Regards, > Jon > > > On 06/05/2012 09:45 PM, Anthony Scopatz wrote: > > Hello Jon, > > I believe that the where() method just uses the Expr / numexpr > functionality under the covers. Anything that you can do in Expr you > should be able to do in where(). Can you provide a short example where > this is not the case? > > Be Well > Anthony > > On Tue, Jun 5, 2012 at 6:17 PM, Jon Wilson <js...@fn...> wrote: > >> Hi all, >> In looking through the docs, I see two very nice features: the .where() >> query method, and the tables.Expr computation mechanism. But, it >> doesn't appear to be possible to combine the two. It appears that, if I >> want to compute some function of my columns, but only for certain rows, >> I have two options. >> - I can use tables.Expr to compute the function, and then filter the >> results in python >> - I can use mytable.where() to select the rows I'm interested in, and >> then compute the function in python >> >> Am I missing anything? Is it possible to perform fast out-of-core >> computations with numexpr, but only on a subset of the existing rows? >> Regards, >> Jon Wilson >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> Pytables-users mailing list >> Pyt...@li... >> https://lists.sourceforge.net/lists/listinfo/pytables-users >> > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > > _______________________________________________ > Pytables-users mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Jon W. <js...@fn...> - 2012-06-06 15:24:44
|
Hi Anthony, On 06/06/2012 12:45 AM, Anthony Scopatz wrote: > > I think something like > histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 & > abs(col3) < 5')).eval()) > would be ideal, but since where() returns a row iterator, and not > something that I can extract Column objects from, I don't see any > way to make it work. > > > You are probably looking for the readWhere() method > <http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere> which > normally returns a numpy structured array. The line you are looking > for is thus: > > histogram(tables.Expr('col0 + col1**2', mytable.readWhere('col2 > 15 & > abs(col3) < 5')).eval()) > > This will likely be fast in both cases. I hope this helps. Oddly, it doesn't work with tables.Expr, but does work with numexpr.evaluate. In the case I talked about before with 7M rows, when selecting very few rows, it does just fine (between the other two solutions), but when selecting all rows, it is still about 2.75x slower than the technique of using tables.Expr for both the histogram var and the condition. I think that this is because .readWhere() pulls all the table rows satisfying the where condition into memory first, and it furthermore does so for all columns of all selected rows, so, for a table with many columns, it has to read many times as much data into memory. I can use the field parameter, but it only accepts one single field, so I would have to perform the query once per variable used in the histogram variable expression to do that. Using .readWhere() gives a medium-fast performance in both cases, but I still feel like it is not quite the right thing because it reads the data completely into memory instead of allowing the computation to be performed out-of-core. Perhaps it is not really feasible, but I think the ideal would be to have a .where type query operator that returns Column objects or a Table object, with a "view" imposed in either case. Regards, Jon |
From: Anthony S. <sc...@gm...> - 2012-06-06 16:07:49
|
On Wed, Jun 6, 2012 at 10:24 AM, Jon Wilson <js...@fn...> wrote: > Hi Anthony, > > > On 06/06/2012 12:45 AM, Anthony Scopatz wrote: > > > I think something like >> histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 & >> abs(col3) < 5')).eval()) >> would be ideal, but since where() returns a row iterator, and not >> something that I can extract Column objects from, I don't see any way to >> make it work. >> > > You are probably looking for the readWhere() method<http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere> which > normally returns a numpy structured array. The line you are looking for is > thus: > > histogram(tables.Expr('col0 + col1**2', mytable.readWhere('col2 > 15 & > abs(col3) < 5')).eval()) > > This will likely be fast in both cases. I hope this helps. > > > Oddly, it doesn't work with tables.Expr, but does work with > numexpr.evaluate. In the case I talked about before with 7M rows, when > selecting very few rows, it does just fine (between the other two > solutions), but when selecting all rows, it is still about 2.75x slower > than the technique of using tables.Expr for both the histogram var and the > condition. > > I think that this is because .readWhere() pulls all the table rows > satisfying the where condition into memory first, and it furthermore does > so for all columns of all selected rows, so, for a table with many columns, > it has to read many times as much data into memory. > Yes that is correct, it does have to read the data into memory. > I can use the field parameter, but it only accepts one single field, so I > would have to perform the query once per variable used in the histogram > variable expression to do that. > > Using .readWhere() gives a medium-fast performance in both cases, but I > still feel like it is not quite the right thing because it reads the data > completely into memory instead of allowing the computation to be performed > out-of-core. Perhaps it is not really feasible, > Well I think the issue at hand is that you are trying to support two disparate cases with one expression: sparse and dense selection. We have tools for dealing with these cases individually and performing out-of-core calculations. And if you know a priori which case you are going to fall into, you can do the right thing. So without doing anything special, I think medium-fast is probably the best and easiest thing that you can expect right now. (Though I would be delighted to be proved wrong on this point.) > but I think the ideal would be to have a .where type query operator that > returns Column objects or a Table object, with a "view" imposed in either > case. > We are very open to pull requests if you come up with an implementation of this that you like more ;). Be Well Anthony > Regards, > Jon > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |
From: Jon W. <js...@fn...> - 2012-06-06 16:18:33
|
Hi Anthony, > Well I think the issue at hand is that you are trying to support two > disparate cases with one expression: sparse and dense selection. We > have tools for dealing with these cases individually > and performing out-of-core calculations. And if you know a priori > which case you are going to fall into, you can do the right thing. So > without doing anything special, I think medium-fast is probably the > best and easiest thing that you can expect right now. (Though I would > be delighted to be proved wrong on this point.) True enough. Sometimes I can't know anything a priori about the density of the selection, of course. And I'm happy to worry about the internals, but some of my colleagues, not so much ;) > > but I think the ideal would be to have a .where type query > operator that returns Column objects or a Table object, with a > "view" imposed in either case. > > > We are very open to pull requests if you come up with an > implementation of this that you like more ;). Very fair. We'll see if I can get to it. Is there any sort of guide-to-the-source to help me get started when and if that happens? I guess just the reference guide? I'll have a lot to learn before I can contribute usefully, I'm sure. I don't know enough about the implementation details yet to know: would a selection make the out-of-core performance gains from chunking and other things moot because you'd have to skip around too much? Regards, Jon |
From: Anthony S. <sc...@gm...> - 2012-06-06 18:53:30
|
On Wed, Jun 6, 2012 at 11:18 AM, Jon Wilson <js...@fn...> wrote: > Hi Anthony, > > Well I think the issue at hand is that you are trying to support two > disparate cases with one expression: sparse and dense selection. We have > tools for dealing with these cases individually and performing out-of-core > calculations. And if you know a priori which case you are going to fall > into, you can do the right thing. So without doing anything special, I > think medium-fast is probably the best and easiest thing that you can > expect right now. (Though I would be delighted to be proved wrong on this > point.) > > True enough. Sometimes I can't know anything a priori about the density > of the selection, of course. And I'm happy to worry about the internals, > but some of my colleagues, not so much ;) > I fully understand your position! > but I think the ideal would be to have a .where type query operator that >> returns Column objects or a Table object, with a "view" imposed in either >> case. >> > > We are very open to pull requests if you come up with an implementation > of this that you like more ;). > > Very fair. We'll see if I can get to it. Is there any sort of > guide-to-the-source to help me get started when and if that happens? I > guess just the reference guide? > Not really as such. There is this list, and then there is pyt...@go... for development specific questions and issues. > > I'll have a lot to learn before I can contribute usefully, I'm sure. I > don't know enough about the implementation details yet to know: would a > selection make the out-of-core performance gains from chunking and other > things moot because you'd have to skip around too much? > Basically, I would say that if you were to write a function that does what you described, we could take care of / help with the PyTables integration issues. You could start by looking at the code for Expr [1] and seeing if you could track down the one field issue. Or maybe there is a way to add this functionality easily... Don't hesitate to ask if you run into problems. Be Well Anthony > Regards, > Jon > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |