From: Jon W. <js...@fn...> - 2012-06-06 03:32:07
|
Hi Anthony, Allow me to clarify. I wish to perform a reduction (histogramming, specifically) over a function of some values, but only including certain rows. For instance, say I have a table with three columns, col0 -- col3. I would like to create a histogram of col0 + col1**2, but only where col2 > 15 and abs(col3) < 5. As far as I understand, I can do the following: histogram(array([row['col0'] + row['col1']**2 for row in mytable.where('col2 > 15 & abs(col3) < 5')])) And this does produce the desired histogram. However, mytable.where() returns an iterator over rows, and then the list comprehension computes col0 + col1**2 for each row in python space, which lacks the optimization and multithreading of the numexpr kernel. It seems as though it should be possible to have both the condition and the histogramming variable (col0 + col1**2) computed in the parallelized and optimized numexpr kernel, but I do not see a way to do this using where(). The alternative that I can see would be to do something like: histvar = tables.Expr('col0 + col1**2', vars(mytable.cols)).eval() selection = tables.Expr('col2 > 15 & abs(col3) < 5', vars(mytable.cols)).eval() histogram(histvar, weights = selection) This should produce the same histogram as above, and it does compute both the histogram variable and the query condition in the numexpr kernel, but it requires the computation of the histogram variable even for rows I do not wish to include in the histogram. If the table is very large and relatively few rows are selected, or if computing the histogram variable is expensive, this is quite undesirable. So, it seems that I can either a) use the fast query operator where(); or, b) perform all computation in numexpr. But not both. FWIW, a quick timeit test shows that, on a table with ~1M rows, for a very simple condition and a very simple histogram variable, the first method is faster than the second method even when all rows are selected. For a table with ~7M rows, for a more complex histogram variable and still a very simple condition, the first method is faster than the second method when only a few rows are selected, but when all rows are selected, the second method is more than 10x faster. (2.16s vs 3.27s for few rows, 43.1s vs 3.19s for all 7M rows) So it is clear that in some cases, method 2 could be sped up substantially, and in other cases, method 1 could be sped up enormously. I think something like histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 & abs(col3) < 5')).eval()) would be ideal, but since where() returns a row iterator, and not something that I can extract Column objects from, I don't see any way to make it work. So, am I missing some way to compute the histogram variable in the numexpr kernel, but only for rows I'm interested in? Regards, Jon On 06/05/2012 09:45 PM, Anthony Scopatz wrote: > Hello Jon, > > I believe that the where() method just uses the Expr / numexpr > functionality under the covers. Anything that you can do in Expr you > should be able to do in where(). Can you provide a short example > where this is not the case? > > Be Well > Anthony > > On Tue, Jun 5, 2012 at 6:17 PM, Jon Wilson <js...@fn... > <mailto:js...@fn...>> wrote: > > Hi all, > In looking through the docs, I see two very nice features: the > .where() > query method, and the tables.Expr computation mechanism. But, it > doesn't appear to be possible to combine the two. It appears > that, if I > want to compute some function of my columns, but only for certain > rows, > I have two options. > - I can use tables.Expr to compute the function, and then filter the > results in python > - I can use mytable.where() to select the rows I'm interested in, and > then compute the function in python > > Am I missing anything? Is it possible to perform fast out-of-core > computations with numexpr, but only on a subset of the existing rows? > Regards, > Jon Wilson > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. > Discussions > will include endpoint security, mobile security and the latest in > malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > <mailto:Pyt...@li...> > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > > > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users |