From: Jon W. <js...@fn...> - 2012-06-06 15:24:44
|
Hi Anthony, On 06/06/2012 12:45 AM, Anthony Scopatz wrote: > > I think something like > histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 & > abs(col3) < 5')).eval()) > would be ideal, but since where() returns a row iterator, and not > something that I can extract Column objects from, I don't see any > way to make it work. > > > You are probably looking for the readWhere() method > <http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere> which > normally returns a numpy structured array. The line you are looking > for is thus: > > histogram(tables.Expr('col0 + col1**2', mytable.readWhere('col2 > 15 & > abs(col3) < 5')).eval()) > > This will likely be fast in both cases. I hope this helps. Oddly, it doesn't work with tables.Expr, but does work with numexpr.evaluate. In the case I talked about before with 7M rows, when selecting very few rows, it does just fine (between the other two solutions), but when selecting all rows, it is still about 2.75x slower than the technique of using tables.Expr for both the histogram var and the condition. I think that this is because .readWhere() pulls all the table rows satisfying the where condition into memory first, and it furthermore does so for all columns of all selected rows, so, for a table with many columns, it has to read many times as much data into memory. I can use the field parameter, but it only accepts one single field, so I would have to perform the query once per variable used in the histogram variable expression to do that. Using .readWhere() gives a medium-fast performance in both cases, but I still feel like it is not quite the right thing because it reads the data completely into memory instead of allowing the computation to be performed out-of-core. Perhaps it is not really feasible, but I think the ideal would be to have a .where type query operator that returns Column objects or a Table object, with a "view" imposed in either case. Regards, Jon |