From: Anthony S. <sc...@gm...> - 2012-06-06 16:07:49
|
On Wed, Jun 6, 2012 at 10:24 AM, Jon Wilson <js...@fn...> wrote: > Hi Anthony, > > > On 06/06/2012 12:45 AM, Anthony Scopatz wrote: > > > I think something like >> histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 & >> abs(col3) < 5')).eval()) >> would be ideal, but since where() returns a row iterator, and not >> something that I can extract Column objects from, I don't see any way to >> make it work. >> > > You are probably looking for the readWhere() method<http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere> which > normally returns a numpy structured array. The line you are looking for is > thus: > > histogram(tables.Expr('col0 + col1**2', mytable.readWhere('col2 > 15 & > abs(col3) < 5')).eval()) > > This will likely be fast in both cases. I hope this helps. > > > Oddly, it doesn't work with tables.Expr, but does work with > numexpr.evaluate. In the case I talked about before with 7M rows, when > selecting very few rows, it does just fine (between the other two > solutions), but when selecting all rows, it is still about 2.75x slower > than the technique of using tables.Expr for both the histogram var and the > condition. > > I think that this is because .readWhere() pulls all the table rows > satisfying the where condition into memory first, and it furthermore does > so for all columns of all selected rows, so, for a table with many columns, > it has to read many times as much data into memory. > Yes that is correct, it does have to read the data into memory. > I can use the field parameter, but it only accepts one single field, so I > would have to perform the query once per variable used in the histogram > variable expression to do that. > > Using .readWhere() gives a medium-fast performance in both cases, but I > still feel like it is not quite the right thing because it reads the data > completely into memory instead of allowing the computation to be performed > out-of-core. Perhaps it is not really feasible, > Well I think the issue at hand is that you are trying to support two disparate cases with one expression: sparse and dense selection. We have tools for dealing with these cases individually and performing out-of-core calculations. And if you know a priori which case you are going to fall into, you can do the right thing. So without doing anything special, I think medium-fast is probably the best and easiest thing that you can expect right now. (Though I would be delighted to be proved wrong on this point.) > but I think the ideal would be to have a .where type query operator that > returns Column objects or a Table object, with a "view" imposed in either > case. > We are very open to pull requests if you come up with an implementation of this that you like more ;). Be Well Anthony > Regards, > Jon > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > > |