From: Anthony S. <sc...@gm...> - 2012-04-13 16:29:40
|
On Fri, Apr 13, 2012 at 6:41 AM, Alvaro Tejero Cantero <al...@mi...>wrote: > Hi Anthony, > > > I can see how the virtual table interface could be made to work with > > PyTables, > > but I guess I don't understand why you would want to. It seems like in > this > > case you are querying using SQL rather than the more expressive Python. > > Yes, you'd be querying using SQL. > SQL is a documented declarative syntax for queries over relations. > Python offers many procedural routes to achieve e.g. joins, all of > them custom. If (a == b) | (c==d) is more expressive to you than > WHERE a=b OR c=d , then you can use SQLAlchemy [1], which wraps SQL in > a Pythonic query syntax. > Hello Alvaro, I am quite familiar with SQL and SQLAlchemy, having used these tools both personally and professionally. My initial question was not "What are you trying to do?" but rather "Why would you want to do it?" > > Moreover, you'd be sacrificing all of the 'H' in HDF5 features to obtain > > this. > [snip] > How does hierarchical help here? do you create a 'singer_name'/song > table? or a 'genre name'/song ?. Most of the time the physical layout > in the form of a hierarchy is just an annoyance. > I have to say that I disagree. The hierarchical features make it so that the data maps well to both Python objects and file systems. I feel that both of these are more natural to work with than having to construct a query of joins, groupbys, etc which reconstructs my data. So while this is just my opinion, I feel that hierarchies are much more natural to work with. > > > Also, my sense is that there would be a fair bit of overhead in this > > interface > > layer, which might not get you the speed boost you desire. I could be > wrong > > about this though. > > I think you're right in the wrapping of the results via the Python > interface to SQLite. I suspect you're not about the queries executed > in the virtual table, because that is left for you to implement and > thus you could turn the query terms (that are handed over to you) into > in-kernel expressions if you so wish (http://www.sqlite.org/vtab.html) > This was informed by my experience with SQLAlchemy which in some situations added an excessively long computation times. With the PyTables infrastructure, we would at least have the option of writing the performance critical parts in C or Cython... > > If I saw a proof-of-concept implementation, I may grok better the > purpose. > > Do you have any code to share? > > No, but I have an example ER diagram which is only part of what I > need. You are welcome to have a look at it[2] Sorry the text in this image is too small for me to read. > and tell me how you'd > achieve to support the jungle of relationships there with the H of > HDF5. In SQL I have a syntax to declare all those relationships. In > HDF5 I must decide for one hierarchical cut of those relations and > since it won't be enough, implement the relational layer on top of it, > perhaphs using attrs to store paths everywhere. It can be done, but > the support out of the box at this point for this is next to nil > (maybe integrating something like recarray.joinby [5] would be > useful?) > Writing data-specific relational layers for your applications on top of HDF5 with PyTables is not hard (IMHO). Add in the features of NumPy to perform in-memory manipulations and you have pretty much everything that you need. I think this is why we don't have formal implementation of the SQ Language in PyTables. > It looks to me, at this moment, that as soon as the data model gets > complicated HDF5 is in trouble, and as soon as very large, contiguous, > read-only, datasets are involved relational RDBMSs are in trouble > (subsetting, speed). Since this is not a happy situation, several > people are interested in combining the strengths of both [3][4] and my > e-mail was just highlighting that there may be a way to go that may > make a self-contained, clear, understandable package for the scenarios > where PyTables is most often deployed (single-user). > Reference [4] is particularly interesting (they mention PyTables!) and they also propose basically what you are suggesting in their third option (integrated SQL & HDF5). > Or I am not seeing something obvious? > I guess what i don't understand still is why - if you wanted to do this - would you use the SQLite vtabs? This seems to have the worst of the SQL world in terms of vendor lock in, compatibility with other SQL implementations, etc. Instead, why not just write a SQLAlchemy dialect [6] that is backed by PyTables? Yes, this isn't 'self-contained' in that we know have a dependency on SQLAlchemy. However, if done right this would be an *optional* dependency. Are there reasons to not do this that I am missing? I think that including something like this as a subpackage in PyTables is more reasonable than interfacing with SQLite in specific. Thanks for fielding my questions here. Be Well Anthony [6] http://docs.sqlalchemy.org/en/latest/#dialect-documentation > > Cheers, > > Álvaro. > -- > [1] http://www.rmunn.com/sqlalchemy-tutorial/tutorial.html > [2] http://dl.dropbox.com/u/2467197/ER-simple.png (yellow tables link > to HDF5 data, or other tables with the real measurements, white tables > are computed). > [3] http://www.scidb.org/ > [4] See p.26-29 and 32 > > http://www.itea-wsmr.org/ITEA%20Papers%20%20Presentations/2006%20ITEA%20Papers%20and%20Presentations/folk_HDF5_databases_pres.pdf > [5] > https://github.com/numpy/numpy/blob/master/numpy/lib/recfunctions.py#L826 > > > > Be Well > > Anthony > > > > On Thu, Apr 12, 2012 at 11:03 AM, Alvaro Tejero Cantero <al...@mi... > > > > wrote: > >> > >> Hi, > >> > >> The topic of introducing some kind of relational management in > >> PyTables comes up with certain frequency. > >> > >> Would it be possible to combine the virtues of RDBMS and hdf5's speed > >> via a mechanism such as SQLite Virtual Tables? > >> > >> http://www.sqlite.org/vtab.html > >> > >> I wonder if the required x* functions could be written for PyTables, > >> or if it being in Python is an obstacle to this kind of interfacing > >> with SQLite. > >> > >> Something like that would be a truly powerful solution in use cases > >> that don't require concurrency. > >> > >> Cheers, > >> > >> -á. > >> > >> > >> > ------------------------------------------------------------------------------ > >> For Developers, A Lot Can Happen In A Second. > >> Boundary is the first to Know...and Tell You. > >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > >> http://p.sf.net/sfu/Boundary-d2dvs2 > >> _______________________________________________ > >> Pytables-users mailing list > >> Pyt...@li... > >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > > > > > ------------------------------------------------------------------------------ > > For Developers, A Lot Can Happen In A Second. > > Boundary is the first to Know...and Tell You. > > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > > http://p.sf.net/sfu/Boundary-d2dvs2 > > _______________________________________________ > > Pytables-users mailing list > > Pyt...@li... > > https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > > ------------------------------------------------------------------------------ > For Developers, A Lot Can Happen In A Second. > Boundary is the first to Know...and Tell You. > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! > http://p.sf.net/sfu/Boundary-d2dvs2 > _______________________________________________ > Pytables-users mailing list > Pyt...@li... > https://lists.sourceforge.net/lists/listinfo/pytables-users > |