Re: [Pytables-users] SQLite Virtual Tables

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Fri, Apr 13, 2012 at 6:41 AM, Alvaro Tejero Cantero <al...@mi...>wrote:

> Hi Anthony,
>
> > I can see how the virtual table interface could be made to work with
> > PyTables,
> > but I guess I don't understand why you would want to.  It seems like in
> this
> > case you are querying using SQL rather than the more expressive Python.
>
> Yes, you'd be querying using SQL.
> SQL is a documented declarative syntax for queries over relations.
> Python offers many procedural routes to achieve e.g. joins, all of
> them custom. If (a == b) |  (c==d) is more expressive to you than
> WHERE a=b OR c=d , then you can use SQLAlchemy [1], which wraps SQL in
> a Pythonic query syntax.
>

Hello Alvaro,

I am quite familiar with SQL and SQLAlchemy, having used these tools both
personally and professionally.  My initial question was not "What are you
trying to
do?" but rather "Why would you want to do it?"

> > Moreover, you'd be sacrificing all of the 'H' in HDF5 features to obtain
> > this.
>

[snip]

> How does hierarchical help here? do you create a 'singer_name'/song
> table? or a 'genre name'/song ?. Most of the time the physical layout
> in the form of a hierarchy is just an annoyance.
>

I have to say that I disagree.  The hierarchical features make it so that
the data maps well to both Python objects and file systems.  I feel that
both of these are more natural to work with than having to construct a
query of joins, groupbys, etc which reconstructs my data. So while this
is just my opinion, I feel that hierarchies are much more natural to work
with.

>
> > Also, my sense is that there would be a fair bit of overhead in this
> > interface
> > layer, which might not get you the speed boost you desire.  I could be
> wrong
> > about this though.
>
> I think you're right in the wrapping of the results via the Python
> interface to SQLite. I suspect you're not about the queries executed
> in the virtual table, because that is left for you to implement and
> thus you could turn the query terms (that are handed over to you) into
> in-kernel expressions if you so wish (http://www.sqlite.org/vtab.html)
>

This was informed by my experience with SQLAlchemy which in some
situations added an excessively long computation times.  With the PyTables
infrastructure, we would at least have the option of writing the
performance
critical parts in C or Cython...

> > If I saw a proof-of-concept implementation, I may grok better the
> purpose.
> > Do you have any code to share?
>
> No, but I have an example ER diagram which is only part of what I
> need.  You are welcome to have a look at it[2]

Sorry the text in this image is too small for me to read.

> and tell me how you'd
> achieve to support the jungle of relationships there with the H of
> HDF5. In SQL I have a syntax to declare all those relationships. In
> HDF5 I must decide for one hierarchical cut of those relations and
> since it won't be enough, implement the relational layer on top of it,
> perhaphs using attrs to store paths everywhere. It can be done, but
> the support out of the box at this point for this is next to nil
> (maybe integrating something like recarray.joinby [5] would be
> useful?)
>

Writing data-specific relational layers for your applications on top of
HDF5 with PyTables is not hard (IMHO).  Add in the features of NumPy
to perform in-memory manipulations and you have pretty much everything
that you need.  I think this is why we don't have formal implementation
of the SQ Language in PyTables.

> It looks to me, at this moment, that as soon as the data model gets
> complicated HDF5 is in trouble, and as soon as very large, contiguous,
> read-only, datasets are involved relational RDBMSs are in trouble
> (subsetting, speed). Since this is not a happy situation, several
> people are interested in combining the strengths of both [3][4] and my
> e-mail was just highlighting that there may be a way to go that may
> make a self-contained, clear, understandable package for the scenarios
> where PyTables is most often deployed  (single-user).
>

Reference [4] is particularly interesting (they mention PyTables!) and
they also propose basically what you are suggesting in their third option
(integrated SQL & HDF5).

> Or I am not seeing something obvious?
>

I guess what i don't understand still is why - if you wanted to do this -
would
you use the SQLite vtabs?  This seems to have the worst of the SQL world
in terms of vendor lock in, compatibility with other SQL implementations,
etc.

Instead, why not just write a SQLAlchemy dialect [6] that is backed by
PyTables?
Yes, this isn't 'self-contained' in that we know have a dependency on
SQLAlchemy.
However, if done right this would be an *optional* dependency.  Are there
reasons
to not do this that I am missing?  I think that including something like
this as a
subpackage in PyTables is more reasonable than interfacing with SQLite
in specific.

Thanks for fielding my questions here.

Be Well
Anthony

[6] http://docs.sqlalchemy.org/en/latest/#dialect-documentation

>
> Cheers,
>
> Álvaro.
> --
> [1] http://www.rmunn.com/sqlalchemy-tutorial/tutorial.html
> [2] http://dl.dropbox.com/u/2467197/ER-simple.png (yellow tables link
> to HDF5 data, or other tables with the real measurements, white tables
> are computed).
> [3] http://www.scidb.org/
> [4] See p.26-29 and 32
>
> http://www.itea-wsmr.org/ITEA%20Papers%20%20Presentations/2006%20ITEA%20Papers%20and%20Presentations/folk_HDF5_databases_pres.pdf
> [5]
> https://github.com/numpy/numpy/blob/master/numpy/lib/recfunctions.py#L826
>
>
> > Be Well
> > Anthony
> >
> > On Thu, Apr 12, 2012 at 11:03 AM, Alvaro Tejero Cantero <al...@mi...
> >
> > wrote:
> >>
> >> Hi,
> >>
> >> The topic of introducing some kind of relational management in
> >> PyTables comes up with certain frequency.
> >>
> >> Would it be possible to combine the virtues of RDBMS and hdf5's speed
> >> via a mechanism such as SQLite Virtual Tables?
> >>
> >> http://www.sqlite.org/vtab.html
> >>
> >> I wonder if the required x* functions could be written for PyTables,
> >> or if it being in Python is an obstacle to this kind of interfacing
> >> with SQLite.
> >>
> >> Something like that would be a truly powerful solution in use cases
> >> that don't require concurrency.
> >>
> >> Cheers,
> >>
> >> -á.
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> For Developers, A Lot Can Happen In A Second.
> >> Boundary is the first to Know...and Tell You.
> >> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
> >> http://p.sf.net/sfu/Boundary-d2dvs2
> >> _______________________________________________
> >> Pytables-users mailing list
> >> Pyt...@li...
> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > For Developers, A Lot Can Happen In A Second.
> > Boundary is the first to Know...and Tell You.
> > Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
> > http://p.sf.net/sfu/Boundary-d2dvs2
> > _______________________________________________
> > Pytables-users mailing list
> > Pyt...@li...
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
>
>
> ------------------------------------------------------------------------------
> For Developers, A Lot Can Happen In A Second.
> Boundary is the first to Know...and Tell You.
> Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
> http://p.sf.net/sfu/Boundary-d2dvs2
> _______________________________________________
> Pytables-users mailing list
> Pyt...@li...
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>