[Pytables-users] Main differences between PyTables and Relational

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

* Hello list,
The relational model has a strong foundation and I have spent a few hours
thinking about what in PyTables is structurally different from it. Here are
my thoughts. I would be delighted if you could add/comment/correct on these
ideas. This could eventually help people with a relational/SQL background
who want to know how to best use the idioms of PyTables for their data
modeling

---
I make a distinction between relational and SQL (see CJ Date’s "SQL and
relational theory" for more on that). From a purely structural point of
view, the following differences are apparent:

   1. relations vs. sequences. Relations are sets (i.e. not ordered) of
   tuples (again, not ordered).
      1. rows: In PyTables, every container has an implicit row number,
      meaning there is always a candidate key and order matters. Although
      strictly an implementation-level concern, row numbers in PyTables are not
      stored but computed, thanks to the in-disk adjacency of the records. This
      is important for large datasets, where adding storage of row
numbers means
      roughly a doubling of diskspace.
      2. columns: In PyTables columns are ordered. That is not the case in
      a purely relational system but it is the case in SQL.
   2. Flat tablespace vs. hierarchical tablespace. SQL tables live in a
   global namespace. PyTables objects can be put inside Groups. Each approach
   can be mapped onto the other by name mangling. Groups in PyTables are like
   tables of tables -- for each node in a group there is a full table (or
   another group...). This introduces a possible ambiguity in data modeling:

Consider a table of car parts, one column is Part ID and the other is Model
ID, indicating in what car models a particular part is built in. In
PyTables you can construct the same table /or/ create a /models group and
create one table per model consisting of a single column of Part IDs e.g.
/model/sedan, /model/cabrio... etc. The same is possible in a relational
setting (dividing the tables according to one attribute, and naming them
according to the attribute value, e.g. model_sedan, model_cabrio...). The
defining difference is that the interface to manipulate that list is the
same (it is a table) whereas in PyTables one listing is a Table object and
the other is a list of Nodes, and the API for both is a bit different.

   1. Attributes of tables and integrity. Any Node (Groups and Tables
   included) can receive a limited amount of metadata in PyTables, by using
   the attached attributeset. In SQL, metadata is limited to some keywords, to
   be used upon table creation, that establish constraints on the columns that
   have a functional significance. A prime example of this is identifying
   foreign keys. SQL allows to use this information structurally at the time
   of joins, whereas in PyTables one is free to implement this or any other
   navigational scheme in a customized way using the attributes of the table.

When designing such a scheme it has to be remembered that PyTables tables
have always an implicit column containing the row numbers, and this is
likely to be used as a key.

NOTE: I intentionally excluded here implementation issues whenever they are
not related to structural ones, e.g. SQL tables do not play well with Numpy
containers and are thus ill-suited for big data with Python. Another
example would be all the features related to transactions/concurrency and
authorization, which are orthogonal to the data model.

*

-á.