From: Alvaro T. C. <al...@mi...> - 2012-04-25 21:41:40
|
* Hello list, The relational model has a strong foundation and I have spent a few hours thinking about what in PyTables is structurally different from it. Here are my thoughts. I would be delighted if you could add/comment/correct on these ideas. This could eventually help people with a relational/SQL background who want to know how to best use the idioms of PyTables for their data modeling --- I make a distinction between relational and SQL (see CJ Date’s "SQL and relational theory" for more on that). From a purely structural point of view, the following differences are apparent: 1. relations vs. sequences. Relations are sets (i.e. not ordered) of tuples (again, not ordered). 1. rows: In PyTables, every container has an implicit row number, meaning there is always a candidate key and order matters. Although strictly an implementation-level concern, row numbers in PyTables are not stored but computed, thanks to the in-disk adjacency of the records. This is important for large datasets, where adding storage of row numbers means roughly a doubling of diskspace. 2. columns: In PyTables columns are ordered. That is not the case in a purely relational system but it is the case in SQL. 2. Flat tablespace vs. hierarchical tablespace. SQL tables live in a global namespace. PyTables objects can be put inside Groups. Each approach can be mapped onto the other by name mangling. Groups in PyTables are like tables of tables -- for each node in a group there is a full table (or another group...). This introduces a possible ambiguity in data modeling: Consider a table of car parts, one column is Part ID and the other is Model ID, indicating in what car models a particular part is built in. In PyTables you can construct the same table /or/ create a /models group and create one table per model consisting of a single column of Part IDs e.g. /model/sedan, /model/cabrio... etc. The same is possible in a relational setting (dividing the tables according to one attribute, and naming them according to the attribute value, e.g. model_sedan, model_cabrio...). The defining difference is that the interface to manipulate that list is the same (it is a table) whereas in PyTables one listing is a Table object and the other is a list of Nodes, and the API for both is a bit different. 1. Attributes of tables and integrity. Any Node (Groups and Tables included) can receive a limited amount of metadata in PyTables, by using the attached attributeset. In SQL, metadata is limited to some keywords, to be used upon table creation, that establish constraints on the columns that have a functional significance. A prime example of this is identifying foreign keys. SQL allows to use this information structurally at the time of joins, whereas in PyTables one is free to implement this or any other navigational scheme in a customized way using the attributes of the table. When designing such a scheme it has to be remembered that PyTables tables have always an implicit column containing the row numbers, and this is likely to be used as a key. NOTE: I intentionally excluded here implementation issues whenever they are not related to structural ones, e.g. SQL tables do not play well with Numpy containers and are thus ill-suited for big data with Python. Another example would be all the features related to transactions/concurrency and authorization, which are orthogonal to the data model. * -á. |