Re: [Pytables-users] Advise needed : can PyTables be used as a database ?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Bernard,

A Monday 08 March 2004 10:01, Bernard Kaplan va escriure:
> Dear community,
> 
> I have to develop a program that performs numerical analysis on data 
> that come from a fab production line. Every month I can count on 
> approximately 100 000 new entries. Each entry is composed on the one 
> hand of general information (such as date, machine, ...) and on the 
> other hand of raw data that we measure (a matrix of size 2000x1000 or 
> more). So far I gather the general information in a relational database 
> (firebird - kinterbasdb) and the data are just kept in individual files. 
> I appreciate the database because I can sort my data on the different 
> columns of my table and I can perform fast search to organize my huge 
> number of entries. But I also realize that the numerical treatment that 
> will follow will become quite cumbersome. This is why I am interested in 
> PyTables (to be honest I am also interested in PyTables because I trully 
> hate SQL and love Python)
> 
> Here are my questions:
> - can I replace my database with PyTables ?

Well, it depends. Normally PyTables is not designed to work as RDB
replacement, but rather as a helper of it (or alone if you don't need
relational or indexing capabilities). Read behind for a better explanation.

> - is it possible to sort efficiently (meaning fast) a table in PyTables 
> along a specific column ? How ?

It is possible, but you need to do some hacking. You can read the column,
then sort it with the numarray.argsort function
(http://stsdas.stsci.edu/numarray/numarray-0.8.html/node33.html) to get the
sorted indices, then rewrite the table following this new order. However,
this will only work for columns that fits in-memory. An out-of-core
algorithm for doing the same could be done if there is enough interest.

> - does the concept of primary key in a database exist in PyTables ? I 
> use primary key to avoid inserting two times the same row in my table.

No, the only primary key is the row number

> Is there an equivalent way to do it in PyTables?

What about caching the primary key in a list and checking if the element
already exists on it before adding to the table?

> - how does PyTables compare with relational databases such as Firebird, 
> SQLite,... in terms of performance ?

See http://pytables.sourceforge.net/html/HowFast.html

> - Are my questions relevant or do you instead advise me to keep to 
> relational database ?

Completely relevant. I would advise you to combine an RDB an PyTables and get
the best of the two worlds.

Regards,

-- 
Francesc Alted