Re: [Pytables-users] MAX_CHILDS_IN_GROUP

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

A Dijous 18 Setembre 2003 01:10, vareu escriure:
>  I'll update to the latest version and try it out. I'll let you know if I
> notice any slowness or excessive memory use. What do you mean by *lots* of
> memory? The data itself that is written out varies from 10's to 100's of
> megabytes. Is it comparable to that?

By *lots* of memory I meant that, roughly, 6 KB will be booked for each
single Group object, 6 KB for each Array and 12 KB for each Table. That's is
where most of memory consumption goes, building the structure (i.e. the
object tree) from metadata. So, a tree with 1000 Groups, 2000 Arrays and
4000 Tables can account for 66 MB (add 10 MB for the python interpreter and
pytables modules).

Also, the I/O buffers for Tables are larger for large datasets; so as to see
the difference they can be 5 KB for small datasets (less than 100 KB), up to
60 KB for larger ones (greater than 200 MB). However, these buffers are
built dinamically (this is new in the CVS version, I forgot to mention
that!), so if you don't access the actual data in a Table, this memory will
not be booked. I'm pondering now if I should release these buffers after use
or keep them into memory for a possible use afterwards (it's a problem of
balance between CPU and memory consumption), but I probably will release all
buffers after a read or write operation, at expense of more CPU consumption.

Array objects are not buffered (you only can read it completely or don't
read at all), so the amount of actual data saved (whether your Array is 1
byte or 1 TB on size) is not going to affect too much your memory demands,
except by the fact that you will need enough memory to keep a large Array if
you want to actually read its data!.

>
>  In the files, I am writing time-dependent data produced by my code. I
> write the data out as arrays and not as tables since the data size varies
> from step to step. The data that is written each step is seven arrays all
> the same size and an integer. Would it be more optimal to create a table
> for each step and write the seven arrays as elements of the table? The
> total number of steps is typically on the order of one thousand.

Well, I'm afraid that your best bet would be to use Variable Length Arrays,
like the ones Nicola Larosa was asking for in an earlier message, but this
will take some time to be implemented.

In the meantime, if you use Tables, you would reduce the number of nodes by
a factor of seven. On the other hand, Tables needs more memory per node than
Arrays (two times more for the object, and twice more for the internal
buffer, if working with small datasets), so one can conclude that, if you
use Tables you will need 4/7 times the memory you are using now. In
addition, Tables are quite more flexible than Array entities (you can do
selections without loading all the info in memory, or just load parts of the
dataset), so I would recommend you to use Tables with arrays as columns.
Keep in mind too, that Array entities do not support compression
on-the-flight, while Tables do.

Another possibility with Tables is to define several Tables with two
columns, one to store the actual array and the other one to save the actual
length of the array. You can then set the series of Tables with different
array column lengths on such a way that your arrays fit well on one of them
(I mean, without wasting too much space). For example, if your arrays are in
the range of (2,1) to (2, 100), you can setup several Tables with columns
taking the values (2,10), (2,20), ..., (2,100). You can then save your array
in the appropriate Table and save the actual shape in the other field. After
retrieving the arrays, you can use the length field to strip out the data
you are not interested in. I agree that this solution is a bit affected, but
if you have a large amounts of arrays, it can be your best choice until
VLArrays are done.

Cheers,

-- 
Francesc Alted