[Pytables-users] Optimizing pytables for reading entire columns at a time

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

I'm attempting to optimize my HDF5/pytables application for reading entire
columns at a time.  I was wondering what the best way to go about this is.

My HDF5 has the following properties:

- 400,000+ rows
- 25 columns
- 147 MB in total size
- 1 string column of size 12
- 1 column of type 'Float'
- 23 columns of type 'Float64'

My access pattern for this data is generally to read an entire column out
at a time.  So, I want to minimize the number of disk accesses this takes
and store data contiguously by column.

I think the proper way to do this via HDF5 is to use 'chunking.'  I'm
creating my HDF5 files via Pytables so I guess using the 'chunkshape'
parameter during creation is the correct way to do this?

All of the HDF5 documentation I read discusses 'chunksize' in terms of rows
and columns.  However, the Pytables 'chunkshape' parameter only takes a
single number.  I looked through the source and see that I can in fact pass
a tuple, which I assume is (row, column) as the HDF5 documentation would
suggest.

Is it best to use the 'expectedrows' parameter instead of the 'chunkshape'
or use both?

I have done some debugging/profiling and discovered that my default
chunkshape is 321 for this dataset.  I have increased this to 1000 and see
quite a bit better speeds.  I'm sure I could keep changing these numbers
and find what is best for this particular dataset.  However, I'm seeking a
bit more knowledge on how Pytables uses each of these parameters, how they
relate to the HDF5 'chunking' concept and best-practices.  This will help
me to understand how to optimize in the future instead of just for this
particular dataset.  Is there any documentation on best practices for using
the 'expectedrows' and 'chunkshape' parameters?

Thank you for your time,

Luke Lee