Re: [Pytables-users] Shuffle and performance

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Elias,

A Thursday 23 August 2007, escrigu=C3=A9reu:
> Francesc,
>
> Here's my setup:
> -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=
=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
>=3D-=3D-=3D-=3D PyTables version:  1.3
> HDF5 version:      1.6.5
> numarray version:  1.5.1
> Zlib version:      1.2.1
> BZIP2 version:     1.0.2 (30-Dec-2001)
> Python version:    2.4.3 (#1, Apr 21 2006, 14:31:08)
> [GCC 3.3.3 (SuSE Linux)]
> Platform:          linux2-x86_64
> Byte-ordering:     little
> -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=
=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
>=3D-=3D-=3D-=3D
>
> I recently switched from 'h5import' to PyTables to convert the output
> from large finite element models into HDF5 format. I like using the
> PyTables approach because it gives me more control than the shell
> scripts that I cobbled together to use 'h5import'
>
> However, the most recent file takes much longer to search. Here is
> the results of a simple test I ran with old and new databases:
>
> 'New':
> $ python test_finder.py
> Found 3 results for your search
> CQUAD4 1121910
> fh.find('1121910') took 2.37 sec
> Found 3 results for your search
> fh.find('1121910', gpf=3DTrue) took 9.44 sec
>
> 'Old':
> $ python test_finder.py
> Found 3 results for your search
> CQUAD4 1121910
> fh.find('1121910') took 0.664 sec
> Found 3 results for your search
> fh.find('1121910', gpf=3DTrue) took 0.638 sec
>
> The only difference I could detect between the two files was that the
> PyTables version is the 'shuffle' parameter. Here is some ptdump
> output of some nodes:
> 'New':
> $ ptdump -v xxx_lev_1_1.h5:/results/oef1/quad4
> /results/oef1/quad4 (EArray(1022L, 17759L, 3L), shuffle, zlib(6)) ''
>   atom =3D Atom(dtype=3D'Float32', shape=3D(0, 17759L, 3L),
> flavor=3D'numarray') nrows =3D 1022
>   extdim =3D 0
>   flavor =3D 'numarray'
>   byteorder =3D 'little'
                ^^^^^^^^   <- Notice this
>
> 'Old':
> $ ptdump -v xxx_lev_0.h5:/results/oef1/quad4
> /cluster/stress/methods/local/lib/python2.4/site-packages/tables/File
>.py:227: UserWarning: file ``xxx_lev_0.h5`` exists and it is an HDF5
> file, but it does not have a PyTables format; I will try to do my
> best to guess what's there using HDF5 metadata
>   METADATA_CACHE_SIZE, nodeCacheSize)
> /results/oef1/quad4 (EArray(1018L, 17402L, 3L), zlib(6)) ''
>   atom =3D Atom(dtype=3D'Float32', shape=3D(0, 17402L, 3L),
> flavor=3D'numarray') nrows =3D 1018
>   extdim =3D 0
>   flavor =3D 'numarray'
>   byteorder =3D 'big'
                ^^^^^   <- Notice this
>
> My client code is completely unchanged with this testing: only the
> databases were created by two different methods. I have yet to do
> more testing with smaller files (these are ~2.2G). I read the section
> on shuffling in the manual where it suggest that shuffle will
> actually improve throughput. but this is the only difference I could
> detect. It is not a trivial matter to produce these large files, so I
> need to get it right. I know it's not much to go on, but any
> suggestions are appreciated.

As I remarked above, another difference is that the 'new' files are=20
converted to little-endian byteorder, and that could affect performance=20
if you process those files on a big-endian machine.

However, my guess is that the real problem in this case could=20
effectively lie in the shuffle filter.  The thing is that in PyTables=20
1.x series, the algorithm for computing the chunksize (i.e. the size=20
where compression applies) was not very fine-tuned, and the computed=20
size for it can be as high as 600KB, putting too much stress on the=20
shuffle filter.  This has been somewhat bettered in 2.x series, so that=20
the chunksize for your files (~2.2 GB) would be something like 32KB or=20
64KB, which is a more reasonable figure for shuffling (besides of=20
allowing far better performance in sparse reads).

So, you may want to try PyTables 2.0 or, if you want to stick with 1.3,=20
try disabling the shuffle filter (at the expense of reducing the=20
compression effectiveness) when creating the 'new' arrays.  My=20
recommendation, though, is you to switch to 2.0 as there are more=20
optimizations (like using numpy natively and others) that can help=20
improving your times still more.

Cheers,

=2D-=20
>0,0<   Francesc Altet =C2=A0 =C2=A0 http://www.carabos.com/
V   V   C=C3=A1rabos Coop. V. =C2=A0=C2=A0Enjoy Data
 "-"