Re: [Pytables-users] Shuffle and performance

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

A Monday 27 August 2007, escrigu=C3=A9reu:
> > Yeah, that's a bit strange.  If 're-adding' shuffle is actually
> > improving your search times, then perhaps it is not the actual
> > problem. Now, I think that the main issue should be the length of
> > the chunksize of 'new' files.  Can you run the 'h5ls -v' utility
> > that comes with HDF5 and send the 'Chunks:' fields of the output
> > for
> > the '/results/oef1/quad4' dataset for both 'old' and 'new' files?
>
> $ h5ls -v old.h5/results/oef1/quad4
> Opened "old.h5" with sec2 driver.
> results/oef1/quad4       Dataset {1018/Inf, 17402/Inf, 3/3}
>     Location:  0:1:0:28034319
>     Links:     1
>     Modified:  2007-01-04 15:45:37 EST
>     Chunks:    {119, 100, 3} 142800 bytes
>     Storage:   212582832 logical bytes, 196302976 allocated bytes,
> 108.29% utilization
>     Filter-0:  deflate-1 OPT {6}
>     Type:      IEEE 32-bit big-endian float
> $ h5ls -v new.h5/results/oef1/quad4
> Opened "new.h5" with sec2 driver.
> results/oef1/quad4       Dataset {1022/Inf, 17759/17759, 3/3}
>     Attribute: CLASS     scalar
>         Type:      7-byte null-terminated ASCII string
>         Data:  "EARRAY"
>     Attribute: EXTDIM    scalar
>         Type:      native int
>         Data:  0
>     Attribute: FLAVOR    scalar
>         Type:      9-byte null-terminated ASCII string
>         Data:  "numarray"
>     Attribute: VERSION   scalar
>         Type:      4-byte null-terminated ASCII string
>         Data:  "1.3"
>     Attribute: TITLE     scalar
>         Type:      1-byte null-terminated ASCII string
>         Data:  ""
>     Location:  0:1:0:1126352
>     Links:     1
>     Modified:  2007-08-21 08:08:41 EDT
>     Chunks:    {1, 17759, 3} 213108 bytes
>     Storage:   217796376 logical bytes, 183047210 allocated bytes,
> 118.98% utilization
>     Filter-0:  shuffle-2 OPT {4}
>     Filter-1:  deflate-1 OPT {6}
>     Type:      native float
>
> > Also, it would be nice to know the way you are doing the search
> > process (sequential or sparse access?);  if you can send the search
> > algorithm that would be nice.  The only thing that comes to my mind
> > is that, if your search process is based on a sparse access
> > pattern, having a large chunksize can highly penalize the times;=20
> > in this case, using PyTables 2.0, which creates far smaller
> > chunksizes by default, will help.  If you are using sequential
> > access, then I don't really understand what can be the cause of the
> > slowdown.
>
> Well, the related arrays are stored in the same order. Then I use a
> simple binary search of an 'index' to determine the offset to find
> the related data. For example, say that in a mesh, the index is a
> rank-1 array of integer identifiers, and the associated space
> coordinates are stored as a rank-2 array, where the second dimension
> is like a tuple of (x, y, z).

Aha, so you are doing a binary search in an 'index' first; then it is=20
almost sure that most of the time is spent in performing the look-up in=20
this rank-1 array.  As you are doing binary search, and the minimum=20
amount of I/O chunk in HDF5 is precisely the chunksize, having small=20
chunksizes will favor the performance.  By looking at your finding=20
times, my guess is that your 'index' array is on-disk, and the sparse=20
access (i.e. the binary search) to it is your bottleneck.=20

Unfortunately, you are not sending the chunksizes for the 1-rank index=20
array, but most probably the chunksize for 'old' files must be rather=20
small compared with the 'new' arrays.  In this case, and as I said in=20
other message, creating the 'new' files with PyTables 2.0 will help=20
because it uses far smaller chunksizes by default.  Also, PyTables 2.0=20
will let you to set even smaller chunksizes than the default (see the=20
new 'chunkshape' parameter in the create*Array factories), allowing a=20
better fine-tuning of query times.

As an aside and just in case you are not aware of that:  PyTables Pro=20
allows to index columns of tables and then doing binary searches in a=20
very quick way.  So, if you want to get maximum performance in your=20
lookups, one possibility is to declare a Table with a single column=20
(the indexes), index it, and then do the query:

offset =3D [r['index'] for r in table.where('index =3D=3D 154092')][0]

Of course, all the parameters in the Pro indexing engine has already=20
been fine-tuned so as to get pretty optimal query times (see [1] for a=20
detailed description on how Pro indexes work and their performance).

[snip]
> The new ptrepack seems to work OK. I did observe that if I used
> --complevel and --shuffle at the same time, shuffle was always set to
> "off" no matter the value of --shuffle.

This is a bug in ptrepack.  The attached patch should solve the problem.

> Unfortunately, I can't test=20
> the effect of the new files:
>
> $ python test_finder.py
> Testing file /cluster/stress/p20loads/gac/lev_0_test.hdf5
> HDF5-DIAG: Error detected in HDF5 library version: 1.6.5 thread 0.=20
> Back trace follows.
>   #000: H5A.c line 457 in H5Aopen_name(): attribute not found
>     major(18): Attribute layer
>     minor(05): Bad value
>   #001: H5A.c line 404 in H5A_get_index(): attribute not found
>     major(18): Attribute layer
>     minor(48): Object not found
> Segmentation fault
>
> So I tried with PyTables 2.0:
> $ python test_finder.py
> Testing file /cluster/stress/p20loads/gac/lev_0_test.hdf5
> Traceback (most recent call last):
>   File "test_finder.py", line 16, in ?
>     fh.find_gpfb('121731')
>   File "/cluster/stress/u308168/public_html/pyloads/model/finder.py",
> line 210, in find_gpfb
>     r =3D nasob.NodalResult(self.fileh, g, balance=3Dnot oelop)
>   File "../nasob.py", line 375, in __init__
>     elements =3D grid.elements
>   File "../nasob.py", line 52, in _elements
>     self._elist.append(Result(self.fileh, eid, ogpf=3DTrue))
>   File "../nasob.py", line 288, in __init__
>     g.ogpf.T1 =3D g.ogpf.t1 =3D g.fx =3D g.FX =3D g.ogpf[:,0]
> AttributeError: 'numpy.ndarray' object has no attribute 'T1'
> Closing remaining open files:
> /cluster/stress/p20loads/gac/lev_0_test.hdf5... done
>
> I guess I'll have to read the migration docs ;)

Well, I think so ;)

[1] http://www.carabos.com/docs/OPSI-indexes.pdf

Cheers,

=2D-=20
>0,0<   Francesc Altet =C2=A0 =C2=A0 http://www.carabos.com/
V   V   C=C3=A1rabos Coop. V. =C2=A0=C2=A0Enjoy Data
 "-"