From: Giovanni L. C. <glc...@gm...> - 2013-06-24 15:52:13
|
Hi Anthony, thanks for the explanation and the links, it's much clearer now. So without compression a CArray is really a smarter type of sparse file, but you have to set a sensible chunk shape. Do you know how the default value is set btw? I am asking because I didn't see any change in performance from using the default value and using (1, N), where (N,N) is the shape of the matrix. I guess that the write performance depends crucially on the size of the I/O buffer, so the default must be choosing a similar setting. Anyway I have played a bit with other values of the chunk shape in conjunction with the compression level and using a shape (1,100) and a complevel=5 gives speeds that are only 10-15% slower than what I get at shape=(1,1) and complevel=0. The resulting file is 10 times smaller, and something like 35 times smaller than a NPY sparse file, btw! Thanks! Giovanni On 06/24/2013 05:25 AM, pyt...@li... wrote: > Hi Giovanni! > > I think that you may have some misunderstanding about how chucking works, > which is leading you to get terrible performance. In fact what you > describe is a great strategy (right all and zip) for using normal Arrays. > > However, chunking and CArrays don't work like this. If a chunk contains no > data, it is not written at all! Also, all zipping takes place on the chunk > level. Thus for very small chunks you can actually increase the file size > and access time by using compression. > > For sparse matrices and CArrays, you need to play around with the > chunkshape argument to create_carray() and compression. Performance is > going to be affected how dense the matrix is and how grouped it is. For > example, for a very dense and randomly distributed matrix, chunkshape=1 and > no compression is best. For block diagonal matrices, the chunkshape should > be the nominal block shape. Compression is only useful here if the blocks > all have similar values or the block shape is large. For example > > 1 1 0 0 0 0 > 1 1 0 0 0 0 > 0 0 1 1 0 0 > 0 0 1 1 0 0 > 0 0 0 0 1 1 > 0 0 0 0 1 1 > > is well suited to a chunkshape=(2, 2) > > For more information on the HDF model please see my talk slides and video > [1,2] I hope this helps. > > Be Well > Anthony > > PS. Glad to see you using the new API > > 1.https://github.com/scopatz/hdf5-is-for-lovers > 2.http://www.youtube.com/watch?v=Nzx0HAd3FiI -- Giovanni Luca Ciampaglia Postdoctoral fellow Center for Complex Networks and Systems Research Indiana University ✎ 910 E 10th St ∙ Bloomington ∙ IN 47408 ☞ http://cnets.indiana.edu/ ✉ gci...@in... |