Re: [Pytables-users] Speed of CArray writing sparse matrices

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Anthony,

thanks for the explanation and the links, it's much clearer now. So without 
compression a CArray is really a smarter type of sparse file, but you have to 
set a sensible chunk shape. Do you know how the default value is set btw? I am 
asking because I didn't see any change in performance from using the default 
value and using (1, N), where (N,N) is the shape of the matrix. I guess that the 
write performance depends crucially on the size of the I/O buffer, so the 
default must be choosing a similar setting.

Anyway I have played a bit with other values of the chunk shape in conjunction 
with the compression level and using a shape (1,100) and a complevel=5 gives 
speeds that are only 10-15% slower than what I get at shape=(1,1) and 
complevel=0. The resulting file is 10 times smaller, and something like 35 times 
smaller than a NPY sparse file, btw!

Thanks!

Giovanni

On 06/24/2013 05:25 AM, pyt...@li... wrote:
> Hi Giovanni!
>
> I think that you may have some misunderstanding about how chucking works,
> which is leading you to get terrible performance.  In fact what you
> describe is a great strategy (right all and zip) for using normal Arrays.
>
> However, chunking and CArrays don't work like this.  If a chunk contains no
> data, it is not written at all!  Also, all zipping takes place on the chunk
> level.  Thus for very small chunks you can actually increase the file size
> and access time by using compression.
>
> For sparse matrices and CArrays, you need to play around with the
> chunkshape argument to create_carray()  and compression.  Performance is
> going to be affected how dense the matrix is and how grouped it is.  For
> example, for a very dense and randomly distributed matrix, chunkshape=1 and
> no compression is best.  For block diagonal matrices, the chunkshape should
> be the nominal block shape.  Compression is only useful here if the blocks
> all have similar values or the block shape is large.  For example
>
> 1 1 0 0 0 0
> 1 1 0 0 0 0
> 0 0 1 1 0 0
> 0 0 1 1 0 0
> 0 0 0 0 1 1
> 0 0 0 0 1 1
>
> is well suited to a chunkshape=(2, 2)
>
> For more information on the HDF model please see my talk slides and video
>   [1,2]  I hope this helps.
>
> Be Well
> Anthony
>
> PS. Glad to see you using the new API
>
> 1.https://github.com/scopatz/hdf5-is-for-lovers
> 2.http://www.youtube.com/watch?v=Nzx0HAd3FiI

-- 
Giovanni Luca Ciampaglia

Postdoctoral fellow
Center for Complex Networks and Systems Research
Indiana University

✎ 910 E 10th St ∙ Bloomington ∙ IN 47408
☞ http://cnets.indiana.edu/
✉ gci...@in...