Re: [Pytables-users] Writing to CArray

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 08/03/2013, at 2:51 AM, Anthony Scopatz wrote:

> Hey Tim, 
> 
> Awesome dataset! And neat image!
> 
> As per your request, a couple of minor things I noticed were that you probably don't need to do the sanity check each time (great for debugging, but not needed always), you are using masked arrays which while sometimes convenient are generally slower than creating an array, a mask and applying the mask to the array, and you seem to be downcasting from float64 to float32 for some reason that I am not entirely clear on (size, speed?).
> 
> To the more major question of write performance, one thing that you could try is compression.  You might want to do some timing studies to find the best compressor and level. Performance here can vary a lot based on how similar your data is (and how close similar data is to each other).  If you have got a bunch of zeros and only a few real data points, even zlib 1 is going to be blazing fast compared to writing all those zeros out explicitly.  
> 
> Another thing you could try doing is switching to EArray and using the append() method.  This might save PyTables, numpy, hdf5, etc from having to check that the shape of "sst_node[qual_indices]" is actually the same as the data you are giving it.  Additionally dumping a block of memory to the file directly (via append()) is generally faster than having to resolve fancy indexes (which are notoriously the slow part of even numpy).
> 
> Lastly, as a general comment, you seem to be doing a lot of stuff in the inner most loop -- including writing to disk.  I would look at how you could restructure this to move as much as possible out of this loop.  Your data seems to be about 12 GB for a year, so this is probably too big to build up the full sst array completely in memory prior to writing.  That is, unless you have a computer much bigger than my laptop ;).  But issuing one fat write command is probably going to be faster than making 365 of them.  
> 
> Happy hacking!
> Be Well
> Anthony
> 

Thanks Anthony for being so responsive and touching on a number of points.

The netCDF library gives me a masked array so I have to explicitly transform that into a regular numpy array. I've looked under the covers and have seen that the ma masked implementation is all pure Python and so there is a performance drawback. I'm not up to speed yet on where the numpy.na masking implementation is (started a new job here).

I tried to do an implementation in memory (except for the final write) and found that I have about 2GB of indices when I extract the quality indices. Simply using those indexes, memory usage grows to over 64GB and I eventually run out of memory and start churning away in swap.

For the moment, I have pulled down the latest git master and am using the new in-memory HDF feature. This seems to give be better performance and is code-wise pretty simple so for the moment, it's good enough.

Cheers and thanks again, Tim

BTW I viewed your SciPy tutorial. Good stuff!