Re: [Pytables-users] compression in 0.9

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I've ended with a new rewrite of EArray._calcBufferSize method, that I'm
including at the end of this message. Please, play with the different values
in lines:

        #bufmultfactor = int(1000 * 10) # Conservative value
        bufmultfactor = int(1000 * 20)  # Medium value
        #bufmultfactor = int(1000 * 50) # Aggresive value
        #bufmultfactor = int(1000 * 100) # Very Aggresive value

and tell me your feedback.

-- 
Francesc Altet

    def _calcBufferSize(self, atom, extdim, expectedrows, compress):
        """Calculate the buffer size and the HDF5 chunk size.

        The logic to do that is based purely in experiments playing
        with different buffer sizes, chunksize and compression
        flag. It is obvious that using big buffers optimize the I/O
        speed. This might (should) be further optimized doing more
        experiments.

        """

        rowsize = atom.atomsize()
        #bufmultfactor = int(1000 * 10) # Conservative value
        bufmultfactor = int(1000 * 20)  # Medium value
        #bufmultfactor = int(1000 * 50) # Aggresive value
        #bufmultfactor = int(1000 * 100) # Very Aggresive value

        rowsizeinfile = rowsize
        expectedfsizeinKb = (expectedrows * rowsizeinfile) / 1024

        if expectedfsizeinKb <= 100:
            # Values for files less than 100 KB of size
            buffersize = 5 * bufmultfactor
        elif (expectedfsizeinKb > 100 and
            expectedfsizeinKb <= 1000):
            # Values for files less than 1 MB of size
            buffersize = 10 * bufmultfactor
        elif (expectedfsizeinKb > 1000 and
              expectedfsizeinKb <= 20 * 1000):
            # Values for sizes between 1 MB and 20 MB
            buffersize = 20  * bufmultfactor
        elif (expectedfsizeinKb > 20 * 1000 and
              expectedfsizeinKb <= 200 * 1000):
            # Values for sizes between 20 MB and 200 MB
            buffersize = 40 * bufmultfactor
        elif (expectedfsizeinKb > 200 * 1000 and
              expectedfsizeinKb <= 2000 * 1000):
            # Values for sizes between 200 MB and 2 GB
            buffersize = 50 * bufmultfactor
        else:  # Greater than 2 GB
            buffersize = 60 * bufmultfactor

        # Max Tuples to fill the buffer
        maxTuples = buffersize // rowsize
        chunksizes = list(atom.shape)
        # Check if at least 1 tuple fits in buffer
        if maxTuples > 1:
            # Yes. So the chunk sizes for the non-extendeable dims will be
            # unchanged
            chunksizes[extdim] = maxTuples
        else:
            # No. reduce other dimensions until we get a proper chunksizes
            # shape
            chunksizes[extdim] = 1  # Only one row in extendeable dimension
            for j in range(len(chunksizes)):
                newrowsize = atom.itemsize
                for i in chunksizes[j+1:]:
                    newrowsize *= i
                maxTuples = buffersize // newrowsize
                if maxTuples > 1:
                    break
                chunksizes[j] = 1
            # Compute the chunksizes correctly for this j index
            chunksize = maxTuples
            if j < len(chunksizes):
                # Only modify chunksizes[j] if needed
                if chunksize < chunksizes[j]:
                    chunksizes[j] = chunksize
            else:
                chunksizes[-1] = 1 # very large itemsizes!
        # Compute the correct maxTuples number
        newrowsize = atom.itemsize
        for i in chunksizes:
            newrowsize *= i
        maxTuples = buffersize // newrowsize
        return (buffersize, maxTuples, chunksizes)