Re: [Samtools-devel] BGZF writing

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Fri, Sep 21, 2012 at 03:50:14PM +0400, Artem Tarasov wrote:
> In fact, if you look at the documentation of Zlib library (
> http://www.zlib.net/manual.html#Utility),
> it mentions a function compressBound(ulong) that returns the upper bound of
> compressed block size.
> So anything less or equal to max { v | compressBound(v) <= 65536 } would do.

Thanks I hadn't noticed that, although it's at odds with the RFC,
maybe due to headers.

uLong ZEXPORT compressBound (sourceLen)
    uLong sourceLen;
{
    return sourceLen + (sourceLen >> 12) + (sourceLen >> 14) +
           (sourceLen >> 25) + 13;
}

Whereas from RFC 1951:

     "A simple counting argument shows that no lossless compression
      algorithm can compress every possible input data set.  For the
      format defined here, the worst case expansion is 5 bytes per
      32K-byte block, i.e., a size increase of 0.015% for large data
      sets."

5 bytes every 32K is 1+4 every 32K or (sourceLen >> 13) + (sourceLen
>> 15). The compressBound seems to indicate it's ~10 every 32K and
some for the header. Anwyay it's better to err on the side of caution
and trust the more conservative zlib version instead.

That implies 65477 as max, I think...

James

-- 
James Bonfield (jk...@sa...) | Hora aderat briligi. Nunc et Slythia Tova
                                  | Plurima gyrabant gymbolitare vabo;
  A Staden Package developer:     | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/   | Momiferique omnes exgrabure Rathi. 

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.