Re: [Samtools-devel] Indexing a BAM Index (BAI)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello Dan,

sorry for the delay in responding. 

The compression of CSI makes the indexing of the index more difficult.
We'd need to keep mappings to compressed blocks and to the uncompressed
offsets within the blocks. This needs to be relative to the index
header, because its compressed size is unknown at the time of writing
the body. 

It is doable, but it increases the complexity and we are not sure if it
is worth it: For high coverage data the CSI(v1) index is about 6.5MB,
which is comparable to the size of a 100kbp bam data chunk, and is a
negligible fraction of the whole BAM file. I don't know what is the
typical access pattern in genome viewers; how does the size of the
transferred data compare to the size of the index? 

Also we expect CRAM to start replacing BAM soon, so this probably
should become less of a problem in near future.

Best wishes,
Petr

On Tue, 2014-11-04 at 13:50 -0500, Dan Vanderkam wrote:
> A 4.2M file is an improvement, but still quite large to pull in while
> loading a visualization on a web page.
> 
> 
> re: Heng's comment about CSI, it would be great if CSI included a list
> of virtual offsets for each chromosome at the start of the file. This
> would work best if the length of the index index (in bytes) were
> encoded in the header before it. This would support the following
> access pattern:
> 
> 
> 1. HTTP request for first 8 bytes (to get index index length)
> 2. HTTP request for the full index index
> 
> 
> or, as an optimization
> 
> 
> 1. HTTP request for the first, say, 64k (to hopefully grab the full
> index index)
> 2. HTTP request for the rest of the index index (if it's longer than
> 64k)
> 
> 
> Prefixing structured fields with their length in bytes is quite common
> in binary formats, e.g. in the google protocol buffer wire format.
> 
> 
>   - Dan
> 
> On Wed, Oct 29, 2014 at 9:43 AM, John Marshall <jm...@sa...>
> wrote:
>         On 24 Oct 2014, at 19:54, Dan Vanderkam <da...@gm...>
>         wrote:
>         > My group's BAI files have gotten quite large (10+MB) and are
>         proving to be a bottleneck when loading interactive
>         visualizations like IGV or BioDalliance. Downloading a 10MB
>         file takes many seconds, during which time the visualization
>         can't display anything.
>         [...]
>         >
>         > - Does CSI (instead of BAI) help with this?
>         
>         As Heng just mentioned in passing, CSI is compressed while BAI
>         is not.  For example, for a 42G BAM file I just compared, the
>         CSI index is half the size of the BAM index (4.2M v. 8.4M).
>         
>             John
>         
>         --
>          The Wellcome Trust Sanger Institute is operated by Genome
>         Research
>          Limited, a charity registered in England with number 1021457
>         and a
>          company registered in England with number 2742969, whose
>         registered
>          office is 215 Euston Road, London, NW1 2BE.
>         
> 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Samtools-devel mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-devel

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.