Re: [Samtools-devel] Indexing a BAM Index (BAI)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

A 4.2M file is an improvement, but still quite large to pull in while
loading a visualization on a web page.

re: Heng's comment about CSI, it would be great if CSI included a list of
virtual offsets for each chromosome at the start of the file. This would
work best if the length of the index index (in bytes) were encoded in the
header before it. This would support the following access pattern:

1. HTTP request for first 8 bytes (to get index index length)
2. HTTP request for the full index index

or, as an optimization

1. HTTP request for the first, say, 64k (to hopefully grab the full index
index)
2. HTTP request for the rest of the index index (if it's longer than 64k)

Prefixing structured fields with their length in bytes is quite common in
binary formats, e.g. in the google protocol buffer wire format
<https://developers.google.com/protocol-buffers/docs/encoding>.

  - Dan

On Wed, Oct 29, 2014 at 9:43 AM, John Marshall <jm...@sa...> wrote:

> On 24 Oct 2014, at 19:54, Dan Vanderkam <da...@gm...> wrote:
> > My group's BAI files have gotten quite large (10+MB) and are proving to
> be a bottleneck when loading interactive visualizations like IGV or
> BioDalliance. Downloading a 10MB file takes many seconds, during which time
> the visualization can't display anything.
> [...]
> >
> > - Does CSI (instead of BAI) help with this?
>
> As Heng just mentioned in passing, CSI is compressed while BAI is not.
> For example, for a 42G BAM file I just compared, the CSI index is half the
> size of the BAM index (4.2M v. 8.4M).
>
>     John
>
> --
>  The Wellcome Trust Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
>