Re: [Samtools-devel] Indexing a BAM Index (BAI)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 10/11/2014 15:34, Petr Danecek wrote:
> Hello Dan,
>
> sorry for the delay in responding.
>
> The compression of CSI makes the indexing of the index more difficult.
> We'd need to keep mappings to compressed blocks and to the uncompressed
> offsets within the blocks. This needs to be relative to the index
> header, because its compressed size is unknown at the time of writing
> the body.
>
> It is doable, but it increases the complexity and we are not sure if it
> is worth it: For high coverage data the CSI(v1) index is about 6.5MB,
> which is comparable to the size of a 100kbp bam data chunk, and is a
> negligible fraction of the whole BAM file. I don't know what is the
> typical access pattern in genome viewers; how does the size of the
> transferred data compare to the size of the index?
>
> Also we expect CRAM to start replacing BAM soon, so this probably
> should become less of a problem in near future.
Index file size for a CRAM file should be smaller than for a 
corresponding BAM file because each container/slice is used as a 
contiguous read, leaving the rest to the iterators. This is equivalent 
to indexing every 10k (by default) reads as one long read. I wonder if 
reduced resolution could be used for BAM files as well. This should not 
affect readers if I'm not mistaken.

Vadim
>
> Best wishes,
> Petr
>
>
>
> On Tue, 2014-11-04 at 13:50 -0500, Dan Vanderkam wrote:
>> A 4.2M file is an improvement, but still quite large to pull in while
>> loading a visualization on a web page.
>>
>>
>> re: Heng's comment about CSI, it would be great if CSI included a list
>> of virtual offsets for each chromosome at the start of the file. This
>> would work best if the length of the index index (in bytes) were
>> encoded in the header before it. This would support the following
>> access pattern:
>>
>>
>> 1. HTTP request for first 8 bytes (to get index index length)
>> 2. HTTP request for the full index index
>>
>>
>> or, as an optimization
>>
>>
>> 1. HTTP request for the first, say, 64k (to hopefully grab the full
>> index index)
>> 2. HTTP request for the rest of the index index (if it's longer than
>> 64k)
>>
>>
>> Prefixing structured fields with their length in bytes is quite common
>> in binary formats, e.g. in the google protocol buffer wire format.
>>
>>
>>    - Dan
>>
>> On Wed, Oct 29, 2014 at 9:43 AM, John Marshall <jm...@sa...>
>> wrote:
>>          On 24 Oct 2014, at 19:54, Dan Vanderkam <da...@gm...>
>>          wrote:
>>          > My group's BAI files have gotten quite large (10+MB) and are
>>          proving to be a bottleneck when loading interactive
>>          visualizations like IGV or BioDalliance. Downloading a 10MB
>>          file takes many seconds, during which time the visualization
>>          can't display anything.
>>          [...]
>>          >
>>          > - Does CSI (instead of BAI) help with this?
>>          
>>          As Heng just mentioned in passing, CSI is compressed while BAI
>>          is not.  For example, for a 42G BAM file I just compared, the
>>          CSI index is half the size of the BAM index (4.2M v. 8.4M).
>>          
>>              John
>>          
>>          --
>>           The Wellcome Trust Sanger Institute is operated by Genome
>>          Research
>>           Limited, a charity registered in England with number 1021457
>>          and a
>>           company registered in England with number 2742969, whose
>>          registered
>>           office is 215 Euston Road, London, NW1 2BE.
>>          
>>
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Samtools-devel mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>
>
>
>


-- 
Vadim Zalunin
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
United Kingdom
Tel: + 44 (0) 1223 494 614
Fax: + 44 (0) 1223 494 468