Re: [Samtools-devel] Indexing a BAM Index (BAI)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Petr is right that keeping the virtual offsets in CSI is overcomplicated. We should not go there.

In general, there is a tension between the index resolution (proportional to the index size) and the amount of data read per query. When we have lower resolution, we have to linearly read more data before reaching the data slot we want to retrieve. IIRC, James showed that we need to read more data with the CRAM index. I doubt there is a universally better solution.

Heng

PS: I acknowledge that the BAM index frequently brings troubles especially for remote access. The BigBED integrated index is more friendly and advanced in this aspect. Jim et al had remote access in mind when they designed BigBED. The BAM remote access is an afterthought suggested by Lincoln.

On Nov 10, 2014, at 10:55, Vadim Zalunin <va...@eb...> wrote:

> On 10/11/2014 15:34, Petr Danecek wrote:
>> Hello Dan,
>> 
>> sorry for the delay in responding.
>> 
>> The compression of CSI makes the indexing of the index more difficult.
>> We'd need to keep mappings to compressed blocks and to the uncompressed
>> offsets within the blocks. This needs to be relative to the index
>> header, because its compressed size is unknown at the time of writing
>> the body.
>> 
>> It is doable, but it increases the complexity and we are not sure if it
>> is worth it: For high coverage data the CSI(v1) index is about 6.5MB,
>> which is comparable to the size of a 100kbp bam data chunk, and is a
>> negligible fraction of the whole BAM file. I don't know what is the
>> typical access pattern in genome viewers; how does the size of the
>> transferred data compare to the size of the index?
>> 
>> Also we expect CRAM to start replacing BAM soon, so this probably
>> should become less of a problem in near future.
> Index file size for a CRAM file should be smaller than for a 
> corresponding BAM file because each container/slice is used as a 
> contiguous read, leaving the rest to the iterators. This is equivalent 
> to indexing every 10k (by default) reads as one long read. I wonder if 
> reduced resolution could be used for BAM files as well. This should not 
> affect readers if I'm not mistaken.
> 
> Vadim
>> 
>> Best wishes,
>> Petr
>> 
>> 
>> 
>> On Tue, 2014-11-04 at 13:50 -0500, Dan Vanderkam wrote:
>>> A 4.2M file is an improvement, but still quite large to pull in while
>>> loading a visualization on a web page.
>>> 
>>> 
>>> re: Heng's comment about CSI, it would be great if CSI included a list
>>> of virtual offsets for each chromosome at the start of the file. This
>>> would work best if the length of the index index (in bytes) were
>>> encoded in the header before it. This would support the following
>>> access pattern:
>>> 
>>> 
>>> 1. HTTP request for first 8 bytes (to get index index length)
>>> 2. HTTP request for the full index index
>>> 
>>> 
>>> or, as an optimization
>>> 
>>> 
>>> 1. HTTP request for the first, say, 64k (to hopefully grab the full
>>> index index)
>>> 2. HTTP request for the rest of the index index (if it's longer than
>>> 64k)
>>> 
>>> 
>>> Prefixing structured fields with their length in bytes is quite common
>>> in binary formats, e.g. in the google protocol buffer wire format.
>>> 
>>> 
>>>   - Dan
>>> 
>>> On Wed, Oct 29, 2014 at 9:43 AM, John Marshall <jm...@sa...>
>>> wrote:
>>>         On 24 Oct 2014, at 19:54, Dan Vanderkam <da...@gm...>
>>>         wrote:
>>>> My group's BAI files have gotten quite large (10+MB) and are
>>>         proving to be a bottleneck when loading interactive
>>>         visualizations like IGV or BioDalliance. Downloading a 10MB
>>>         file takes many seconds, during which time the visualization
>>>         can't display anything.
>>>         [...]
>>>> 
>>>> - Does CSI (instead of BAI) help with this?
>>> 
>>>         As Heng just mentioned in passing, CSI is compressed while BAI
>>>         is not.  For example, for a 42G BAM file I just compared, the
>>>         CSI index is half the size of the BAM index (4.2M v. 8.4M).
>>> 
>>>             John
>>> 
>>>         --
>>>          The Wellcome Trust Sanger Institute is operated by Genome
>>>         Research
>>>          Limited, a charity registered in England with number 1021457
>>>         and a
>>>          company registered in England with number 2742969, whose
>>>         registered
>>>          office is 215 Euston Road, London, NW1 2BE.
>>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Samtools-devel mailing list
>>> Sam...@li...
>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>> 
>> 
>> 
>> 
> 
> 
> -- 
> Vadim Zalunin
> European Bioinformatics Institute (EMBL-EBI)
> European Molecular Biology Laboratory
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> United Kingdom
> Tel: + 44 (0) 1223 494 614
> Fax: + 44 (0) 1223 494 468
> 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Samtools-devel mailing list
> Sam...@li...
> https://lists.sourceforge.net/lists/listinfo/samtools-devel