Re: [Samtools-devel] Indexing a BAM Index (BAI)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I use the java gzip library to un-gzip it, and it has a bug if there are 
more than one blocks.   So for IGV it has to be gzipped.

> Hi Jim,
>
> A bgzipped file is a valid gzipped file, so it should work either way.
>
> -Alec
>
> On 11/30/14, 10:01 PM, Jim Robinson wrote:
>> Hi,  I just pushed a simple change to IGV that should help with the
>> original problem.   IGV will now accept a remote gzipped index file,
>> with naming convention http://..../foo.bam.bai.gz.   The index should be
>> gzipped, not "bgzipped".   IGV will check for presence of the .gz file
>> first, so that both gzipped and non-gzipped bai files can coexist.
>>
>> This only works with remote resources (http & https),  not local files.
>>
>> Jim
>>
>>> Petr is right that keeping the virtual offsets in CSI is 
>>> overcomplicated. We should not go there.
>>>
>>> In general, there is a tension between the index resolution 
>>> (proportional to the index size) and the amount of data read per 
>>> query. When we have lower resolution, we have to linearly read more 
>>> data before reaching the data slot we want to retrieve. IIRC, James 
>>> showed that we need to read more data with the CRAM index. I doubt 
>>> there is a universally better solution.
>>>
>>> Heng
>>>
>>> PS: I acknowledge that the BAM index frequently brings troubles 
>>> especially for remote access. The BigBED integrated index is more 
>>> friendly and advanced in this aspect. Jim et al had remote access in 
>>> mind when they designed BigBED. The BAM remote access is an 
>>> afterthought suggested by Lincoln.
>>>
>>> On Nov 10, 2014, at 10:55, Vadim Zalunin <va...@eb...> wrote:
>>>
>>>> On 10/11/2014 15:34, Petr Danecek wrote:
>>>>> Hello Dan,
>>>>>
>>>>> sorry for the delay in responding.
>>>>>
>>>>> The compression of CSI makes the indexing of the index more 
>>>>> difficult.
>>>>> We'd need to keep mappings to compressed blocks and to the 
>>>>> uncompressed
>>>>> offsets within the blocks. This needs to be relative to the index
>>>>> header, because its compressed size is unknown at the time of writing
>>>>> the body.
>>>>>
>>>>> It is doable, but it increases the complexity and we are not sure 
>>>>> if it
>>>>> is worth it: For high coverage data the CSI(v1) index is about 6.5MB,
>>>>> which is comparable to the size of a 100kbp bam data chunk, and is a
>>>>> negligible fraction of the whole BAM file. I don't know what is the
>>>>> typical access pattern in genome viewers; how does the size of the
>>>>> transferred data compare to the size of the index?
>>>>>
>>>>> Also we expect CRAM to start replacing BAM soon, so this probably
>>>>> should become less of a problem in near future.
>>>> Index file size for a CRAM file should be smaller than for a
>>>> corresponding BAM file because each container/slice is used as a
>>>> contiguous read, leaving the rest to the iterators. This is equivalent
>>>> to indexing every 10k (by default) reads as one long read. I wonder if
>>>> reduced resolution could be used for BAM files as well. This should 
>>>> not
>>>> affect readers if I'm not mistaken.
>>>>
>>>> Vadim
>>>>> Best wishes,
>>>>> Petr
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 2014-11-04 at 13:50 -0500, Dan Vanderkam wrote:
>>>>>> A 4.2M file is an improvement, but still quite large to pull in 
>>>>>> while
>>>>>> loading a visualization on a web page.
>>>>>>
>>>>>>
>>>>>> re: Heng's comment about CSI, it would be great if CSI included a 
>>>>>> list
>>>>>> of virtual offsets for each chromosome at the start of the file. 
>>>>>> This
>>>>>> would work best if the length of the index index (in bytes) were
>>>>>> encoded in the header before it. This would support the following
>>>>>> access pattern:
>>>>>>
>>>>>>
>>>>>> 1. HTTP request for first 8 bytes (to get index index length)
>>>>>> 2. HTTP request for the full index index
>>>>>>
>>>>>>
>>>>>> or, as an optimization
>>>>>>
>>>>>>
>>>>>> 1. HTTP request for the first, say, 64k (to hopefully grab the full
>>>>>> index index)
>>>>>> 2. HTTP request for the rest of the index index (if it's longer than
>>>>>> 64k)
>>>>>>
>>>>>>
>>>>>> Prefixing structured fields with their length in bytes is quite 
>>>>>> common
>>>>>> in binary formats, e.g. in the google protocol buffer wire format.
>>>>>>
>>>>>>
>>>>>>     - Dan
>>>>>>
>>>>>> On Wed, Oct 29, 2014 at 9:43 AM, John Marshall <jm...@sa...>
>>>>>> wrote:
>>>>>>           On 24 Oct 2014, at 19:54, Dan Vanderkam <da...@gm...>
>>>>>>           wrote:
>>>>>>> My group's BAI files have gotten quite large (10+MB) and are
>>>>>>           proving to be a bottleneck when loading interactive
>>>>>>           visualizations like IGV or BioDalliance. Downloading a 
>>>>>> 10MB
>>>>>>           file takes many seconds, during which time the 
>>>>>> visualization
>>>>>>           can't display anything.
>>>>>>           [...]
>>>>>>> - Does CSI (instead of BAI) help with this?
>>>>>>           As Heng just mentioned in passing, CSI is compressed 
>>>>>> while BAI
>>>>>>           is not.  For example, for a 42G BAM file I just 
>>>>>> compared, the
>>>>>>           CSI index is half the size of the BAM index (4.2M v. 
>>>>>> 8.4M).
>>>>>>
>>>>>>               John
>>>>>>
>>>>>>           --
>>>>>>            The Wellcome Trust Sanger Institute is operated by Genome
>>>>>>           Research
>>>>>>            Limited, a charity registered in England with number 
>>>>>> 1021457
>>>>>>           and a
>>>>>>            company registered in England with number 2742969, whose
>>>>>>           registered
>>>>>>            office is 215 Euston Road, London, NW1 2BE.
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------ 
>>>>>>
>>>>>> _______________________________________________
>>>>>> Samtools-devel mailing list
>>>>>> Sam...@li...
>>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>>>>
>>>>>
>>>> -- 
>>>> Vadim Zalunin
>>>> European Bioinformatics Institute (EMBL-EBI)
>>>> European Molecular Biology Laboratory
>>>> Wellcome Trust Genome Campus
>>>> Hinxton
>>>> Cambridge CB10 1SD
>>>> United Kingdom
>>>> Tel: + 44 (0) 1223 494 614
>>>> Fax: + 44 (0) 1223 494 468
>>>>
>>>>
>>>> ------------------------------------------------------------------------------ 
>>>>
>>>> _______________________________________________
>>>> Samtools-devel mailing list
>>>> Sam...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>> ------------------------------------------------------------------------------ 
>>>
>>> _______________________________________________
>>> Samtools-devel mailing list
>>> Sam...@li...
>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>>
>> ------------------------------------------------------------------------------ 
>>
>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>> with Interactivity, Sharing, Native Excel Exports, App Integration & 
>> more
>> Get technology previously reserved for billion-dollar corporations, FREE
>> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk 
>>
>> _______________________________________________
>> Samtools-devel mailing list
>> Sam...@li...
>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
>