Re: [Samtools-devel] Indexing a BAM Index (BAI)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Shouldn't that be handled by http? i.e if IGV sets "Accept-encoding: gzip"
in the request for the bai file, then the remote http server is free to
respond with "Content-encoding: gzip" and compress the response.
On Sun, Nov 30, 2014 at 10:01 PM Jim Robinson <jro...@br...>
wrote:

> Hi,  I just pushed a simple change to IGV that should help with the
> original problem.   IGV will now accept a remote gzipped index file,
> with naming convention http://..../foo.bam.bai.gz.   The index should be
> gzipped, not "bgzipped".   IGV will check for presence of the .gz file
> first, so that both gzipped and non-gzipped bai files can coexist.
>
> This only works with remote resources (http & https),  not local files.
>
> Jim
>
> > Petr is right that keeping the virtual offsets in CSI is
> overcomplicated. We should not go there.
> >
> > In general, there is a tension between the index resolution
> (proportional to the index size) and the amount of data read per query.
> When we have lower resolution, we have to linearly read more data before
> reaching the data slot we want to retrieve. IIRC, James showed that we need
> to read more data with the CRAM index. I doubt there is a universally
> better solution.
> >
> > Heng
> >
> > PS: I acknowledge that the BAM index frequently brings troubles
> especially for remote access. The BigBED integrated index is more friendly
> and advanced in this aspect. Jim et al had remote access in mind when they
> designed BigBED. The BAM remote access is an afterthought suggested by
> Lincoln.
> >
> > On Nov 10, 2014, at 10:55, Vadim Zalunin <va...@eb...> wrote:
> >
> >> On 10/11/2014 15:34, Petr Danecek wrote:
> >>> Hello Dan,
> >>>
> >>> sorry for the delay in responding.
> >>>
> >>> The compression of CSI makes the indexing of the index more difficult.
> >>> We'd need to keep mappings to compressed blocks and to the uncompressed
> >>> offsets within the blocks. This needs to be relative to the index
> >>> header, because its compressed size is unknown at the time of writing
> >>> the body.
> >>>
> >>> It is doable, but it increases the complexity and we are not sure if it
> >>> is worth it: For high coverage data the CSI(v1) index is about 6.5MB,
> >>> which is comparable to the size of a 100kbp bam data chunk, and is a
> >>> negligible fraction of the whole BAM file. I don't know what is the
> >>> typical access pattern in genome viewers; how does the size of the
> >>> transferred data compare to the size of the index?
> >>>
> >>> Also we expect CRAM to start replacing BAM soon, so this probably
> >>> should become less of a problem in near future.
> >> Index file size for a CRAM file should be smaller than for a
> >> corresponding BAM file because each container/slice is used as a
> >> contiguous read, leaving the rest to the iterators. This is equivalent
> >> to indexing every 10k (by default) reads as one long read. I wonder if
> >> reduced resolution could be used for BAM files as well. This should not
> >> affect readers if I'm not mistaken.
> >>
> >> Vadim
> >>> Best wishes,
> >>> Petr
> >>>
> >>>
> >>>
> >>> On Tue, 2014-11-04 at 13:50 -0500, Dan Vanderkam wrote:
> >>>> A 4.2M file is an improvement, but still quite large to pull in while
> >>>> loading a visualization on a web page.
> >>>>
> >>>>
> >>>> re: Heng's comment about CSI, it would be great if CSI included a list
> >>>> of virtual offsets for each chromosome at the start of the file. This
> >>>> would work best if the length of the index index (in bytes) were
> >>>> encoded in the header before it. This would support the following
> >>>> access pattern:
> >>>>
> >>>>
> >>>> 1. HTTP request for first 8 bytes (to get index index length)
> >>>> 2. HTTP request for the full index index
> >>>>
> >>>>
> >>>> or, as an optimization
> >>>>
> >>>>
> >>>> 1. HTTP request for the first, say, 64k (to hopefully grab the full
> >>>> index index)
> >>>> 2. HTTP request for the rest of the index index (if it's longer than
> >>>> 64k)
> >>>>
> >>>>
> >>>> Prefixing structured fields with their length in bytes is quite common
> >>>> in binary formats, e.g. in the google protocol buffer wire format.
> >>>>
> >>>>
> >>>>    - Dan
> >>>>
> >>>> On Wed, Oct 29, 2014 at 9:43 AM, John Marshall <jm...@sa...>
> >>>> wrote:
> >>>>          On 24 Oct 2014, at 19:54, Dan Vanderkam <da...@gm...>
> >>>>          wrote:
> >>>>> My group's BAI files have gotten quite large (10+MB) and are
> >>>>          proving to be a bottleneck when loading interactive
> >>>>          visualizations like IGV or BioDalliance. Downloading a 10MB
> >>>>          file takes many seconds, during which time the visualization
> >>>>          can't display anything.
> >>>>          [...]
> >>>>> - Does CSI (instead of BAI) help with this?
> >>>>          As Heng just mentioned in passing, CSI is compressed while
> BAI
> >>>>          is not.  For example, for a 42G BAM file I just compared, the
> >>>>          CSI index is half the size of the BAM index (4.2M v. 8.4M).
> >>>>
> >>>>              John
> >>>>
> >>>>          --
> >>>>           The Wellcome Trust Sanger Institute is operated by Genome
> >>>>          Research
> >>>>           Limited, a charity registered in England with number 1021457
> >>>>          and a
> >>>>           company registered in England with number 2742969, whose
> >>>>          registered
> >>>>           office is 215 Euston Road, London, NW1 2BE.
> >>>>
> >>>>
> >>>>
> >>>> ------------------------------------------------------------
> ------------------
> >>>> _______________________________________________
> >>>> Samtools-devel mailing list
> >>>> Sam...@li...
> >>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel
> >>>
> >>>
> >>>
> >>
> >> --
> >> Vadim Zalunin
> >> European Bioinformatics Institute (EMBL-EBI)
> >> European Molecular Biology Laboratory
> >> Wellcome Trust Genome Campus
> >> Hinxton
> >> Cambridge CB10 1SD
> >> United Kingdom
> >> Tel: + 44 (0) 1223 494 614
> >> Fax: + 44 (0) 1223 494 468
> >>
> >>
> >> ------------------------------------------------------------
> ------------------
> >> _______________________________________________
> >> Samtools-devel mailing list
> >> Sam...@li...
> >> https://lists.sourceforge.net/lists/listinfo/samtools-devel
> >
> > ------------------------------------------------------------
> ------------------
> > _______________________________________________
> > Samtools-devel mailing list
> > Sam...@li...
> > https://lists.sourceforge.net/lists/listinfo/samtools-devel
>
>