|
From: Vadim Z. <va...@eb...> - 2014-11-10 15:55:57
|
On 10/11/2014 15:34, Petr Danecek wrote: > Hello Dan, > > sorry for the delay in responding. > > The compression of CSI makes the indexing of the index more difficult. > We'd need to keep mappings to compressed blocks and to the uncompressed > offsets within the blocks. This needs to be relative to the index > header, because its compressed size is unknown at the time of writing > the body. > > It is doable, but it increases the complexity and we are not sure if it > is worth it: For high coverage data the CSI(v1) index is about 6.5MB, > which is comparable to the size of a 100kbp bam data chunk, and is a > negligible fraction of the whole BAM file. I don't know what is the > typical access pattern in genome viewers; how does the size of the > transferred data compare to the size of the index? > > Also we expect CRAM to start replacing BAM soon, so this probably > should become less of a problem in near future. Index file size for a CRAM file should be smaller than for a corresponding BAM file because each container/slice is used as a contiguous read, leaving the rest to the iterators. This is equivalent to indexing every 10k (by default) reads as one long read. I wonder if reduced resolution could be used for BAM files as well. This should not affect readers if I'm not mistaken. Vadim > > Best wishes, > Petr > > > > On Tue, 2014-11-04 at 13:50 -0500, Dan Vanderkam wrote: >> A 4.2M file is an improvement, but still quite large to pull in while >> loading a visualization on a web page. >> >> >> re: Heng's comment about CSI, it would be great if CSI included a list >> of virtual offsets for each chromosome at the start of the file. This >> would work best if the length of the index index (in bytes) were >> encoded in the header before it. This would support the following >> access pattern: >> >> >> 1. HTTP request for first 8 bytes (to get index index length) >> 2. HTTP request for the full index index >> >> >> or, as an optimization >> >> >> 1. HTTP request for the first, say, 64k (to hopefully grab the full >> index index) >> 2. HTTP request for the rest of the index index (if it's longer than >> 64k) >> >> >> Prefixing structured fields with their length in bytes is quite common >> in binary formats, e.g. in the google protocol buffer wire format. >> >> >> - Dan >> >> On Wed, Oct 29, 2014 at 9:43 AM, John Marshall <jm...@sa...> >> wrote: >> On 24 Oct 2014, at 19:54, Dan Vanderkam <da...@gm...> >> wrote: >> > My group's BAI files have gotten quite large (10+MB) and are >> proving to be a bottleneck when loading interactive >> visualizations like IGV or BioDalliance. Downloading a 10MB >> file takes many seconds, during which time the visualization >> can't display anything. >> [...] >> > >> > - Does CSI (instead of BAI) help with this? >> >> As Heng just mentioned in passing, CSI is compressed while BAI >> is not. For example, for a 42G BAM file I just compared, the >> CSI index is half the size of the BAM index (4.2M v. 8.4M). >> >> John >> >> -- >> The Wellcome Trust Sanger Institute is operated by Genome >> Research >> Limited, a charity registered in England with number 1021457 >> and a >> company registered in England with number 2742969, whose >> registered >> office is 215 Euston Road, London, NW1 2BE. >> >> >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> Samtools-devel mailing list >> Sam...@li... >> https://lists.sourceforge.net/lists/listinfo/samtools-devel > > > > -- Vadim Zalunin European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom Tel: + 44 (0) 1223 494 614 Fax: + 44 (0) 1223 494 468 |