|
From: Jim R. <jro...@br...> - 2014-12-01 15:11:54
|
I use the java gzip library to un-gzip it, and it has a bug if there are more than one blocks. So for IGV it has to be gzipped. > Hi Jim, > > A bgzipped file is a valid gzipped file, so it should work either way. > > -Alec > > On 11/30/14, 10:01 PM, Jim Robinson wrote: >> Hi, I just pushed a simple change to IGV that should help with the >> original problem. IGV will now accept a remote gzipped index file, >> with naming convention http://..../foo.bam.bai.gz. The index should be >> gzipped, not "bgzipped". IGV will check for presence of the .gz file >> first, so that both gzipped and non-gzipped bai files can coexist. >> >> This only works with remote resources (http & https), not local files. >> >> Jim >> >>> Petr is right that keeping the virtual offsets in CSI is >>> overcomplicated. We should not go there. >>> >>> In general, there is a tension between the index resolution >>> (proportional to the index size) and the amount of data read per >>> query. When we have lower resolution, we have to linearly read more >>> data before reaching the data slot we want to retrieve. IIRC, James >>> showed that we need to read more data with the CRAM index. I doubt >>> there is a universally better solution. >>> >>> Heng >>> >>> PS: I acknowledge that the BAM index frequently brings troubles >>> especially for remote access. The BigBED integrated index is more >>> friendly and advanced in this aspect. Jim et al had remote access in >>> mind when they designed BigBED. The BAM remote access is an >>> afterthought suggested by Lincoln. >>> >>> On Nov 10, 2014, at 10:55, Vadim Zalunin <va...@eb...> wrote: >>> >>>> On 10/11/2014 15:34, Petr Danecek wrote: >>>>> Hello Dan, >>>>> >>>>> sorry for the delay in responding. >>>>> >>>>> The compression of CSI makes the indexing of the index more >>>>> difficult. >>>>> We'd need to keep mappings to compressed blocks and to the >>>>> uncompressed >>>>> offsets within the blocks. This needs to be relative to the index >>>>> header, because its compressed size is unknown at the time of writing >>>>> the body. >>>>> >>>>> It is doable, but it increases the complexity and we are not sure >>>>> if it >>>>> is worth it: For high coverage data the CSI(v1) index is about 6.5MB, >>>>> which is comparable to the size of a 100kbp bam data chunk, and is a >>>>> negligible fraction of the whole BAM file. I don't know what is the >>>>> typical access pattern in genome viewers; how does the size of the >>>>> transferred data compare to the size of the index? >>>>> >>>>> Also we expect CRAM to start replacing BAM soon, so this probably >>>>> should become less of a problem in near future. >>>> Index file size for a CRAM file should be smaller than for a >>>> corresponding BAM file because each container/slice is used as a >>>> contiguous read, leaving the rest to the iterators. This is equivalent >>>> to indexing every 10k (by default) reads as one long read. I wonder if >>>> reduced resolution could be used for BAM files as well. This should >>>> not >>>> affect readers if I'm not mistaken. >>>> >>>> Vadim >>>>> Best wishes, >>>>> Petr >>>>> >>>>> >>>>> >>>>> On Tue, 2014-11-04 at 13:50 -0500, Dan Vanderkam wrote: >>>>>> A 4.2M file is an improvement, but still quite large to pull in >>>>>> while >>>>>> loading a visualization on a web page. >>>>>> >>>>>> >>>>>> re: Heng's comment about CSI, it would be great if CSI included a >>>>>> list >>>>>> of virtual offsets for each chromosome at the start of the file. >>>>>> This >>>>>> would work best if the length of the index index (in bytes) were >>>>>> encoded in the header before it. This would support the following >>>>>> access pattern: >>>>>> >>>>>> >>>>>> 1. HTTP request for first 8 bytes (to get index index length) >>>>>> 2. HTTP request for the full index index >>>>>> >>>>>> >>>>>> or, as an optimization >>>>>> >>>>>> >>>>>> 1. HTTP request for the first, say, 64k (to hopefully grab the full >>>>>> index index) >>>>>> 2. HTTP request for the rest of the index index (if it's longer than >>>>>> 64k) >>>>>> >>>>>> >>>>>> Prefixing structured fields with their length in bytes is quite >>>>>> common >>>>>> in binary formats, e.g. in the google protocol buffer wire format. >>>>>> >>>>>> >>>>>> - Dan >>>>>> >>>>>> On Wed, Oct 29, 2014 at 9:43 AM, John Marshall <jm...@sa...> >>>>>> wrote: >>>>>> On 24 Oct 2014, at 19:54, Dan Vanderkam <da...@gm...> >>>>>> wrote: >>>>>>> My group's BAI files have gotten quite large (10+MB) and are >>>>>> proving to be a bottleneck when loading interactive >>>>>> visualizations like IGV or BioDalliance. Downloading a >>>>>> 10MB >>>>>> file takes many seconds, during which time the >>>>>> visualization >>>>>> can't display anything. >>>>>> [...] >>>>>>> - Does CSI (instead of BAI) help with this? >>>>>> As Heng just mentioned in passing, CSI is compressed >>>>>> while BAI >>>>>> is not. For example, for a 42G BAM file I just >>>>>> compared, the >>>>>> CSI index is half the size of the BAM index (4.2M v. >>>>>> 8.4M). >>>>>> >>>>>> John >>>>>> >>>>>> -- >>>>>> The Wellcome Trust Sanger Institute is operated by Genome >>>>>> Research >>>>>> Limited, a charity registered in England with number >>>>>> 1021457 >>>>>> and a >>>>>> company registered in England with number 2742969, whose >>>>>> registered >>>>>> office is 215 Euston Road, London, NW1 2BE. >>>>>> >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>>> _______________________________________________ >>>>>> Samtools-devel mailing list >>>>>> Sam...@li... >>>>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel >>>>> >>>>> >>>> -- >>>> Vadim Zalunin >>>> European Bioinformatics Institute (EMBL-EBI) >>>> European Molecular Biology Laboratory >>>> Wellcome Trust Genome Campus >>>> Hinxton >>>> Cambridge CB10 1SD >>>> United Kingdom >>>> Tel: + 44 (0) 1223 494 614 >>>> Fax: + 44 (0) 1223 494 468 >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> Samtools-devel mailing list >>>> Sam...@li... >>>> https://lists.sourceforge.net/lists/listinfo/samtools-devel >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Samtools-devel mailing list >>> Sam...@li... >>> https://lists.sourceforge.net/lists/listinfo/samtools-devel >> >> ------------------------------------------------------------------------------ >> >> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server >> from Actuate! Instantly Supercharge Your Business Reports and Dashboards >> with Interactivity, Sharing, Native Excel Exports, App Integration & >> more >> Get technology previously reserved for billion-dollar corporations, FREE >> http://pubads.g.doubleclick.net/gampad/clk?id=157005751&iu=/4140/ostg.clktrk >> >> _______________________________________________ >> Samtools-devel mailing list >> Sam...@li... >> https://lists.sourceforge.net/lists/listinfo/samtools-devel > |