|
From: Petr D. <pd...@sa...> - 2014-10-29 11:22:02
|
Hi, On Mon, 2014-10-27 at 10:20 +0000, James Bonfield wrote: > On Fri, Oct 24, 2014 at 02:54:43PM -0400, Dan Vanderkam wrote: > > My group's BAI files have gotten quite large (10+MB) and are proving to be > > a bottleneck when loading interactive visualizations like IGV or > > BioDalliance. Downloading a 10MB file takes many seconds, during which time > > the visualization can't display anything. > > > > A simple index of the BAM Index would solve this -- if I'm viewing short > > reads on a small portion of chr20, there's no reason that the visualization > > should have to pull in all the index data for the other chromosomes. > > I'm not sure if indexing the index is the right solution, but I agree > some internal structure to the index to allow random access on it > would help. This isn't there in CSI either I *think*. That's correct, I am afraid one has to read the whole index http://samtools.github.io/hts-specs/CSIv1.pdf To make things worse, we were thinking about increasing the index size by storing also the number of records in each chunk. This seemed like a trivial increase in size compared to the size of the data files, but I see how inconvenient this would be in your application. petr > > I'm curious what others thoughts are on this: > > - Does the CRAM format mitigate this problem? > > Mitigate, but not solve. > > CRAM indices are normally smaller and faster. I haven't tried it on > real whoppers to test though. > > Some quick benchmarks on a small file (a C.Elegans run): > > $ time samtools index ~/scratch/data/SRR065390_1.srt.bam > real 0m26.906s > user 0m26.580s > sys 0m0.280s > > time samtools index ~/scratch/data/SRR065390_1.srt.cram > real 0m7.785s > user 0m0.150s > sys 0m0.530s > > (Ie CRAM is mostly I/O time. The BAM file had already been read > recently so was probably in cache, but I expect the I/O wouldn't be > more than CPU anyway.) > > jkb@seq3a[nfs_j/jkb] ls -l ~/scratch/data/SRR065390_1.srt* > -rw-r--r-- 1 jkb team117 1830573786 May 8 2012 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.bam > -rw-r--r-- 1 jkb team117 301248 Oct 27 10:02 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.bam.bai > -rw-r--r-- 1 jkb team117 1112133803 Sep 5 12:26 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.cram > -rw-r--r-- 1 jkb team117 53708 Oct 27 10:01 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.cram.crai > > Cram index is 5-6x smaller than the BAM one for this case. > > In terms of file I/O, a view of CHROMOSOME_III:1000000-1010000 in both > cases reads the entire index, but the BAM vs CRAM file I/O itself > differs slightly: > > BAM: > open("/nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.bam", O_RDONLY) = 3 > fstat(3, {st_mode=S_IFREG|0644, st_size=1830573786, ...}) = 0 > read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0\377\0sr\364eTed`p\360p\341\f\363"..., 32768) = 32768 > lseek(3, -28, SEEK_END) = 1830573758 > read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0\33\0\3\0\0\0\0\0\0\0\0\0", 32768) = 28 > lseek(3, 0, SEEK_SET) = 0 > read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0\377\0sr\364eTed`p\360p\341\f\363"..., 32768) = 32768 > lseek(3, 531356626, SEEK_SET) = 531356626 > read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0g:\305}{\214e[ZWM\317\355\276\267\373"..., 32768) = 32768 > read(3, "\276\340c\3060e\372s\373\334\341\263\266\336\342\360\277gX\3S\227\323\\\27\7X17\2426\t"..., 32768) = 32768 > read(3, "cMjn&\300\353&\315\353\352T\324\4KKpoS\204c\n!d\334\323\213\3023\211\261'"..., 32768) = 32768 > read(3, "\237\252\243\271\t'$+\200nV\273\223\221\362i\253\334@\206\354\335\355\317\261\2471\330o\363\345\22"..., 32768) = 32768 > read(3, "\274\271\316A \312g\315\264@\342Oa+\256\361\271\33003>m$\4\331x\\\203\241\2416\243"..., 32768) = 32768 > read(3, "\330\243\326\201F\24\260\326\363g\217[\2\211Nqz\"\360\374v\324\351\4W\335.{\322\0023\36"..., 32768) = 32768 > > CRAM: > open("/nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.cram", O_RDONLY) = 3 > fstat(3, {st_mode=S_IFREG|0644, st_size=1112133803, ...}) = 0 > read(3, "CRAM\2\1SRR065390_1.srt.cram0+\0\0\0\0"..., 32768) = 32768 > lseek(3, 312180418, SEEK_SET) = 312180418 > read(3, "i\242\4\0\2\3175\277\300S\330\247\20\340\226\337H\317B@\20\1\201\250\0\1\0\201\241\201\241y"..., 32768) = 32768 > read(3, "g\306[\205~\246U\252\215\262r\2s\245\375t2\3\226\330\32;\260$\331\336Y\263m\307o\272"..., 19584) = 19584 > read(3, "\1\4\f\303\207<\317B@\37\213\10\0\0\0\0\0\4\3\314\341\211r\33\331\226m\333\246\301,\367"..., 32768) = 32768 > read(3, ".d\255\261t\267t\323\330\203\246\2\22P\223\252Z\313\277\f\177\6*\24\370^|\245\335\304\35\32"..., 198469) = 198469 > read(3, "\1\4\r\200\261\201\315\37\213\10\0\0\0\0\0\0\3M\220\261\r\4A\f\2imD@\3\323\177"..., 32768) = 32768 > > > The initial lseeks in BAM are to verify EOF. Possibly this should be > done via the same method for CRAM, but right now it only checks EOF > when streaming, to detect proper end of stream. (Ie it'll spot EOF > during a full file conversion, but not if you do random access unless > it actually needs to read past end of file.) > > After that they boil down to much the same thing; decoder header, seek > once and read data. > > That's only a single trivial test so I've no idea how it holds up to > larger files. > > James > > -- > James Bonfield (jk...@sa...) | Hora aderat briligi. Nunc et Slythia Tova > | Plurima gyrabant gymbolitare vabo; > A Staden Package developer: | Et Borogovorum mimzebant undique formae, > https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. > > -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |