|
From: Jim R. <jro...@br...> - 2014-10-29 12:12:09
|
Just a thought, if you're reving the spec anyway would you consider embedding an index of the index, along the lines of the json posted in the orginal message? It could be an optional header field which programs like IGV could take advantage of if present. > Hi, > > On Mon, 2014-10-27 at 10:20 +0000, James Bonfield wrote: >> On Fri, Oct 24, 2014 at 02:54:43PM -0400, Dan Vanderkam wrote: >>> My group's BAI files have gotten quite large (10+MB) and are proving to be >>> a bottleneck when loading interactive visualizations like IGV or >>> BioDalliance. Downloading a 10MB file takes many seconds, during which time >>> the visualization can't display anything. >>> >>> A simple index of the BAM Index would solve this -- if I'm viewing short >>> reads on a small portion of chr20, there's no reason that the visualization >>> should have to pull in all the index data for the other chromosomes. >> I'm not sure if indexing the index is the right solution, but I agree >> some internal structure to the index to allow random access on it >> would help. This isn't there in CSI either I *think*. > That's correct, I am afraid one has to read the whole index > http://samtools.github.io/hts-specs/CSIv1.pdf > > To make things worse, we were thinking about increasing the index size > by storing also the number of records in each chunk. This seemed like a > trivial increase in size compared to the size of the data files, but I > see how inconvenient this would be in your application. > > petr > > > >>> I'm curious what others thoughts are on this: >>> - Does the CRAM format mitigate this problem? >> Mitigate, but not solve. >> >> CRAM indices are normally smaller and faster. I haven't tried it on >> real whoppers to test though. >> >> Some quick benchmarks on a small file (a C.Elegans run): >> >> $ time samtools index ~/scratch/data/SRR065390_1.srt.bam >> real 0m26.906s >> user 0m26.580s >> sys 0m0.280s >> >> time samtools index ~/scratch/data/SRR065390_1.srt.cram >> real 0m7.785s >> user 0m0.150s >> sys 0m0.530s >> >> (Ie CRAM is mostly I/O time. The BAM file had already been read >> recently so was probably in cache, but I expect the I/O wouldn't be >> more than CPU anyway.) >> >> jkb@seq3a[nfs_j/jkb] ls -l ~/scratch/data/SRR065390_1.srt* >> -rw-r--r-- 1 jkb team117 1830573786 May 8 2012 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.bam >> -rw-r--r-- 1 jkb team117 301248 Oct 27 10:02 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.bam.bai >> -rw-r--r-- 1 jkb team117 1112133803 Sep 5 12:26 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.cram >> -rw-r--r-- 1 jkb team117 53708 Oct 27 10:01 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.cram.crai >> >> Cram index is 5-6x smaller than the BAM one for this case. >> >> In terms of file I/O, a view of CHROMOSOME_III:1000000-1010000 in both >> cases reads the entire index, but the BAM vs CRAM file I/O itself >> differs slightly: >> >> BAM: >> open("/nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.bam", O_RDONLY) = 3 >> fstat(3, {st_mode=S_IFREG|0644, st_size=1830573786, ...}) = 0 >> read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0\377\0sr\364eTed`p\360p\341\f\363"..., 32768) = 32768 >> lseek(3, -28, SEEK_END) = 1830573758 >> read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0\33\0\3\0\0\0\0\0\0\0\0\0", 32768) = 28 >> lseek(3, 0, SEEK_SET) = 0 >> read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0\377\0sr\364eTed`p\360p\341\f\363"..., 32768) = 32768 >> lseek(3, 531356626, SEEK_SET) = 531356626 >> read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0g:\305}{\214e[ZWM\317\355\276\267\373"..., 32768) = 32768 >> read(3, "\276\340c\3060e\372s\373\334\341\263\266\336\342\360\277gX\3S\227\323\\\27\7X17\2426\t"..., 32768) = 32768 >> read(3, "cMjn&\300\353&\315\353\352T\324\4KKpoS\204c\n!d\334\323\213\3023\211\261'"..., 32768) = 32768 >> read(3, "\237\252\243\271\t'$+\200nV\273\223\221\362i\253\334@\206\354\335\355\317\261\2471\330o\363\345\22"..., 32768) = 32768 >> read(3, "\274\271\316A \312g\315\264@\342Oa+\256\361\271\33003>m$\4\331x\\\203\241\2416\243"..., 32768) = 32768 >> read(3, "\330\243\326\201F\24\260\326\363g\217[\2\211Nqz\"\360\374v\324\351\4W\335.{\322\0023\36"..., 32768) = 32768 >> >> CRAM: >> open("/nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.cram", O_RDONLY) = 3 >> fstat(3, {st_mode=S_IFREG|0644, st_size=1112133803, ...}) = 0 >> read(3, "CRAM\2\1SRR065390_1.srt.cram0+\0\0\0\0"..., 32768) = 32768 >> lseek(3, 312180418, SEEK_SET) = 312180418 >> read(3, "i\242\4\0\2\3175\277\300S\330\247\20\340\226\337H\317B@\20\1\201\250\0\1\0\201\241\201\241y"..., 32768) = 32768 >> read(3, "g\306[\205~\246U\252\215\262r\2s\245\375t2\3\226\330\32;\260$\331\336Y\263m\307o\272"..., 19584) = 19584 >> read(3, "\1\4\f\303\207<\317B@\37\213\10\0\0\0\0\0\4\3\314\341\211r\33\331\226m\333\246\301,\367"..., 32768) = 32768 >> read(3, ".d\255\261t\267t\323\330\203\246\2\22P\223\252Z\313\277\f\177\6*\24\370^|\245\335\304\35\32"..., 198469) = 198469 >> read(3, "\1\4\r\200\261\201\315\37\213\10\0\0\0\0\0\0\3M\220\261\r\4A\f\2imD@\3\323\177"..., 32768) = 32768 >> >> >> The initial lseeks in BAM are to verify EOF. Possibly this should be >> done via the same method for CRAM, but right now it only checks EOF >> when streaming, to detect proper end of stream. (Ie it'll spot EOF >> during a full file conversion, but not if you do random access unless >> it actually needs to read past end of file.) >> >> After that they boil down to much the same thing; decoder header, seek >> once and read data. >> >> That's only a single trivial test so I've no idea how it holds up to >> larger files. >> >> James >> >> -- >> James Bonfield (jk...@sa...) | Hora aderat briligi. Nunc et Slythia Tova >> | Plurima gyrabant gymbolitare vabo; >> A Staden Package developer: | Et Borogovorum mimzebant undique formae, >> https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi. >> >> > > > |