|
From: James B. <jk...@sa...> - 2014-10-27 10:20:56
|
On Fri, Oct 24, 2014 at 02:54:43PM -0400, Dan Vanderkam wrote:
> My group's BAI files have gotten quite large (10+MB) and are proving to be
> a bottleneck when loading interactive visualizations like IGV or
> BioDalliance. Downloading a 10MB file takes many seconds, during which time
> the visualization can't display anything.
>
> A simple index of the BAM Index would solve this -- if I'm viewing short
> reads on a small portion of chr20, there's no reason that the visualization
> should have to pull in all the index data for the other chromosomes.
I'm not sure if indexing the index is the right solution, but I agree
some internal structure to the index to allow random access on it
would help. This isn't there in CSI either I *think*.
> I'm curious what others thoughts are on this:
> - Does the CRAM format mitigate this problem?
Mitigate, but not solve.
CRAM indices are normally smaller and faster. I haven't tried it on
real whoppers to test though.
Some quick benchmarks on a small file (a C.Elegans run):
$ time samtools index ~/scratch/data/SRR065390_1.srt.bam
real 0m26.906s
user 0m26.580s
sys 0m0.280s
time samtools index ~/scratch/data/SRR065390_1.srt.cram
real 0m7.785s
user 0m0.150s
sys 0m0.530s
(Ie CRAM is mostly I/O time. The BAM file had already been read
recently so was probably in cache, but I expect the I/O wouldn't be
more than CPU anyway.)
jkb@seq3a[nfs_j/jkb] ls -l ~/scratch/data/SRR065390_1.srt*
-rw-r--r-- 1 jkb team117 1830573786 May 8 2012 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.bam
-rw-r--r-- 1 jkb team117 301248 Oct 27 10:02 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.bam.bai
-rw-r--r-- 1 jkb team117 1112133803 Sep 5 12:26 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.cram
-rw-r--r-- 1 jkb team117 53708 Oct 27 10:01 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.cram.crai
Cram index is 5-6x smaller than the BAM one for this case.
In terms of file I/O, a view of CHROMOSOME_III:1000000-1010000 in both
cases reads the entire index, but the BAM vs CRAM file I/O itself
differs slightly:
BAM:
open("/nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.bam", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1830573786, ...}) = 0
read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0\377\0sr\364eTed`p\360p\341\f\363"..., 32768) = 32768
lseek(3, -28, SEEK_END) = 1830573758
read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0\33\0\3\0\0\0\0\0\0\0\0\0", 32768) = 28
lseek(3, 0, SEEK_SET) = 0
read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0\377\0sr\364eTed`p\360p\341\f\363"..., 32768) = 32768
lseek(3, 531356626, SEEK_SET) = 531356626
read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0g:\305}{\214e[ZWM\317\355\276\267\373"..., 32768) = 32768
read(3, "\276\340c\3060e\372s\373\334\341\263\266\336\342\360\277gX\3S\227\323\\\27\7X17\2426\t"..., 32768) = 32768
read(3, "cMjn&\300\353&\315\353\352T\324\4KKpoS\204c\n!d\334\323\213\3023\211\261'"..., 32768) = 32768
read(3, "\237\252\243\271\t'$+\200nV\273\223\221\362i\253\334@\206\354\335\355\317\261\2471\330o\363\345\22"..., 32768) = 32768
read(3, "\274\271\316A \312g\315\264@\342Oa+\256\361\271\33003>m$\4\331x\\\203\241\2416\243"..., 32768) = 32768
read(3, "\330\243\326\201F\24\260\326\363g\217[\2\211Nqz\"\360\374v\324\351\4W\335.{\322\0023\36"..., 32768) = 32768
CRAM:
open("/nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.cram", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1112133803, ...}) = 0
read(3, "CRAM\2\1SRR065390_1.srt.cram0+\0\0\0\0"..., 32768) = 32768
lseek(3, 312180418, SEEK_SET) = 312180418
read(3, "i\242\4\0\2\3175\277\300S\330\247\20\340\226\337H\317B@\20\1\201\250\0\1\0\201\241\201\241y"..., 32768) = 32768
read(3, "g\306[\205~\246U\252\215\262r\2s\245\375t2\3\226\330\32;\260$\331\336Y\263m\307o\272"..., 19584) = 19584
read(3, "\1\4\f\303\207<\317B@\37\213\10\0\0\0\0\0\4\3\314\341\211r\33\331\226m\333\246\301,\367"..., 32768) = 32768
read(3, ".d\255\261t\267t\323\330\203\246\2\22P\223\252Z\313\277\f\177\6*\24\370^|\245\335\304\35\32"..., 198469) = 198469
read(3, "\1\4\r\200\261\201\315\37\213\10\0\0\0\0\0\0\3M\220\261\r\4A\f\2imD@\3\323\177"..., 32768) = 32768
The initial lseeks in BAM are to verify EOF. Possibly this should be
done via the same method for CRAM, but right now it only checks EOF
when streaming, to detect proper end of stream. (Ie it'll spot EOF
during a full file conversion, but not if you do random access unless
it actually needs to read past end of file.)
After that they boil down to much the same thing; decoder header, seek
once and read data.
That's only a single trivial test so I've no idea how it holds up to
larger files.
James
--
James Bonfield (jk...@sa...) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
|