Re: [Samtools-devel] Indexing a BAM Index (BAI)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

On Mon, 2014-10-27 at 10:20 +0000, James Bonfield wrote: 
> On Fri, Oct 24, 2014 at 02:54:43PM -0400, Dan Vanderkam wrote:
> > My group's BAI files have gotten quite large (10+MB) and are proving to be
> > a bottleneck when loading interactive visualizations like IGV or
> > BioDalliance. Downloading a 10MB file takes many seconds, during which time
> > the visualization can't display anything.
> > 
> > A simple index of the BAM Index would solve this -- if I'm viewing short
> > reads on a small portion of chr20, there's no reason that the visualization
> > should have to pull in all the index data for the other chromosomes.
> 
> I'm not sure if indexing the index is the right solution, but I agree
> some internal structure to the index to allow random access on it
> would help.  This isn't there in CSI either I *think*.

That's correct, I am afraid one has to read the whole index
http://samtools.github.io/hts-specs/CSIv1.pdf

To make things worse, we were thinking about increasing the index size
by storing also the number of records in each chunk. This seemed like a
trivial increase in size compared to the size of the data files, but I
see how inconvenient this would be in your application.

petr

> > I'm curious what others thoughts are on this:
> > - Does the CRAM format mitigate this problem?
> 
> Mitigate, but not solve.
> 
> CRAM indices are normally smaller and faster. I haven't tried it on
> real whoppers to test though.
> 
> Some quick benchmarks on a small file (a C.Elegans run):
> 
> $ time samtools index ~/scratch/data/SRR065390_1.srt.bam
> real    0m26.906s
> user    0m26.580s
> sys     0m0.280s
> 
> time samtools index ~/scratch/data/SRR065390_1.srt.cram
> real    0m7.785s
> user    0m0.150s
> sys     0m0.530s
> 
> (Ie CRAM is mostly I/O time. The BAM file had already been read
> recently so was probably in cache, but I expect the I/O wouldn't be
> more than CPU anyway.)
> 
> jkb@seq3a[nfs_j/jkb] ls -l ~/scratch/data/SRR065390_1.srt*
> -rw-r--r-- 1 jkb team117 1830573786 May  8  2012 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.bam
> -rw-r--r-- 1 jkb team117     301248 Oct 27 10:02 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.bam.bai
> -rw-r--r-- 1 jkb team117 1112133803 Sep  5 12:26 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.cram
> -rw-r--r-- 1 jkb team117      53708 Oct 27 10:01 /nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.cram.crai
> 
> Cram index is 5-6x smaller than the BAM one for this case.
> 
> In terms of file I/O, a view of CHROMOSOME_III:1000000-1010000 in both
> cases reads the entire index, but the BAM vs CRAM file I/O itself
> differs slightly:
> 
> BAM:
> open("/nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.bam", O_RDONLY) = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=1830573786, ...}) = 0
> read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0\377\0sr\364eTed`p\360p\341\f\363"..., 32768) = 32768
> lseek(3, -28, SEEK_END)                 = 1830573758
> read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0\33\0\3\0\0\0\0\0\0\0\0\0", 32768) = 28
> lseek(3, 0, SEEK_SET)                   = 0
> read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0\377\0sr\364eTed`p\360p\341\f\363"..., 32768) = 32768
> lseek(3, 531356626, SEEK_SET)           = 531356626
> read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0g:\305}{\214e[ZWM\317\355\276\267\373"..., 32768) = 32768
> read(3, "\276\340c\3060e\372s\373\334\341\263\266\336\342\360\277gX\3S\227\323\\\27\7X17\2426\t"..., 32768) = 32768
> read(3, "cMjn&\300\353&\315\353\352T\324\4KKpoS\204c\n!d\334\323\213\3023\211\261'"..., 32768) = 32768
> read(3, "\237\252\243\271\t'$+\200nV\273\223\221\362i\253\334@\206\354\335\355\317\261\2471\330o\363\345\22"..., 32768) = 32768
> read(3, "\274\271\316A \312g\315\264@\342Oa+\256\361\271\33003>m$\4\331x\\\203\241\2416\243"..., 32768) = 32768
> read(3, "\330\243\326\201F\24\260\326\363g\217[\2\211Nqz\"\360\374v\324\351\4W\335.{\322\0023\36"..., 32768) = 32768
> 
> CRAM:
> open("/nfs/users/nfs_j/jkb/scratch/data/SRR065390_1.srt.cram", O_RDONLY) = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=1112133803, ...}) = 0
> read(3, "CRAM\2\1SRR065390_1.srt.cram0+\0\0\0\0"..., 32768) = 32768
> lseek(3, 312180418, SEEK_SET)           = 312180418
> read(3, "i\242\4\0\2\3175\277\300S\330\247\20\340\226\337H\317B@\20\1\201\250\0\1\0\201\241\201\241y"..., 32768) = 32768
> read(3, "g\306[\205~\246U\252\215\262r\2s\245\375t2\3\226\330\32;\260$\331\336Y\263m\307o\272"..., 19584) = 19584
> read(3, "\1\4\f\303\207<\317B@\37\213\10\0\0\0\0\0\4\3\314\341\211r\33\331\226m\333\246\301,\367"..., 32768) = 32768
> read(3, ".d\255\261t\267t\323\330\203\246\2\22P\223\252Z\313\277\f\177\6*\24\370^|\245\335\304\35\32"..., 198469) = 198469
> read(3, "\1\4\r\200\261\201\315\37\213\10\0\0\0\0\0\0\3M\220\261\r\4A\f\2imD@\3\323\177"..., 32768) = 32768
> 
> 
> The initial lseeks in BAM are to verify EOF. Possibly this should be
> done via the same method for CRAM, but right now it only checks EOF
> when streaming, to detect proper end of stream. (Ie it'll spot EOF
> during a full file conversion, but not if you do random access unless
> it actually needs to read past end of file.)
> 
> After that they boil down to much the same thing; decoder header, seek
> once and read data.
> 
> That's only a single trivial test so I've no idea how it holds up to
> larger files.
> 
> James
> 
> -- 
> James Bonfield (jk...@sa...) | Hora aderat briligi. Nunc et Slythia Tova
>                                   | Plurima gyrabant gymbolitare vabo;
>   A Staden Package developer:     | Et Borogovorum mimzebant undique formae,
> https://sf.net/projects/staden/   | Momiferique omnes exgrabure Rathi. 
> 
> 

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.