From: Heng Li <lh...@sa...> - 2013-12-05 22:27:39
|
Randomly accessing millions of regions is indeed inefficient. One possible way to alleviate the problem a little bit could be to enable caching with "bgzf_set_cache_size(fp, 8000000)" right after BAM opening. This will save some time on decompression. However, probably the solution will not be fast enough as you still pay great penalty on BAM reading and lseek() system calls. An iterator over a set of regions would be nice. This is a very good suggestion. Unfortunately, we do not have that now. Heng On Dec 5, 2013, at 4:49 PM, Alessandro Mammana <ma...@mo...> wrote: > Dear all, > I am trying to write some efficient code to process some reads from a bam file. > Because the main goal is efficiency and because C/C++ can be easily > interfaced with R I decided to use the Samtools C library. I managed > to write my little program which, for now, is just counting reads from > a (possibly very long) list of genomic intervals. > The problem that I have now is the following: > If for every genomic interval I am interested in, I use the function > bam_iter_query to loop through the relevant reads, I observe a > relatively unsatisfying performance. However if I sort the genomic > intervals, I merge them if they are in the same chromosome and closer > than "maxgap" basepairs one to the next, and then query the merged > interval, I observe a substantial increase in performance. > These are some simple benchmarks: > > here I am just looping through all the reads using the function > bam_read1 and counting them > >> system.time(loop(bampath)) > There are 33578855 total reads in the bam file > user system elapsed > 30.570 0.765 31.611 > > here I am calling my C++ code using the parameter maxgap described above > >> system.time(counts <- pileup(bins, bampath, binsize="count", shift=90, ss=F, maxgap=10000000)) > user system elapsed > 32.158 0.764 32.994 >> system.time(counts <- pileup(bins, bampath, binsize="count", shift=90, ss=F, maxgap=1000000)) > user system elapsed > 32.021 0.720 32.812 >> system.time(counts <- pileup(bins, bampath, binsize="count", shift=90, ss=F, maxgap=100000)) > user system elapsed > 31.770 0.764 32.604 >> system.time(counts <- pileup(bins, bampath, binsize="count", shift=90, ss=F, maxgap=10000)) > user system elapsed > 46.571 1.003 47.677 >> system.time(counts <- pileup(bins, bampath, binsize="count", shift=90, ss=F, maxgap=1000)) > user system elapsed > 112.101 1.847 114.194 >> system.time(counts <- pileup(bins, bampath, binsize="count", shift=90, ss=F, maxgap=100)) > user system elapsed > 173.572 2.664 176.617 >> system.time(counts <- pileup(bins, bampath, binsize="count", shift=90, ss=F, maxgap=10)) > user system elapsed > 192.863 2.876 196.371 >> system.time(counts <- pileup(bins, bampath, binsize="count", shift=90, ss=F, maxgap=0)) > user system elapsed > 195.549 2.979 199.166 > > So it looks like a blind usage of the bam index performs poorly, and > that interval closer than 100000 bps should be queried and processed > together. > My questions are the following: > 1. Why does that happen? I think that it has to do with the calls to > bam_seek and with loading chunks of the compressed bam file. > 2. Is there a way of choosing an optimal maxgap parameter > independently of the particular bam file used (It could be that > maxgap=100000 works well only on this particular file...)? > 3. Or even better, is there a library function that takes care of it > for me? In theory it could be possible to generalize the function > bam_iter_query so that it returns an iterator that loops through not > just one range, but many, and it does it efficiently... > > > Sorry for the very long mail and thank you very much in advance. > > Ale > > -- > Alessandro Mammana, PhD Student > Max Planck Institute for Molecular Genetics > Ihnestraße 63-73 > D-14195 Berlin, Germany > > ------------------------------------------------------------------------------ > Sponsored by Intel(R) XDK > Develop, test and display web and hybrid apps with a single code base. > Download it for free now! > http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk > _______________________________________________ > Samtools-devel mailing list > Sam...@li... > https://lists.sourceforge.net/lists/listinfo/samtools-devel -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |