Re: [maq-announce] MAQ poster and so on
Status: Beta
Brought to you by:
lh3lh3
From: Benjamin B. <bb...@us...> - 2008-05-12 19:09:55
|
Heng, Thanks for sending these along. Have you looked at all at how read length affects alignability and accuracy? I think a lot of people will be looking for guidance in terms of the optimal number of cycles to sequence. Thanks, ben. On May 12, 2008, at 3:25 AM, Heng Li wrote: > Hello MAQ Users, > > Recently I have presented a poster about MAQ algorithm at the CSHL > meeting. You could download the poster at: > > http://www.sanger.ac.uk/Users/lh3/maq-poster.pdf > > I also gave a 5-minute talk at a small workshop about issues in short > read mapping. Part of the slides can be found here: > > http://www.sanger.ac.uk/Users/lh3/CSHL-mapping-20080511.pdf > > In addition, I did a small benchmark on various read mapping software. > Although I was trying to make a fair comparison in mind, it obviously > biases towards maq in fact. Anyway, I think it might still be helpful > to someone. The comparison is appended at the end of this email. > > with kind regards, > > Heng > > > > > > Here I will present a small comparison between different read > alignment programs, including eland, soap, rmap and maq. The > comparison is designed as follows. From human reference chrX, I > simulated 1 million pairs of 32bp reads (i.e. 2 million reads) with > maq. Maq first simulated a diploid sequence (two haploid sequences) by > adding 135351 substitutions, 7650 1bp insertions and 7510 1bp > deletions, and then generated reads from the diploid sequence. The > base quality were generated based on real data. Sequencing errors were > added based on qualities. As we know the exact position where the > reads come from, we can exactly count the number of reads that are > wrongly mapped by a certain alignment program. > > To facilitate comparison, I wrote a Perl script 'paf_utils.pl' that > converts various alignment format to the socalled PAF format (Pairwise > Alignment Format), which is only for my own use at the moment. > > Important Notes > =============== > > * This benchmark favours maq because I have not squeezed the best > performance out of other software. Here are the reasons. > > * eland: eland is able to make use of paired end (PE) information and > generate accurate mapping qualities. However, these functionality are > assisted by several scripts in the GAPipeline-0.3.0 and for the moment > I do not know how to run these scripts independently without invoking > the full Gerald module of the pipeline. I was running eland in the > single end mode without mapping qualities. > > * rmap: rmapq is able to use base qualities to improve the alignment. > However, it requires _prb.txt format which is rarely used at the > Sanger Institute and in Illumina as well. Maq does not support this > format in simulation and therefore rmapq is not evaluated. I am sure > rmapq would work much better than rmap if base qualities were > available. > > * soap: soap also comes with the PE alignment mode. Unfortunately, it > segfaults when I try to use this mode. Perhaps my input file is buggy > or I was doing stupid things, but anyway, I could not evaluate soap in > PE alignment mode. If this mode worked, soap would get higher mapping > accuracy. > > * shrimp: I tuned the parameters for faster speed but with lower > sensitivity. The overall accuracy would be affected. In addition, I > was using version 1.04 instead of the latest 1.05. The latest version > has been optimized a bit for mapping Illumina reads. > > * In simulation we assume that base qualities are accurate, but this > is not the case on real data. When base qualities are inaccurate, > maq's mapping qualities will be less accurate, which will also reduce > the accuracy of alignments. rmapq will suffer from the similar problem > more or less. > > * The CPU time is only a rough approximate. It is subjected to the > conditions (e.g. memory and cache uses) of the machine where the > program runs. On different conditions, the speed may vary up to 30%. > > Results > ======= > > * eland (GAPipeline-0.3.0, mapping quality ROUGHLY estimated assuming > uniform quality Q25) > - CMD: ./eland_32 r12.fa chrX eland.txt > - CPU time: 284.64 sec > - res memory: 383 MB > - # Q0;Q10;Q20 mapped reads: 1,686,129;1678776;1622577 > - # Q0;Q10;Q20 wrongly mapped reads: 7,406;4234;819 (p_err: 0.439%; > 0.252%;0.0505%) > - Other features: PE alignment (not evaluated); mapping qualities > (not evaluated); counts of the alternative places. > > * rmap (0.2) > - CMD: ./rmap r12.fa -m 2 -o rmap.txt -c chrX.fa -w 32 -v > - CPU time: 2230.07 sec > - res memory: 894 MB > - # mapped reads: 1,686,129 > - # wrongly mapped reads: 7,408 (p_err: 0.4393%) > - Other features: mapping using base qualities (not evaluated); 3 > or more mismatches (not evaluated) > > * maq-se (0.6.3-33, almost identical to 0.6.4) > - CMD: maq map maq_se.map chrX.bfa r12.bfq > - CPU time: 1039.89 sec > - res memory: 819 MB > - # Q0;Q10;Q20 mapped reads: 1946152; 1665959; 1614122 > - # Q0;Q10;Q20 wrongly mapped reads: 200563; 1317; 320 (p_err: > 10.33%; 0.0790%; 0.0198%) > > * maq-pe (0.6.3-33, almost identical to 0.6.4) > - CMD: maq map maq_pe.map chrX.bfa r1.bfq r2.bfq > - CPU time: 1256.13 sec > - res memory: 674 MB > - # Q0;Q10;Q20 mapped reads: 1949177; 1756368; 1728480 > - # Q0;Q10;Q20 wrongly mapped reads: 68542; 281; 153 (p_err: > 3.516%; 0.0160%; 0.0088%) > - # reads mapped with indel: 2277 > - # wrongly mapped indel reads: 0 (p_err = 0/2277 = 0%) > > * soap-c0r0g0s (1.07, CPU time is measured with 1.05) > - CMD: ./soap -c 0 -r 0 -g 0 -a r12.fa -d chrX.fa -o > soap_c0r0g0s.sop > - CPU time: 1228.120 sec (or 1417.29 user time/372.44 real time if > '-p 4' is applied) > - res memory: 963 MB > - # mapped reads: 1,686,034 > - # wrongly mapped reads: 7,840 (p_err: 0.4650%) > > * soap-c0r0g3s (1.07, CPU time with 1.05) > - CMD: ./soap -c 0 -r 0 -g 3 -a r12.fa -d chrX.fa -o > soap_c0r0g3s.sop > - CPU time: 1398.95 sec > - res memory: 963 MB > - # mapped reads: 1,687,887 > - # wrongly mapped reads: 7,854 (p_err: 0.4653%) > - # reads mapped with indel: 1853 > - # wrongly mapped indel reads: 14 (p_err: 0.756%) > > * soap-c42r0g3s (1.07, CPU time with 1.05) > - CMD: ./soap -c 42 -r 0 -g 3 -a r12.fa -d chrX.fa -o > soap_c42r0g3s.sop > - CPU time: 1517.41 sec > - res memory: 963 MB > - # mapped reads: 1,698,212 > - # wrongly mapped reads: 8,131 (p_err: 0.4788%) > - # reads mapped with indel: 2207 > - # wrongly mapped indel reads: 41 (p_err: 1.86%) > > * shrimp (1.04) > - CMD: ./rmapper-ls -s 111111011111 -w 35 -n 3 -r 32 -o 20 -d -1 -h > 2675 r12.fa chrX.fa > - CPU time: 11875.75 sec > - res memory: 3085MB > - # all;normodds>0.5 mapped reads: 1930350;1587003 > - # all;normodds>0.5 wrongly mapped reads: 204355;2414 (p_err: > 10.6%;0.152%) > - # all;normodds>0.5 reads mapped with indel: 2062;1817 > - # all;normodds>0.5 wrongly mapped indel reads: 212;24 (p_err: > 10.3%;1.38%) > > Additional Notes > ================ > > * eland: eland is definitely the fastest, much faster than all the > competitors. What is more important, eland gives the number of > alternative places, which makes it possible for you to get further > information about the repetitive structures of the genome and to > select reads that can be really confidently mapped. In addition, with > the help of additional scripts, Eland IS able to map reads longer than > 32bp. Eland is one of the best software I ever used. It would be even > superior if Tony could make it easier to use for a user, like me, who > wants to run eland independently of the GAPipeline. > > * rmap: the strength of rmap is to use base qualities to improve the > alignment accuracy. I believe it can produce better alignment than > maq- > se because maq trades accuracy for speed at this point (technically it > is a bit hard to explain here). Nonetheless, I think rmap would be > more popular if its authors could add support for fastq-like quality > string which is now the standard in both Illumina and the Sanger > Institute (although maybe not elsewhere). rmap supports longer reads, > which is also a gain. Furthermore, I did learn a lot from its way to > count the number of mismatches. > > * soap: soap is a versatile program. It supports iterative-trimmed > alignment, long reads, gapped alignment, TAG alignment and PE mode. > Its PE mode is easier to use than eland. In principle, soap and eland > should give almost the same number of wrong alignments. However, soap > gives 442 more wrong alignments. Further investigation shows that most > of these 442 wrong ones are flagged as R? (repeat) by eland. > > * shrimp: Actually I was not expecting that a program using seeding > +Smith-Waterman could be that fast. So far as I know, all the other > software in the list do not do Smith-Waterman (maq does for PE data > only), which is why they are fast. SHRiMP's normodds score has similar > meaning to mapping quality. Such score helps to determine whether an > alignment is reliable. The most obvious advantage is SHRiMP can map > long reads (454/capillary) with the standard gapped alignment. If you > only work with small genomes, SHRiMP is a worthy choice. I think > SHRiMP would be better if it could make use of paired end information; > it would be even better if it could calculate mapping quality. The > current normodds score helps but is not exactly the mapping quality. > In addition, I also modified probcalc codes because in 1.04 underflow > may occur to long reads and leads to "nan" normodds. However, although > my revision fixes the underflow, it may lead to some inaccurate > normodds. > > * maq: at the moment maq is easier to use than eland. Supporting SNP > calling is maq's huge gain. Its paired end mode is also highly helpful > to recover some repetitive regions. Maq's random mapping, which is > frequently misused by users who have not noticed mapping qualities, > may be useful to some people, too, and at least it helps to call SNPs > at the verge of repeats. > > Update > ====== > > * 2008-03-13 > - Use maq version 0.6.3-33, which is faster than the public version > 0.6.3. > - Compile all the programs with Intel C/C++ compiler and run all > the programs on the same 8-core 64bit machine (Xeon-L5320 1.86GHz) > with 8GB memory. Multi-threading is NOT used. Previously I run some > programs on a faster machine (Xeon-5150 2.66GHz) and some on a slower > machine (1.86GHz), which is unfair. > - I invoked two programs at a time. The combination is: > (maq_se,maq_pe), (eland,eland-multi), (rmap,soap_c0r0g0s), > (soap_c0r0g3s,soap_c42r0g3s). Note that processes on the same machine > may compete with the memory bandwidth and cache uses, which may affect > the speed. > > * 2008-03-14 > - Test multithreading of SOAP. > > * 2008-03-22 > - Use soap_1.07. But PE mode did not work for me unfortunately. > - Add comments for the additional wrong alignments of soap (in > comparison to eland). > - Add "fake" eland mapping quality (done by my paf_utils.pl script). > - Evaluate indel alignment. Note that I am using soap's single-end > mode. PE mode would give better accuracy. > > * 2008-03-27 > - Evaluate SHRiMP-1.04 > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save > $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > maq-announce mailing list > maq...@li... > https://lists.sourceforge.net/lists/listinfo/maq-announce |