Re: [maq-announce] MAQ poster and so on

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Heng,

Thanks for sending these along.   Have you looked at all at how read  
length affects alignability and accuracy?  I think a lot of people  
will be looking for guidance in terms of the optimal number of cycles  
to sequence.

Thanks,
ben.

On May 12, 2008, at 3:25 AM, Heng Li wrote:

> Hello MAQ Users,
>
> Recently I have presented a poster about MAQ algorithm at the CSHL
> meeting. You could download the poster at:
>
> http://www.sanger.ac.uk/Users/lh3/maq-poster.pdf
>
> I also gave a 5-minute talk at a small workshop about issues in short
> read mapping. Part of the slides can be found here:
>
> http://www.sanger.ac.uk/Users/lh3/CSHL-mapping-20080511.pdf
>
> In addition, I did a small benchmark on various read mapping software.
> Although I was trying to make a fair comparison in mind, it obviously
> biases towards maq in fact. Anyway, I think it might still be helpful
> to someone. The comparison is appended at the end of this email.
>
> with kind regards,
>
> Heng
>
>
>
>
>
> Here I will present a small comparison between different read
> alignment programs, including eland, soap, rmap and maq. The
> comparison is designed as follows. From human reference chrX, I
> simulated 1 million pairs of 32bp reads (i.e. 2 million reads) with
> maq. Maq first simulated a diploid sequence (two haploid sequences) by
> adding 135351 substitutions, 7650 1bp insertions and 7510 1bp
> deletions, and then generated reads from the diploid sequence. The
> base quality were generated based on real data. Sequencing errors were
> added based on qualities. As we know the exact position where the
> reads come from, we can exactly count the number of reads that are
> wrongly mapped by a certain alignment program.
>
> To facilitate comparison, I wrote a Perl script 'paf_utils.pl' that
> converts various alignment format to the socalled PAF format (Pairwise
> Alignment Format), which is only for my own use at the moment.
>
> Important Notes
> ===============
>
> * This benchmark favours maq because I have not squeezed the best
> performance out of other software. Here are the reasons.
>
> * eland: eland is able to make use of paired end (PE) information and
> generate accurate mapping qualities. However, these functionality are
> assisted by several scripts in the GAPipeline-0.3.0 and for the moment
> I do not know how to run these scripts independently without invoking
> the full Gerald module of the pipeline. I was running eland in the
> single end mode without mapping qualities.
>
> * rmap: rmapq is able to use base qualities to improve the alignment.
> However, it requires _prb.txt format which is rarely used at the
> Sanger Institute and in Illumina as well. Maq does not support this
> format in simulation and therefore rmapq is not evaluated. I am sure
> rmapq would work much better than rmap if base qualities were  
> available.
>
> * soap: soap also comes with the PE alignment mode. Unfortunately, it
> segfaults when I try to use this mode. Perhaps my input file is buggy
> or I was doing stupid things, but anyway, I could not evaluate soap in
> PE alignment mode. If this mode worked, soap would get higher mapping
> accuracy.
>
> * shrimp: I tuned the parameters for faster speed but with lower
> sensitivity. The overall accuracy would be affected. In addition, I
> was using version 1.04 instead of the latest 1.05. The latest version
> has been optimized a bit for mapping Illumina reads.
>
> * In simulation we assume that base qualities are accurate, but this
> is not the case on real data. When base qualities are inaccurate,
> maq's mapping qualities will be less accurate, which will also reduce
> the accuracy of alignments. rmapq will suffer from the similar problem
> more or less.
>
> * The CPU time is only a rough approximate. It is subjected to the
> conditions (e.g. memory and cache uses) of the machine where the
> program runs. On different conditions, the speed may vary up to 30%.
>
> Results
> =======
>
> * eland (GAPipeline-0.3.0, mapping quality ROUGHLY estimated assuming
> uniform quality Q25)
>   - CMD: ./eland_32 r12.fa chrX eland.txt
>   - CPU time: 284.64 sec
>   - res memory: 383 MB
>   - # Q0;Q10;Q20 mapped reads: 1,686,129;1678776;1622577
>   - # Q0;Q10;Q20 wrongly mapped reads: 7,406;4234;819 (p_err: 0.439%;
> 0.252%;0.0505%)
>   - Other features: PE alignment (not evaluated); mapping qualities
> (not evaluated); counts of the alternative places.
>
> * rmap (0.2)
>   - CMD: ./rmap r12.fa -m 2 -o rmap.txt -c chrX.fa -w 32 -v
>   - CPU time: 2230.07 sec
>   - res memory: 894 MB
>   - # mapped reads: 1,686,129
>   - # wrongly mapped reads: 7,408 (p_err: 0.4393%)
>   - Other features: mapping using base qualities (not evaluated); 3
> or more mismatches (not evaluated)
>
> * maq-se (0.6.3-33, almost identical to 0.6.4)
>   - CMD: maq map maq_se.map chrX.bfa r12.bfq
>   - CPU time: 1039.89 sec
>   - res memory: 819 MB
>   - # Q0;Q10;Q20 mapped reads: 1946152; 1665959; 1614122
>   - # Q0;Q10;Q20 wrongly mapped reads: 200563; 1317; 320 (p_err:
> 10.33%; 0.0790%; 0.0198%)
>
> * maq-pe (0.6.3-33, almost identical to 0.6.4)
>   - CMD: maq map maq_pe.map chrX.bfa r1.bfq r2.bfq
>   - CPU time: 1256.13 sec
>   - res memory: 674 MB
>   - # Q0;Q10;Q20 mapped reads: 1949177; 1756368; 1728480
>   - # Q0;Q10;Q20 wrongly mapped reads: 68542; 281; 153 (p_err:
> 3.516%; 0.0160%; 0.0088%)
>   - # reads mapped with indel: 2277
>   - # wrongly mapped indel reads: 0 (p_err = 0/2277 = 0%)
>
> * soap-c0r0g0s (1.07, CPU time is measured with 1.05)
>   - CMD: ./soap -c 0 -r 0 -g 0 -a r12.fa -d chrX.fa -o  
> soap_c0r0g0s.sop
>   - CPU time: 1228.120 sec (or 1417.29 user time/372.44 real time if
> '-p 4' is applied)
>   - res memory: 963 MB
>   - # mapped reads: 1,686,034
>   - # wrongly mapped reads: 7,840 (p_err: 0.4650%)
>
> * soap-c0r0g3s (1.07, CPU time with 1.05)
>   - CMD: ./soap -c 0 -r 0 -g 3 -a r12.fa -d chrX.fa -o  
> soap_c0r0g3s.sop
>   - CPU time: 1398.95 sec
>   - res memory: 963 MB
>   - # mapped reads: 1,687,887
>   - # wrongly mapped reads: 7,854 (p_err: 0.4653%)
>   - # reads mapped with indel: 1853
>   - # wrongly mapped indel reads: 14 (p_err: 0.756%)
>
> * soap-c42r0g3s (1.07, CPU time with 1.05)
>   - CMD: ./soap -c 42 -r 0 -g 3 -a r12.fa -d chrX.fa -o
> soap_c42r0g3s.sop
>   - CPU time: 1517.41 sec
>   - res memory: 963 MB
>   - # mapped reads: 1,698,212
>   - # wrongly mapped reads: 8,131 (p_err: 0.4788%)
>   - # reads mapped with indel: 2207
>   - # wrongly mapped indel reads: 41 (p_err: 1.86%)
>
> * shrimp (1.04)
>   - CMD: ./rmapper-ls -s 111111011111 -w 35 -n 3 -r 32 -o 20 -d -1 -h
> 2675 r12.fa chrX.fa
>   - CPU time: 11875.75 sec
>   - res memory: 3085MB
>   - # all;normodds>0.5 mapped reads: 1930350;1587003
>   - # all;normodds>0.5 wrongly mapped reads: 204355;2414 (p_err:
> 10.6%;0.152%)
>   - # all;normodds>0.5 reads mapped with indel: 2062;1817
>   - # all;normodds>0.5 wrongly mapped indel reads: 212;24 (p_err:
> 10.3%;1.38%)
>
> Additional Notes
> ================
>
> * eland: eland is definitely the fastest, much faster than all the
> competitors. What is more important, eland gives the number of
> alternative places, which makes it possible for you to get further
> information about the repetitive structures of the genome and to
> select reads that can be really confidently mapped. In addition, with
> the help of additional scripts, Eland IS able to map reads longer than
> 32bp. Eland is one of the best software I ever used. It would be even
> superior if Tony could make it easier to use for a user, like me, who
> wants to run eland independently of the GAPipeline.
>
> * rmap: the strength of rmap is to use base qualities to improve the
> alignment accuracy. I believe it can produce better alignment than  
> maq-
> se because maq trades accuracy for speed at this point (technically it
> is a bit hard to explain here). Nonetheless, I think rmap would be
> more popular if its authors could add support for fastq-like quality
> string which is now the standard in both Illumina and the Sanger
> Institute (although maybe not elsewhere). rmap supports longer reads,
> which is also a gain. Furthermore, I did learn a lot from its way to
> count the number of mismatches.
>
> * soap: soap is a versatile program. It supports iterative-trimmed
> alignment, long reads, gapped alignment, TAG alignment and PE mode.
> Its PE mode is easier to use than eland. In principle, soap and eland
> should give almost the same number of wrong alignments. However, soap
> gives 442 more wrong alignments. Further investigation shows that most
> of these 442 wrong ones are flagged as R? (repeat) by eland.
>
> * shrimp: Actually I was not expecting that a program using seeding
> +Smith-Waterman could be that fast. So far as I know, all the other
> software in the list do not do Smith-Waterman (maq does for PE data
> only), which is why they are fast. SHRiMP's normodds score has similar
> meaning to mapping quality. Such score helps to determine whether an
> alignment is reliable. The most obvious advantage is SHRiMP can map
> long reads (454/capillary) with the standard gapped alignment. If you
> only work with small genomes, SHRiMP is a worthy choice. I think
> SHRiMP would be better if it could make use of paired end information;
> it would be even better if it could calculate mapping quality. The
> current normodds score helps but is not exactly the mapping quality.
> In addition, I also modified probcalc codes because in 1.04 underflow
> may occur to long reads and leads to "nan" normodds. However, although
> my revision fixes the underflow, it may lead to some inaccurate
> normodds.
>
> * maq: at the moment maq is easier to use than eland. Supporting SNP
> calling is maq's huge gain. Its paired end mode is also highly helpful
> to recover some repetitive regions. Maq's random mapping, which is
> frequently misused by users who have not noticed mapping qualities,
> may be useful to some people, too, and at least it helps to call SNPs
> at the verge of repeats.
>
> Update
> ======
>
> * 2008-03-13
>   - Use maq version 0.6.3-33, which is faster than the public version
> 0.6.3.
>   - Compile all the programs with Intel C/C++ compiler and run all
> the programs on the same 8-core 64bit machine (Xeon-L5320 1.86GHz)
> with 8GB memory. Multi-threading is NOT used. Previously I run some
> programs on a faster machine (Xeon-5150 2.66GHz) and some on a slower
> machine (1.86GHz), which is unfair.
>   - I invoked two programs at a time. The combination is:
> (maq_se,maq_pe), (eland,eland-multi), (rmap,soap_c0r0g0s),
> (soap_c0r0g3s,soap_c42r0g3s). Note that processes on the same machine
> may compete with the memory bandwidth and cache uses, which may affect
> the speed.
>
> * 2008-03-14
>   - Test multithreading of SOAP.
>
> * 2008-03-22
>   - Use soap_1.07. But PE mode did not work for me unfortunately.
>   - Add comments for the additional wrong alignments of soap (in
> comparison to eland).
>   - Add "fake" eland mapping quality (done by my paf_utils.pl script).
>   - Evaluate indel alignment. Note that I am using soap's single-end
> mode. PE mode would give better accuracy.
>
> * 2008-03-27
>   - Evaluate SHRiMP-1.04
>
>
>
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> Don't miss this year's exciting event. There's still time to save  
> $100.
> Use priority code J8TL2D2.
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> maq-announce mailing list
> maq...@li...
> https://lists.sourceforge.net/lists/listinfo/maq-announce