Re: [Bio-bwa-help] Problems with edit distance, unique mappings and mapping qualities
Status: Beta
Brought to you by:
lh3lh3
From: Heng Li <lh...@sa...> - 2010-10-29 14:35:07
|
On Oct 29, 2010, at 9:30 AM, Eva Berglund wrote: > Hi all, > > I have some problems parsing the output from BWA (Version 0.5.7 (r1310)). > > First, how can I see how many errors (mismatches+gaps) I have in an > alignment? > > The BWA manual says: > XM Number of mismatches in the alignment > XO Number of gap opens > XG Number of gap extentions > > The SAM format specification says: > NM Number of nucleotide differences (i.e. edit distance to the > reference sequence). Edit distance equals the sum of mismatches, gap > opens and gap extensions. Mismatches or gaps in clipped sequences are > not counted. > > So, one would think that NM=XM+X0+XG, and that NM equals the number of > errors. NM and XM/XO/XG are calculated at different procedures. In general, trust NM and ignore XM/XO/XG. > However, these numbers do not always add up. It seems to me that there > are no cases where XO>XG, possibly suggesting that one single gap > corresponds to XO=1 and XG=1 (so that XG is the number of errors, > excluding mismatches), and that NM=XM+XG. There are however also other > cases, including some where NM is really high, although XM=XO=XG=0 (I > include some examples below). How is NM defined in BWA, and what is the > actual number of errors? > > Second, how can I see whether a read maps uniquely? I have used the tags > X0 (Number of best hits) and XT (Type: Unique/Repeat/N/Mate-sw), but it > is not clear to me if there is any way of telling whether reads with > XT=M or XT=N are uniquely aligned or not? (I read the previous thread > Meaning of the XT tag) Use mapping quality and do not look at any tags. > Third, what does it mean that a read is not "mapped in a proper pair"? > It seems to me that this includes all reads where any of the two reads > in the pair is not mapped to the reference sequence, but also some > additional reads. Which are these? Proper pairs are ones with correct orientation and within a sensible distance. For illumina short-insert libraries, the distance is about 6 sigma of the insert size distribution. The exact boundaries are outputted in the stderr of sampe. > Finally, I wonder how the mapping qualities are calculated? It seems > like the range of MAPQ differs depending on whether I do a single-end or > a paired-end alignment (in my run, they range up to 37 in the first > case, and to 60 in the second)? Also, if MAPQ is specific for a read > rather than a pair, how come a read which is uniquely aligned with no > mismatches or errors have MAPQ=0 (see example below)? I would like to > determine a cutoff for what is a "good" MAPQ, but it is difficult since > I do not know how to interpret these values. Unless you develop algorithms, you do not need to understand how mapping quality is calculated. Different programs have different models, but they all mean the same thing: the Phred scaled probability of the mapping not overlapping the true position. As to your specific example, the top hit is perfect, but there are 61 1-mismatch hits. It is quite possible that one of these 61 1-mismatch hit is true while the perfect hit just arises due to an sequencing error or a true mutation. Heng > > Thanks for any advice! > Eva Berglund > > # NM=XM+XG, for the first read > EAS18_0111_FC627M2AAXX:1:1:18902:1035#0 83 chr10 42662134 > 15 1S29M3I14M2S = 42661895 -282 > CATCATTGAATGGAATCGAATGGAATCATCAAAAAATGTAAACGAATAG > IIGIIIHIIIIHIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIII XT:A:M > NM:i:6 SM:i:15 AM:i:15 XM:i:3 XO:i:1 XG:i:3 > MD:Z:16A17G2T5 > EAS18_0111_FC627M2AAXX:1:1:18902:1035#0 163 chr10 42661895 > 15 49M = 42662134 282 > AATCATCATCGAATGGAATCGAATGGAATCATCATTGAATGGAATCAAA > IIIIIIIIIIHIIIIIIIIIGGGGGGGGGGIIIHIIGEFIDBHHHIIFH XT:A:U > NM:i:0 SM:i:15 AM:i:15 X0:i:1 X1:i:7 XM:i:0 XO:i:0 > XG:i:0 MD:Z:49 > > # NM=4 (XM+XO+XG=3, XM+XG=2), for the first read > EAS18_0111_FC627M2AAXX:1:1:6631:2402#0 83 chr16 19759371 > 60 47M2D2M = 19759224 -198 > GTTCAACCATCTGGGCTCAAATTCTTTTTTTTTTTTTTTTTTTTTTGAG > :=>81716.*306;:*335@969)DIIIIIIIIIIIIIIIIIIIIIIII XT:A:U > NM:i:4 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:1 > XG:i:2 MD:Z:44G0A1^AC2 > EAS18_0111_FC627M2AAXX:1:1:6631:2402#0 163 chr16 19759224 > 60 49M = 19759371 198 > AATGGTGGCCCTTTTTGCAACTTACCGTATACACTGCGTTAGAAATAAT > IIIIIIIIIIIIIIIIIFIIIIIIIIHHEIFIIIIDGIFGHIIHIHIIF XT:A:U > NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 > XG:i:0 MD:Z:49 > > # NM=29, although XM=XO=XG=0, for the first read > EAS18_0111_FC627M2AAXX:1:35:1616:7565#0 99 chr2 3314245 29 > 48M = 3314427 231 > AACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTAGTGGCG > G:GEEGBGG=G=GG?GGEDBEEDGG>DDDA@?GBGG>GGB?D:8D>BA XT:A:R NM:i:29 > SM:i:0 AM:i:0 X0:i:2214 XM:i:0 XO:i:0 XG:i:0 > MD:Z:2A3C0G0T0C0T0C0T0A0C0T4A0T0A0C5A1T0A0G1C2G0C0G0T0A0G0T1G0C0 > EAS18_0111_FC627M2AAXX:1:35:1616:7565#0 147 chr2 3314427 > 37 49M = 3314245 -231 > AAAAAAAAAAAAAAAAAAAAGTCTTGATAATTTAAAACTAAAAACTTAT > IHDIFIHGID>GGD@II@IEF?BBDDGGGE??DDGGGEGDGEGG>F:EF XT:A:U > NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 > XG:i:0 MD:Z:49 > > # MAPQ=0 for the first read, although it is uniquely aligned and XM=XO=XG=0 > EAS18_0111_FC627M2AAXX:1:1:13564:1065#0 99 chr13 60479205 > 0 49M = 60479265 109 > GGTGGCTCATGCCTGTAATCCCAGCAGTTTGGGAGGCCGAGACGGGCAG > GGFGGHHHHHHHHHHGGBGGGEGG@AFCEFGGBGGHFHHEDDDGGEEEG XT:A:U > NM:i:0 SM:i:0 AM:i:0 X0:i:1 X1:i:61 XM:i:0 XO:i:0 > XG:i:0 MD:Z:49 > EAS18_0111_FC627M2AAXX:1:1:13564:1065#0 147 chr13 60479265 > 0 49M = 60479205 -109 > AGGAGTGCGAGACTAGCCTGGCGAACGTGGCAAAACCCTGTCTCTACTA > BEABGD+GFEGBGDEIIIDGIHDIIIGHIIIIIFGIIIGIIIHIGIDHB XT:A:M > NM:i:11 SM:i:0 AM:i:0 XM:i:11 XO:i:0 XG:i:0 > MD:Z:5A0T5T0C1T6T2A0A3T0G6C10 XA:Z:chr4,-14830339,49M,2; > > > ------------------------------------------------------------------------------ > Nokia and AT&T present the 2010 Calling All Innovators-North America contest > Create new apps & games for the Nokia N8 for consumers in U.S. and Canada > $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing > Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store > http://p.sf.net/sfu/nokia-dev2dev > _______________________________________________ > Bio-bwa-help mailing list > Bio...@li... > https://lists.sourceforge.net/lists/listinfo/bio-bwa-help -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |