Re: [Bio-bwa-help] BWA allowing too many mismatches?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

OK, thanks again, Joe. I'm going to run a script to pick out all the reads
that were mapped this way (have the XT:A:M tag) and see if that accounts for
all the reads with large edit distances...

On Sun, Jan 30, 2011 at 6:03 PM, Joseph Fass <jos...@gm...> wrote:

> Hi Kris,
> I'm sure there's a cutoff for the Smith-Waterman alignment score, but I
> don't know what it is (or if it depends on length, e.g.). I don't know of a
> way to turn it off, but I think the "XT:A:M" tag will reliably indicate that
> the read was aligned using Smith-Waterman after being anchored by its mate.
> You could selectively edit those SAM records if you wanted to disqualify
> those reads?
> ~Joe
>
>
> On Sun, Jan 30, 2011 at 1:26 PM, Kris W <kri...@gm...> wrote:
>
>> Hi Joe,
>> Thanks for replying. Well, the mate is mapped with 2 mismatches. Is this
>> the default behavior: if one mate maps within the mismatch threshold, the
>> other mate is mapped with no threshold at all? If that is the default
>> behavior, is there a way to turn it off? Do you know where I can find the
>> documentation for this?
>>
>> Here is the mapping of the mate, if that is helpful. (But keep in mind
>> that 100% or our reads mapped):
>>
>> IPAR1:4:104:13763:17273:1#0     99      chr10   42597783        15
>> 76M =42597861        142
>> CAAATGGATTTATCGAATGCAATCGAATGGAATTATCGAATGCAATCGAATAGAATCATCGAATGGACTCGAATGG
>> GGBGGGGGGGIIIIIHIHIIIIIIIIIIIIIIIGIIIIIIIIIIIHIIIGHIHHHFHFHIIIIHHGGGGHGFGEEC
>> XT:A:U  NM:i:2  SM:i:15 AM:i:15 X0:i:1  X1:i:3  XM:i:2       XO:i:0  XG:i:0
>> MD:Z:8A57G9
>> XA:Z:chr10,+42395979,76M,3;chr10,+42390905,76M,3;chr10,+42599159,76M,3;
>>
>>
>>
>> On Sun, Jan 30, 2011 at 9:11 AM, Joseph Fass <jos...@gm...>wrote:
>>
>>> Hi Kris,
>>> Unmapped reads can be "anchored" by their mapped mates for -more
>>> permissive- Smith-Waterman alignment. Does that explain it for the read you
>>> show?
>>> ~Joe
>>> On Jan 30, 2011 6:25 AM, "Kris W" <kri...@gm...> wrote:
>>> > After aligning paired-end Illumina reads with BWA (see bottom for
>>> details),
>>> > I'm finding reads that are mapped with a high number of mismatches -
>>> higher
>>> > than should be allowed, I think.
>>> >
>>> > Here is an example of one of these reads (one end of a paired read): It
>>> is
>>> > 76 bp long and so - if I understand correctly - should not be allowed
>>> to
>>> > have more than 5 mismatches:
>>> > @IPAR1:4:104:13763:17273:1#0/3
>>> >
>>> CGATTGTATTCGATGGTGATTGATTTCGAGTCCATGGATTAATCCATTGCATTCCATTCGGTGANNNNNNNNNNNN
>>> > +
>>> > IIIIIIIIIIIIIIHIHEHIHGIIHIHIIGHIIHIEECEE?BBB@EEEEEIIIIIIIIIH
>>> ?=A#############
>>> >
>>> > However, BWA mapped it with 16 mismatches:
>>> > IPAR1:4:104:13763:17273:1#0/3 147 chr10 42597861 15
>>> > 12S64M = 42597783 -142
>>> >
>>> NNNNNNNNNNNNTCACCGAATGGAATGCAATGGATTAATCCATGGACTCGAAATCAATCACCATCGAATACAATCG
>>> > #############A=?HIIIIIIIIIEEEEE@BBB
>>> ?EECEEIHIIHGIIHIHIIGHIHEHIHIIIIIIIIIIIIII
>>> > XT:A:M NM:i:16 SM:i:15 AM:i:15 XM:i:16 XO:i:0 XG:i:0
>>> > MD:Z:3T11G6A2G0A1A3A1T0A4T0G0G5T3T4G0G5
>>> >
>>> > Here are the details of how I used BWA. I tried this with both versions
>>> > 0.5.6 and 0.5.9 (RHEL5, 64 bit) and got the same results:
>>> >
>>> > 1. index the reference genome: (this step was only done once - with
>>> > version 0.5.6)
>>> > bwa index -a bwtsw <reference fasta (hg19 complete)>
>>> >
>>> > 2. make one sai file from each fastq file:
>>> > bwa aln <reference fasta> <reads1.fq> > <reads1.sai>
>>> > bwa aln <reference fasta> <reads2.fq> > <reads2.sai>
>>> >
>>> > 3. output alignment to sam file:
>>> > bwa sampe <reference fasta> <reads1.sai> <reads2.sai> <reads1.fq>
>>> > <reads2.fq> > <alignment.sam>
>>> >
>>> > (I used defaults for all parameters.)
>>> >
>>> > After converting the SAM file to BAM, sorting it, and removing PCR
>>> > duplicates (all with samtools), the "samtools flagstat" command shows
>>> that
>>> > 100% of the reads are mapped.
>>> >
>>> > Is this a bug, or am I doing something wrong, or have I misunderstood
>>> what
>>> > should happen with the default parameters? The output from running "bwa
>>> > aln" includes these lines, and I'm assuming "max_diff" means the
>>> maximum
>>> > edit distance allowed for a read to be mapped:
>>> > [bwa_aln] 17bp reads: max_diff = 2
>>> > [bwa_aln] 38bp reads: max_diff = 3
>>> > [bwa_aln] 64bp reads: max_diff = 4
>>> > [bwa_aln] 93bp reads: max_diff = 5
>>> > [bwa_aln] 124bp reads: max_diff = 6
>>> > [bwa_aln] 157bp reads: max_diff = 7
>>> > [bwa_aln] 190bp reads: max_diff = 8
>>> > [bwa_aln] 225bp reads: max_diff = 9
>>> >
>>> > Thanks for any help!
>>> >
>>> > -Kris
>>>
>>
>>
>
>
> --
> Joseph Fass
> Bioinformatics Programmer
> UC Davis Bioinformatics Core
> joseph.fass -at- gmail.com (professional)
> 970.227.5928 (c)  ||  530.752.2698 (w)
>