Re: [Bio-bwa-help] Bug in BWA
Status: Beta
Brought to you by:
lh3lh3
From: Neva D. <ne...@br...> - 2017-09-11 18:40:42
|
We can use these reads, by taking as the Hi-C contact the portions of the read that map uniquely. I.e. in this example we would take the first read and the uniquely mapping portion of the second read, ignoring the repetitive read. Though this actually represent a 3 way contact in which one end maps to a repetitive region, we would at least be able to make use of the information about the two uniquely mapping portions being in contact. With the -5 flag on and using the master version, running with splitting fastqs and without, there are still differences; they are quite minor as a percentage of reads, but we would like to be able to report the exact same Hi-C contacts regardless of whether or not the fastqs were split. I am now getting cases where one read end is entirely reported as MAPQ 0 on one run and on another run, is reported as chimeric with at least a portion mappable. Example below (again the ligation junction is present). Do you know the source of this inconsistency / is it possible to mitigate it? Note that we can be totally consistent (in terms of splitting fastqs) by throwing out any read with minimum MAPQ = 0, but then we lose a great deal of reads; in this latest run, we lose 5% (i.e. 5% of the reads have a supplementary alignment that is MAPQ0 and if we ignore just the MAPQ0 portion, we can recover a Hi-C contact). *M00336:181:000000000-A29H6:1:1103:25909:14227* 97 16 46398932 15 47M53S = 46391997 -6837 CGATTCCATTTGATGATTTCATTTGATTCCATTTGTTAATGATTTCGATCGATCATTCCATTTGATTTCAGTCGATGATGATTCCATTCGATTCCATTCG AAABBFFFFFFFGGGGGGGGGGHHHGHHHFHHHHHHHGHHHHHHHHEGHHGGHHGHHHHFHHHHFHHHHBFHHGEGHFHHHHHHHHHHHHDHHHHHHHHH NM:i:2 MD:Z:27C5C13 MC:Z:100M AS:i:37 XS:i:33 SA:Z:16,46403672,+,50S50M,0,1; XA:Z:2,-89845901,67S33M,0;16,-46450076,65S35M,1; *M00336:181:000000000-A29H6:1:1103:25909:14227* 353 16 46403672 0 50H50M = 46391997 -11577 GATCATTCCATTTGATTTCAGTCGATGATGATTCCATTCGATTCCATTCG GGHHGHHHHFHHHHFHHHHBFHHGEGHFHHHHHHHHHHHHDHHHHHHHHH NM:i:1 MD:Z:20T29 MC:Z:100M AS:i:45 XS:i:45 SA:Z:16,46398932,+,47M53S,15,2; *M00336:181:000000000-A29H6:1:1103:25909:14227* 145 16 46391997 26 100M = 46398932 6837 TTCTATTCGATTCTCTTCGATGGTGATTCAATTCTATTATATTGGATGATTCCATTCGATTCCATTTGATGTTGATTCAATTCGATTCTATTCGATGATG FHHEHHHHEHHHHFGHHGHHHHEHHGGHHGHHEHHHGHEHHHHHHGHHGHHHHHFHHHHHHHHHHHHHHHHG5FHHHHGGGGGGGGFGBFFFFFFBBBBB NM:i:0 MD:Z:100 MC:Z:47M53S AS:i:100 XS:i:90 XA:Z:16,+46428782,100M,2; ------------------------------------ *M00336:181:000000000-A29H6:1:1103:25909:14227* 97 7 64954041 0 100M 16 46391997 0 CGATTCCATTTGATGATTTCATTTGATTCCATTTGTTAATGATTTCGATCGATCATTCCATTTGATTTCAGTCGATGATGATTCCATTCGATTCCATTCG AAABBFFFFFFFGGGGGGGGGGHHHGHHHFHHHHHHHGHHHHHHHHEGHHGGHHGHHHHFHHHHFHHHHBFHHGEGHFHHHHHHHHHHHHDHHHHHHHHH NM:i:11 MD:Z:33C1A1G6C1A0T5G8C4C2T2T26 MC:Z:100M AS:i:45 XS:i:45 *M00336:181:000000000-A29H6:1:1103:25909:14227* 145 16 46391997 26 100M 7 64954041 0 TTCTATTCGATTCTCTTCGATGGTGATTCAATTCTATTATATTGGATGATTCCATTCGATTCCATTTGATGTTGATTCAATTCGATTCTATTCGATGATG FHHEHHHHEHHHHFGHHGHHHHEHHGGHHGHHEHHHGHEHHHHHHGHHGHHHHHFHHHHHHHHHHHHHHHHG5FHHHHGGGGGGGGFGBFFFFFFBBBBB NM:i:0 MD:Z:100 MC:Z:100M AS:i:100 XS:i:90 XA:Z:16,+46428782,100M,2; On Mon, Sep 11, 2017 at 10:51 AM, Heng Li <lh...@me...> wrote: > The difference between with -5 and without -5 should be big, because the > primary assignment is different. Without -5, the run-to-run difference > should be tiny – it only occurs when the two parts of a read have identical > alignment score, just as your example shows. In addition, I am not sure how > this difference matters in downstream. In your example, you have a unique > part of read and a repetitive part of the read. You can't reliably build a > map from this read no matter how mapq is assigned to the unique part. > > Anyway, with the new change to the master, just apply -5 for Hi-C or > similar data. > > Heng > > > On Sep 11, 2017, at 10:35, Neva Durand <ne...@br...> wrote: > > > > I will also note that it's inconsistent - depending on the fastq splits, > I can align the same reads and will recover different chimeras. I suppose > I could take all reads of this nature (supplementary alignments with two > reads MAPQ0) and rerun them with a different split size until I recover > everything, but this is a pretty hacky solution. > > > > On Mon, Sep 11, 2017 at 10:32 AM, Neva Durand <ne...@br...> > wrote: > > So the problem is that BWA tries to align the portion that is a repeat > region first, and then the supplementary alignment cannot be found even > though it is alignable? This still seems to be a bug, because a perfectly > alignable read has to be thrown away due to arbitrarily looking at the > "bad" portion first. > > > > For us, the fact that this arises when the fastqs are split is very > problematic. I only mentioned the -5 flag because collaborators use it and > I saw the problem then as well. > > > > On Mon, Sep 11, 2017 at 10:12 AM, Heng Li <lh...@me...> wrote: > > By default, bwa-mem sets the primary alignment as the one with the > highest alignment score and it doesn't give supplementary alignments a > higher mapq than the mapq of the primary because shorter alignments are > usually less reliable. In your example, the primary alignment is different. > That leads to the mapq difference. > > > > Please check out the master. I have disabled this heuristic when you use > option -5. > > > > Heng > > > > > On Sep 11, 2017, at 09:03, Neva Durand <ne...@br...> > wrote: > > > > > > I believe I have found a bug in BWA. This affects only supplementary > alignments, but those alignments are very important to us as our primary > experiment is a ligation assay where many reads are chimeric. > > > > > > When conditions are slightly changed between runs, BWA will report > supplementary reads that previously had been reported as MAPQ 0 with a MAPQ > of 60, and vice versa. One way to induce this is to simply split the > fastqs. Another way is to run with the "-5" flag on or off (available in > bwa-0.7.15-1142-dirty). > > > > > > Though I know the random seed would change based on such conditions, I > would think it would only affect where the MAPQ0 read mapped, not whether > it was a MAPQ 0 read. > > > > > > Note also, in spot checking these reads that are inconsistent between > runs against BLAT, the choice of MAPQ 60 is actually the correct one; these > are not repeat regions. > > > > > > Here is an example. In this case, the first read end maps to one > place. The second read end is chimeric (perfectly so, in fact - with 50 bp > in each portion and the ligation junction in the middle). One time I ran > it, both ends of the second read end had MAPQ0. Another time, one end of > the second read mapped with MAPQ 0 - to the exact same place as in the > first run - and upon double checking with BLAT, indeed this is an alignable > read. I have highlighted the discordant read. > > > > > > M00336:181:000000000-A29H6:1:1102:3011:19564 97 9 > 85403418 51 100M 2 219173546 0 > ATGGTGAAACCCCATCTCTACTAAAAATACAAAACTTAGCCAGGTGTGGTGGCACATGCC > TGTAGTCCCAGCTACTGGGAGGCTGAGGCAGGAGAATCGC BBABAFFFFFFBGGGGGGGGGGHHHHHHHH > HHFFHHFHGGHHFDBFGGGHGGHHHHHGHGHHHHHDHGHFBFHHFHHHGEGCGHHGHFGGGCEGHHFHHG > NM:i:0 MD:Z:100 AS:i:100 XS:i:80 > > > > > > M00336:181:000000000-A29H6:1:1102:3011:19564 145 2 > 219173546 0 50S50M 9 85403418 0 > AAAGAGTGCATGAAGAAGGAATATCTCCACACTTATCAAACCATCAGATCGATCTGCCTG > CCTCAACCTCCCAAAGTGCTGGGATTACAGGAGTGAGCCA FHHHHHHFDGFBDFHHEHHHFBGB? > 3DEGDABD5FHEHGBBBGFGGFEFHHGGGFFEEFB2F3HHFGGHGHHHGHHHFFGGGGGFFGFFFFFFBFAA3BA > NM:i:0 MD:Z:50 AS:i:50 XS:i:50 SA:Z:9,85403764,-,50M50S,0,0; > > > > > > M00336:181:000000000-A29H6:1:1102:3011:19564 2193 9 > 85403764 0 50M50H = 85403418 -396 > AAAGAGTGCATGAAGAAGGAATATCTCCACACTTATCAAACCATCAGATC > FHHHHHHFDGFBDFHHEHHHFBGB?3DEGDABD5FHEHGBBBGFGGFEFH NM:i:0 MD:Z:50 > AS:i:50 XS:i:20 SA:Z:2,219173546,-,50S50M,0,0; > > > > > > -------------------------- > > > > > > M00336:181:000000000-A29H6:1:1102:3011:19564 97 9 > 85403418 51 100M = 85403764 396 > ATGGTGAAACCCCATCTCTACTAAAAATACAAAACTTAGCCAGGTGTGGTGGCACATGCC > TGTAGTCCCAGCTACTGGGAGGCTGAGGCAGGAGAATCGC BBABAFFFFFFBGGGGGGGGGGHHHHHHHH > HHFFHHFHGGHHFDBFGGGHGGHHHHHGHGHHHHHDHGHFBFHHFHHHGEGCGHHGHFGGGCEGHHFHHG > NM:i:0 MD:Z:100 AS:i:100 XS:i:80 > > > > > > M00336:181:000000000-A29H6:1:1102:3011:19564 145 9 > 85403764 60 50M50S = 85403418 -396 > AAAGAGTGCATGAAGAAGGAATATCTCCACACTTATCAAACCATCAGATCGATCTGCCTG > CCTCAACCTCCCAAAGTGCTGGGATTACAGGAGTGAGCCA FHHHHHHFDGFBDFHHEHHHFBGB? > 3DEGDABD5FHEHGBBBGFGGFEFHHGGGFFEEFB2F3HHFGGHGHHHGHHHFFGGGGGFFGFFFFFFBFAA3BA > NM:i:0 MD:Z:50 AS:i:50 XS:i:20 SA:Z:19,17818690,-,50S50M,0,0; > > > > > > > > > M00336:181:000000000-A29H6:1:1102:3011:19564 2193 19 > 17818690 0 50H50M 9 85403418 0 > GATCTGCCTGCCTCAACCTCCCAAAGTGCTGGGATTACAGGAGTGAGCCA > HGGGFFEEFB2F3HHFGGHGHHHGHHHFFGGGGGFFGFFFFFFBFAA3BA NM:i:0 MD:Z:50 > AS:i:50 XS:i:50 SA:Z:9,85403764,-,50M50S,60,0; > > > > > > ------------------------- > > > > > > Please advise - BWA is really the only aligner for chimeric reads, but > this case comes up frequently (at least 1% of the time - and as the scale > of our sequencing is billions of reads, we lose tens of millions of exactly > the kinds of reads we're interested in analyzing). > > > > > > -- > > > Neva Cherniavsky Durand, Ph.D. > > > Staff Scientist, Aiden Lab > > > www.aidenlab.org > > > ------------------------------------------------------------ > ------------------ > > > Check out the vibrant tech community on one of the world's most > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot______ > _________________________________________ > > > Bio-bwa-help mailing list > > > Bio...@li... > > > https://lists.sourceforge.net/lists/listinfo/bio-bwa-help > > > > > > > > > > -- > > Neva Cherniavsky Durand, Ph.D. > > Staff Scientist, Aiden Lab > > www.aidenlab.org > > > > > > > > -- > > Neva Cherniavsky Durand, Ph.D. > > Staff Scientist, Aiden Lab > > www.aidenlab.org > > -- Neva Cherniavsky Durand, Ph.D. Staff Scientist, Aiden Lab www.aidenlab.org |