Menu

#11 2 different REF bases reported by Scalpel

1.0
closed
None
1
2017-06-19
2017-06-09
No

Hi Giuseppe,

I am getting a weird output from Scalpel, where it is reporting 2 different bases as REF at the same genomic location (example: MT:309 correct REF: "C", Scalpel reports "C" and "T" at the same location, MT:310 is a "T" base) and I can't seem to understand why that's happening. Ref at a particular genomic position should always be constant based on the reference used.

The VCF output looks like:
MT 309 . C CCT 119.95 HighChi2score AVGCOV=269.0;MINCOV=269;ALTCOV=590;ZYG=het;COVRATIO=0.31;CHI2=119.95;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na GT:AD:DP 0/1:590,269:859
MT 309 . T TC 758.15 LowVaf;HighChi2score AVGCOV=26.0;MINCOV=26;ALTCOV=833;ZYG=het;COVRATIO=0.03;CHI2=758.15;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na GT:AD:DP 0/1:833,26:859
MT 309 . C CCCT 758.15 LowVaf;HighChi2score AVGCOV=26.0;MINCOV=26;ALTCOV=833;ZYG=het;COVRATIO=0.03;CHI2=758.15;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na GT:AD:DP 0/1:833,26:859

Can you please help me figure this out.

Thank you,
Ashini

Discussion

  • Giuseppe Narzisi

    Hi Ashini,

    a similar problem was reported in the past by another user but at a different location:

    https://sourceforge.net/p/scalpel/bugs/7/

    even MuTect was reported to preduce the same error at your same location MT:309 :
    https://github.com/chapmanb/bcbio-nextgen/issues/1362

    I am listing below a few possible scenarios that can couse this problem:

    1. The reference used to align the reads may be different from the one used to run scalpel. Please double check that you are provinding the exact same reference file.

    2. It is possible that this a complex variant where there is a SNP right before the indel. Since scalpel extract the previous base from the assembled sequence, it is possible that it does not match the reference. Inspection of the alignements in IGV could help figure this out.

    3. Are you running scalpel using a BED file as input? Or are you runing it on a small region (chr:start-end) around the variant? If a very short region is provided scalpel uses samtools to load the reference sequence for the provided region (to speed things up) and it was previsouly reported to generate different results compared to loading the whole reference sequence: https://github.com/chapmanb/bcbio-nextgen/issues/1237

    If none of these seem to be your problem, please extarct a small bam around the variant and send it to me so that I can reproduce your problem and possibly debug it.

     
  • Ashini Bolia

    Ashini Bolia - 2017-06-14

    Thanks for your reply. We are using the same refernece to align reads and to run Scalpel. So, the first scenario is ruled out.
    Second, Looking at the bam in IGV (snapshot attached0, I can see an insertion (bases CCT) happening between 309-310 position and a SNP at 310 (T>C), so the SNP is right after the insertion. Even though, scalpel extract the previous base from the assembled sequence, shouldn't that effect "ALT", because the assmeble sequence will reflect the alternate base, not the reference as that comes from reference fasta file. I am not sure how will it effect the reference base ?

    Third, I am listing all the variants found (5) in the region MT:305-310 here below, when you give bedfile (bed has region: MT 0 16569),
    MT 308 . C CCT 209.28 HighChi2score AVGCOV=35.5;MINCOV=33;ALTCOV=295;ZYG=het;COVRATIO=0.10;CHI2=209.28;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na GT:AD:DP 0/1:295,33:328
    MT 309 . C CCT 119.95 HighChi2score AVGCOV=269.0;MINCOV=269;ALTCOV=590;ZYG=het;COVRATIO=0.31;CHI2=119.95;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na GT:AD:DP 0/1:590,269:859
    MT 309 . T TC 758.15 LowVaf;HighChi2score AVGCOV=26.0;MINCOV=26;ALTCOV=833;ZYG=het;COVRATIO=0.03;CHI2=758.15;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na GT:AD:DP 0/1:833,26:859
    MT 309 . C CCCT 758.15 LowVaf;HighChi2score AVGCOV=26.0;MINCOV=26;ALTCOV=833;ZYG=het;COVRATIO=0.03;CHI2=758.15;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na GT:AD:DP 0/1:833,26:859
    MT 310 . T TC 148.51 HighChi2score AVGCOV=51.0;MINCOV=51;ALTCOV=269;ZYG=het;COVRATIO=0.16;CHI2=148.51;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na GT:AD:DP 0/1:269,51:320

    However, when you give a small region (I used MT:305-310) on the command line, I get only 3 variants, now I loose the avriant MT:309 T>TC), thius solving the problem. I also loose MT:308 C>CCT variant, which I cannot see in IGV, so looks like a fasle psoitive either.
    MT 309 . C CCCT . LowVaf;HighChi2score AVGCOV=9.0;MINCOV=9;ALTCOV=410;ZYG=het;COVRATIO=0.02;CHI2=383.77;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na GT:AD:DP 0/1:410,9:419
    MT 309 . C CCT . HighChi2score AVGCOV=117.0;MINCOV=117;ALTCOV=302;ZYG=het;COVRATIO=0.28;CHI2=81.68;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na GT:AD:DP 0/1:302,117:419
    MT 310 . T TC . HighChi2score AVGCOV=30.0;MINCOV=30;ALTCOV=176;ZYG=het;COVRATIO=0.15;CHI2=103.48;FISHERPHREDSCORE=0;INH=na;BESTSTATE=na;COVSTATE=na GT:AD:DP 0/1:176,30:206

    The third scenario seem to have solved the problem, however I am not sure how to implement that in our workflow. We need to analyze large regions and giving a subset wouldn't make sense. Do you have any recommendation?

    FYI: I am using Scalpel 0.5.3 version.

    Thank you for all your help.
    Ashini

     
  • Giuseppe Narzisi

    Is CCT the inserted sequence shown in the IGV for all the alignments? Or just CT? It is possible that scalpel is joining together the two variants into one. Also this is within a long streatch of homopolymer C and I would be suprised if there is no subset of reads that show a deletion of a C also. The deletion combined with the T variant may cause the T to shift down of one base and, the alignment of the assembled sequence to the reference, could cause the MT 309 . T TC

    When you run scalpel on a small region you should alway choose a windows size that is at least 2 to 3 times the read size. This is because only reads fully aligning within the region are used for assembly. The region that you used, MT:305-310, is too small. Please try again wiht a larger region to check that those mutation indeed disappear.

     

    Last edit: Giuseppe Narzisi 2017-06-19
  • Ashini Bolia

    Ashini Bolia - 2017-06-16

    Ah, you are right. There is an insertion for CT. only few reads support CCT. Also, where do you see a deletion. I can only see insertions here.

    Moreover, since our read length are 150bp, I played with different regions sizes. When I used MT:100-500, I get all 5 variants including both refs "C" and "T" like I would get with a bed. However, when I use MT:1-650 (exact 3 times read length), I get only 3 variants (loosing MT:308 C>CCTand MT:309 T>TC variant). So, this seems to be very sepcific to the size range of the region I choose.

     
  • Giuseppe Narzisi

    It is possible, in rare cases, to get a different set of variants when using different regions sizes. This is because scalpel automatically adjust the k-mer size (to build a cycle free deBruijn graph) as a function of the sequnce compositon of the region. In your case the larger region may contain a more complecated repeat that requires a larger k-mer. But a larger k-mer makes the tool less sensitive to very low coverage variants, which is likely what it is happening here.

     
  • Ashini Bolia

    Ashini Bolia - 2017-06-19

    So, in this particular scenario, do you have any recommendation on how to have a general approach to solve this problem to seeing 2 ref bases or does this have to be a special case that could be flagged and manually evaluated?

     
  • Giuseppe Narzisi

    Unfortunately I don't have a solution for the 2 ref bases issue. Yes, it would be good to flag this scenario when encountered and manually inspect. Another option is to filter them out after they are called if you don't trust such variants.

     
  • Ashini Bolia

    Ashini Bolia - 2017-06-19

    Okay.. Sounds good. Thank you so much for all your help. I really appreciate you helping me with this issue.

    Ashini

     
  • Giuseppe Narzisi

    • status: open --> closed
     

Log in to post a comment.

MongoDB Logo MongoDB