Menu

#7 scalpel output vcf with inconsistent reference

1.0
closed
nobody
None
1
2016-03-16
2016-02-23
No

I recently found when running the bcbio pipeline, that scalpel generates a vcf output with an inconsistent reference in some cases. See here. I have a small example with bam, bed, output, etc files here.

Discussion

  • Giuseppe Narzisi

    Thank you Endre for your feedback.
    I will look into it and, if needed, I will submit a patch to fix the problem.

     
  • Giuseppe Narzisi

    The reson for the error is that the previous base reported for one of your variants is not matching the reference. In fact (by contradiction) there are two variant in your vcf file that have different base-pair matching the reference GL000225.1 at position 206:

    GL000225.1 206 . CT C
    GL000225.1 206 . GC G

    This is not possible.

    I ran both scalpel v0.5.1 (that you used) and the most recent version v0.5.3 and in both cases for your small dataset my job produces only 1 variant in ouput:

    GL000225.1 546 . GA G

    Are you sure you used the same reference sequence file both for calling and aligning the reads?

     
    • Endre Sebestyen

      Endre Sebestyen - 2016-03-04

      Thanks for checking this. I used the bcbio pipeline for the bwa alignment and running scalpel, so the reference should be the same. It was not set by me, I only specified genome_build: GRCh37 in the config file. I'll try to dig into the details and check if there are any discrepancies.

       
  • Endre Sebestyen

    Endre Sebestyen - 2016-03-06

    Hi! Brad Chapman can reproduce the error that I also see, both with 0.5.1 and 0.5.3. Can you check if you get the error with this reference file:

    ftp://gsapubftp-anonymous:none@ftp.broadinstitute.org/bundle/2.8/b37/

    The reference files can't really be out of sync, but maybe using the above link helps in finding out what is happening.

     
  • Giuseppe Narzisi

    This is pazzling! I downloaded the referece (human_g1k_v37.fasta and human_g1k_v37.decoy.fasta, both with and without decoy) from the GATK bundle ftp site that you gave me. I still get the same results as before. Only one variant:

    GL000225.1 546 . GA G

    Can you try to download both scalpel and the reference that you sent me and try your test case outside of the bcbio framework?

    Also, I have been thinking about your problem and the only other possibility is that this a complex variant where there is a SNP right before the deletion. Since scalpel extract the previous base from the assembled sequence, it is possible that it does not match the reference...

     
  • Endre Sebestyen

    Endre Sebestyen - 2016-03-15

    Hmmm, so if I download scalpel 0.5.1, compile and run it with exactly the same parameters as in the bcbio pipeline I do not get the bug. Weird ...

     
  • Endre Sebestyen

    Endre Sebestyen - 2016-03-16

    Please see the comments at github. The use of samtools in FindVariants.pl seems to lead to a variants.indel.vcf that contains errors. I have no idea why.

     
  • Giuseppe Narzisi

    Thank you for the update Endre. Glad the source of the probolem was found.

    Maybe scalpel's code does not handle properly patch chromosomes names (e.g., GL000225.1) when loading the BED/FASTA files (which happens when setting $USEFAIDX=1). I will do some additional investigation in the future, but it is not high prioprity for me at this time.

    Thank you again for your feedback and the investigation work!

     
  • Giuseppe Narzisi

    • status: open --> closed
     

Log in to post a comment.

MongoDB Logo MongoDB