I recently found when running the bcbio pipeline, that scalpel generates a vcf output with an inconsistent reference in some cases. See here. I have a small example with bam, bed, output, etc files here.
The reson for the error is that the previous base reported for one of your variants is not matching the reference. In fact (by contradiction) there are two variant in your vcf file that have different base-pair matching the reference GL000225.1 at position 206:
GL000225.1 206 . CT C
GL000225.1 206 . GC G
This is not possible.
I ran both scalpel v0.5.1 (that you used) and the most recent version v0.5.3 and in both cases for your small dataset my job produces only 1 variant in ouput:
GL000225.1 546 . GA G
Are you sure you used the same reference sequence file both for calling and aligning the reads?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for checking this. I used the bcbio pipeline for the bwa alignment and running scalpel, so the reference should be the same. It was not set by me, I only specified genome_build: GRCh37 in the config file. I'll try to dig into the details and check if there are any discrepancies.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This is pazzling! I downloaded the referece (human_g1k_v37.fasta and human_g1k_v37.decoy.fasta, both with and without decoy) from the GATK bundle ftp site that you gave me. I still get the same results as before. Only one variant:
GL000225.1 546 . GA G
Can you try to download both scalpel and the reference that you sent me and try your test case outside of the bcbio framework?
Also, I have been thinking about your problem and the only other possibility is that this a complex variant where there is a SNP right before the deletion. Since scalpel extract the previous base from the assembled sequence, it is possible that it does not match the reference...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Please see the comments at github. The use of samtools in FindVariants.pl seems to lead to a variants.indel.vcf that contains errors. I have no idea why.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for the update Endre. Glad the source of the probolem was found.
Maybe scalpel's code does not handle properly patch chromosomes names (e.g., GL000225.1) when loading the BED/FASTA files (which happens when setting $USEFAIDX=1). I will do some additional investigation in the future, but it is not high prioprity for me at this time.
Thank you again for your feedback and the investigation work!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you Endre for your feedback.
I will look into it and, if needed, I will submit a patch to fix the problem.
The reson for the error is that the previous base reported for one of your variants is not matching the reference. In fact (by contradiction) there are two variant in your vcf file that have different base-pair matching the reference GL000225.1 at position 206:
GL000225.1 206 . CT C
GL000225.1 206 . GC G
This is not possible.
I ran both scalpel v0.5.1 (that you used) and the most recent version v0.5.3 and in both cases for your small dataset my job produces only 1 variant in ouput:
GL000225.1 546 . GA G
Are you sure you used the same reference sequence file both for calling and aligning the reads?
Thanks for checking this. I used the bcbio pipeline for the bwa alignment and running scalpel, so the reference should be the same. It was not set by me, I only specified genome_build: GRCh37 in the config file. I'll try to dig into the details and check if there are any discrepancies.
Hi! Brad Chapman can reproduce the error that I also see, both with 0.5.1 and 0.5.3. Can you check if you get the error with this reference file:
ftp://gsapubftp-anonymous:none@ftp.broadinstitute.org/bundle/2.8/b37/
The reference files can't really be out of sync, but maybe using the above link helps in finding out what is happening.
This is pazzling! I downloaded the referece (human_g1k_v37.fasta and human_g1k_v37.decoy.fasta, both with and without decoy) from the GATK bundle ftp site that you gave me. I still get the same results as before. Only one variant:
GL000225.1 546 . GA G
Can you try to download both scalpel and the reference that you sent me and try your test case outside of the bcbio framework?
Also, I have been thinking about your problem and the only other possibility is that this a complex variant where there is a SNP right before the deletion. Since scalpel extract the previous base from the assembled sequence, it is possible that it does not match the reference...
Hmmm, so if I download scalpel 0.5.1, compile and run it with exactly the same parameters as in the bcbio pipeline I do not get the bug. Weird ...
Please see the comments at github. The use of samtools in FindVariants.pl seems to lead to a variants.indel.vcf that contains errors. I have no idea why.
Thank you for the update Endre. Glad the source of the probolem was found.
Maybe scalpel's code does not handle properly patch chromosomes names (e.g., GL000225.1) when loading the BED/FASTA files (which happens when setting $USEFAIDX=1). I will do some additional investigation in the future, but it is not high prioprity for me at this time.
Thank you again for your feedback and the investigation work!