I downloaded a VCF file that was produced with the GATK pipeline. I used NGSEP to filter it: remove some samples, filter out sites with only one allele (-fi option), keep only biallelic SNVs (-s option), and a minimum MAF of 0.05. I noticed the filtered VCF file now includes the field ADP in the FORMAT column (I guess replacing the previous AD field), and the NS, AFS, MAF fields in the INFO column. However, these new fields are not described in the header of the filtered VCF. This generated an issue when trying to use bcftools with the final VCF file for further processing.
Thanks in advance for your help.
Daniel Ariza Suarez
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Many thanks for reporting this issue. Although the VCF files produced using the full NGSEP pipeline are completely compliant with the v4.2 specification, it is true that modules that read and write VCF files (namely annotation, filtering and imputation) do not double check if the header has the particular INFO and FORMAT tags that we created for NGSEP. Hence, output VCFs could have the issue that you mention if the input VCF was produced with some other tool / pipeline. We will have this behavior improved in future versions. In the mean time, please find attached a header that has the union of the INFO and format tags produced by GATK and NGSEP. You can just replace the header of your filtered VCF following these instructions:
I am not sure if you plan to use bcftools or vcftools for further processing. In either case, it could be the case that NGSEP already has the analysis that you need to perform and then you can just keep using the filtered VCF file without adding the missing headers. In any case, let me know if you have any trouble following the instructions.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2018-04-19
Hi Jorge,
Thank you very much for your help. I was able to properly modify the header in the VCF and now it's working. Thanks for sharing the GATK-NGSEP header file.
I'm trying to split multi-allelic variants into bi-allelic sites following this blog post, recommended in the PLINK data management page.
Best regards,
Daniel
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
No problem. About the blog post, I am still wondering which is the exact meaning of the expression "best practices" that some people just use to describe their pipelines. If you want to use Plink for LD-decay estimations or even for GWAS, a simpler solution is just to use the FilterVCF command keeping only biallelic SNVs (-s option in the command line). Because SNVs are the most frequent and stable type of variation, the loss of power produced by this filter should be minimal. If you want to keep biallelic indels, we still do not have a explicit filter for all biallelic variants but that can be done with a simple awk:
In general it does not sound right to me to break a multiallelic variant into two separate biallelic variants spanning the same location because the variant is actually multiallelic in the population. Depending of the exact analysis that you would like to make on multiallelic sites, you may want to look for other software that could be able to process this type of variation instead of trying to artificially break multiallelic variants.
In any case, if you want to follow the instructions of the blog post, I guess you can now do it with the VCF with the adjusted headers. Let us know if you have any further question.
Best regards
Jorge
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2018-04-26
Hi Jorge,
Thanks for your suggestions. I'll give it a try and see what happen, but I agree with you in the sense that splitting multi-allelic variants may not be the best solution to use them for GWAS. There should be a specific tool for that, I'll search if there's any.
Best,
Daniel
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi there,
I downloaded a VCF file that was produced with the GATK pipeline. I used NGSEP to filter it: remove some samples, filter out sites with only one allele (-fi option), keep only biallelic SNVs (-s option), and a minimum MAF of 0.05. I noticed the filtered VCF file now includes the field ADP in the FORMAT column (I guess replacing the previous AD field), and the NS, AFS, MAF fields in the INFO column. However, these new fields are not described in the header of the filtered VCF. This generated an issue when trying to use bcftools with the final VCF file for further processing.
Thanks in advance for your help.
Daniel Ariza Suarez
Hi Daniel
Many thanks for reporting this issue. Although the VCF files produced using the full NGSEP pipeline are completely compliant with the v4.2 specification, it is true that modules that read and write VCF files (namely annotation, filtering and imputation) do not double check if the header has the particular INFO and FORMAT tags that we created for NGSEP. Hence, output VCFs could have the issue that you mention if the input VCF was produced with some other tool / pipeline. We will have this behavior improved in future versions. In the mean time, please find attached a header that has the union of the INFO and format tags produced by GATK and NGSEP. You can just replace the header of your filtered VCF following these instructions:
cat headerGATKNGSEP.vcf > filtered_fixed.vcf
grep "##contig" filtered.vcf >> filtered_fixed.vcf
awk '{if(substr($1,1,2)!="##")print $0}' filtered.vcf >> filtered_fixed.vcf
I am not sure if you plan to use bcftools or vcftools for further processing. In either case, it could be the case that NGSEP already has the analysis that you need to perform and then you can just keep using the filtered VCF file without adding the missing headers. In any case, let me know if you have any trouble following the instructions.
Best regards
Jorge
Hi Jorge,
Thank you very much for your help. I was able to properly modify the header in the VCF and now it's working. Thanks for sharing the GATK-NGSEP header file.
I'm trying to split multi-allelic variants into bi-allelic sites following this blog post, recommended in the PLINK data management page.
Best regards,
Daniel
Hi Daniel
No problem. About the blog post, I am still wondering which is the exact meaning of the expression "best practices" that some people just use to describe their pipelines. If you want to use Plink for LD-decay estimations or even for GWAS, a simpler solution is just to use the FilterVCF command keeping only biallelic SNVs (-s option in the command line). Because SNVs are the most frequent and stable type of variation, the loss of power produced by this filter should be minimal. If you want to keep biallelic indels, we still do not have a explicit filter for all biallelic variants but that can be done with a simple awk:
awk '{if(substr($1,1,1)=="#" || index($5,",")<=0) print $0}' file.vcf > file_onlyBiallelic.vcf
In general it does not sound right to me to break a multiallelic variant into two separate biallelic variants spanning the same location because the variant is actually multiallelic in the population. Depending of the exact analysis that you would like to make on multiallelic sites, you may want to look for other software that could be able to process this type of variation instead of trying to artificially break multiallelic variants.
In any case, if you want to follow the instructions of the blog post, I guess you can now do it with the VCF with the adjusted headers. Let us know if you have any further question.
Best regards
Jorge
Hi Jorge,
Thanks for your suggestions. I'll give it a try and see what happen, but I agree with you in the sense that splitting multi-allelic variants may not be the best solution to use them for GWAS. There should be a specific tool for that, I'll search if there's any.
Best,
Daniel