From: David N. <Dav...@hc...> - 2013-05-17 18:49:21
|
OK Folks, Here's a new USeq app for comparing VCF files to a key of trusted calls. To use it: Download the latest NIST key set (a vcf file of het and homo non ref calls and a bed file of interrogated regions) from Justin Zook <jus...@ni...<mailto:jus...@ni...>> or https://bioserver.hci.utah.edu/gnomex/gnomexFlex.jsp?analysisNumber=A1489 (this isn't necessarily the most recent). Download the USeq package from SourceForge http://sourceforge.net/projects/useq/ . Run the VCFComparator. [u0028003@hci-alta NIST]$ cd /Repository/AnalysisData/2013/A1489/NIST/5thCallSet_2.14/ [u0028003@hci-alta 5thCallSet_2.14]$ ls ExampleCalls Key VCFCompTest [u0028003@hci-alta 5thCallSet_2.14]$ java -jar -Xmx10G /home/BioApps/USeq/Apps/VCFComparator ************************************************************************************** ** VCF Comparator : April 2013 ** ************************************************************************************** Compares test vcf file(s) against a gold standard key of trusted vcf calls. Only calls that fall in the common interrogated regions are compared. WARNING tabix gzipped files often fail to parse correctly with java. Seeing odd error messages? Try uncompressing. Required Options: -a VCF file for the key dataset (xxx.vcf(.gz/.zip OK)). -b Bed file of interrogated regions for the key dataset (xxx.bed(.gz/.zip OK)). -c VCF file for the test dataset (xxx.vcf(.gz/.zip OK)). May also provide a directory containing xxx.vcf(.gz/.zip OK) files to compare. -d Bed file of interrogated regions for the test dataset (xxx.bed(.gz/.zip OK)). Optional Options: -g Require the genotype to match, defaults to scoring a match when the alternate allele is present. -v Use VQSLOD score as ranking statistic in place of the QUAL score. -s Only compare SNPs, defaults to all. -n Only compare non SNPs, defaults to all. -p Provide a full path directory for saving the parsed data. Defaults to not saving. -e Exclude test and key records whose FILTER field is not . or PASS. Defaults to scoring all. Example: java -Xmx10G -jar pathTo/USeq/Apps/VCFComparator -a /NIST/NA12878/key.vcf -b /NIST/NA12878/regions.bed.gz -c /EdgeBio/Exome/testHaploCaller.vcf.zip -d /EdgeBio/Exome/NimbleGenExomeV3.bed -g -v -s -e -p /CompRes/ ************************************************************************************** [u0028003@hci-alta 5thCallSet_2.14]$ java -jar -Xmx10G /home/BioApps/USeq/Apps/VCFComparator -a Key/nist2.14.vcf.gz -b Key/nist2.14.bed.gz -c ExampleCalls/ -d Key/nimExomeV3.bed.gz -g -s -e -p VCFCompTest This generates a variety of files including matching and non matching vcf files, spreadsheets of stats for each test set, and a combine file listing just the FDR and TPR's for each dataset over a range of thresholds enabling the generation of TPR vs FDR pseudo ROC curves. [cid:48F...@hc...] Basically the method/ condition that returns the largest TPRs over a range of relevant FDRs for your experiment, say 1-5% is the "best". How much better can be measured off the graph. In the example above, sequencing 3 exomes per lane instead of 8 exomes increases the recovery of the key SNP variants by 5% ( at an FDR of 5%). Is that 5% improvement worth 2.7x the sequencing cost? Possibly. One issue I am a little worried about is the low FDR values for unfiltered variants call files with these datasets. Setting GATK's strand_call and strand_emit confidence to 0 doesn't produce a list of unfiltered SNPs with > 7 %FDR. This seems wrong given our experience with variant analysis of non NA12878 exomes. Any suggestions/ debugging tips would be most appreciated. -cheers, David David Austin Nix, PhD Huntsman Cancer Institute Dept. OncSci University of Utah Bioinformatics Shared Resource HCI 3165 (801) 587-4611 dav...@hc...<mailto:dav...@hc...> |