From: Barry M. <bar...@ge...> - 2013-05-18 17:06:18
|
Nice David - thanks. B On May 17, 2013, at 12:47 PM, David Nix wrote: > OK Folks, > > Here's a new USeq app for comparing VCF files to a key of trusted calls. To use it: > > Download the latest NIST key set (a vcf file of het and homo non ref calls and a bed file of interrogated regions) from Justin Zook <jus...@ni...<mailto:jus...@ni...>> or https://bioserver.hci.utah.edu/gnomex/gnomexFlex.jsp?analysisNumber=A1489 (this isn't necessarily the most recent). > > Download the USeq package from SourceForge http://sourceforge.net/projects/useq/ . > > Run the VCFComparator. > > [u0028003@hci-alta NIST]$ cd /Repository/AnalysisData/2013/A1489/NIST/5thCallSet_2.14/ > [u0028003@hci-alta 5thCallSet_2.14]$ ls > ExampleCalls Key VCFCompTest > [u0028003@hci-alta 5thCallSet_2.14]$ java -jar -Xmx10G /home/BioApps/USeq/Apps/VCFComparator > > ************************************************************************************** > ** VCF Comparator : April 2013 ** > ************************************************************************************** > Compares test vcf file(s) against a gold standard key of trusted vcf calls. Only calls > that fall in the common interrogated regions are compared. WARNING tabix gzipped files > often fail to parse correctly with java. Seeing odd error messages? Try uncompressing. > > Required Options: > -a VCF file for the key dataset (xxx.vcf(.gz/.zip OK)). > -b Bed file of interrogated regions for the key dataset (xxx.bed(.gz/.zip OK)). > -c VCF file for the test dataset (xxx.vcf(.gz/.zip OK)). May also provide a directory > containing xxx.vcf(.gz/.zip OK) files to compare. > -d Bed file of interrogated regions for the test dataset (xxx.bed(.gz/.zip OK)). > > Optional Options: > -g Require the genotype to match, defaults to scoring a match when the alternate > allele is present. > -v Use VQSLOD score as ranking statistic in place of the QUAL score. > -s Only compare SNPs, defaults to all. > -n Only compare non SNPs, defaults to all. > -p Provide a full path directory for saving the parsed data. Defaults to not saving. > -e Exclude test and key records whose FILTER field is not . or PASS. Defaults to > scoring all. > > Example: java -Xmx10G -jar pathTo/USeq/Apps/VCFComparator -a /NIST/NA12878/key.vcf > -b /NIST/NA12878/regions.bed.gz -c /EdgeBio/Exome/testHaploCaller.vcf.zip > -d /EdgeBio/Exome/NimbleGenExomeV3.bed -g -v -s -e -p /CompRes/ > > ************************************************************************************** > > [u0028003@hci-alta 5thCallSet_2.14]$ java -jar -Xmx10G /home/BioApps/USeq/Apps/VCFComparator -a Key/nist2.14.vcf.gz -b Key/nist2.14.bed.gz -c ExampleCalls/ -d Key/nimExomeV3.bed.gz -g -s -e -p VCFCompTest > > This generates a variety of files including matching and non matching vcf files, spreadsheets of stats for each test set, and a combine file listing just the FDR and TPR's for each dataset over a range of thresholds enabling the generation of TPR vs FDR pseudo ROC curves. > > [cid:48F...@hc...] > > Basically the method/ condition that returns the largest TPRs over a range of relevant FDRs for your experiment, say 1-5% is the "best". How much better can be measured off the graph. In the example above, sequencing 3 exomes per lane instead of 8 exomes increases the recovery of the key SNP variants by 5% ( at an FDR of 5%). Is that 5% improvement worth 2.7x the sequencing cost? Possibly. > > One issue I am a little worried about is the low FDR values for unfiltered variants call files with these datasets. Setting GATK's strand_call and strand_emit confidence to 0 doesn't produce a list of unfiltered SNPs with > 7 %FDR. This seems wrong given our experience with variant analysis of non NA12878 exomes. Any suggestions/ debugging tips would be most appreciated. > > -cheers, David > > > > David Austin Nix, PhD > Huntsman Cancer Institute > Dept. OncSci University of Utah > Bioinformatics Shared Resource > HCI 3165 (801) 587-4611 > dav...@hc...<mailto:dav...@hc...> > > > > <PastedGraphic-2.tiff> Barry Moore Research Scientist Dept. of Human Genetics University of Utah Salt Lake City, UT 84112 -------------------------------------------- (801) 585-3543 |