In order to define a set of SNPs that would be common across all platforms in use at QCMG, we selected all single-base dbSNP-derived SNPs included on the OMNI-1Mquad genotyping array (~1.4 million SNPs). These SNPs are common to other members of the Illumina OMNI array family as well as whole genome data and some regions of exome data and targeted gene panels. We then determine the nucleotide frequencies for each of these SNPs based on array intensities or BAM read counts.
Genotyping array intensities are transformed into relative nucleotide counts using the following formula:
T = ⌊C⋅e^LRR ⌋
A = ⌊BAF⋅T⌋
R = T-A
T = total counts
A = alternate allele count
R = reference allele count
C = pseudocount,20
LRR = logR ratio
BAF = B-allele frequency
To calculate nucleotide frequencies from BAM read counts, we perform a pileup at each of the selected SNP positions and report the total count of each nucleotide from reads that have a mapping quality of at least 10; a base quality of at least 10; have passed the vendor check; are the primary alignment; and are not a duplicate read.
VCF generation takes about 20 minutes on a single core to report nucleotide counts from 500 million reads and less than a minute to estimate counts from a genotype array. This step needs to be performed only once per file.
java -cp qsignature.jar org.qcmg.sig.SignatureGenerator \
-log $BAM.qsig.log \
-i qsignature_positions.txt \
-i $BAM \
-i Illumina_arrays_design.txt
Example output:
##fileformat=VCFv4.0
##patient_id=ABCD_1234
##library=Library_EXT20140505_C
##bam=/bamFile.bam
##snp_file=/qsignature_positions.txt
##filter_q_score=10
##filter_match_qual=10
##FILTER=<ID=LowQual,Description="REQUIRED: QUAL < 50.0">
##INFO=<ID=FULLCOV,Number=.,Type=String,Description="all bases at position">
##INFO=<ID=NOVELCOV,Number=.,Type=String,Description="bases at position from reads with novel starts">
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 89788 cnvi0159992 G . . FULLCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0;NOVELCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0
chr1 90900 cnvi0135911 G . . FULLCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0;NOVELCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0
chr1 91152 cnvi0111730 A . . FULLCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0;NOVELCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0
chr1 91467 cnvi0132916 G . . FULLCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0;NOVELCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0
chr1 91472 rs6680825 C . . FULLCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0;NOVELCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0
chr1 91538 cnvi0158801 T . . FULLCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0;NOVELCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0
chr1 91719 cnvi0131353 C . . FULLCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0;NOVELCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0
chr1 98222 cnvi0147298 C . . FULLCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0;NOVELCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0
chr1 99236 cnvi0131297 T . . FULLCOV=A:0,C:0,G:0,T:2,N:0,TOTAL:2;NOVELCOV=A:0,C:0,G:0,T:2,N:0,TOTAL:2
chr1 100622 cnvi0147523 G . . FULLCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0;NOVELCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0
chr1 101095 cnvi0133071 T . . FULLCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0;NOVELCOV=A:0,C:0,G:0,T:0,N:0,TOTAL:0
chr1 102954 cnvi0120648 T . . FULLCOV=A:0,C:0,G:0,T:2,N:0,TOTAL:2;NOVELCOV=A:0,C:0,G:0,T:2,N:0,TOTAL:2