Download the most recent version of ART-DeCo program in the code page and extract the gzipped archive.
tar -zxvf ART-DeCo.tar.gz
It contains the followings:
1. The list of SNP of interest to analyse in scripts/FILES/PolymorphismsInformation.tsv
2. In the directory "DATA", example of 8 samples SNPs allele coverage report (obtained from GATK DepthOfCoverage with base count option)
3. A file "CoveragePathExample.txt" containing sample ID and path to samples SNP coverage report
4. A directory "scripts" with ART-DeCo scripts
5. A readme file
Upstream ART-DeCo running, users have to select a list of polymorphisms to analyse. This list should contain polymorphisms covered on the design analysed. In case users want to add relevant variant to a design, we recommand them to keep variants observed in 50% of any population as the 479 polymorphisms extracted from 1000 genome annotation. Only good mappability regions (Duke Uniqueness 35) and positions outside of short tandem repeats (UCSC repeat tracks) have been kept.
Since some polymorphisms are likely to generate false positive contaminations, we recommand to filter them :
- Exclude SNPs with recurrent unexpected allelic ratios (between background noise and heterozygosis threshold) by ploting AAR histogram for each polymorphisms for all samples
- Exclude SNPs close to homopolymer stretches
- Exclude SNPs in paralogous genes or repeated sequence
Please run GATK DepthOfCoverage with parameter -baseCounts on your list of polymorphisms (bed file) in order to generate coverage report file required by ART-DeCo.
To run ART-DeCo, you can run the perl executable file like below:
perl scripts/ART-DeCo.pl
The -h options will bring the following usage :
Usage:
ART-DeCo (Allelic Ratio based Tool for Detection of Contamination) version: 1.1 ART-DeCo.pl -a <path for information about SNPs file> -i <path for sample ID and SNP allele coverage report file> -o <output file name> [options] Input options: -a PATH File containing information about SNPs of interest to analyse -i PATH File containing sample ID and SNP allele coverage report file (optional) -c INT Minimal number of reads required to analyse a SNP [200] -f INT Maximum allele frequency noise allowed [1] Output options: -o STR Output file name (optional) -d PATH Output directory
There are three mandatory inputs for ART-DeCo.
Input | Option | Description and example |
---|---|---|
file containing SNPs informations | -a | file with chromosome, position, reference base, alternative base and SNP ID |
file containing sample ID and path to SNP allele coverage report | -i | 2 column file : Sample ID and path to sample allele coverage report (please see bellow for report example) |
output file name | -o | any string as run ID for example |
Note : the file containing SNPs informations provided as example is scripts/FILES/PolymorphismsInformation.tsv must contain chromosome, position, reference alternative and any informative information aiming to annotate the variant (rsID might be use for example)
chr1 position1 A T polym1
chr2 position2 G A polym2
chr3 position3 C A polym3
Example of SNPs allele coverage report (available using GATK DepthOfCoverage with -basecount option)
chr1:position1 A:0 C:496 G:0 T:946
chr2:position2 A:154 C:0 G:169 T:0
chr3:position3 A:598 C:0 G:0 T:0
Note : the 3 line upstream is an exemple with absolute positions. Any output provided by GATK DepthOfCoverage with -basecount option will match the expectations.
There are several options you should adjust to your run and design characteristics.
Input | Option | Recommandation |
---|---|---|
Minimal number of reads required to analyse a SNP | -c | Please adjust it to your sample depth and the number of SNPs in your design. |
Maximum allele frequency noise | -f | Please adjust it to your sequencer expected noise |
Regarding the minimal number of reads required to analyse a SNP (option -c) please note that :
An high value will lead to a stringent SNP selection with less positions analysed but meaningful statistical reports.
An low value will lead to a compliant SNP selection with more positions analysed and less powerfull statistical reports.
The file ART-DeCo.tsv reports the worst case scenario (WCS) aiming to highlignt contamination.
Sample | WCSpercentage of contamination | Contaminant | p-value | Percentage of contamination by the contaminant |
---|---|---|---|---|
Sample1 | 49.52 | |||
Sample2 | 0.37 | |||
Sample3 | 46.65 | Sample2 | 0.00158 | 45.44 |
Sample4 | 45.37 | Sample2 | 8.97e-12 | 22.55 |
Sample5 | 25.21 | Sample2 | 7.31e-13 | 12.04 |
Sample6 | 13.72 | Sample2 | 7.31e-13 | 5.65 |
Sample7 | 6.58 | Sample2 | 1.1e-05 | 2.66 |
Sample8 | 0.44 | Sample2 |
The example provided shows Sample1 with a high WCS. No contaminant is observed. Regarding the high rate, it could be due to an allogreffe, index issues (several samples with identical indexes) or a contaminant not included in the run. False positive prediction is unlikely so high.
In order to have a deeper look on predictions, the file ART-DeCo_Long_Report.tsv helps to better distinguish real contaminations.
Sample | Sample1 | Sample2 | Sample3 | Sample4 | Sample5 | Sample6 | Sample7 | Sample8 |
---|---|---|---|---|---|---|---|---|
AAR mean variants < 25% | 12.33 | 0.02 | 7.75 | 6.58 | 3.76 | 1.89 | 0.83 | 0.03 |
AAR sd variants < 25% | 10.77 | 0.05 | 10.59 | 7.93 | 4.3 | 2.26 | 1.02 | 0.07 |
number of variants < 25% | 28 | 42 | 31 | 37 | 39 | 39 | 39 | 33 |
AAR mean variants > 75% | 85.7 | 99.87 | 77.68 | 79.85 | 88.01 | 95.05 | 96.92 | 99.88 |
AAR sd variants > 75% | 12.29 | 0.09 | NA | 5.54 | 3.08 | 0.58 | 0.81 | 0.14 |
number of variants > 75% | 3 | 9 | 1 | 4 | 6 | 6 | 6 | 6 |
AAR mean variants in 25-75% | 44.82 | 47.37 | 37.63 | 40.6 | 42.16 | 46.07 | 46.43 | 48.03 |
AAR sd variants in 25-75 % | 14.86 | 3.1 | 12.29 | 12.95 | 4.95 | 3.56 | 3.89 | 3.66 |
number of variants in 25-75% | 32 | 15 | 34 | 25 | 21 | 21 | 21 | 27 |
number of most informative variants (30-70% of run allelic population) | 12 | 8 | 25 | 26 | 24 | 24 | 24 | 12 |
WCS percentage of contamination | 49.52 | 0.37 | 46.65 | 45.37 | 25.21 | 13.72 | 6.58 | 0.44 |
Contaminant | Sample2 | Sample2 | Sample2 | Sample2 | Sample2 | |||
p-value | 0.00158 | 8.97e-12 | 7.31e-13 | 7.31e-13 | 1.1e-05 | |||
% homozygous variants compatible with contamination | 90.62 | 100 | 100 | 100 | 86.67 | |||
# homozygous variants in contaminated covered in contaminant | 32 | 41 | 45 | 45 | 45 | |||
# covered variants | 64 | 66 | 66 | 66 | 66 | 66 | 66 | 66 |
Percentage of contamination by the contaminant | 45.44 | 22.55 | 12.04 | 5.65 | 2.66 | |||
sd for percentage of contamination by the contaminant | 3.75 | 9.05 | 6.53 | 2.53 | 1.92 |
AAR mean variants < 25% with high values and AAR mean variants > 75% with low values might help to highlight a contamination.
AAR sd variants with high values might help to highlight a contamination.
Number of homozygous variants with high values / number of heterozygous variants with low value might help to highlight a contamination.
% homozygous variants compatible with contamination with high values might help to highlight a contamination.
The number of variants observed in 30-70% of run population with AAR 25% in sample helps user to be sure enough informative variants were analysed.
For each contaminated sample having a contaminant in the run, homozygous variants are analysed. Some contaminant AAR are irrational (i.e. if contaminated sample is 2%, the contaminant MUST be higher, and not 0%). Based on this logic, we reported the number of homozygous variants in contaminated covered in contaminant (the ones we can dig on) and the percentage of homozygous variants compatible with contamination. The clother they are from 100%, the more contamination is evident.
Note that a 50% contamination might lower the percentage since some heterozygous variants might cross the 25% / 75% heterozygous thereshold. Nevertheless, the percentage is close from 100%
For low WCS, a view on sample alternative allele frequencies can help to distinguish between false positive prediction (due to run noise / index hopping) and real unexpected distributions. The file Frequencies.tsv aims to plot SNP distributions and users must have a look on it.
This file allows to highlight unexpected frequencies (matching a contamination prediction) as those examples for Sample 1
SNP id | Alternative allele frequency |
---|---|
polym86 | 13.94 |
polym60 | 15.22 |
polym82 | 18.26 |
polym90 | 76.53 |
polym106 | 80.89 |
If a contaminant is predicted same plot as upstream helps, and all unexpected frequency in the contaminated sample should easily be explainable with contaminant frequencies as below
SNP id | Sample 2 : contaminant | Sample 3 : contaminated | comments |
---|---|---|---|
polym60 | 0 | 22.9 | contaminated sample was likely heterozygous and contamination lead to frequency decrease |
polym51 | 0 | 23.2 | contaminated sample was likely heterozygous and contamination lead to frequency decrease |
polym15 | 46.6 | 23.3 | contaminated sample was likely homozygous reference and contamination lead to frequency increase |
polym93 | 51.6 | 23.4 | contaminated sample was likely homozygous reference and contamination lead to frequency increase |
polym82 | 45.2 | 77.7 | contaminated sample was likely homozygous alternative and contamination lead to frequency decrease |
Following this logic, no SNP should be around 0% in one sample and 100% in the other if the are contaminant / contaminated.