ART-DeCo Wiki

ART-DeCo uses polymorphism allelic ratio to predict contaminations

Brought to you by: vber

ART-DeCo

Installation of ART-DeCo

Download the most recent version of ART-DeCo program in the code page and extract the gzipped archive.
tar -zxvf ART-DeCo.tar.gz

It contains the followings:
1. The list of SNP of interest to analyse in scripts/FILES/PolymorphismsInformation.tsv
2. In the directory "DATA", example of 8 samples SNPs allele coverage report (obtained from GATK DepthOfCoverage with base count option)
3. A file "CoveragePathExample.txt" containing sample ID and path to samples SNP coverage report
4. A directory "scripts" with ART-DeCo scripts
5. A readme file

Polymorphism selection & base count report

Upstream ART-DeCo running, users have to select a list of polymorphisms to analyse. This list should contain polymorphisms covered on the design analysed. In case users want to add relevant variant to a design, we recommand them to keep variants observed in 50% of any population as the 479 polymorphisms extracted from 1000 genome annotation. Only good mappability regions (Duke Uniqueness 35) and positions outside of short tandem repeats (UCSC repeat tracks) have been kept.

Since some polymorphisms are likely to generate false positive contaminations, we recommand to filter them :
- Exclude SNPs with recurrent unexpected allelic ratios (between background noise and heterozygosis threshold) by ploting AAR histogram for each polymorphisms for all samples
- Exclude SNPs close to homopolymer stretches
- Exclude SNPs in paralogous genes or repeated sequence

Please run GATK DepthOfCoverage with parameter -baseCounts on your list of polymorphisms (bed file) in order to generate coverage report file required by ART-DeCo.

Running ART-DeCo

To run ART-DeCo, you can run the perl executable file like below:

perl scripts/ART-DeCo.pl

The -h options will bring the following usage :

Usage:

ART-DeCo (Allelic Ratio based Tool for Detection of Contamination)
version: 1.1

ART-DeCo.pl -a <path for information about SNPs file> -i <path for sample ID and SNP allele coverage report file> -o <output file name> [options]

Input options:
-a PATH File containing information about SNPs of interest to analyse
-i PATH File containing sample ID and SNP allele coverage report file
(optional)
-c INT Minimal number of reads required to analyse a SNP [200]
-f INT Maximum allele frequency noise allowed [1]

Output options:
-o STR Output file name
(optional)
-d PATH Output directory

Inputs and options

Mandatory inputs:

There are three mandatory inputs for ART-DeCo.

Input	Option	Description and example
file containing SNPs informations	-a	file with chromosome, position, reference base, alternative base and SNP ID
file containing sample ID and path to SNP allele coverage report	-i	2 column file : Sample ID and path to sample allele coverage report (please see bellow for report example)
output file name	-o	any string as run ID for example

Note : the file containing SNPs informations provided as example is scripts/FILES/PolymorphismsInformation.tsv must contain chromosome, position, reference alternative and any informative information aiming to annotate the variant (rsID might be use for example)
chr1 position1 A T polym1
chr2 position2 G A polym2
chr3 position3 C A polym3

Example of SNPs allele coverage report (available using GATK DepthOfCoverage with -basecount option)

chr1:position1 A:0 C:496 G:0 T:946
chr2:position2 A:154 C:0 G:169 T:0
chr3:position3 A:598 C:0 G:0 T:0

Note : the 3 line upstream is an exemple with absolute positions. Any output provided by GATK DepthOfCoverage with -basecount option will match the expectations.

Options and recommandations:

There are several options you should adjust to your run and design characteristics.

Input	Option	Recommandation
Minimal number of reads required to analyse a SNP	-c	Please adjust it to your sample depth and the number of SNPs in your design.
Maximum allele frequency noise	-f	Please adjust it to your sequencer expected noise

Regarding the minimal number of reads required to analyse a SNP (option -c) please note that :
An high value will lead to a stringent SNP selection with less positions analysed but meaningful statistical reports.
An low value will lead to a compliant SNP selection with more positions analysed and less powerfull statistical reports.

Outputs

The file ART-DeCo.tsv reports the worst case scenario (WCS) aiming to highlignt contamination.

Sample	WCSpercentage of contamination	Contaminant	p-value	Percentage of contamination by the contaminant
Sample1	49.52
Sample2	0.37
Sample3	46.65	Sample2	0.00158	45.44
Sample4	45.37	Sample2	8.97e-12	22.55
Sample5	25.21	Sample2	7.31e-13	12.04
Sample6	13.72	Sample2	7.31e-13	5.65
Sample7	6.58	Sample2	1.1e-05	2.66
Sample8	0.44	Sample2

The example provided shows Sample1 with a high WCS. No contaminant is observed. Regarding the high rate, it could be due to an allogreffe, index issues (several samples with identical indexes) or a contaminant not included in the run. False positive prediction is unlikely so high.

In order to have a deeper look on predictions, the file ART-DeCo_Long_Report.tsv helps to better distinguish real contaminations.

Sample	Sample1	Sample2	Sample3	Sample4	Sample5	Sample6	Sample7	Sample8
AAR mean variants < 25%	12.33	0.02	7.75	6.58	3.76	1.89	0.83	0.03
AAR sd variants < 25%	10.77	0.05	10.59	7.93	4.3	2.26	1.02	0.07
number of variants < 25%	28	42	31	37	39	39	39	33
AAR mean variants > 75%	85.7	99.87	77.68	79.85	88.01	95.05	96.92	99.88
AAR sd variants > 75%	12.29	0.09	NA	5.54	3.08	0.58	0.81	0.14
number of variants > 75%	3	9	1	4	6	6	6	6
AAR mean variants in 25-75%	44.82	47.37	37.63	40.6	42.16	46.07	46.43	48.03
AAR sd variants in 25-75 %	14.86	3.1	12.29	12.95	4.95	3.56	3.89	3.66
number of variants in 25-75%	32	15	34	25	21	21	21	27
number of most informative variants (30-70% of run allelic population)	12	8	25	26	24	24	24	12
WCS percentage of contamination	49.52	0.37	46.65	45.37	25.21	13.72	6.58	0.44
Contaminant			Sample2	Sample2	Sample2	Sample2	Sample2
p-value			0.00158	8.97e-12	7.31e-13	7.31e-13	1.1e-05
% homozygous variants compatible with contamination			90.62	100	100	100	86.67
# homozygous variants in contaminated covered in contaminant			32	41	45	45	45
# covered variants	64	66	66	66	66	66	66	66
Percentage of contamination by the contaminant			45.44	22.55	12.04	5.65	2.66
sd for percentage of contamination by the contaminant			3.75	9.05	6.53	2.53	1.92

AAR mean variants < 25% with high values and AAR mean variants > 75% with low values might help to highlight a contamination.
AAR sd variants with high values might help to highlight a contamination.
Number of homozygous variants with high values / number of heterozygous variants with low value might help to highlight a contamination.
% homozygous variants compatible with contamination with high values might help to highlight a contamination.

The number of variants observed in 30-70% of run population with AAR 25% in sample helps user to be sure enough informative variants were analysed.

For each contaminated sample having a contaminant in the run, homozygous variants are analysed. Some contaminant AAR are irrational (i.e. if contaminated sample is 2%, the contaminant MUST be higher, and not 0%). Based on this logic, we reported the number of homozygous variants in contaminated covered in contaminant (the ones we can dig on) and the percentage of homozygous variants compatible with contamination. The clother they are from 100%, the more contamination is evident.
Note that a 50% contamination might lower the percentage since some heterozygous variants might cross the 25% / 75% heterozygous thereshold. Nevertheless, the percentage is close from 100%

For low WCS, a view on sample alternative allele frequencies can help to distinguish between false positive prediction (due to run noise / index hopping) and real unexpected distributions. The file Frequencies.tsv aims to plot SNP distributions and users must have a look on it.

This file allows to highlight unexpected frequencies (matching a contamination prediction) as those examples for Sample 1

SNP id	Alternative allele frequency
polym86	13.94
polym60	15.22
polym82	18.26
polym90	76.53
polym106	80.89

If a contaminant is predicted same plot as upstream helps, and all unexpected frequency in the contaminated sample should easily be explainable with contaminant frequencies as below

SNP id	Sample 2 : contaminant	Sample 3 : contaminated	comments
polym60	0	22.9	contaminated sample was likely heterozygous and contamination lead to frequency decrease
polym51	0	23.2	contaminated sample was likely heterozygous and contamination lead to frequency decrease
polym15	46.6	23.3	contaminated sample was likely homozygous reference and contamination lead to frequency increase
polym93	51.6	23.4	contaminated sample was likely homozygous reference and contamination lead to frequency increase
polym82	45.2	77.7	contaminated sample was likely homozygous alternative and contamination lead to frequency decrease

Following this logic, no SNP should be around 0% in one sample and 100% in the other if the are contaminant / contaminated.