Re: [svtoolkit-help] RedundancyAnnotator

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi, Thomas,

The RedundancyAnnotator is designed to work with VCFs generated from 
multiple tools (it was originally developed to help merge SVs from 
multiple callers are part of the 1000 Genomes project). It does require 
consistent annotations (e.g. all input files have genotype likelihoods) 
and the quality of the results will depend on how well-calibrated the 
likelihoods are. I haven't used it with lumpy, so your mileage may vary.

There are a couple of different approaches/modes you can use. The 
defaults are more designed for multiple calling methods on the same 
samples, which is what it sounds like you have. The tool compares all 
"nearby" variants pairwise, where "nearby" is usually determined by 
degree of overlap (default is 50% I believe, but you can change this). 
For the pairwise comparison, the default mode calculates that likelihood 
that any sample is more likely to have a different genotype than the 
same genotype, basically on-diagonal vs. off-diagonal (again, you can 
adjust the threshold). If no samples have sufficiently confidently 
different genotypes, then the two variants are deemed redundant. In this 
case, we want to filter one of the two redundant variants, and this is 
done by setting a filter on the variant with the smallest posterior 
genotype likelihoods (least confident genotype calls). The method 
attempts to compute a stable dominance order so that if there are 
multiple overlapping calls the minimal set is removed.

The default settings tend to produce rather "light" filtering, erring on 
the side of leaving calls unfiltered and only filtering those that are 
confidently quite similar. This may be appropriate for an association 
study, where you don't mind a few extra tests. If you are trying to 
create a reference map, you may get better results by turning up the 
thresholds. You can evaluate, for example, by looking at how many 
overlapping calls remain and the degree of overlap. If you turn the 
thresholds high enough, you can force the output to have no overlapping 
variants.

As an aside, the other application we use this for is to combine calls 
across disjoint sets of samples. In this case, we need more aggressive 
merging, so we change the settings to ignore the genotype likelihoods in 
the pairwise comparisons and just use the hard genotype calls and set a 
threshold on the allowable number of genotype discordances.

-Bob

On 10/3/17 8:09 AM, Thomas Faraut wrote:
> Dear GenomeSTRIP team,
>
> We used successfully genomeSTRIP to detect medium to large deletions 
> in goats.
> For smaller deletions, we use another variant detection tool (lumpy) but
> would like to be able to use the RedundancyAnnotator from the 
> svtoolkit to
> detect duplicate calls.
>
> Is it possible to use the RedundancyAnnotator with a vcf file provided by
> another SV genotyping tool provided that the genotype likelihoods are
> available ?
> Or is this redundancy score calculation described in one of the 
> genomeSTRIP
> paper ?
>
> Thank you in advance for your help.
>
> Best regards,
> Thomas Faraut
>