Re: [VCFtools-spec] Wishlist for 4.2: allow space characters in INFO; functional annotation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Both Will's and Petr's suggestions make sense in a human-readable format.
There are a few points I'd like to highlight (not necessarily with very
solid proposals)

** Whether to distinguish coding sequence annotation from annotation of
other elements?*
  1. We can separate a CDS annotation (which is associated with protein
changes or alternative splicing) and other annotations (mostly based on
genomic coordinates)
  2. Or we can make a CDS annotation as a 'specialized, and finer-grained'
annotation of a more general form of other genomic annotations.
  - In the former case, we will need two separate annotations (CDS-specific
one, and a general, region-based one) for a coding variant, and zero to
multiple annotations for non-coding variants
  - The first approach is probably easier to handle separately between
genomes and exomes, and the benefit of second approaches is the possibility
of more fine-grained non-coding information later on (such as missense or
nonsense variants, some specific mutations within a region can be
characterized)

** Whether (and how) to make the "dictionary" of the annotations in the VCF
header for facilitating automated parsing?*
 - The two suggested formats are very well readable by human, but in order
to enable automated query without knowing the specific details on how the
annotation was performed, we will need a "dictionary" in the header. For
example, there needs to be a consensus on whether to use the keyword
"Transcript" to represent a transcript (in Ensemble ID?), and "Gene" to
represent a gene (in gene symbol?). Also, whether to hard-code the
hierarchical relationship between different elements (e.g.
Gene-Transcript-Exon-AminoAcid), or allow flexibility there is an important
issue.
 - Especially the granularity of the "Consequence" part varies quite a lot
between different annotation softwares. And the consequence can be
structured in a hierarchical way (e.g. a frameshift indel is an instance of
LoF variant). Whether to predefine such rule in the VCF spec, or have a
meta-data representing the relationship between different types of
"functional consequences"

** Distinguishing region-based annotations and variant-specific annotations?
*
 - Some annotation category requires only the knowledge of REF allele (e.g.
whether the variant overlaps with an exon, or an ENCODE region), and some
category requires the knowledge of both REF and ALT alleles (e.g. missense
and nonsense variants). I prefer to separate these two types of
annotations, or making the former (REF-only) as an instance of the latter
(REF and ALT requiring). The features for these types of annotations can
also be different. For example, conservation score fits well to the former
(and latter) category and it is possible to make an annotation on
non-variant site without having the ALT allele. However, polyphen scores do
not conform to the former category without knowing the variant allele, so
it is specific to latter category.
 - For the former category, it may be worthwhile to annotate every possible
genomic coordinate as some sharable format (e.g. this includes, gencode and
ENCODE annotation, conservation scores, 1000G masks, ancestral allele
information)

** Whether to encode the function annotation into VCF only, or make a
separate file if necessary?*
 - Writing down all detailed annotation on every variant may be quite large
and redundant in some cases (especially when considering non-coding
annotations). It is also possible to consider to put only "annotation IDs"
in the VCF file and have a separate file describing the details of the
annotations (and the separate file may also represent the hierarchy of the
annotations if necessary)*. *

On Mon, Nov 26, 2012 at 11:47 AM, Will McLaren <wm...@eb...> wrote:

> Hello all,
>
> I'm the lead developer on the Ensembl VEP (Variant Effect Predictor)
> software - I'd like to give the list our perspective on how we add
> functional annotations in VCF 4.1 currently.
>
> The VEP parses VCF (alongside other formats) and users can choose to
> output in VCF format too (though this is not the default, many of our users
> use it).
>
> The format for the functional data that we use is similar to that
> described by Petr (I suspect the example he shows is derived from VEP
> output of some form). We use the CSQ key in the INFO field, with the value
> consisting of "|" (pipe) separated chunks of data fields; the chunks
> themselves are separated by commas.
>
> Each chunk contains functional annotation for one alt allele + functional
> element combination. At the moment a functional element can be a
> transcript, regulatory feature or transcription factor binding motif.
>
> The fields and their order vary according to which command line options
> are used (and therefore which additional data is added). The order of
> fields is defined in a header line added to the VCF. The user may also
> specify a list of fields that they would like included, somewhat similar to
> a roll-your-own format.
>
> Missing data are left empty (i.e. you will see two consecutive "|"
> delimiters if a field is empty).
>
> Example:
>
> ##fileformat=VCFv4.1
> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
> predicted by VEP. Format:
> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|DISTANCE|SIFT|PolyPhen">
> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
> 21      26960070        rs116645811     G       A       .       .
> CSQ=A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11|||tolerated(0.05)|benign(0.001)
>
> Some thoughts from me:
>
> - I would definitely prefer to avoid introducing additional whitespace.
> Currently I am whitespace ambivalent when parsing input; changing this
> would cause a lot of problems for users without any major benefits that I
> can see. In the few cases we have to push in data that might have spaces,
> we replace them with "_" underscores (and commas are replaced by "&"
> ampersands)
>
> - I would be strongly in favour of enforcing standards on the functional
> types called - we use Sequence Ontology (SO) types, and we've encouraged
> UCSC (successfully) and NCBI/dbSNP (not yet) to switch to using them too.
> The SO guys are very open to contributions if there are types not yet
> described
>
> - some flexibility in the data fields that go with the functional
> annotation would be great - we report, for example, SIFT and PolyPhen
> predictions which are very popular with our users, but there's no reason to
> suppose these will be the flavours of the day in 1 or 2 years' time. Not to
> mention the potential expansion in non-coding annotations in a post-ENCODE
> world. But of course I recognise flexibility scales inversely with ease of
> parsing
>
> - beyond not wanting to disrupt our users' parsers, I don't have a problem
> changing the delimiters etc that we currently use
>
> - some of our fields are duplicated as they are specific only to the
> variant, not particularly to the allele/functional element combination -
> e.g. the Existing_variation field. This is a hangover of the VCF output
> being derived from our default output, which has one line per
> allele/functional element combo. I'd also be in favour of resolving these
> out somehow to reduce duplications
>
> Regards
>
> Will McLaren
> Ensembl Variation
>
>
> On 26 November 2012 12:52, Petr Danecek <pd...@sa...> wrote:
>
>> Hi Gonzalo,
>>
>> I welcome the idea of standardizing the functional annotations. Here is
>> an example of a wildly evolved format that we have been using so far:
>>
>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence of the ALT
>> alleles from Ensembl 66 VEP v2.4, format
>>
>> transcriptId:geneName:consequence[:codingSeqPosition:proteinPosition:proteinAlleles:proteinPredictions]+...[+gerpScore]">
>>
>> and two concrete examples, the first for a multiallelic site:
>>
>> CSQ=621:S>R:Grantham,110:Allele,C:Gene,Ssh3
>>
>> +ENSMUST00000037992:ENSMUSG00000034616:SYNONYMOUS_CODING:1863:621:S>S:Allele,A:Gene,Ssh3
>>
>>
>> CSQ=ENST00000382410:DEFB125:NON_SYNONYMOUS_CODING:184:62:H>Y:SIFT,tolerated(0.41):PolyPhen,benign(0):Condel,neutral(0.015):Grantham,83
>>
>> I am curious what other formats are in use?
>>
>>
>> I'd prefer not to introduce whitespaces in the INFO field or change the
>> column delimiters to spaces or extend to whitespaces; it would break
>> existing software and wouldn't bring much benefit.
>>
>> Petr
>>
>>
>>
>> On Mon, 2012-11-26 at 11:20 +0000, Peter Cock wrote:
>> > On Mon, Nov 26, 2012 at 5:12 AM, Eric Banks <eb...@br...>
>> wrote:
>> > > Hi Bradford,
>> > >
>> > > I do understand where you're coming from, but truthfully I'd prefer
>> to go in
>> > > the opposite direction once we're open to changing delimiters.  I've
>> never
>> > > quite understood why VCF is tab-delimited and not
>> whitespace-delimited.
>> >
>> > Tab separated makes it easy to use in Galaxy, R, etc, even Excel -
>> please
>> > keep that. It is a good thing!
>> >
>> > > You wouldn't believe how many times people have manually generated
>> > > VCFs that were space-delimited and couldn't understand why they were
>> > > failing in VCF parsers.
>> >
>> > I'd be asking why doesn't your parser give a clearer error message?
>> > If you've seen people fall over this pothole many times the parser
>> > concerned should be fixed.
>> >
>> > > I'd much rather that all whitespace be treated equally (as it is
>> > > visually).  It makes for a much simpler spec.
>> >
>> > The problem with white space is you can't see how many characters
>> > there are - spaces and tabs are not treated equally visually. What
>> > would you expect if there were several spaces in a row? If you treat
>> > it as one separator you prevent using empty cells (I'm thinking in
>> > terms of generalities here, not just VCF).
>> >
>> > Regards,
>> >
>> > Peter
>>
>>
>>
>>
>> --
>>  The Wellcome Trust Sanger Institute is operated by Genome Research
>>  Limited, a charity registered in England with number 1021457 and a
>>  company registered in England with number 2742969, whose registered
>>  office is 215 Euston Road, London, NW1 2BE.
>>
>>
>> ------------------------------------------------------------------------------
>> Monitor your physical, virtual and cloud infrastructure from a single
>> web console. Get in-depth insight into apps, servers, databases, vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov
>> _______________________________________________
>> VCFtools-spec mailing list
>> VCF...@li...
>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>
>
>
>
> ------------------------------------------------------------------------------
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov
> _______________________________________________
> VCFtools-spec mailing list
> VCF...@li...
> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>
>

-- 
-----------------------------------------------------
Hyun Min Kang, Ph.D.
Assistant Professor of Biostatistics
University of Michigan, Ann Arbor
Email : hm...@um...