Re: [VCFtools-spec] Wishlist for 4.2: allow space characters in INFO; functional annotation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

That is a good list of questions. I would also add, keeping in mind the
VCF/BCF duality:

** How should functional consequence be encoded in BCF?*
- Encoding the consequences as a string with subdelimiters gives a lot of
flexibility and makes it easy to associate multiple consequences with a
variant (i.e. a variant that can affect multiple transcripts) or
multiallelic sites (where the different alleles have different
consequences). However, should there be a way to keep the strict typing
within the consequence fields, particularly to allow a BCF writer to know
what the types should be and to encode numeric values appropriately?

It may be a premature optimization to think about this, but with the recent
enhanced tool support for BCF (in htslib and GATK) I think more people will
be using BCF going forward.

On Mon, Nov 26, 2012 at 1:35 PM, Hyun Min Kang <hm...@um...> wrote:

> Both Will's and Petr's suggestions make sense in a human-readable format.
> There are a few points I'd like to highlight (not necessarily with very
> solid proposals)
>
> ** Whether to distinguish coding sequence annotation from annotation of
> other elements?*
>   1. We can separate a CDS annotation (which is associated with protein
> changes or alternative splicing) and other annotations (mostly based on
> genomic coordinates)
>   2. Or we can make a CDS annotation as a 'specialized, and finer-grained'
> annotation of a more general form of other genomic annotations.
>   - In the former case, we will need two separate annotations
> (CDS-specific one, and a general, region-based one) for a coding variant,
> and zero to multiple annotations for non-coding variants
>   - The first approach is probably easier to handle separately between
> genomes and exomes, and the benefit of second approaches is the possibility
> of more fine-grained non-coding information later on (such as missense or
> nonsense variants, some specific mutations within a region can be
> characterized)
>
> ** Whether (and how) to make the "dictionary" of the annotations in the
> VCF header for facilitating automated parsing?*
>  - The two suggested formats are very well readable by human, but in order
> to enable automated query without knowing the specific details on how the
> annotation was performed, we will need a "dictionary" in the header. For
> example, there needs to be a consensus on whether to use the keyword
> "Transcript" to represent a transcript (in Ensemble ID?), and "Gene" to
> represent a gene (in gene symbol?). Also, whether to hard-code the
> hierarchical relationship between different elements (e.g.
> Gene-Transcript-Exon-AminoAcid), or allow flexibility there is an important
> issue.
>  - Especially the granularity of the "Consequence" part varies quite a
> lot between different annotation softwares. And the consequence can be
> structured in a hierarchical way (e.g. a frameshift indel is an instance of
> LoF variant). Whether to predefine such rule in the VCF spec, or have a
> meta-data representing the relationship between different types of
> "functional consequences"
>
> ** Distinguishing region-based annotations and variant-specific
> annotations?*
>  - Some annotation category requires only the knowledge of REF allele
> (e.g. whether the variant overlaps with an exon, or an ENCODE region), and
> some category requires the knowledge of both REF and ALT alleles (e.g.
> missense and nonsense variants). I prefer to separate these two types of
> annotations, or making the former (REF-only) as an instance of the latter
> (REF and ALT requiring). The features for these types of annotations can
> also be different. For example, conservation score fits well to the former
> (and latter) category and it is possible to make an annotation on
> non-variant site without having the ALT allele. However, polyphen scores do
> not conform to the former category without knowing the variant allele, so
> it is specific to latter category.
>  - For the former category, it may be worthwhile to annotate every
> possible genomic coordinate as some sharable format (e.g. this includes,
> gencode and ENCODE annotation, conservation scores, 1000G masks, ancestral
> allele information)
>
> ** Whether to encode the function annotation into VCF only, or make a
> separate file if necessary?*
>  - Writing down all detailed annotation on every variant may be quite
> large and redundant in some cases (especially when considering non-coding
> annotations). It is also possible to consider to put only "annotation IDs"
> in the VCF file and have a separate file describing the details of the
> annotations (and the separate file may also represent the hierarchy of the
> annotations if necessary)*. *
>
>
> On Mon, Nov 26, 2012 at 11:47 AM, Will McLaren <wm...@eb...> wrote:
>
>> Hello all,
>>
>> I'm the lead developer on the Ensembl VEP (Variant Effect Predictor)
>> software - I'd like to give the list our perspective on how we add
>> functional annotations in VCF 4.1 currently.
>>
>> The VEP parses VCF (alongside other formats) and users can choose to
>> output in VCF format too (though this is not the default, many of our users
>> use it).
>>
>> The format for the functional data that we use is similar to that
>> described by Petr (I suspect the example he shows is derived from VEP
>> output of some form). We use the CSQ key in the INFO field, with the value
>> consisting of "|" (pipe) separated chunks of data fields; the chunks
>> themselves are separated by commas.
>>
>> Each chunk contains functional annotation for one alt allele + functional
>> element combination. At the moment a functional element can be a
>> transcript, regulatory feature or transcription factor binding motif.
>>
>> The fields and their order vary according to which command line options
>> are used (and therefore which additional data is added). The order of
>> fields is defined in a header line added to the VCF. The user may also
>> specify a list of fields that they would like included, somewhat similar to
>> a roll-your-own format.
>>
>> Missing data are left empty (i.e. you will see two consecutive "|"
>> delimiters if a field is empty).
>>
>> Example:
>>
>> ##fileformat=VCFv4.1
>>  ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
>> predicted by VEP. Format:
>> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|DISTANCE|SIFT|PolyPhen">
>> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>> 21      26960070        rs116645811     G       A       .       .
>> CSQ=A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11|||tolerated(0.05)|benign(0.001)
>>
>> Some thoughts from me:
>>
>> - I would definitely prefer to avoid introducing additional whitespace.
>> Currently I am whitespace ambivalent when parsing input; changing this
>> would cause a lot of problems for users without any major benefits that I
>> can see. In the few cases we have to push in data that might have spaces,
>> we replace them with "_" underscores (and commas are replaced by "&"
>> ampersands)
>>
>> - I would be strongly in favour of enforcing standards on the functional
>> types called - we use Sequence Ontology (SO) types, and we've encouraged
>> UCSC (successfully) and NCBI/dbSNP (not yet) to switch to using them too.
>> The SO guys are very open to contributions if there are types not yet
>> described
>>
>> - some flexibility in the data fields that go with the functional
>> annotation would be great - we report, for example, SIFT and PolyPhen
>> predictions which are very popular with our users, but there's no reason to
>> suppose these will be the flavours of the day in 1 or 2 years' time. Not to
>> mention the potential expansion in non-coding annotations in a post-ENCODE
>> world. But of course I recognise flexibility scales inversely with ease of
>> parsing
>>
>> - beyond not wanting to disrupt our users' parsers, I don't have a
>> problem changing the delimiters etc that we currently use
>>
>> - some of our fields are duplicated as they are specific only to the
>> variant, not particularly to the allele/functional element combination -
>> e.g. the Existing_variation field. This is a hangover of the VCF output
>> being derived from our default output, which has one line per
>> allele/functional element combo. I'd also be in favour of resolving these
>> out somehow to reduce duplications
>>
>> Regards
>>
>> Will McLaren
>> Ensembl Variation
>>
>>
>> On 26 November 2012 12:52, Petr Danecek <pd...@sa...> wrote:
>>
>>> Hi Gonzalo,
>>>
>>> I welcome the idea of standardizing the functional annotations. Here is
>>> an example of a wildly evolved format that we have been using so far:
>>>
>>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence of the ALT
>>> alleles from Ensembl 66 VEP v2.4, format
>>>
>>> transcriptId:geneName:consequence[:codingSeqPosition:proteinPosition:proteinAlleles:proteinPredictions]+...[+gerpScore]">
>>>
>>> and two concrete examples, the first for a multiallelic site:
>>>
>>> CSQ=621:S>R:Grantham,110:Allele,C:Gene,Ssh3
>>>
>>> +ENSMUST00000037992:ENSMUSG00000034616:SYNONYMOUS_CODING:1863:621:S>S:Allele,A:Gene,Ssh3
>>>
>>>
>>> CSQ=ENST00000382410:DEFB125:NON_SYNONYMOUS_CODING:184:62:H>Y:SIFT,tolerated(0.41):PolyPhen,benign(0):Condel,neutral(0.015):Grantham,83
>>>
>>> I am curious what other formats are in use?
>>>
>>>
>>> I'd prefer not to introduce whitespaces in the INFO field or change the
>>> column delimiters to spaces or extend to whitespaces; it would break
>>> existing software and wouldn't bring much benefit.
>>>
>>> Petr
>>>
>>>
>>>
>>> On Mon, 2012-11-26 at 11:20 +0000, Peter Cock wrote:
>>> > On Mon, Nov 26, 2012 at 5:12 AM, Eric Banks <eb...@br...>
>>> wrote:
>>> > > Hi Bradford,
>>> > >
>>> > > I do understand where you're coming from, but truthfully I'd prefer
>>> to go in
>>> > > the opposite direction once we're open to changing delimiters.  I've
>>> never
>>> > > quite understood why VCF is tab-delimited and not
>>> whitespace-delimited.
>>> >
>>> > Tab separated makes it easy to use in Galaxy, R, etc, even Excel -
>>> please
>>> > keep that. It is a good thing!
>>> >
>>> > > You wouldn't believe how many times people have manually generated
>>> > > VCFs that were space-delimited and couldn't understand why they were
>>> > > failing in VCF parsers.
>>> >
>>> > I'd be asking why doesn't your parser give a clearer error message?
>>> > If you've seen people fall over this pothole many times the parser
>>> > concerned should be fixed.
>>> >
>>> > > I'd much rather that all whitespace be treated equally (as it is
>>> > > visually).  It makes for a much simpler spec.
>>> >
>>> > The problem with white space is you can't see how many characters
>>> > there are - spaces and tabs are not treated equally visually. What
>>> > would you expect if there were several spaces in a row? If you treat
>>> > it as one separator you prevent using empty cells (I'm thinking in
>>> > terms of generalities here, not just VCF).
>>> >
>>> > Regards,
>>> >
>>> > Peter
>>>
>>>
>>>
>>>
>>> --
>>>  The Wellcome Trust Sanger Institute is operated by Genome Research
>>>  Limited, a charity registered in England with number 1021457 and a
>>>  company registered in England with number 2742969, whose registered
>>>  office is 215 Euston Road, London, NW1 2BE.
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Monitor your physical, virtual and cloud infrastructure from a single
>>> web console. Get in-depth insight into apps, servers, databases, vmware,
>>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>> Pricing starts from $795 for 25 servers or applications!
>>> http://p.sf.net/sfu/zoho_dev2dev_nov
>>> _______________________________________________
>>> VCFtools-spec mailing list
>>> VCF...@li...
>>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Monitor your physical, virtual and cloud infrastructure from a single
>> web console. Get in-depth insight into apps, servers, databases, vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov
>> _______________________________________________
>> VCFtools-spec mailing list
>> VCF...@li...
>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>
>>
>
>
> --
> -----------------------------------------------------
> Hyun Min Kang, Ph.D.
> Assistant Professor of Biostatistics
> University of Michigan, Ann Arbor
> Email : hm...@um...
>
>
>
> ------------------------------------------------------------------------------
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov
> _______________________________________________
> VCFtools-spec mailing list
> VCF...@li...
> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>
>