Re: [VCFtools-spec] Wishlist for 4.2: allow space characters in INFO; functional annotation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I agree that we should have allowed spaces in INFO in VCF v4. However, changing it now is a little late. Quite a lot of in-house scripts may have been written assuming no space in INFO because the 4.1 spec mandates so. Allowing spaces will break all of them, which is a serious concern. The best solution now is to simply replace space. It just requirements one additional line of code.

Heng

On Dec 3, 2012, at 3:32 PM, Danny Challis wrote:

> Hi all,
>    We would like to speak in favor of allowing for spaces in the INFO 
> column.  Many annotation and analysis pipelines involve copying 
> descriptive statements directly from a database which often includes 
> whitespace.  Rather than requiring all such pipelines to include an 
> additional search/replace function to remove spaces, it would be better 
> to simply allow the space characters.  Aside from the AWK issues which 
> are easily dealt with, I don't foresee any serious problems with this 
> approach as the space character has no other specified significance in 
> VCF.
> Thanks,
> 
> Danny.
> HGSC-BCM
> 
> On Mon 26 Nov 2012 04:03:43 PM CST, Heng Li wrote:
>> Hyun, what's your annotation format in VCF? Could anyone show an
>> example output of snpeff and/or other popular annotation tools?
>> Perhaps it would be good to start from the existing ones and reach a
>> consensus on how functional annotations should be stored in VCF. We
>> then encourage everyone to stick to the convention. Note that I am
>> thinking about a convention, which requires no changes to the spec.
>> 
>> Heng
>> 
>> On Nov 26, 2012, at 2:35 PM, Hyun Min Kang wrote:
>> 
>>> Both Will's and Petr's suggestions make sense in a human-readable
>>> format. There are a few points I'd like to highlight (not necessarily
>>> with very solid proposals)
>>> 
>>> ** Whether to distinguish coding sequence annotation from annotation
>>> of other elements?*
>>>  1. We can separate a CDS annotation (which is associated with
>>> protein changes or alternative splicing) and other annotations
>>> (mostly based on genomic coordinates)
>>>  2. Or we can make a CDS annotation as a 'specialized, and
>>> finer-grained' annotation of a more general form of other genomic
>>> annotations.
>>>  - In the former case, we will need two separate annotations
>>> (CDS-specific one, and a general, region-based one) for a coding
>>> variant, and zero to multiple annotations for non-coding variants
>>>  - The first approach is probably easier to handle separately
>>> between genomes and exomes, and the benefit of second approaches is
>>> the possibility of more fine-grained non-coding information later on
>>> (such as missense or nonsense variants, some specific mutations
>>> within a region can be characterized)
>>> 
>>> ** Whether (and how) to make the "dictionary" of the annotations in
>>> the VCF header for facilitating automated parsing?*
>>> - The two suggested formats are very well readable by human, but in
>>> order to enable automated query without knowing the specific details
>>> on how the annotation was performed, we will need a "dictionary" in
>>> the header. For example, there needs to be a consensus on whether to
>>> use the keyword "Transcript" to represent a transcript (in Ensemble
>>> ID?), and "Gene" to represent a gene (in gene symbol?). Also, whether
>>> to hard-code the hierarchical relationship between different elements
>>> (e.g. Gene-Transcript-Exon-AminoAcid), or allow flexibility there is
>>> an important issue.
>>> - Especially the granularity of the "Consequence" part varies quite
>>> a lot between different annotation softwares. And the consequence can
>>> be structured in a hierarchical way (e.g. a frameshift indel is an
>>> instance of LoF variant). Whether to predefine such rule in the VCF
>>> spec, or have a meta-data representing the relationship between
>>> different types of "functional consequences"
>>> 
>>> ** Distinguishing region-based annotations and variant-specific
>>> annotations?*
>>> - Some annotation category requires only the knowledge of REF allele
>>> (e.g. whether the variant overlaps with an exon, or an ENCODE
>>> region), and some category requires the knowledge of both REF and ALT
>>> alleles (e.g. missense and nonsense variants). I prefer to separate
>>> these two types of annotations, or making the former (REF-only) as an
>>> instance of the latter (REF and ALT requiring). The features for
>>> these types of annotations can also be different. For example,
>>> conservation score fits well to the former (and latter) category and
>>> it is possible to make an annotation on non-variant site without
>>> having the ALT allele. However, polyphen scores do not conform to the
>>> former category without knowing the variant allele, so it is specific
>>> to latter category.
>>> - For the former category, it may be worthwhile to annotate every
>>> possible genomic coordinate as some sharable format (e.g. this
>>> includes, gencode and ENCODE annotation, conservation scores, 1000G
>>> masks, ancestral allele information)
>>> 
>>> ** Whether to encode the function annotation into VCF only, or make a
>>> separate file if necessary?*
>>> - Writing down all detailed annotation on every variant may be quite
>>> large and redundant in some cases (especially when considering
>>> non-coding annotations). It is also possible to consider to put only
>>> "annotation IDs" in the VCF file and have a separate file describing
>>> the details of the annotations (and the separate file may also
>>> represent the hierarchy of the annotations if necessary)*. *
>>> 
>>> 
>>> On Mon, Nov 26, 2012 at 11:47 AM, Will McLaren <wm...@eb...
>>> <mailto:wm...@eb...>> wrote:
>>> 
>>>    Hello all,
>>> 
>>>    I'm the lead developer on the Ensembl VEP (Variant Effect
>>>    Predictor) software - I'd like to give the list our perspective
>>>    on how we add functional annotations in VCF 4.1 currently.
>>> 
>>>    The VEP parses VCF (alongside other formats) and users can choose
>>>    to output in VCF format too (though this is not the default, many
>>>    of our users use it).
>>> 
>>>    The format for the functional data that we use is similar to that
>>>    described by Petr (I suspect the example he shows is derived from
>>>    VEP output of some form). We use the CSQ key in the INFO field,
>>>    with the value consisting of "|" (pipe) separated chunks of data
>>>    fields; the chunks themselves are separated by commas.
>>> 
>>>    Each chunk contains functional annotation for one alt allele +
>>>    functional element combination. At the moment a functional
>>>    element can be a transcript, regulatory feature or transcription
>>>    factor binding motif.
>>> 
>>>    The fields and their order vary according to which command line
>>>    options are used (and therefore which additional data is added).
>>>    The order of fields is defined in a header line added to the VCF.
>>>    The user may also specify a list of fields that they would like
>>>    included, somewhat similar to a roll-your-own format.
>>> 
>>>    Missing data are left empty (i.e. you will see two consecutive
>>>    "|" delimiters if a field is empty).
>>> 
>>>    Example:
>>> 
>>>    ##fileformat=VCFv4.1
>>>    ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type
>>>    as predicted by VEP. Format:
>>>    Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|DISTANCE|SIFT|PolyPhen">
>>>    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>>>    21      26960070        rs116645811     G       A       .       .
>>> 
>>>    CSQ=A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11|||tolerated(0.05)|benign(0.001)
>>> 
>>>    Some thoughts from me:
>>> 
>>>    - I would definitely prefer to avoid introducing additional
>>>    whitespace. Currently I am whitespace ambivalent when parsing
>>>    input; changing this would cause a lot of problems for users
>>>    without any major benefits that I can see. In the few cases we
>>>    have to push in data that might have spaces, we replace them with
>>>    "_" underscores (and commas are replaced by "&" ampersands)
>>> 
>>>    - I would be strongly in favour of enforcing standards on the
>>>    functional types called - we use Sequence Ontology (SO) types,
>>>    and we've encouraged UCSC (successfully) and NCBI/dbSNP (not yet)
>>>    to switch to using them too. The SO guys are very open to
>>>    contributions if there are types not yet described
>>> 
>>>    - some flexibility in the data fields that go with the functional
>>>    annotation would be great - we report, for example, SIFT and
>>>    PolyPhen predictions which are very popular with our users, but
>>>    there's no reason to suppose these will be the flavours of the
>>>    day in 1 or 2 years' time. Not to mention the potential expansion
>>>    in non-coding annotations in a post-ENCODE world. But of course I
>>>    recognise flexibility scales inversely with ease of parsing
>>> 
>>>    - beyond not wanting to disrupt our users' parsers, I don't have
>>>    a problem changing the delimiters etc that we currently use
>>> 
>>>    - some of our fields are duplicated as they are specific only to
>>>    the variant, not particularly to the allele/functional element
>>>    combination - e.g. the Existing_variation field. This is a
>>>    hangover of the VCF output being derived from our default output,
>>>    which has one line per allele/functional element combo. I'd also
>>>    be in favour of resolving these out somehow to reduce duplications
>>> 
>>>    Regards
>>> 
>>>    Will McLaren
>>>    Ensembl Variation
>>> 
>>> 
>>>    On 26 November 2012 12:52, Petr Danecek <pd...@sa...
>>>    <mailto:pd...@sa...>> wrote:
>>> 
>>>        Hi Gonzalo,
>>> 
>>>        I welcome the idea of standardizing the functional
>>>        annotations. Here is
>>>        an example of a wildly evolved format that we have been using
>>>        so far:
>>> 
>>>        ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence
>>>        of the ALT
>>>        alleles from Ensembl 66 VEP v2.4, format
>>>        transcriptId:geneName:consequence[:codingSeqPosition:proteinPosition:proteinAlleles:proteinPredictions]+...[+gerpScore]">
>>> 
>>>        and two concrete examples, the first for a multiallelic site:
>>> 
>>>        CSQ=621:S>R:Grantham,110:Allele,C:Gene,Ssh3
>>>        +ENSMUST00000037992:ENSMUSG00000034616:SYNONYMOUS_CODING:1863:621:S>S:Allele,A:Gene,Ssh3
>>> 
>>>        CSQ=ENST00000382410:DEFB125:NON_SYNONYMOUS_CODING:184:62:H>Y:SIFT,tolerated(0.41):PolyPhen,benign(0):Condel,neutral(0.015):Grantham,83
>>> 
>>>        I am curious what other formats are in use?
>>> 
>>> 
>>>        I'd prefer not to introduce whitespaces in the INFO field or
>>>        change the
>>>        column delimiters to spaces or extend to whitespaces; it
>>>        would break
>>>        existing software and wouldn't bring much benefit.
>>> 
>>>        Petr
>>> 
>>> 
>>> 
>>>        On Mon, 2012-11-26 at 11:20 +0000, Peter Cock wrote:
>>>> On Mon, Nov 26, 2012 at 5:12 AM, Eric Banks
>>>        <eb...@br...
>>>        <mailto:eb...@br...>> wrote:
>>>>> Hi Bradford,
>>>>> 
>>>>> I do understand where you're coming from, but truthfully
>>>        I'd prefer to go in
>>>>> the opposite direction once we're open to changing
>>>        delimiters.  I've never
>>>>> quite understood why VCF is tab-delimited and not
>>>        whitespace-delimited.
>>>> 
>>>> Tab separated makes it easy to use in Galaxy, R, etc, even
>>>        Excel - please
>>>> keep that. It is a good thing!
>>>> 
>>>>> You wouldn't believe how many times people have manually
>>>        generated
>>>>> VCFs that were space-delimited and couldn't understand
>>>        why they were
>>>>> failing in VCF parsers.
>>>> 
>>>> I'd be asking why doesn't your parser give a clearer error
>>>        message?
>>>> If you've seen people fall over this pothole many times the
>>>        parser
>>>> concerned should be fixed.
>>>> 
>>>>> I'd much rather that all whitespace be treated equally
>>>        (as it is
>>>>> visually).  It makes for a much simpler spec.
>>>> 
>>>> The problem with white space is you can't see how many
>>>        characters
>>>> there are - spaces and tabs are not treated equally
>>>        visually. What
>>>> would you expect if there were several spaces in a row? If
>>>        you treat
>>>> it as one separator you prevent using empty cells (I'm
>>>        thinking in
>>>> terms of generalities here, not just VCF).
>>>> 
>>>> Regards,
>>>> 
>>>> Peter
>>> 
>>> 
>>> 
>>> 
>>>        --
>>>         The Wellcome Trust Sanger Institute is operated by Genome
>>>        Research
>>>         Limited, a charity registered in England with number 1021457
>>>        and a
>>>         company registered in England with number 2742969, whose
>>>        registered
>>>         office is 215 Euston Road, London, NW1 2BE.
>>> 
>>>        ------------------------------------------------------------------------------
>>>        Monitor your physical, virtual and cloud infrastructure from
>>>        a single
>>>        web console. Get in-depth insight into apps, servers,
>>>        databases, vmware,
>>>        SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>>        Pricing starts from $795 for 25 servers or applications!
>>>        http://p.sf.net/sfu/zoho_dev2dev_nov
>>>        _______________________________________________
>>>        VCFtools-spec mailing list
>>>        VCF...@li...
>>>        <mailto:VCF...@li...>
>>>        https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>> 
>>> 
>>> 
>>>    ------------------------------------------------------------------------------
>>>    Monitor your physical, virtual and cloud infrastructure from a single
>>>    web console. Get in-depth insight into apps, servers, databases,
>>>    vmware,
>>>    SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>>    Pricing starts from $795 for 25 servers or applications!
>>>    http://p.sf.net/sfu/zoho_dev2dev_nov
>>>    _______________________________________________
>>>    VCFtools-spec mailing list
>>>    VCF...@li...
>>>    <mailto:VCF...@li...>
>>>    https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>> 
>>> 
>>> 
>>> 
>>> --
>>> -----------------------------------------------------
>>> Hyun Min Kang, Ph.D.
>>> Assistant Professor of Biostatistics
>>> University of Michigan, Ann Arbor
>>> Email : hm...@um... <mailto:hm...@um...>
>>> 
>>> ------------------------------------------------------------------------------
>>> Monitor your physical, virtual and cloud infrastructure from a single
>>> web console. Get in-depth insight into apps, servers, databases, vmware,
>>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>> Pricing starts from $795 for 25 servers or applications!
>>> http://p.sf.net/sfu/zoho_dev2dev_nov_______________________________________________
>>> VCFtools-spec mailing list
>>> VCF...@li...
>>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>> 
>> 
>> -- The Wellcome Trust Sanger Institute is operated by Genome Research
>> Limited, a charity registered in England with number 1021457 and a
>> company registered in England with number 2742969, whose registered
>> office is 215 Euston Road, London, NW1 2BE.
>> ------------------------------------------------------------------------
>> 
>> No virus found in this message.
>> Checked by AVG - www.avg.com <http://www.avg.com>
>> Version: 2013.0.2793 / Virus Database: 2629/5908 - Release Date: 11/20/12
>> 
> 
> 
> 
> ------------------------------------------------------------------------------
> Keep yourself connected to Go Parallel: 
> BUILD Helping you discover the best ways to construct your parallel projects.
> http://goparallel.sourceforge.net
> _______________________________________________
> VCFtools-spec mailing list
> VCF...@li...
> https://lists.sourceforge.net/lists/listinfo/vcftools-spec

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.