Re: [VCFtools-spec] Wishlist for 4.2: allow space characters in INFO; functional annotation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi all,
    We would like to speak in favor of allowing for spaces in the INFO 
column.  Many annotation and analysis pipelines involve copying 
descriptive statements directly from a database which often includes 
whitespace.  Rather than requiring all such pipelines to include an 
additional search/replace function to remove spaces, it would be better 
to simply allow the space characters.  Aside from the AWK issues which 
are easily dealt with, I don't foresee any serious problems with this 
approach as the space character has no other specified significance in 
VCF.
Thanks,

Danny.
HGSC-BCM

On Mon 26 Nov 2012 04:03:43 PM CST, Heng Li wrote:
> Hyun, what's your annotation format in VCF? Could anyone show an
> example output of snpeff and/or other popular annotation tools?
> Perhaps it would be good to start from the existing ones and reach a
> consensus on how functional annotations should be stored in VCF. We
> then encourage everyone to stick to the convention. Note that I am
> thinking about a convention, which requires no changes to the spec.
>
> Heng
>
> On Nov 26, 2012, at 2:35 PM, Hyun Min Kang wrote:
>
>> Both Will's and Petr's suggestions make sense in a human-readable
>> format. There are a few points I'd like to highlight (not necessarily
>> with very solid proposals)
>>
>> ** Whether to distinguish coding sequence annotation from annotation
>> of other elements?*
>>   1. We can separate a CDS annotation (which is associated with
>> protein changes or alternative splicing) and other annotations
>> (mostly based on genomic coordinates)
>>   2. Or we can make a CDS annotation as a 'specialized, and
>> finer-grained' annotation of a more general form of other genomic
>> annotations.
>>   - In the former case, we will need two separate annotations
>> (CDS-specific one, and a general, region-based one) for a coding
>> variant, and zero to multiple annotations for non-coding variants
>>   - The first approach is probably easier to handle separately
>> between genomes and exomes, and the benefit of second approaches is
>> the possibility of more fine-grained non-coding information later on
>> (such as missense or nonsense variants, some specific mutations
>> within a region can be characterized)
>>
>> ** Whether (and how) to make the "dictionary" of the annotations in
>> the VCF header for facilitating automated parsing?*
>>  - The two suggested formats are very well readable by human, but in
>> order to enable automated query without knowing the specific details
>> on how the annotation was performed, we will need a "dictionary" in
>> the header. For example, there needs to be a consensus on whether to
>> use the keyword "Transcript" to represent a transcript (in Ensemble
>> ID?), and "Gene" to represent a gene (in gene symbol?). Also, whether
>> to hard-code the hierarchical relationship between different elements
>> (e.g. Gene-Transcript-Exon-AminoAcid), or allow flexibility there is
>> an important issue.
>>  - Especially the granularity of the "Consequence" part varies quite
>> a lot between different annotation softwares. And the consequence can
>> be structured in a hierarchical way (e.g. a frameshift indel is an
>> instance of LoF variant). Whether to predefine such rule in the VCF
>> spec, or have a meta-data representing the relationship between
>> different types of "functional consequences"
>>
>> ** Distinguishing region-based annotations and variant-specific
>> annotations?*
>>  - Some annotation category requires only the knowledge of REF allele
>> (e.g. whether the variant overlaps with an exon, or an ENCODE
>> region), and some category requires the knowledge of both REF and ALT
>> alleles (e.g. missense and nonsense variants). I prefer to separate
>> these two types of annotations, or making the former (REF-only) as an
>> instance of the latter (REF and ALT requiring). The features for
>> these types of annotations can also be different. For example,
>> conservation score fits well to the former (and latter) category and
>> it is possible to make an annotation on non-variant site without
>> having the ALT allele. However, polyphen scores do not conform to the
>> former category without knowing the variant allele, so it is specific
>> to latter category.
>>  - For the former category, it may be worthwhile to annotate every
>> possible genomic coordinate as some sharable format (e.g. this
>> includes, gencode and ENCODE annotation, conservation scores, 1000G
>> masks, ancestral allele information)
>>
>> ** Whether to encode the function annotation into VCF only, or make a
>> separate file if necessary?*
>>  - Writing down all detailed annotation on every variant may be quite
>> large and redundant in some cases (especially when considering
>> non-coding annotations). It is also possible to consider to put only
>> "annotation IDs" in the VCF file and have a separate file describing
>> the details of the annotations (and the separate file may also
>> represent the hierarchy of the annotations if necessary)*. *
>>
>>
>> On Mon, Nov 26, 2012 at 11:47 AM, Will McLaren <wm...@eb...
>> <mailto:wm...@eb...>> wrote:
>>
>>     Hello all,
>>
>>     I'm the lead developer on the Ensembl VEP (Variant Effect
>>     Predictor) software - I'd like to give the list our perspective
>>     on how we add functional annotations in VCF 4.1 currently.
>>
>>     The VEP parses VCF (alongside other formats) and users can choose
>>     to output in VCF format too (though this is not the default, many
>>     of our users use it).
>>
>>     The format for the functional data that we use is similar to that
>>     described by Petr (I suspect the example he shows is derived from
>>     VEP output of some form). We use the CSQ key in the INFO field,
>>     with the value consisting of "|" (pipe) separated chunks of data
>>     fields; the chunks themselves are separated by commas.
>>
>>     Each chunk contains functional annotation for one alt allele +
>>     functional element combination. At the moment a functional
>>     element can be a transcript, regulatory feature or transcription
>>     factor binding motif.
>>
>>     The fields and their order vary according to which command line
>>     options are used (and therefore which additional data is added).
>>     The order of fields is defined in a header line added to the VCF.
>>     The user may also specify a list of fields that they would like
>>     included, somewhat similar to a roll-your-own format.
>>
>>     Missing data are left empty (i.e. you will see two consecutive
>>     "|" delimiters if a field is empty).
>>
>>     Example:
>>
>>     ##fileformat=VCFv4.1
>>     ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type
>>     as predicted by VEP. Format:
>>     Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|DISTANCE|SIFT|PolyPhen">
>>     #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>>     21      26960070        rs116645811     G       A       .       .
>>
>>     CSQ=A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11|||tolerated(0.05)|benign(0.001)
>>
>>     Some thoughts from me:
>>
>>     - I would definitely prefer to avoid introducing additional
>>     whitespace. Currently I am whitespace ambivalent when parsing
>>     input; changing this would cause a lot of problems for users
>>     without any major benefits that I can see. In the few cases we
>>     have to push in data that might have spaces, we replace them with
>>     "_" underscores (and commas are replaced by "&" ampersands)
>>
>>     - I would be strongly in favour of enforcing standards on the
>>     functional types called - we use Sequence Ontology (SO) types,
>>     and we've encouraged UCSC (successfully) and NCBI/dbSNP (not yet)
>>     to switch to using them too. The SO guys are very open to
>>     contributions if there are types not yet described
>>
>>     - some flexibility in the data fields that go with the functional
>>     annotation would be great - we report, for example, SIFT and
>>     PolyPhen predictions which are very popular with our users, but
>>     there's no reason to suppose these will be the flavours of the
>>     day in 1 or 2 years' time. Not to mention the potential expansion
>>     in non-coding annotations in a post-ENCODE world. But of course I
>>     recognise flexibility scales inversely with ease of parsing
>>
>>     - beyond not wanting to disrupt our users' parsers, I don't have
>>     a problem changing the delimiters etc that we currently use
>>
>>     - some of our fields are duplicated as they are specific only to
>>     the variant, not particularly to the allele/functional element
>>     combination - e.g. the Existing_variation field. This is a
>>     hangover of the VCF output being derived from our default output,
>>     which has one line per allele/functional element combo. I'd also
>>     be in favour of resolving these out somehow to reduce duplications
>>
>>     Regards
>>
>>     Will McLaren
>>     Ensembl Variation
>>
>>
>>     On 26 November 2012 12:52, Petr Danecek <pd...@sa...
>>     <mailto:pd...@sa...>> wrote:
>>
>>         Hi Gonzalo,
>>
>>         I welcome the idea of standardizing the functional
>>         annotations. Here is
>>         an example of a wildly evolved format that we have been using
>>         so far:
>>
>>         ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence
>>         of the ALT
>>         alleles from Ensembl 66 VEP v2.4, format
>>         transcriptId:geneName:consequence[:codingSeqPosition:proteinPosition:proteinAlleles:proteinPredictions]+...[+gerpScore]">
>>
>>         and two concrete examples, the first for a multiallelic site:
>>
>>         CSQ=621:S>R:Grantham,110:Allele,C:Gene,Ssh3
>>         +ENSMUST00000037992:ENSMUSG00000034616:SYNONYMOUS_CODING:1863:621:S>S:Allele,A:Gene,Ssh3
>>
>>         CSQ=ENST00000382410:DEFB125:NON_SYNONYMOUS_CODING:184:62:H>Y:SIFT,tolerated(0.41):PolyPhen,benign(0):Condel,neutral(0.015):Grantham,83
>>
>>         I am curious what other formats are in use?
>>
>>
>>         I'd prefer not to introduce whitespaces in the INFO field or
>>         change the
>>         column delimiters to spaces or extend to whitespaces; it
>>         would break
>>         existing software and wouldn't bring much benefit.
>>
>>         Petr
>>
>>
>>
>>         On Mon, 2012-11-26 at 11:20 +0000, Peter Cock wrote:
>>         > On Mon, Nov 26, 2012 at 5:12 AM, Eric Banks
>>         <eb...@br...
>>         <mailto:eb...@br...>> wrote:
>>         > > Hi Bradford,
>>         > >
>>         > > I do understand where you're coming from, but truthfully
>>         I'd prefer to go in
>>         > > the opposite direction once we're open to changing
>>         delimiters.  I've never
>>         > > quite understood why VCF is tab-delimited and not
>>         whitespace-delimited.
>>         >
>>         > Tab separated makes it easy to use in Galaxy, R, etc, even
>>         Excel - please
>>         > keep that. It is a good thing!
>>         >
>>         > > You wouldn't believe how many times people have manually
>>         generated
>>         > > VCFs that were space-delimited and couldn't understand
>>         why they were
>>         > > failing in VCF parsers.
>>         >
>>         > I'd be asking why doesn't your parser give a clearer error
>>         message?
>>         > If you've seen people fall over this pothole many times the
>>         parser
>>         > concerned should be fixed.
>>         >
>>         > > I'd much rather that all whitespace be treated equally
>>         (as it is
>>         > > visually).  It makes for a much simpler spec.
>>         >
>>         > The problem with white space is you can't see how many
>>         characters
>>         > there are - spaces and tabs are not treated equally
>>         visually. What
>>         > would you expect if there were several spaces in a row? If
>>         you treat
>>         > it as one separator you prevent using empty cells (I'm
>>         thinking in
>>         > terms of generalities here, not just VCF).
>>         >
>>         > Regards,
>>         >
>>         > Peter
>>
>>
>>
>>
>>         --
>>          The Wellcome Trust Sanger Institute is operated by Genome
>>         Research
>>          Limited, a charity registered in England with number 1021457
>>         and a
>>          company registered in England with number 2742969, whose
>>         registered
>>          office is 215 Euston Road, London, NW1 2BE.
>>
>>         ------------------------------------------------------------------------------
>>         Monitor your physical, virtual and cloud infrastructure from
>>         a single
>>         web console. Get in-depth insight into apps, servers,
>>         databases, vmware,
>>         SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>         Pricing starts from $795 for 25 servers or applications!
>>         http://p.sf.net/sfu/zoho_dev2dev_nov
>>         _______________________________________________
>>         VCFtools-spec mailing list
>>         VCF...@li...
>>         <mailto:VCF...@li...>
>>         https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>
>>
>>
>>     ------------------------------------------------------------------------------
>>     Monitor your physical, virtual and cloud infrastructure from a single
>>     web console. Get in-depth insight into apps, servers, databases,
>>     vmware,
>>     SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>     Pricing starts from $795 for 25 servers or applications!
>>     http://p.sf.net/sfu/zoho_dev2dev_nov
>>     _______________________________________________
>>     VCFtools-spec mailing list
>>     VCF...@li...
>>     <mailto:VCF...@li...>
>>     https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Hyun Min Kang, Ph.D.
>> Assistant Professor of Biostatistics
>> University of Michigan, Ann Arbor
>> Email : hm...@um... <mailto:hm...@um...>
>>
>> ------------------------------------------------------------------------------
>> Monitor your physical, virtual and cloud infrastructure from a single
>> web console. Get in-depth insight into apps, servers, databases, vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov_______________________________________________
>> VCFtools-spec mailing list
>> VCF...@li...
>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>
>
> -- The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
> ------------------------------------------------------------------------
>
> No virus found in this message.
> Checked by AVG - www.avg.com <http://www.avg.com>
> Version: 2013.0.2793 / Virus Database: 2629/5908 - Release Date: 11/20/12
>