Re: [VCFtools-spec] Wishlist for 4.2: allow space characters in INFO; functional annotation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello Anja,

Thank you for raising this. VCF allows you to define your own INFO fields,
so much of what you propose if already valid within the specification.
However, while there is nothing technically wrong with them, I would
suggest that SV, I, and FT are perhaps not ideal abbreviations. SV could be
confused for Structural Variant, FT is already used to define a filter in
the FORMAT field, and for "I" we suggest avoiding single letter fields.

As for List, isn't this already allowed by the "Number" part of the INFO
definition? For example,

##INFO=<ID=LIST,Number=4,Type=String,Description="A list with 4 entries">

Does this not meet your requirements?

Kind regards,

Adam

On 20 February 2013 06:49, Anja Thormann <an...@eb...> wrote:

> Hello,
>
> We want to provide data dumps in VCF format for the next ensembl release
> 71. We already provide data dumps in GVF format.
> The first step for us would be to parse existing GVF files to VCF files:
> However, so far VCF has no predefined way of storing variant effects.
> Using the given tools provided by
> the VCF specification and following GVF specification and Will's
> suggestions we came up with the following:
>
> In GVF a variant effect is defined as:
> Variant_effect=sequence_variant index feature_type feature_ID feature_ID (
> http://www.sequenceontology.org/resources/gvf.html#gvf_pragmas)
>
> In VCF the variant effect could be part of the INFO column:
> 1. Define INFO fields:
> ##INFO=<ID=SV,Number=String,Type=,Description="Sequence_variant. Term that
> describes the effect of the sequence_alteration on a sequence feature. SO
> term.">
> ##INFO=<ID=I,Number=Integer,Type=,Description="Index. 0-based index value
> that identifies which Variant_seq the effect is being described for.">
> ##INFO=<ID=FT,Number=String,Type=,Description="Feature type. Sequence
> feature that is being affected. SO term.">
> ##INFO=<ID=FID,Number=List,Type=,Description="Feature IDs. These feature
> IDs correspond to ID attributes in a GFF3 file that describe the sequence
> features.">
>
> 2. Introduce a Format tag to the INFO field and allow a new Type: List?
> A list is seperated by commas.
> ##INFO=<ID=VE,Number=.,Type=List,Description="Variant effect: Effect that
> a sequence alteration has on a sequence feature that overlaps
> it.",Format=SV|I|FT|FID">', "\n";
>
> Then a possible row in a VCF file could look like this:
> 1 847514 rs28651100 C T . .
> VE=downstream_gene_variant|0|transcript|ENST00000417705,upstream_gene_variant|0|transcript|ENST00000398216,nc_transcript_variant|0|ncRNA|ENST00000448179,non_coding_exon_variant|0|ncRNA|ENST00000448179;VS_Freq;VS_1000G;dbSNP_137
>
> At least this is what we will include in our VCF (v4.1) files and
> hopefully this will not clash with existing parsers.
>
> I would be very interested in some feedback.
>
> Best regards,
>
> Anja Thormann
> Ensembl-Variation
>
> On 26 Nov 2012, at 16:47, Will McLaren wrote:
>
> Hello all,
>
> I'm the lead developer on the Ensembl VEP (Variant Effect Predictor)
> software - I'd like to give the list our perspective on how we add
> functional annotations in VCF 4.1 currently.
>
> The VEP parses VCF (alongside other formats) and users can choose to
> output in VCF format too (though this is not the default, many of our users
> use it).
>
> The format for the functional data that we use is similar to that
> described by Petr (I suspect the example he shows is derived from VEP
> output of some form). We use the CSQ key in the INFO field, with the value
> consisting of "|" (pipe) separated chunks of data fields; the chunks
> themselves are separated by commas.
>
> Each chunk contains functional annotation for one alt allele + functional
> element combination. At the moment a functional element can be a
> transcript, regulatory feature or transcription factor binding motif.
>
> The fields and their order vary according to which command line options
> are used (and therefore which additional data is added). The order of
> fields is defined in a header line added to the VCF. The user may also
> specify a list of fields that they would like included, somewhat similar to
> a roll-your-own format.
>
> Missing data are left empty (i.e. you will see two consecutive "|"
> delimiters if a field is empty).
>
> Example:
>
> ##fileformat=VCFv4.1
>  ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
> predicted by VEP. Format:
> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|DISTANCE|SIFT|PolyPhen">
> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
> 21      26960070        rs116645811     G       A       .       .
> CSQ=A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11|||tolerated(0.05)|benign(0.001)
>
> Some thoughts from me:
>
> - I would definitely prefer to avoid introducing additional whitespace.
> Currently I am whitespace ambivalent when parsing input; changing this
> would cause a lot of problems for users without any major benefits that I
> can see. In the few cases we have to push in data that might have spaces,
> we replace them with "_" underscores (and commas are replaced by "&"
> ampersands)
>
> - I would be strongly in favour of enforcing standards on the functional
> types called - we use Sequence Ontology (SO) types, and we've encouraged
> UCSC (successfully) and NCBI/dbSNP (not yet) to switch to using them too.
> The SO guys are very open to contributions if there are types not yet
> described
>
> - some flexibility in the data fields that go with the functional
> annotation would be great - we report, for example, SIFT and PolyPhen
> predictions which are very popular with our users, but there's no reason to
> suppose these will be the flavours of the day in 1 or 2 years' time. Not to
> mention the potential expansion in non-coding annotations in a post-ENCODE
> world. But of course I recognise flexibility scales inversely with ease of
> parsing
>
> - beyond not wanting to disrupt our users' parsers, I don't have a problem
> changing the delimiters etc that we currently use
>
> - some of our fields are duplicated as they are specific only to the
> variant, not particularly to the allele/functional element combination -
> e.g. the Existing_variation field. This is a hangover of the VCF output
> being derived from our default output, which has one line per
> allele/functional element combo. I'd also be in favour of resolving these
> out somehow to reduce duplications
>
> Regards
>
> Will McLaren
> Ensembl Variation
>
>
> On 26 November 2012 12:52, Petr Danecek <pd...@sa...> wrote:
>
>> Hi Gonzalo,
>>
>> I welcome the idea of standardizing the functional annotations. Here is
>> an example of a wildly evolved format that we have been using so far:
>>
>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence of the ALT
>> alleles from Ensembl 66 VEP v2.4, format
>>
>> transcriptId:geneName:consequence[:codingSeqPosition:proteinPosition:proteinAlleles:proteinPredictions]+...[+gerpScore]">
>>
>> and two concrete examples, the first for a multiallelic site:
>>
>> CSQ=621:S>R:Grantham,110:Allele,C:Gene,Ssh3
>>
>> +ENSMUST00000037992:ENSMUSG00000034616:SYNONYMOUS_CODING:1863:621:S>S:Allele,A:Gene,Ssh3
>>
>>
>> CSQ=ENST00000382410:DEFB125:NON_SYNONYMOUS_CODING:184:62:H>Y:SIFT,tolerated(0.41):PolyPhen,benign(0):Condel,neutral(0.015):Grantham,83
>>
>> I am curious what other formats are in use?
>>
>>
>> I'd prefer not to introduce whitespaces in the INFO field or change the
>> column delimiters to spaces or extend to whitespaces; it would break
>> existing software and wouldn't bring much benefit.
>>
>> Petr
>>
>>
>>
>> On Mon, 2012-11-26 at 11:20 +0000, Peter Cock wrote:
>> > On Mon, Nov 26, 2012 at 5:12 AM, Eric Banks <eb...@br...>
>> wrote:
>> > > Hi Bradford,
>> > >
>> > > I do understand where you're coming from, but truthfully I'd prefer
>> to go in
>> > > the opposite direction once we're open to changing delimiters.  I've
>> never
>> > > quite understood why VCF is tab-delimited and not
>> whitespace-delimited.
>> >
>> > Tab separated makes it easy to use in Galaxy, R, etc, even Excel -
>> please
>> > keep that. It is a good thing!
>> >
>> > > You wouldn't believe how many times people have manually generated
>> > > VCFs that were space-delimited and couldn't understand why they were
>> > > failing in VCF parsers.
>> >
>> > I'd be asking why doesn't your parser give a clearer error message?
>> > If you've seen people fall over this pothole many times the parser
>> > concerned should be fixed.
>> >
>> > > I'd much rather that all whitespace be treated equally (as it is
>> > > visually).  It makes for a much simpler spec.
>> >
>> > The problem with white space is you can't see how many characters
>> > there are - spaces and tabs are not treated equally visually. What
>> > would you expect if there were several spaces in a row? If you treat
>> > it as one separator you prevent using empty cells (I'm thinking in
>> > terms of generalities here, not just VCF).
>> >
>> > Regards,
>> >
>> > Peter
>>
>>
>>
>>
>> --
>>  The Wellcome Trust Sanger Institute is operated by Genome Research
>>  Limited, a charity registered in England with number 1021457 and a
>>  company registered in England with number 2742969, whose registered
>>  office is 215 Euston Road, London, NW1 2BE.
>>
>>
>> ------------------------------------------------------------------------------
>> Monitor your physical, virtual and cloud infrastructure from a single
>> web console. Get in-depth insight into apps, servers, databases, vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov
>> _______________________________________________
>> VCFtools-spec mailing list
>> VCF...@li...
>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>
>
>
> ------------------------------------------------------------------------------
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
>
> http://p.sf.net/sfu/zoho_dev2dev_nov_______________________________________________
> VCFtools-spec mailing list
> VCF...@li...
> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>
>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_feb
> _______________________________________________
> VCFtools-spec mailing list
> VCF...@li...
> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>
>

-- 
Adam Auton
Assistant Professor,
Department of Genetics,
Albert Einstein College of Medicine,
1301 Morris Park Avenue,
Van Etten B06,
Bronx, New York 10461

Tel: +1 (718) 839 7216