Re: [VCFtools-spec] Wishlist for 4.2: allow space characters in INFO; functional annotation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

If the number of entries is unknown, then you can just specify this in the
INFO metadata using "Number=.".

##INFO=<ID=LIST,Number=.,Type=String,Description="A list with a unknown
number of entries">

This would define a comma separated list of strings. I don't think you
could use tab or semicolon delimiters within the strings, as this would
interfere with the legion number of delimiters included as part of VCF, but
otherwise it would be fine.

Alternatively, you could just define a your list as a single string using
an adhoc delimiter, which would be supported by your software (although not
by other parsers of VCF). Something like:

##INFO=<ID=LIST,Number=1,Type=String,Description="A simple string">

which you could use within the VCF file as:

LIST="Effect_Impact|Functional_Class|Codon_Change".

Adam

On 20 February 2013 10:58, Anja Thormann <an...@eb...> wrote:

> Thank you for the clarification. That is exactly what I was asking for: A
> standardised way of describing multi-value (number unknown) key=value pairs
> in the INFO column.
> This would help a lot with storing e.g. variant effect data.
> Anja
>
>
> On 20 Feb 2013, at 15:07, bradford powell wrote:
>
> I think what Anja's proposal is trying to do is to group a set of info
> tags that should appear in association with each other. This is similar to
> how effects are encoded in VCF by SnpEff. The idea is to keep the sequence
> alteration associated with, for instance, a specific transcript. It is
> another level of "join" to the denormalization of the VCF format:
>
> 1. A variant site can have one or more alternate alleles
> 2. A called variant site can have zero or more sample calls (the end of
> the line, after FORMAT
> 3. Each sample call can have one or more pieces of sample-level
> information (i.e. GT, PL, etc...)
> 4. Each variant site can have zero or more FILTERs
> 5. Each variant site can have zero or more INFO data
>
> (under discussion)
>  -> 6. Each INFO field (if present) can have one or more _groups_ of
> "sub-INFO", which within a group are associated with each other.
>
> For comparison, SnpEff does this:
>
>  ##SnpEffVersion="SnpEff 3.1k (build 2012-12-17), by Pablo Cingolani"
> ##SnpEffCmd="SnpEff  -no-downstream -no-upstream -no-intergenic -no-intron
> -noStats GRCh37.68 "
> ##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for
> this variant.Format: 'Effect ( Effect_Impact | Functional_Class |
> Codon_Change | Amino_Acid_change| Amino_Acid_length | Gene_Name |
> Gene_BioType | Coding | Transcript | Exon [ | ERRORS | WARNINGS ] )' ">
>
> This is the level of sub-sub-delimiters
>   1st level: '\t' preferred (but whitespace in practice for
> backward-compatibility)
>   2nd level: ';' between INFO fields, '=' to separate key/value, ',' for
> multi-valued fields
>   3rd level: '|'? (it is what SnpEff uses and what Anja proposes)
>
> I think the underlying question is how/if the subfields of VE could be
> documented in the header (are they separate INFO tags? Just documented as
> part of the header line for the VE tag?). It would be nice to standardize
> this so there would not need to be a proliferation of special-cases to
> handle how different groups encode the 'value' portion of a key=value pair
> in the INFO field.
>
> --
> Bradford Powell
> Clinical Genetics Academic Research Fellow
> Baylor College of Medicine
>
>
> On Wed, Feb 20, 2013 at 8:31 AM, Adam Auton <ada...@gm...> wrote:
>
>> Hello Anja,
>>
>> Thank you for raising this. VCF allows you to define your own INFO
>> fields, so much of what you propose if already valid within the
>> specification. However, while there is nothing technically wrong with them,
>> I would suggest that SV, I, and FT are perhaps not ideal abbreviations. SV
>> could be confused for Structural Variant, FT is already used to define a
>> filter in the FORMAT field, and for "I" we suggest avoiding single letter
>> fields.
>>
>> As for List, isn't this already allowed by the "Number" part of the INFO
>> definition? For example,
>>
>> ##INFO=<ID=LIST,Number=4,Type=String,Description="A list with 4 entries">
>>
>> Does this not meet your requirements?
>>
>> Kind regards,
>>
>> Adam
>>
>> On 20 February 2013 06:49, Anja Thormann <an...@eb...> wrote:
>>
>>> Hello,
>>>
>>> We want to provide data dumps in VCF format for the next ensembl release
>>> 71. We already provide data dumps in GVF format.
>>> The first step for us would be to parse existing GVF files to VCF files:
>>> However, so far VCF has no predefined way of storing variant effects.
>>> Using the given tools provided by
>>> the VCF specification and following GVF specification and Will's
>>> suggestions we came up with the following:
>>>
>>> In GVF a variant effect is defined as:
>>> Variant_effect=sequence_variant index feature_type feature_ID feature_ID
>>> (http://www.sequenceontology.org/resources/gvf.html#gvf_pragmas)
>>>
>>> In VCF the variant effect could be part of the INFO column:
>>> 1. Define INFO fields:
>>> ##INFO=<ID=SV,Number=String,Type=,Description="Sequence_variant. Term
>>> that describes the effect of the sequence_alteration on a sequence feature.
>>> SO term.">
>>> ##INFO=<ID=I,Number=Integer,Type=,Description="Index. 0-based index
>>> value that identifies which Variant_seq the effect is being described for.">
>>> ##INFO=<ID=FT,Number=String,Type=,Description="Feature type. Sequence
>>> feature that is being affected. SO term.">
>>> ##INFO=<ID=FID,Number=List,Type=,Description="Feature IDs. These feature
>>> IDs correspond to ID attributes in a GFF3 file that describe the sequence
>>> features.">
>>>
>>> 2. Introduce a Format tag to the INFO field and allow a new Type: List?
>>> A list is seperated by commas.
>>> ##INFO=<ID=VE,Number=.,Type=List,Description="Variant effect: Effect
>>> that a sequence alteration has on a sequence feature that overlaps
>>> it.",Format=SV|I|FT|FID">', "\n";
>>>
>>> Then a possible row in a VCF file could look like this:
>>> 1 847514 rs28651100 C T . .
>>> VE=downstream_gene_variant|0|transcript|ENST00000417705,upstream_gene_variant|0|transcript|ENST00000398216,nc_transcript_variant|0|ncRNA|ENST00000448179,non_coding_exon_variant|0|ncRNA|ENST00000448179;VS_Freq;VS_1000G;dbSNP_137
>>>
>>> At least this is what we will include in our VCF (v4.1) files and
>>> hopefully this will not clash with existing parsers.
>>>
>>> I would be very interested in some feedback.
>>>
>>> Best regards,
>>>
>>> Anja Thormann
>>> Ensembl-Variation
>>>
>>> On 26 Nov 2012, at 16:47, Will McLaren wrote:
>>>
>>> Hello all,
>>>
>>> I'm the lead developer on the Ensembl VEP (Variant Effect Predictor)
>>> software - I'd like to give the list our perspective on how we add
>>> functional annotations in VCF 4.1 currently.
>>>
>>> The VEP parses VCF (alongside other formats) and users can choose to
>>> output in VCF format too (though this is not the default, many of our users
>>> use it).
>>>
>>> The format for the functional data that we use is similar to that
>>> described by Petr (I suspect the example he shows is derived from VEP
>>> output of some form). We use the CSQ key in the INFO field, with the value
>>> consisting of "|" (pipe) separated chunks of data fields; the chunks
>>> themselves are separated by commas.
>>>
>>> Each chunk contains functional annotation for one alt allele +
>>> functional element combination. At the moment a functional element can be a
>>> transcript, regulatory feature or transcription factor binding motif.
>>>
>>> The fields and their order vary according to which command line options
>>> are used (and therefore which additional data is added). The order of
>>> fields is defined in a header line added to the VCF. The user may also
>>> specify a list of fields that they would like included, somewhat similar to
>>> a roll-your-own format.
>>>
>>> Missing data are left empty (i.e. you will see two consecutive "|"
>>> delimiters if a field is empty).
>>>
>>> Example:
>>>
>>> ##fileformat=VCFv4.1
>>>  ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
>>> predicted by VEP. Format:
>>> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|DISTANCE|SIFT|PolyPhen">
>>> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>>> 21      26960070        rs116645811     G       A       .       .
>>> CSQ=A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11|||tolerated(0.05)|benign(0.001)
>>>
>>> Some thoughts from me:
>>>
>>> - I would definitely prefer to avoid introducing additional whitespace.
>>> Currently I am whitespace ambivalent when parsing input; changing this
>>> would cause a lot of problems for users without any major benefits that I
>>> can see. In the few cases we have to push in data that might have spaces,
>>> we replace them with "_" underscores (and commas are replaced by "&"
>>> ampersands)
>>>
>>> - I would be strongly in favour of enforcing standards on the functional
>>> types called - we use Sequence Ontology (SO) types, and we've encouraged
>>> UCSC (successfully) and NCBI/dbSNP (not yet) to switch to using them too.
>>> The SO guys are very open to contributions if there are types not yet
>>> described
>>>
>>> - some flexibility in the data fields that go with the functional
>>> annotation would be great - we report, for example, SIFT and PolyPhen
>>> predictions which are very popular with our users, but there's no reason to
>>> suppose these will be the flavours of the day in 1 or 2 years' time. Not to
>>> mention the potential expansion in non-coding annotations in a post-ENCODE
>>> world. But of course I recognise flexibility scales inversely with ease of
>>> parsing
>>>
>>> - beyond not wanting to disrupt our users' parsers, I don't have a
>>> problem changing the delimiters etc that we currently use
>>>
>>> - some of our fields are duplicated as they are specific only to the
>>> variant, not particularly to the allele/functional element combination -
>>> e.g. the Existing_variation field. This is a hangover of the VCF output
>>> being derived from our default output, which has one line per
>>> allele/functional element combo. I'd also be in favour of resolving these
>>> out somehow to reduce duplications
>>>
>>> Regards
>>>
>>> Will McLaren
>>> Ensembl Variation
>>>
>>>
>>> On 26 November 2012 12:52, Petr Danecek <pd...@sa...> wrote:
>>>
>>>> Hi Gonzalo,
>>>>
>>>> I welcome the idea of standardizing the functional annotations. Here is
>>>> an example of a wildly evolved format that we have been using so far:
>>>>
>>>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence of the ALT
>>>> alleles from Ensembl 66 VEP v2.4, format
>>>>
>>>> transcriptId:geneName:consequence[:codingSeqPosition:proteinPosition:proteinAlleles:proteinPredictions]+...[+gerpScore]">
>>>>
>>>> and two concrete examples, the first for a multiallelic site:
>>>>
>>>> CSQ=621:S>R:Grantham,110:Allele,C:Gene,Ssh3
>>>>
>>>> +ENSMUST00000037992:ENSMUSG00000034616:SYNONYMOUS_CODING:1863:621:S>S:Allele,A:Gene,Ssh3
>>>>
>>>>
>>>> CSQ=ENST00000382410:DEFB125:NON_SYNONYMOUS_CODING:184:62:H>Y:SIFT,tolerated(0.41):PolyPhen,benign(0):Condel,neutral(0.015):Grantham,83
>>>>
>>>> I am curious what other formats are in use?
>>>>
>>>>
>>>> I'd prefer not to introduce whitespaces in the INFO field or change the
>>>> column delimiters to spaces or extend to whitespaces; it would break
>>>> existing software and wouldn't bring much benefit.
>>>>
>>>> Petr
>>>>
>>>>
>>>>
>>>> On Mon, 2012-11-26 at 11:20 +0000, Peter Cock wrote:
>>>> > On Mon, Nov 26, 2012 at 5:12 AM, Eric Banks <
>>>> eb...@br...> wrote:
>>>> > > Hi Bradford,
>>>> > >
>>>> > > I do understand where you're coming from, but truthfully I'd prefer
>>>> to go in
>>>> > > the opposite direction once we're open to changing delimiters.
>>>>  I've never
>>>> > > quite understood why VCF is tab-delimited and not
>>>> whitespace-delimited.
>>>> >
>>>> > Tab separated makes it easy to use in Galaxy, R, etc, even Excel -
>>>> please
>>>> > keep that. It is a good thing!
>>>> >
>>>> > > You wouldn't believe how many times people have manually generated
>>>> > > VCFs that were space-delimited and couldn't understand why they were
>>>> > > failing in VCF parsers.
>>>> >
>>>> > I'd be asking why doesn't your parser give a clearer error message?
>>>> > If you've seen people fall over this pothole many times the parser
>>>> > concerned should be fixed.
>>>> >
>>>> > > I'd much rather that all whitespace be treated equally (as it is
>>>> > > visually).  It makes for a much simpler spec.
>>>> >
>>>> > The problem with white space is you can't see how many characters
>>>> > there are - spaces and tabs are not treated equally visually. What
>>>> > would you expect if there were several spaces in a row? If you treat
>>>> > it as one separator you prevent using empty cells (I'm thinking in
>>>> > terms of generalities here, not just VCF).
>>>> >
>>>> > Regards,
>>>> >
>>>> > Peter
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>  The Wellcome Trust Sanger Institute is operated by Genome Research
>>>>  Limited, a charity registered in England with number 1021457 and a
>>>>  company registered in England with number 2742969, whose registered
>>>>  office is 215 Euston Road, London, NW1 2BE.
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Monitor your physical, virtual and cloud infrastructure from a single
>>>> web console. Get in-depth insight into apps, servers, databases, vmware,
>>>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>>> Pricing starts from $795 for 25 servers or applications!
>>>> http://p.sf.net/sfu/zoho_dev2dev_nov
>>>> _______________________________________________
>>>> VCFtools-spec mailing list
>>>> VCF...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Monitor your physical, virtual and cloud infrastructure from a single
>>> web console. Get in-depth insight into apps, servers, databases, vmware,
>>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>> Pricing starts from $795 for 25 servers or applications!
>>>
>>> http://p.sf.net/sfu/zoho_dev2dev_nov_______________________________________________
>>> VCFtools-spec mailing list
>>> VCF...@li...
>>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Everyone hates slow websites. So do we.
>>> Make your web apps faster with AppDynamics
>>> Download AppDynamics Lite for free today:
>>> http://p.sf.net/sfu/appdyn_d2d_feb
>>>
>>> _______________________________________________
>>> VCFtools-spec mailing list
>>> VCF...@li...
>>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>>
>>>
>>
>>
>> --
>> Adam Auton
>> Assistant Professor,
>> Department of Genetics,
>> Albert Einstein College of Medicine,
>> 1301 Morris Park Avenue,
>> Van Etten B06,
>> Bronx, New York 10461
>>
>> Tel: +1 (718) 839 7216
>>
>>
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://p.sf.net/sfu/appdyn_d2d_feb
>> _______________________________________________
>> VCFtools-spec mailing list
>> VCF...@li...
>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>
>>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
>
> http://p.sf.net/sfu/appdyn_d2d_feb_______________________________________________
> VCFtools-spec mailing list
> VCF...@li...
> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>
>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_feb
> _______________________________________________
> VCFtools-spec mailing list
> VCF...@li...
> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>
>

-- 
Adam Auton
Assistant Professor,
Department of Genetics,
Albert Einstein College of Medicine,
1301 Morris Park Avenue,
Van Etten B06,
Bronx, New York 10461

Tel: +1 (718) 839 7216