Re: [VCFtools-spec] Wishlist for 4.2: allow space characters in INFO; functional annotation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

A downside to using an adhoc delimiter and defining the whole list as a
string is that there is no type information for the subfields within a
list. Type safety could be handled at the application-specific parser
layer, but BCF would have no ability to encode the numbers in something
like 'VE=NONSYN|122|C>A|0.34|0.22', losing potential file-size savings.
(the numbers might represent something like SIFT/polyphen/MutTaster/etc
scores).

I think the only way to do something like this under the current spec is to
use parallel lists
TYPE=downstream_gene_variant,upstream_gene_variant,nc_transcript_variant,non_coding_exon_variant;IDX=0,0,0,0;FTYPE=transcript,transcript,ncRNA,exon_variant

Then it would be the application's responsibility to ensure that the number
of items in the list at least match. I presume most applications that
process VCF files maintain the order of lists (i.e. the order of key-value
pairs need not be maintained, but the order of items in a list-type
probably should be maintained... maybe this should be part of the spec?)
otherwise there is no guarantee that the groups values (between associated
keys) would maintain their order/association.

--
Bradford Powell
Clinical Genetics Academic Research Fellow
Baylor College of Medicine

On Wed, Feb 20, 2013 at 10:51 AM, Adam Auton <ada...@gm...> wrote:

> If the number of entries is unknown, then you can just specify this in the
> INFO metadata using "Number=.".
>
> ##INFO=<ID=LIST,Number=.,Type=String,Description="A list with a unknown
> number of entries">
>
> This would define a comma separated list of strings. I don't think you
> could use tab or semicolon delimiters within the strings, as this would
> interfere with the legion number of delimiters included as part of VCF, but
> otherwise it would be fine.
>
> Alternatively, you could just define a your list as a single string using
> an adhoc delimiter, which would be supported by your software (although not
> by other parsers of VCF). Something like:
>
> ##INFO=<ID=LIST,Number=1,Type=String,Description="A simple string">
>
> which you could use within the VCF file as:
>
> LIST="Effect_Impact|Functional_Class|Codon_Change".
>
>
> Adam
>
> On 20 February 2013 10:58, Anja Thormann <an...@eb...> wrote:
>
>> Thank you for the clarification. That is exactly what I was asking for: A
>> standardised way of describing multi-value (number unknown) key=value pairs
>> in the INFO column.
>> This would help a lot with storing e.g. variant effect data.
>> Anja
>>
>>
>> On 20 Feb 2013, at 15:07, bradford powell wrote:
>>
>> I think what Anja's proposal is trying to do is to group a set of info
>> tags that should appear in association with each other. This is similar to
>> how effects are encoded in VCF by SnpEff. The idea is to keep the sequence
>> alteration associated with, for instance, a specific transcript. It is
>> another level of "join" to the denormalization of the VCF format:
>>
>> 1. A variant site can have one or more alternate alleles
>> 2. A called variant site can have zero or more sample calls (the end of
>> the line, after FORMAT
>> 3. Each sample call can have one or more pieces of sample-level
>> information (i.e. GT, PL, etc...)
>> 4. Each variant site can have zero or more FILTERs
>> 5. Each variant site can have zero or more INFO data
>>
>> (under discussion)
>>  -> 6. Each INFO field (if present) can have one or more _groups_ of
>> "sub-INFO", which within a group are associated with each other.
>>
>> For comparison, SnpEff does this:
>>
>>  ##SnpEffVersion="SnpEff 3.1k (build 2012-12-17), by Pablo Cingolani"
>> ##SnpEffCmd="SnpEff  -no-downstream -no-upstream -no-intergenic
>> -no-intron -noStats GRCh37.68 "
>> ##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for
>> this variant.Format: 'Effect ( Effect_Impact | Functional_Class |
>> Codon_Change | Amino_Acid_change| Amino_Acid_length | Gene_Name |
>> Gene_BioType | Coding | Transcript | Exon [ | ERRORS | WARNINGS ] )' ">
>>
>> This is the level of sub-sub-delimiters
>>   1st level: '\t' preferred (but whitespace in practice for
>> backward-compatibility)
>>   2nd level: ';' between INFO fields, '=' to separate key/value, ',' for
>> multi-valued fields
>>   3rd level: '|'? (it is what SnpEff uses and what Anja proposes)
>>
>> I think the underlying question is how/if the subfields of VE could be
>> documented in the header (are they separate INFO tags? Just documented as
>> part of the header line for the VE tag?). It would be nice to standardize
>> this so there would not need to be a proliferation of special-cases to
>> handle how different groups encode the 'value' portion of a key=value pair
>> in the INFO field.
>>
>> --
>> Bradford Powell
>> Clinical Genetics Academic Research Fellow
>> Baylor College of Medicine
>>
>>
>> On Wed, Feb 20, 2013 at 8:31 AM, Adam Auton <ada...@gm...> wrote:
>>
>>> Hello Anja,
>>>
>>> Thank you for raising this. VCF allows you to define your own INFO
>>> fields, so much of what you propose if already valid within the
>>> specification. However, while there is nothing technically wrong with them,
>>> I would suggest that SV, I, and FT are perhaps not ideal abbreviations. SV
>>> could be confused for Structural Variant, FT is already used to define a
>>> filter in the FORMAT field, and for "I" we suggest avoiding single letter
>>> fields.
>>>
>>> As for List, isn't this already allowed by the "Number" part of the INFO
>>> definition? For example,
>>>
>>> ##INFO=<ID=LIST,Number=4,Type=String,Description="A list with 4 entries">
>>>
>>> Does this not meet your requirements?
>>>
>>> Kind regards,
>>>
>>> Adam
>>>
>>> On 20 February 2013 06:49, Anja Thormann <an...@eb...> wrote:
>>>
>>>> Hello,
>>>>
>>>> We want to provide data dumps in VCF format for the next ensembl
>>>> release 71. We already provide data dumps in GVF format.
>>>> The first step for us would be to parse existing GVF files to VCF
>>>> files:
>>>> However, so far VCF has no predefined way of storing variant effects.
>>>> Using the given tools provided by
>>>> the VCF specification and following GVF specification and Will's
>>>> suggestions we came up with the following:
>>>>
>>>> In GVF a variant effect is defined as:
>>>> Variant_effect=sequence_variant index feature_type feature_ID
>>>> feature_ID (
>>>> http://www.sequenceontology.org/resources/gvf.html#gvf_pragmas)
>>>>
>>>> In VCF the variant effect could be part of the INFO column:
>>>> 1. Define INFO fields:
>>>> ##INFO=<ID=SV,Number=String,Type=,Description="Sequence_variant. Term
>>>> that describes the effect of the sequence_alteration on a sequence feature.
>>>> SO term.">
>>>> ##INFO=<ID=I,Number=Integer,Type=,Description="Index. 0-based index
>>>> value that identifies which Variant_seq the effect is being described for.">
>>>> ##INFO=<ID=FT,Number=String,Type=,Description="Feature type. Sequence
>>>> feature that is being affected. SO term.">
>>>> ##INFO=<ID=FID,Number=List,Type=,Description="Feature IDs. These
>>>> feature IDs correspond to ID attributes in a GFF3 file that describe the
>>>> sequence features.">
>>>>
>>>> 2. Introduce a Format tag to the INFO field and allow a new Type: List?
>>>> A list is seperated by commas.
>>>> ##INFO=<ID=VE,Number=.,Type=List,Description="Variant effect: Effect
>>>> that a sequence alteration has on a sequence feature that overlaps
>>>> it.",Format=SV|I|FT|FID">', "\n";
>>>>
>>>> Then a possible row in a VCF file could look like this:
>>>> 1 847514 rs28651100 C T . .
>>>> VE=downstream_gene_variant|0|transcript|ENST00000417705,upstream_gene_variant|0|transcript|ENST00000398216,nc_transcript_variant|0|ncRNA|ENST00000448179,non_coding_exon_variant|0|ncRNA|ENST00000448179;VS_Freq;VS_1000G;dbSNP_137
>>>>
>>>> At least this is what we will include in our VCF (v4.1) files and
>>>> hopefully this will not clash with existing parsers.
>>>>
>>>> I would be very interested in some feedback.
>>>>
>>>> Best regards,
>>>>
>>>> Anja Thormann
>>>> Ensembl-Variation
>>>>
>>>> On 26 Nov 2012, at 16:47, Will McLaren wrote:
>>>>
>>>> Hello all,
>>>>
>>>> I'm the lead developer on the Ensembl VEP (Variant Effect Predictor)
>>>> software - I'd like to give the list our perspective on how we add
>>>> functional annotations in VCF 4.1 currently.
>>>>
>>>> The VEP parses VCF (alongside other formats) and users can choose to
>>>> output in VCF format too (though this is not the default, many of our users
>>>> use it).
>>>>
>>>> The format for the functional data that we use is similar to that
>>>> described by Petr (I suspect the example he shows is derived from VEP
>>>> output of some form). We use the CSQ key in the INFO field, with the value
>>>> consisting of "|" (pipe) separated chunks of data fields; the chunks
>>>> themselves are separated by commas.
>>>>
>>>> Each chunk contains functional annotation for one alt allele +
>>>> functional element combination. At the moment a functional element can be a
>>>> transcript, regulatory feature or transcription factor binding motif.
>>>>
>>>> The fields and their order vary according to which command line options
>>>> are used (and therefore which additional data is added). The order of
>>>> fields is defined in a header line added to the VCF. The user may also
>>>> specify a list of fields that they would like included, somewhat similar to
>>>> a roll-your-own format.
>>>>
>>>> Missing data are left empty (i.e. you will see two consecutive "|"
>>>> delimiters if a field is empty).
>>>>
>>>> Example:
>>>>
>>>> ##fileformat=VCFv4.1
>>>>  ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
>>>> predicted by VEP. Format:
>>>> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|DISTANCE|SIFT|PolyPhen">
>>>> #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
>>>> 21      26960070        rs116645811     G       A       .       .
>>>> CSQ=A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11|||tolerated(0.05)|benign(0.001)
>>>>
>>>> Some thoughts from me:
>>>>
>>>> - I would definitely prefer to avoid introducing additional whitespace.
>>>> Currently I am whitespace ambivalent when parsing input; changing this
>>>> would cause a lot of problems for users without any major benefits that I
>>>> can see. In the few cases we have to push in data that might have spaces,
>>>> we replace them with "_" underscores (and commas are replaced by "&"
>>>> ampersands)
>>>>
>>>> - I would be strongly in favour of enforcing standards on the
>>>> functional types called - we use Sequence Ontology (SO) types, and we've
>>>> encouraged UCSC (successfully) and NCBI/dbSNP (not yet) to switch to using
>>>> them too. The SO guys are very open to contributions if there are types not
>>>> yet described
>>>>
>>>> - some flexibility in the data fields that go with the functional
>>>> annotation would be great - we report, for example, SIFT and PolyPhen
>>>> predictions which are very popular with our users, but there's no reason to
>>>> suppose these will be the flavours of the day in 1 or 2 years' time. Not to
>>>> mention the potential expansion in non-coding annotations in a post-ENCODE
>>>> world. But of course I recognise flexibility scales inversely with ease of
>>>> parsing
>>>>
>>>> - beyond not wanting to disrupt our users' parsers, I don't have a
>>>> problem changing the delimiters etc that we currently use
>>>>
>>>> - some of our fields are duplicated as they are specific only to the
>>>> variant, not particularly to the allele/functional element combination -
>>>> e.g. the Existing_variation field. This is a hangover of the VCF output
>>>> being derived from our default output, which has one line per
>>>> allele/functional element combo. I'd also be in favour of resolving these
>>>> out somehow to reduce duplications
>>>>
>>>> Regards
>>>>
>>>> Will McLaren
>>>> Ensembl Variation
>>>>
>>>>
>>>> On 26 November 2012 12:52, Petr Danecek <pd...@sa...> wrote:
>>>>
>>>>> Hi Gonzalo,
>>>>>
>>>>> I welcome the idea of standardizing the functional annotations. Here is
>>>>> an example of a wildly evolved format that we have been using so far:
>>>>>
>>>>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence of the ALT
>>>>> alleles from Ensembl 66 VEP v2.4, format
>>>>>
>>>>> transcriptId:geneName:consequence[:codingSeqPosition:proteinPosition:proteinAlleles:proteinPredictions]+...[+gerpScore]">
>>>>>
>>>>> and two concrete examples, the first for a multiallelic site:
>>>>>
>>>>> CSQ=621:S>R:Grantham,110:Allele,C:Gene,Ssh3
>>>>>
>>>>> +ENSMUST00000037992:ENSMUSG00000034616:SYNONYMOUS_CODING:1863:621:S>S:Allele,A:Gene,Ssh3
>>>>>
>>>>>
>>>>> CSQ=ENST00000382410:DEFB125:NON_SYNONYMOUS_CODING:184:62:H>Y:SIFT,tolerated(0.41):PolyPhen,benign(0):Condel,neutral(0.015):Grantham,83
>>>>>
>>>>> I am curious what other formats are in use?
>>>>>
>>>>>
>>>>> I'd prefer not to introduce whitespaces in the INFO field or change the
>>>>> column delimiters to spaces or extend to whitespaces; it would break
>>>>> existing software and wouldn't bring much benefit.
>>>>>
>>>>> Petr
>>>>>
>>>>>
>>>>>
>>>>> On Mon, 2012-11-26 at 11:20 +0000, Peter Cock wrote:
>>>>> > On Mon, Nov 26, 2012 at 5:12 AM, Eric Banks <
>>>>> eb...@br...> wrote:
>>>>> > > Hi Bradford,
>>>>> > >
>>>>> > > I do understand where you're coming from, but truthfully I'd
>>>>> prefer to go in
>>>>> > > the opposite direction once we're open to changing delimiters.
>>>>>  I've never
>>>>> > > quite understood why VCF is tab-delimited and not
>>>>> whitespace-delimited.
>>>>> >
>>>>> > Tab separated makes it easy to use in Galaxy, R, etc, even Excel -
>>>>> please
>>>>> > keep that. It is a good thing!
>>>>> >
>>>>> > > You wouldn't believe how many times people have manually generated
>>>>> > > VCFs that were space-delimited and couldn't understand why they
>>>>> were
>>>>> > > failing in VCF parsers.
>>>>> >
>>>>> > I'd be asking why doesn't your parser give a clearer error message?
>>>>> > If you've seen people fall over this pothole many times the parser
>>>>> > concerned should be fixed.
>>>>> >
>>>>> > > I'd much rather that all whitespace be treated equally (as it is
>>>>> > > visually).  It makes for a much simpler spec.
>>>>> >
>>>>> > The problem with white space is you can't see how many characters
>>>>> > there are - spaces and tabs are not treated equally visually. What
>>>>> > would you expect if there were several spaces in a row? If you treat
>>>>> > it as one separator you prevent using empty cells (I'm thinking in
>>>>> > terms of generalities here, not just VCF).
>>>>> >
>>>>> > Regards,
>>>>> >
>>>>> > Peter
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>  The Wellcome Trust Sanger Institute is operated by Genome Research
>>>>>  Limited, a charity registered in England with number 1021457 and a
>>>>>  company registered in England with number 2742969, whose registered
>>>>>  office is 215 Euston Road, London, NW1 2BE.
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Monitor your physical, virtual and cloud infrastructure from a single
>>>>> web console. Get in-depth insight into apps, servers, databases,
>>>>> vmware,
>>>>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>>>> Pricing starts from $795 for 25 servers or applications!
>>>>> http://p.sf.net/sfu/zoho_dev2dev_nov
>>>>> _______________________________________________
>>>>> VCFtools-spec mailing list
>>>>> VCF...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Monitor your physical, virtual and cloud infrastructure from a single
>>>> web console. Get in-depth insight into apps, servers, databases, vmware,
>>>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>>>> Pricing starts from $795 for 25 servers or applications!
>>>>
>>>> http://p.sf.net/sfu/zoho_dev2dev_nov_______________________________________________
>>>> VCFtools-spec mailing list
>>>> VCF...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Everyone hates slow websites. So do we.
>>>> Make your web apps faster with AppDynamics
>>>> Download AppDynamics Lite for free today:
>>>> http://p.sf.net/sfu/appdyn_d2d_feb
>>>>
>>>> _______________________________________________
>>>> VCFtools-spec mailing list
>>>> VCF...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>>>
>>>>
>>>
>>>
>>> --
>>> Adam Auton
>>> Assistant Professor,
>>> Department of Genetics,
>>> Albert Einstein College of Medicine,
>>> 1301 Morris Park Avenue,
>>> Van Etten B06,
>>> Bronx, New York 10461
>>>
>>> Tel: +1 (718) 839 7216
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Everyone hates slow websites. So do we.
>>> Make your web apps faster with AppDynamics
>>> Download AppDynamics Lite for free today:
>>> http://p.sf.net/sfu/appdyn_d2d_feb
>>> _______________________________________________
>>> VCFtools-spec mailing list
>>> VCF...@li...
>>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>>
>>>
>>
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>>
>> http://p.sf.net/sfu/appdyn_d2d_feb_______________________________________________
>> VCFtools-spec mailing list
>> VCF...@li...
>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://p.sf.net/sfu/appdyn_d2d_feb
>> _______________________________________________
>> VCFtools-spec mailing list
>> VCF...@li...
>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>
>>
>
>
> --
> Adam Auton
> Assistant Professor,
> Department of Genetics,
> Albert Einstein College of Medicine,
> 1301 Morris Park Avenue,
> Van Etten B06,
> Bronx, New York 10461
>
> Tel: +1 (718) 839 7216
>