|
From: bradford p. <bra...@gm...> - 2013-02-21 04:20:22
|
A downside to using an adhoc delimiter and defining the whole list as a string is that there is no type information for the subfields within a list. Type safety could be handled at the application-specific parser layer, but BCF would have no ability to encode the numbers in something like 'VE=NONSYN|122|C>A|0.34|0.22', losing potential file-size savings. (the numbers might represent something like SIFT/polyphen/MutTaster/etc scores). I think the only way to do something like this under the current spec is to use parallel lists TYPE=downstream_gene_variant,upstream_gene_variant,nc_transcript_variant,non_coding_exon_variant;IDX=0,0,0,0;FTYPE=transcript,transcript,ncRNA,exon_variant Then it would be the application's responsibility to ensure that the number of items in the list at least match. I presume most applications that process VCF files maintain the order of lists (i.e. the order of key-value pairs need not be maintained, but the order of items in a list-type probably should be maintained... maybe this should be part of the spec?) otherwise there is no guarantee that the groups values (between associated keys) would maintain their order/association. -- Bradford Powell Clinical Genetics Academic Research Fellow Baylor College of Medicine On Wed, Feb 20, 2013 at 10:51 AM, Adam Auton <ada...@gm...> wrote: > If the number of entries is unknown, then you can just specify this in the > INFO metadata using "Number=.". > > ##INFO=<ID=LIST,Number=.,Type=String,Description="A list with a unknown > number of entries"> > > This would define a comma separated list of strings. I don't think you > could use tab or semicolon delimiters within the strings, as this would > interfere with the legion number of delimiters included as part of VCF, but > otherwise it would be fine. > > Alternatively, you could just define a your list as a single string using > an adhoc delimiter, which would be supported by your software (although not > by other parsers of VCF). Something like: > > ##INFO=<ID=LIST,Number=1,Type=String,Description="A simple string"> > > which you could use within the VCF file as: > > LIST="Effect_Impact|Functional_Class|Codon_Change". > > > Adam > > On 20 February 2013 10:58, Anja Thormann <an...@eb...> wrote: > >> Thank you for the clarification. That is exactly what I was asking for: A >> standardised way of describing multi-value (number unknown) key=value pairs >> in the INFO column. >> This would help a lot with storing e.g. variant effect data. >> Anja >> >> >> On 20 Feb 2013, at 15:07, bradford powell wrote: >> >> I think what Anja's proposal is trying to do is to group a set of info >> tags that should appear in association with each other. This is similar to >> how effects are encoded in VCF by SnpEff. The idea is to keep the sequence >> alteration associated with, for instance, a specific transcript. It is >> another level of "join" to the denormalization of the VCF format: >> >> 1. A variant site can have one or more alternate alleles >> 2. A called variant site can have zero or more sample calls (the end of >> the line, after FORMAT >> 3. Each sample call can have one or more pieces of sample-level >> information (i.e. GT, PL, etc...) >> 4. Each variant site can have zero or more FILTERs >> 5. Each variant site can have zero or more INFO data >> >> (under discussion) >> -> 6. Each INFO field (if present) can have one or more _groups_ of >> "sub-INFO", which within a group are associated with each other. >> >> For comparison, SnpEff does this: >> >> ##SnpEffVersion="SnpEff 3.1k (build 2012-12-17), by Pablo Cingolani" >> ##SnpEffCmd="SnpEff -no-downstream -no-upstream -no-intergenic >> -no-intron -noStats GRCh37.68 " >> ##INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for >> this variant.Format: 'Effect ( Effect_Impact | Functional_Class | >> Codon_Change | Amino_Acid_change| Amino_Acid_length | Gene_Name | >> Gene_BioType | Coding | Transcript | Exon [ | ERRORS | WARNINGS ] )' "> >> >> This is the level of sub-sub-delimiters >> 1st level: '\t' preferred (but whitespace in practice for >> backward-compatibility) >> 2nd level: ';' between INFO fields, '=' to separate key/value, ',' for >> multi-valued fields >> 3rd level: '|'? (it is what SnpEff uses and what Anja proposes) >> >> I think the underlying question is how/if the subfields of VE could be >> documented in the header (are they separate INFO tags? Just documented as >> part of the header line for the VE tag?). It would be nice to standardize >> this so there would not need to be a proliferation of special-cases to >> handle how different groups encode the 'value' portion of a key=value pair >> in the INFO field. >> >> -- >> Bradford Powell >> Clinical Genetics Academic Research Fellow >> Baylor College of Medicine >> >> >> On Wed, Feb 20, 2013 at 8:31 AM, Adam Auton <ada...@gm...> wrote: >> >>> Hello Anja, >>> >>> Thank you for raising this. VCF allows you to define your own INFO >>> fields, so much of what you propose if already valid within the >>> specification. However, while there is nothing technically wrong with them, >>> I would suggest that SV, I, and FT are perhaps not ideal abbreviations. SV >>> could be confused for Structural Variant, FT is already used to define a >>> filter in the FORMAT field, and for "I" we suggest avoiding single letter >>> fields. >>> >>> As for List, isn't this already allowed by the "Number" part of the INFO >>> definition? For example, >>> >>> ##INFO=<ID=LIST,Number=4,Type=String,Description="A list with 4 entries"> >>> >>> Does this not meet your requirements? >>> >>> Kind regards, >>> >>> Adam >>> >>> On 20 February 2013 06:49, Anja Thormann <an...@eb...> wrote: >>> >>>> Hello, >>>> >>>> We want to provide data dumps in VCF format for the next ensembl >>>> release 71. We already provide data dumps in GVF format. >>>> The first step for us would be to parse existing GVF files to VCF >>>> files: >>>> However, so far VCF has no predefined way of storing variant effects. >>>> Using the given tools provided by >>>> the VCF specification and following GVF specification and Will's >>>> suggestions we came up with the following: >>>> >>>> In GVF a variant effect is defined as: >>>> Variant_effect=sequence_variant index feature_type feature_ID >>>> feature_ID ( >>>> http://www.sequenceontology.org/resources/gvf.html#gvf_pragmas) >>>> >>>> In VCF the variant effect could be part of the INFO column: >>>> 1. Define INFO fields: >>>> ##INFO=<ID=SV,Number=String,Type=,Description="Sequence_variant. Term >>>> that describes the effect of the sequence_alteration on a sequence feature. >>>> SO term."> >>>> ##INFO=<ID=I,Number=Integer,Type=,Description="Index. 0-based index >>>> value that identifies which Variant_seq the effect is being described for."> >>>> ##INFO=<ID=FT,Number=String,Type=,Description="Feature type. Sequence >>>> feature that is being affected. SO term."> >>>> ##INFO=<ID=FID,Number=List,Type=,Description="Feature IDs. These >>>> feature IDs correspond to ID attributes in a GFF3 file that describe the >>>> sequence features."> >>>> >>>> 2. Introduce a Format tag to the INFO field and allow a new Type: List? >>>> A list is seperated by commas. >>>> ##INFO=<ID=VE,Number=.,Type=List,Description="Variant effect: Effect >>>> that a sequence alteration has on a sequence feature that overlaps >>>> it.",Format=SV|I|FT|FID">', "\n"; >>>> >>>> Then a possible row in a VCF file could look like this: >>>> 1 847514 rs28651100 C T . . >>>> VE=downstream_gene_variant|0|transcript|ENST00000417705,upstream_gene_variant|0|transcript|ENST00000398216,nc_transcript_variant|0|ncRNA|ENST00000448179,non_coding_exon_variant|0|ncRNA|ENST00000448179;VS_Freq;VS_1000G;dbSNP_137 >>>> >>>> At least this is what we will include in our VCF (v4.1) files and >>>> hopefully this will not clash with existing parsers. >>>> >>>> I would be very interested in some feedback. >>>> >>>> Best regards, >>>> >>>> Anja Thormann >>>> Ensembl-Variation >>>> >>>> On 26 Nov 2012, at 16:47, Will McLaren wrote: >>>> >>>> Hello all, >>>> >>>> I'm the lead developer on the Ensembl VEP (Variant Effect Predictor) >>>> software - I'd like to give the list our perspective on how we add >>>> functional annotations in VCF 4.1 currently. >>>> >>>> The VEP parses VCF (alongside other formats) and users can choose to >>>> output in VCF format too (though this is not the default, many of our users >>>> use it). >>>> >>>> The format for the functional data that we use is similar to that >>>> described by Petr (I suspect the example he shows is derived from VEP >>>> output of some form). We use the CSQ key in the INFO field, with the value >>>> consisting of "|" (pipe) separated chunks of data fields; the chunks >>>> themselves are separated by commas. >>>> >>>> Each chunk contains functional annotation for one alt allele + >>>> functional element combination. At the moment a functional element can be a >>>> transcript, regulatory feature or transcription factor binding motif. >>>> >>>> The fields and their order vary according to which command line options >>>> are used (and therefore which additional data is added). The order of >>>> fields is defined in a header line added to the VCF. The user may also >>>> specify a list of fields that they would like included, somewhat similar to >>>> a roll-your-own format. >>>> >>>> Missing data are left empty (i.e. you will see two consecutive "|" >>>> delimiters if a field is empty). >>>> >>>> Example: >>>> >>>> ##fileformat=VCFv4.1 >>>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as >>>> predicted by VEP. Format: >>>> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|DISTANCE|SIFT|PolyPhen"> >>>> #CHROM POS ID REF ALT QUAL FILTER INFO >>>> 21 26960070 rs116645811 G A . . >>>> CSQ=A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11|||tolerated(0.05)|benign(0.001) >>>> >>>> Some thoughts from me: >>>> >>>> - I would definitely prefer to avoid introducing additional whitespace. >>>> Currently I am whitespace ambivalent when parsing input; changing this >>>> would cause a lot of problems for users without any major benefits that I >>>> can see. In the few cases we have to push in data that might have spaces, >>>> we replace them with "_" underscores (and commas are replaced by "&" >>>> ampersands) >>>> >>>> - I would be strongly in favour of enforcing standards on the >>>> functional types called - we use Sequence Ontology (SO) types, and we've >>>> encouraged UCSC (successfully) and NCBI/dbSNP (not yet) to switch to using >>>> them too. The SO guys are very open to contributions if there are types not >>>> yet described >>>> >>>> - some flexibility in the data fields that go with the functional >>>> annotation would be great - we report, for example, SIFT and PolyPhen >>>> predictions which are very popular with our users, but there's no reason to >>>> suppose these will be the flavours of the day in 1 or 2 years' time. Not to >>>> mention the potential expansion in non-coding annotations in a post-ENCODE >>>> world. But of course I recognise flexibility scales inversely with ease of >>>> parsing >>>> >>>> - beyond not wanting to disrupt our users' parsers, I don't have a >>>> problem changing the delimiters etc that we currently use >>>> >>>> - some of our fields are duplicated as they are specific only to the >>>> variant, not particularly to the allele/functional element combination - >>>> e.g. the Existing_variation field. This is a hangover of the VCF output >>>> being derived from our default output, which has one line per >>>> allele/functional element combo. I'd also be in favour of resolving these >>>> out somehow to reduce duplications >>>> >>>> Regards >>>> >>>> Will McLaren >>>> Ensembl Variation >>>> >>>> >>>> On 26 November 2012 12:52, Petr Danecek <pd...@sa...> wrote: >>>> >>>>> Hi Gonzalo, >>>>> >>>>> I welcome the idea of standardizing the functional annotations. Here is >>>>> an example of a wildly evolved format that we have been using so far: >>>>> >>>>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence of the ALT >>>>> alleles from Ensembl 66 VEP v2.4, format >>>>> >>>>> transcriptId:geneName:consequence[:codingSeqPosition:proteinPosition:proteinAlleles:proteinPredictions]+...[+gerpScore]"> >>>>> >>>>> and two concrete examples, the first for a multiallelic site: >>>>> >>>>> CSQ=621:S>R:Grantham,110:Allele,C:Gene,Ssh3 >>>>> >>>>> +ENSMUST00000037992:ENSMUSG00000034616:SYNONYMOUS_CODING:1863:621:S>S:Allele,A:Gene,Ssh3 >>>>> >>>>> >>>>> CSQ=ENST00000382410:DEFB125:NON_SYNONYMOUS_CODING:184:62:H>Y:SIFT,tolerated(0.41):PolyPhen,benign(0):Condel,neutral(0.015):Grantham,83 >>>>> >>>>> I am curious what other formats are in use? >>>>> >>>>> >>>>> I'd prefer not to introduce whitespaces in the INFO field or change the >>>>> column delimiters to spaces or extend to whitespaces; it would break >>>>> existing software and wouldn't bring much benefit. >>>>> >>>>> Petr >>>>> >>>>> >>>>> >>>>> On Mon, 2012-11-26 at 11:20 +0000, Peter Cock wrote: >>>>> > On Mon, Nov 26, 2012 at 5:12 AM, Eric Banks < >>>>> eb...@br...> wrote: >>>>> > > Hi Bradford, >>>>> > > >>>>> > > I do understand where you're coming from, but truthfully I'd >>>>> prefer to go in >>>>> > > the opposite direction once we're open to changing delimiters. >>>>> I've never >>>>> > > quite understood why VCF is tab-delimited and not >>>>> whitespace-delimited. >>>>> > >>>>> > Tab separated makes it easy to use in Galaxy, R, etc, even Excel - >>>>> please >>>>> > keep that. It is a good thing! >>>>> > >>>>> > > You wouldn't believe how many times people have manually generated >>>>> > > VCFs that were space-delimited and couldn't understand why they >>>>> were >>>>> > > failing in VCF parsers. >>>>> > >>>>> > I'd be asking why doesn't your parser give a clearer error message? >>>>> > If you've seen people fall over this pothole many times the parser >>>>> > concerned should be fixed. >>>>> > >>>>> > > I'd much rather that all whitespace be treated equally (as it is >>>>> > > visually). It makes for a much simpler spec. >>>>> > >>>>> > The problem with white space is you can't see how many characters >>>>> > there are - spaces and tabs are not treated equally visually. What >>>>> > would you expect if there were several spaces in a row? If you treat >>>>> > it as one separator you prevent using empty cells (I'm thinking in >>>>> > terms of generalities here, not just VCF). >>>>> > >>>>> > Regards, >>>>> > >>>>> > Peter >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> The Wellcome Trust Sanger Institute is operated by Genome Research >>>>> Limited, a charity registered in England with number 1021457 and a >>>>> company registered in England with number 2742969, whose registered >>>>> office is 215 Euston Road, London, NW1 2BE. >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Monitor your physical, virtual and cloud infrastructure from a single >>>>> web console. Get in-depth insight into apps, servers, databases, >>>>> vmware, >>>>> SAP, cloud infrastructure, etc. Download 30-day Free Trial. >>>>> Pricing starts from $795 for 25 servers or applications! >>>>> http://p.sf.net/sfu/zoho_dev2dev_nov >>>>> _______________________________________________ >>>>> VCFtools-spec mailing list >>>>> VCF...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >>>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Monitor your physical, virtual and cloud infrastructure from a single >>>> web console. Get in-depth insight into apps, servers, databases, vmware, >>>> SAP, cloud infrastructure, etc. Download 30-day Free Trial. >>>> Pricing starts from $795 for 25 servers or applications! >>>> >>>> http://p.sf.net/sfu/zoho_dev2dev_nov_______________________________________________ >>>> VCFtools-spec mailing list >>>> VCF...@li... >>>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >>>> >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Everyone hates slow websites. So do we. >>>> Make your web apps faster with AppDynamics >>>> Download AppDynamics Lite for free today: >>>> http://p.sf.net/sfu/appdyn_d2d_feb >>>> >>>> _______________________________________________ >>>> VCFtools-spec mailing list >>>> VCF...@li... >>>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >>>> >>>> >>> >>> >>> -- >>> Adam Auton >>> Assistant Professor, >>> Department of Genetics, >>> Albert Einstein College of Medicine, >>> 1301 Morris Park Avenue, >>> Van Etten B06, >>> Bronx, New York 10461 >>> >>> Tel: +1 (718) 839 7216 >>> >>> >>> ------------------------------------------------------------------------------ >>> Everyone hates slow websites. So do we. >>> Make your web apps faster with AppDynamics >>> Download AppDynamics Lite for free today: >>> http://p.sf.net/sfu/appdyn_d2d_feb >>> _______________________________________________ >>> VCFtools-spec mailing list >>> VCF...@li... >>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >>> >>> >> >> ------------------------------------------------------------------------------ >> Everyone hates slow websites. So do we. >> Make your web apps faster with AppDynamics >> Download AppDynamics Lite for free today: >> >> http://p.sf.net/sfu/appdyn_d2d_feb_______________________________________________ >> VCFtools-spec mailing list >> VCF...@li... >> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >> >> >> >> >> ------------------------------------------------------------------------------ >> Everyone hates slow websites. So do we. >> Make your web apps faster with AppDynamics >> Download AppDynamics Lite for free today: >> http://p.sf.net/sfu/appdyn_d2d_feb >> _______________________________________________ >> VCFtools-spec mailing list >> VCF...@li... >> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >> >> > > > -- > Adam Auton > Assistant Professor, > Department of Genetics, > Albert Einstein College of Medicine, > 1301 Morris Park Avenue, > Van Etten B06, > Bronx, New York 10461 > > Tel: +1 (718) 839 7216 > |