|
From: Anja T. <an...@eb...> - 2013-02-20 11:50:04
|
Hello, We want to provide data dumps in VCF format for the next ensembl release 71. We already provide data dumps in GVF format. The first step for us would be to parse existing GVF files to VCF files: However, so far VCF has no predefined way of storing variant effects. Using the given tools provided by the VCF specification and following GVF specification and Will's suggestions we came up with the following: In GVF a variant effect is defined as: Variant_effect=sequence_variant index feature_type feature_ID feature_ID (http://www.sequenceontology.org/resources/gvf.html#gvf_pragmas) In VCF the variant effect could be part of the INFO column: 1. Define INFO fields: ##INFO=<ID=SV,Number=String,Type=,Description="Sequence_variant. Term that describes the effect of the sequence_alteration on a sequence feature. SO term."> ##INFO=<ID=I,Number=Integer,Type=,Description="Index. 0-based index value that identifies which Variant_seq the effect is being described for."> ##INFO=<ID=FT,Number=String,Type=,Description="Feature type. Sequence feature that is being affected. SO term."> ##INFO=<ID=FID,Number=List,Type=,Description="Feature IDs. These feature IDs correspond to ID attributes in a GFF3 file that describe the sequence features."> 2. Introduce a Format tag to the INFO field and allow a new Type: List? A list is seperated by commas. ##INFO=<ID=VE,Number=.,Type=List,Description="Variant effect: Effect that a sequence alteration has on a sequence feature that overlaps it.",Format=SV|I|FT|FID">', "\n"; Then a possible row in a VCF file could look like this: 1 847514 rs28651100 C T . . VE=downstream_gene_variant|0|transcript|ENST00000417705,upstream_gene_variant|0|transcript|ENST00000398216,nc_transcript_variant|0|ncRNA|ENST00000448179,non_coding_exon_variant|0|ncRNA|ENST00000448179;VS_Freq;VS_1000G;dbSNP_137 At least this is what we will include in our VCF (v4.1) files and hopefully this will not clash with existing parsers. I would be very interested in some feedback. Best regards, Anja Thormann Ensembl-Variation On 26 Nov 2012, at 16:47, Will McLaren wrote: > Hello all, > > I'm the lead developer on the Ensembl VEP (Variant Effect Predictor) software - I'd like to give the list our perspective on how we add functional annotations in VCF 4.1 currently. > > The VEP parses VCF (alongside other formats) and users can choose to output in VCF format too (though this is not the default, many of our users use it). > > The format for the functional data that we use is similar to that described by Petr (I suspect the example he shows is derived from VEP output of some form). We use the CSQ key in the INFO field, with the value consisting of "|" (pipe) separated chunks of data fields; the chunks themselves are separated by commas. > > Each chunk contains functional annotation for one alt allele + functional element combination. At the moment a functional element can be a transcript, regulatory feature or transcription factor binding motif. > > The fields and their order vary according to which command line options are used (and therefore which additional data is added). The order of fields is defined in a header line added to the VCF. The user may also specify a list of fields that they would like included, somewhat similar to a roll-your-own format. > > Missing data are left empty (i.e. you will see two consecutive "|" delimiters if a field is empty). > > Example: > > ##fileformat=VCFv4.1 > ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|DISTANCE|SIFT|PolyPhen"> > #CHROM POS ID REF ALT QUAL FILTER INFO > 21 26960070 rs116645811 G A . . CSQ=A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11|||tolerated(0.05)|benign(0.001) > > Some thoughts from me: > > - I would definitely prefer to avoid introducing additional whitespace. Currently I am whitespace ambivalent when parsing input; changing this would cause a lot of problems for users without any major benefits that I can see. In the few cases we have to push in data that might have spaces, we replace them with "_" underscores (and commas are replaced by "&" ampersands) > > - I would be strongly in favour of enforcing standards on the functional types called - we use Sequence Ontology (SO) types, and we've encouraged UCSC (successfully) and NCBI/dbSNP (not yet) to switch to using them too. The SO guys are very open to contributions if there are types not yet described > > - some flexibility in the data fields that go with the functional annotation would be great - we report, for example, SIFT and PolyPhen predictions which are very popular with our users, but there's no reason to suppose these will be the flavours of the day in 1 or 2 years' time. Not to mention the potential expansion in non-coding annotations in a post-ENCODE world. But of course I recognise flexibility scales inversely with ease of parsing > > - beyond not wanting to disrupt our users' parsers, I don't have a problem changing the delimiters etc that we currently use > > - some of our fields are duplicated as they are specific only to the variant, not particularly to the allele/functional element combination - e.g. the Existing_variation field. This is a hangover of the VCF output being derived from our default output, which has one line per allele/functional element combo. I'd also be in favour of resolving these out somehow to reduce duplications > > Regards > > Will McLaren > Ensembl Variation > > > On 26 November 2012 12:52, Petr Danecek <pd...@sa...> wrote: > Hi Gonzalo, > > I welcome the idea of standardizing the functional annotations. Here is > an example of a wildly evolved format that we have been using so far: > > ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence of the ALT > alleles from Ensembl 66 VEP v2.4, format > transcriptId:geneName:consequence[:codingSeqPosition:proteinPosition:proteinAlleles:proteinPredictions]+...[+gerpScore]"> > > and two concrete examples, the first for a multiallelic site: > > CSQ=621:S>R:Grantham,110:Allele,C:Gene,Ssh3 > +ENSMUST00000037992:ENSMUSG00000034616:SYNONYMOUS_CODING:1863:621:S>S:Allele,A:Gene,Ssh3 > > CSQ=ENST00000382410:DEFB125:NON_SYNONYMOUS_CODING:184:62:H>Y:SIFT,tolerated(0.41):PolyPhen,benign(0):Condel,neutral(0.015):Grantham,83 > > I am curious what other formats are in use? > > > I'd prefer not to introduce whitespaces in the INFO field or change the > column delimiters to spaces or extend to whitespaces; it would break > existing software and wouldn't bring much benefit. > > Petr > > > > On Mon, 2012-11-26 at 11:20 +0000, Peter Cock wrote: > > On Mon, Nov 26, 2012 at 5:12 AM, Eric Banks <eb...@br...> wrote: > > > Hi Bradford, > > > > > > I do understand where you're coming from, but truthfully I'd prefer to go in > > > the opposite direction once we're open to changing delimiters. I've never > > > quite understood why VCF is tab-delimited and not whitespace-delimited. > > > > Tab separated makes it easy to use in Galaxy, R, etc, even Excel - please > > keep that. It is a good thing! > > > > > You wouldn't believe how many times people have manually generated > > > VCFs that were space-delimited and couldn't understand why they were > > > failing in VCF parsers. > > > > I'd be asking why doesn't your parser give a clearer error message? > > If you've seen people fall over this pothole many times the parser > > concerned should be fixed. > > > > > I'd much rather that all whitespace be treated equally (as it is > > > visually). It makes for a much simpler spec. > > > > The problem with white space is you can't see how many characters > > there are - spaces and tabs are not treated equally visually. What > > would you expect if there were several spaces in a row? If you treat > > it as one separator you prevent using empty cells (I'm thinking in > > terms of generalities here, not just VCF). > > > > Regards, > > > > Peter > > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > ------------------------------------------------------------------------------ > Monitor your physical, virtual and cloud infrastructure from a single > web console. Get in-depth insight into apps, servers, databases, vmware, > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > Pricing starts from $795 for 25 servers or applications! > http://p.sf.net/sfu/zoho_dev2dev_nov > _______________________________________________ > VCFtools-spec mailing list > VCF...@li... > https://lists.sourceforge.net/lists/listinfo/vcftools-spec > > ------------------------------------------------------------------------------ > Monitor your physical, virtual and cloud infrastructure from a single > web console. Get in-depth insight into apps, servers, databases, vmware, > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > Pricing starts from $795 for 25 servers or applications! > http://p.sf.net/sfu/zoho_dev2dev_nov_______________________________________________ > VCFtools-spec mailing list > VCF...@li... > https://lists.sourceforge.net/lists/listinfo/vcftools-spec |