|
From: Heng Li <lh...@sa...> - 2012-12-03 21:04:14
|
I agree that we should have allowed spaces in INFO in VCF v4. However, changing it now is a little late. Quite a lot of in-house scripts may have been written assuming no space in INFO because the 4.1 spec mandates so. Allowing spaces will break all of them, which is a serious concern. The best solution now is to simply replace space. It just requirements one additional line of code. Heng On Dec 3, 2012, at 3:32 PM, Danny Challis wrote: > Hi all, > We would like to speak in favor of allowing for spaces in the INFO > column. Many annotation and analysis pipelines involve copying > descriptive statements directly from a database which often includes > whitespace. Rather than requiring all such pipelines to include an > additional search/replace function to remove spaces, it would be better > to simply allow the space characters. Aside from the AWK issues which > are easily dealt with, I don't foresee any serious problems with this > approach as the space character has no other specified significance in > VCF. > Thanks, > > Danny. > HGSC-BCM > > On Mon 26 Nov 2012 04:03:43 PM CST, Heng Li wrote: >> Hyun, what's your annotation format in VCF? Could anyone show an >> example output of snpeff and/or other popular annotation tools? >> Perhaps it would be good to start from the existing ones and reach a >> consensus on how functional annotations should be stored in VCF. We >> then encourage everyone to stick to the convention. Note that I am >> thinking about a convention, which requires no changes to the spec. >> >> Heng >> >> On Nov 26, 2012, at 2:35 PM, Hyun Min Kang wrote: >> >>> Both Will's and Petr's suggestions make sense in a human-readable >>> format. There are a few points I'd like to highlight (not necessarily >>> with very solid proposals) >>> >>> ** Whether to distinguish coding sequence annotation from annotation >>> of other elements?* >>> 1. We can separate a CDS annotation (which is associated with >>> protein changes or alternative splicing) and other annotations >>> (mostly based on genomic coordinates) >>> 2. Or we can make a CDS annotation as a 'specialized, and >>> finer-grained' annotation of a more general form of other genomic >>> annotations. >>> - In the former case, we will need two separate annotations >>> (CDS-specific one, and a general, region-based one) for a coding >>> variant, and zero to multiple annotations for non-coding variants >>> - The first approach is probably easier to handle separately >>> between genomes and exomes, and the benefit of second approaches is >>> the possibility of more fine-grained non-coding information later on >>> (such as missense or nonsense variants, some specific mutations >>> within a region can be characterized) >>> >>> ** Whether (and how) to make the "dictionary" of the annotations in >>> the VCF header for facilitating automated parsing?* >>> - The two suggested formats are very well readable by human, but in >>> order to enable automated query without knowing the specific details >>> on how the annotation was performed, we will need a "dictionary" in >>> the header. For example, there needs to be a consensus on whether to >>> use the keyword "Transcript" to represent a transcript (in Ensemble >>> ID?), and "Gene" to represent a gene (in gene symbol?). Also, whether >>> to hard-code the hierarchical relationship between different elements >>> (e.g. Gene-Transcript-Exon-AminoAcid), or allow flexibility there is >>> an important issue. >>> - Especially the granularity of the "Consequence" part varies quite >>> a lot between different annotation softwares. And the consequence can >>> be structured in a hierarchical way (e.g. a frameshift indel is an >>> instance of LoF variant). Whether to predefine such rule in the VCF >>> spec, or have a meta-data representing the relationship between >>> different types of "functional consequences" >>> >>> ** Distinguishing region-based annotations and variant-specific >>> annotations?* >>> - Some annotation category requires only the knowledge of REF allele >>> (e.g. whether the variant overlaps with an exon, or an ENCODE >>> region), and some category requires the knowledge of both REF and ALT >>> alleles (e.g. missense and nonsense variants). I prefer to separate >>> these two types of annotations, or making the former (REF-only) as an >>> instance of the latter (REF and ALT requiring). The features for >>> these types of annotations can also be different. For example, >>> conservation score fits well to the former (and latter) category and >>> it is possible to make an annotation on non-variant site without >>> having the ALT allele. However, polyphen scores do not conform to the >>> former category without knowing the variant allele, so it is specific >>> to latter category. >>> - For the former category, it may be worthwhile to annotate every >>> possible genomic coordinate as some sharable format (e.g. this >>> includes, gencode and ENCODE annotation, conservation scores, 1000G >>> masks, ancestral allele information) >>> >>> ** Whether to encode the function annotation into VCF only, or make a >>> separate file if necessary?* >>> - Writing down all detailed annotation on every variant may be quite >>> large and redundant in some cases (especially when considering >>> non-coding annotations). It is also possible to consider to put only >>> "annotation IDs" in the VCF file and have a separate file describing >>> the details of the annotations (and the separate file may also >>> represent the hierarchy of the annotations if necessary)*. * >>> >>> >>> On Mon, Nov 26, 2012 at 11:47 AM, Will McLaren <wm...@eb... >>> <mailto:wm...@eb...>> wrote: >>> >>> Hello all, >>> >>> I'm the lead developer on the Ensembl VEP (Variant Effect >>> Predictor) software - I'd like to give the list our perspective >>> on how we add functional annotations in VCF 4.1 currently. >>> >>> The VEP parses VCF (alongside other formats) and users can choose >>> to output in VCF format too (though this is not the default, many >>> of our users use it). >>> >>> The format for the functional data that we use is similar to that >>> described by Petr (I suspect the example he shows is derived from >>> VEP output of some form). We use the CSQ key in the INFO field, >>> with the value consisting of "|" (pipe) separated chunks of data >>> fields; the chunks themselves are separated by commas. >>> >>> Each chunk contains functional annotation for one alt allele + >>> functional element combination. At the moment a functional >>> element can be a transcript, regulatory feature or transcription >>> factor binding motif. >>> >>> The fields and their order vary according to which command line >>> options are used (and therefore which additional data is added). >>> The order of fields is defined in a header line added to the VCF. >>> The user may also specify a list of fields that they would like >>> included, somewhat similar to a roll-your-own format. >>> >>> Missing data are left empty (i.e. you will see two consecutive >>> "|" delimiters if a field is empty). >>> >>> Example: >>> >>> ##fileformat=VCFv4.1 >>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type >>> as predicted by VEP. Format: >>> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|DISTANCE|SIFT|PolyPhen"> >>> #CHROM POS ID REF ALT QUAL FILTER INFO >>> 21 26960070 rs116645811 G A . . >>> >>> CSQ=A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11|||tolerated(0.05)|benign(0.001) >>> >>> Some thoughts from me: >>> >>> - I would definitely prefer to avoid introducing additional >>> whitespace. Currently I am whitespace ambivalent when parsing >>> input; changing this would cause a lot of problems for users >>> without any major benefits that I can see. In the few cases we >>> have to push in data that might have spaces, we replace them with >>> "_" underscores (and commas are replaced by "&" ampersands) >>> >>> - I would be strongly in favour of enforcing standards on the >>> functional types called - we use Sequence Ontology (SO) types, >>> and we've encouraged UCSC (successfully) and NCBI/dbSNP (not yet) >>> to switch to using them too. The SO guys are very open to >>> contributions if there are types not yet described >>> >>> - some flexibility in the data fields that go with the functional >>> annotation would be great - we report, for example, SIFT and >>> PolyPhen predictions which are very popular with our users, but >>> there's no reason to suppose these will be the flavours of the >>> day in 1 or 2 years' time. Not to mention the potential expansion >>> in non-coding annotations in a post-ENCODE world. But of course I >>> recognise flexibility scales inversely with ease of parsing >>> >>> - beyond not wanting to disrupt our users' parsers, I don't have >>> a problem changing the delimiters etc that we currently use >>> >>> - some of our fields are duplicated as they are specific only to >>> the variant, not particularly to the allele/functional element >>> combination - e.g. the Existing_variation field. This is a >>> hangover of the VCF output being derived from our default output, >>> which has one line per allele/functional element combo. I'd also >>> be in favour of resolving these out somehow to reduce duplications >>> >>> Regards >>> >>> Will McLaren >>> Ensembl Variation >>> >>> >>> On 26 November 2012 12:52, Petr Danecek <pd...@sa... >>> <mailto:pd...@sa...>> wrote: >>> >>> Hi Gonzalo, >>> >>> I welcome the idea of standardizing the functional >>> annotations. Here is >>> an example of a wildly evolved format that we have been using >>> so far: >>> >>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence >>> of the ALT >>> alleles from Ensembl 66 VEP v2.4, format >>> transcriptId:geneName:consequence[:codingSeqPosition:proteinPosition:proteinAlleles:proteinPredictions]+...[+gerpScore]"> >>> >>> and two concrete examples, the first for a multiallelic site: >>> >>> CSQ=621:S>R:Grantham,110:Allele,C:Gene,Ssh3 >>> +ENSMUST00000037992:ENSMUSG00000034616:SYNONYMOUS_CODING:1863:621:S>S:Allele,A:Gene,Ssh3 >>> >>> CSQ=ENST00000382410:DEFB125:NON_SYNONYMOUS_CODING:184:62:H>Y:SIFT,tolerated(0.41):PolyPhen,benign(0):Condel,neutral(0.015):Grantham,83 >>> >>> I am curious what other formats are in use? >>> >>> >>> I'd prefer not to introduce whitespaces in the INFO field or >>> change the >>> column delimiters to spaces or extend to whitespaces; it >>> would break >>> existing software and wouldn't bring much benefit. >>> >>> Petr >>> >>> >>> >>> On Mon, 2012-11-26 at 11:20 +0000, Peter Cock wrote: >>>> On Mon, Nov 26, 2012 at 5:12 AM, Eric Banks >>> <eb...@br... >>> <mailto:eb...@br...>> wrote: >>>>> Hi Bradford, >>>>> >>>>> I do understand where you're coming from, but truthfully >>> I'd prefer to go in >>>>> the opposite direction once we're open to changing >>> delimiters. I've never >>>>> quite understood why VCF is tab-delimited and not >>> whitespace-delimited. >>>> >>>> Tab separated makes it easy to use in Galaxy, R, etc, even >>> Excel - please >>>> keep that. It is a good thing! >>>> >>>>> You wouldn't believe how many times people have manually >>> generated >>>>> VCFs that were space-delimited and couldn't understand >>> why they were >>>>> failing in VCF parsers. >>>> >>>> I'd be asking why doesn't your parser give a clearer error >>> message? >>>> If you've seen people fall over this pothole many times the >>> parser >>>> concerned should be fixed. >>>> >>>>> I'd much rather that all whitespace be treated equally >>> (as it is >>>>> visually). It makes for a much simpler spec. >>>> >>>> The problem with white space is you can't see how many >>> characters >>>> there are - spaces and tabs are not treated equally >>> visually. What >>>> would you expect if there were several spaces in a row? If >>> you treat >>>> it as one separator you prevent using empty cells (I'm >>> thinking in >>>> terms of generalities here, not just VCF). >>>> >>>> Regards, >>>> >>>> Peter >>> >>> >>> >>> >>> -- >>> The Wellcome Trust Sanger Institute is operated by Genome >>> Research >>> Limited, a charity registered in England with number 1021457 >>> and a >>> company registered in England with number 2742969, whose >>> registered >>> office is 215 Euston Road, London, NW1 2BE. >>> >>> ------------------------------------------------------------------------------ >>> Monitor your physical, virtual and cloud infrastructure from >>> a single >>> web console. Get in-depth insight into apps, servers, >>> databases, vmware, >>> SAP, cloud infrastructure, etc. Download 30-day Free Trial. >>> Pricing starts from $795 for 25 servers or applications! >>> http://p.sf.net/sfu/zoho_dev2dev_nov >>> _______________________________________________ >>> VCFtools-spec mailing list >>> VCF...@li... >>> <mailto:VCF...@li...> >>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >>> >>> >>> >>> ------------------------------------------------------------------------------ >>> Monitor your physical, virtual and cloud infrastructure from a single >>> web console. Get in-depth insight into apps, servers, databases, >>> vmware, >>> SAP, cloud infrastructure, etc. Download 30-day Free Trial. >>> Pricing starts from $795 for 25 servers or applications! >>> http://p.sf.net/sfu/zoho_dev2dev_nov >>> _______________________________________________ >>> VCFtools-spec mailing list >>> VCF...@li... >>> <mailto:VCF...@li...> >>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >>> >>> >>> >>> >>> -- >>> ----------------------------------------------------- >>> Hyun Min Kang, Ph.D. >>> Assistant Professor of Biostatistics >>> University of Michigan, Ann Arbor >>> Email : hm...@um... <mailto:hm...@um...> >>> >>> ------------------------------------------------------------------------------ >>> Monitor your physical, virtual and cloud infrastructure from a single >>> web console. Get in-depth insight into apps, servers, databases, vmware, >>> SAP, cloud infrastructure, etc. Download 30-day Free Trial. >>> Pricing starts from $795 for 25 servers or applications! >>> http://p.sf.net/sfu/zoho_dev2dev_nov_______________________________________________ >>> VCFtools-spec mailing list >>> VCF...@li... >>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >> >> >> -- The Wellcome Trust Sanger Institute is operated by Genome Research >> Limited, a charity registered in England with number 1021457 and a >> company registered in England with number 2742969, whose registered >> office is 215 Euston Road, London, NW1 2BE. >> ------------------------------------------------------------------------ >> >> No virus found in this message. >> Checked by AVG - www.avg.com <http://www.avg.com> >> Version: 2013.0.2793 / Virus Database: 2629/5908 - Release Date: 11/20/12 >> > > > > ------------------------------------------------------------------------------ > Keep yourself connected to Go Parallel: > BUILD Helping you discover the best ways to construct your parallel projects. > http://goparallel.sourceforge.net > _______________________________________________ > VCFtools-spec mailing list > VCF...@li... > https://lists.sourceforge.net/lists/listinfo/vcftools-spec -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |