|
From: bradford p. <bra...@gm...> - 2012-11-26 19:48:38
|
That is a good list of questions. I would also add, keeping in mind the VCF/BCF duality: ** How should functional consequence be encoded in BCF?* - Encoding the consequences as a string with subdelimiters gives a lot of flexibility and makes it easy to associate multiple consequences with a variant (i.e. a variant that can affect multiple transcripts) or multiallelic sites (where the different alleles have different consequences). However, should there be a way to keep the strict typing within the consequence fields, particularly to allow a BCF writer to know what the types should be and to encode numeric values appropriately? It may be a premature optimization to think about this, but with the recent enhanced tool support for BCF (in htslib and GATK) I think more people will be using BCF going forward. On Mon, Nov 26, 2012 at 1:35 PM, Hyun Min Kang <hm...@um...> wrote: > Both Will's and Petr's suggestions make sense in a human-readable format. > There are a few points I'd like to highlight (not necessarily with very > solid proposals) > > ** Whether to distinguish coding sequence annotation from annotation of > other elements?* > 1. We can separate a CDS annotation (which is associated with protein > changes or alternative splicing) and other annotations (mostly based on > genomic coordinates) > 2. Or we can make a CDS annotation as a 'specialized, and finer-grained' > annotation of a more general form of other genomic annotations. > - In the former case, we will need two separate annotations > (CDS-specific one, and a general, region-based one) for a coding variant, > and zero to multiple annotations for non-coding variants > - The first approach is probably easier to handle separately between > genomes and exomes, and the benefit of second approaches is the possibility > of more fine-grained non-coding information later on (such as missense or > nonsense variants, some specific mutations within a region can be > characterized) > > ** Whether (and how) to make the "dictionary" of the annotations in the > VCF header for facilitating automated parsing?* > - The two suggested formats are very well readable by human, but in order > to enable automated query without knowing the specific details on how the > annotation was performed, we will need a "dictionary" in the header. For > example, there needs to be a consensus on whether to use the keyword > "Transcript" to represent a transcript (in Ensemble ID?), and "Gene" to > represent a gene (in gene symbol?). Also, whether to hard-code the > hierarchical relationship between different elements (e.g. > Gene-Transcript-Exon-AminoAcid), or allow flexibility there is an important > issue. > - Especially the granularity of the "Consequence" part varies quite a > lot between different annotation softwares. And the consequence can be > structured in a hierarchical way (e.g. a frameshift indel is an instance of > LoF variant). Whether to predefine such rule in the VCF spec, or have a > meta-data representing the relationship between different types of > "functional consequences" > > ** Distinguishing region-based annotations and variant-specific > annotations?* > - Some annotation category requires only the knowledge of REF allele > (e.g. whether the variant overlaps with an exon, or an ENCODE region), and > some category requires the knowledge of both REF and ALT alleles (e.g. > missense and nonsense variants). I prefer to separate these two types of > annotations, or making the former (REF-only) as an instance of the latter > (REF and ALT requiring). The features for these types of annotations can > also be different. For example, conservation score fits well to the former > (and latter) category and it is possible to make an annotation on > non-variant site without having the ALT allele. However, polyphen scores do > not conform to the former category without knowing the variant allele, so > it is specific to latter category. > - For the former category, it may be worthwhile to annotate every > possible genomic coordinate as some sharable format (e.g. this includes, > gencode and ENCODE annotation, conservation scores, 1000G masks, ancestral > allele information) > > ** Whether to encode the function annotation into VCF only, or make a > separate file if necessary?* > - Writing down all detailed annotation on every variant may be quite > large and redundant in some cases (especially when considering non-coding > annotations). It is also possible to consider to put only "annotation IDs" > in the VCF file and have a separate file describing the details of the > annotations (and the separate file may also represent the hierarchy of the > annotations if necessary)*. * > > > On Mon, Nov 26, 2012 at 11:47 AM, Will McLaren <wm...@eb...> wrote: > >> Hello all, >> >> I'm the lead developer on the Ensembl VEP (Variant Effect Predictor) >> software - I'd like to give the list our perspective on how we add >> functional annotations in VCF 4.1 currently. >> >> The VEP parses VCF (alongside other formats) and users can choose to >> output in VCF format too (though this is not the default, many of our users >> use it). >> >> The format for the functional data that we use is similar to that >> described by Petr (I suspect the example he shows is derived from VEP >> output of some form). We use the CSQ key in the INFO field, with the value >> consisting of "|" (pipe) separated chunks of data fields; the chunks >> themselves are separated by commas. >> >> Each chunk contains functional annotation for one alt allele + functional >> element combination. At the moment a functional element can be a >> transcript, regulatory feature or transcription factor binding motif. >> >> The fields and their order vary according to which command line options >> are used (and therefore which additional data is added). The order of >> fields is defined in a header line added to the VCF. The user may also >> specify a list of fields that they would like included, somewhat similar to >> a roll-your-own format. >> >> Missing data are left empty (i.e. you will see two consecutive "|" >> delimiters if a field is empty). >> >> Example: >> >> ##fileformat=VCFv4.1 >> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as >> predicted by VEP. Format: >> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|DISTANCE|SIFT|PolyPhen"> >> #CHROM POS ID REF ALT QUAL FILTER INFO >> 21 26960070 rs116645811 G A . . >> CSQ=A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11|||tolerated(0.05)|benign(0.001) >> >> Some thoughts from me: >> >> - I would definitely prefer to avoid introducing additional whitespace. >> Currently I am whitespace ambivalent when parsing input; changing this >> would cause a lot of problems for users without any major benefits that I >> can see. In the few cases we have to push in data that might have spaces, >> we replace them with "_" underscores (and commas are replaced by "&" >> ampersands) >> >> - I would be strongly in favour of enforcing standards on the functional >> types called - we use Sequence Ontology (SO) types, and we've encouraged >> UCSC (successfully) and NCBI/dbSNP (not yet) to switch to using them too. >> The SO guys are very open to contributions if there are types not yet >> described >> >> - some flexibility in the data fields that go with the functional >> annotation would be great - we report, for example, SIFT and PolyPhen >> predictions which are very popular with our users, but there's no reason to >> suppose these will be the flavours of the day in 1 or 2 years' time. Not to >> mention the potential expansion in non-coding annotations in a post-ENCODE >> world. But of course I recognise flexibility scales inversely with ease of >> parsing >> >> - beyond not wanting to disrupt our users' parsers, I don't have a >> problem changing the delimiters etc that we currently use >> >> - some of our fields are duplicated as they are specific only to the >> variant, not particularly to the allele/functional element combination - >> e.g. the Existing_variation field. This is a hangover of the VCF output >> being derived from our default output, which has one line per >> allele/functional element combo. I'd also be in favour of resolving these >> out somehow to reduce duplications >> >> Regards >> >> Will McLaren >> Ensembl Variation >> >> >> On 26 November 2012 12:52, Petr Danecek <pd...@sa...> wrote: >> >>> Hi Gonzalo, >>> >>> I welcome the idea of standardizing the functional annotations. Here is >>> an example of a wildly evolved format that we have been using so far: >>> >>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence of the ALT >>> alleles from Ensembl 66 VEP v2.4, format >>> >>> transcriptId:geneName:consequence[:codingSeqPosition:proteinPosition:proteinAlleles:proteinPredictions]+...[+gerpScore]"> >>> >>> and two concrete examples, the first for a multiallelic site: >>> >>> CSQ=621:S>R:Grantham,110:Allele,C:Gene,Ssh3 >>> >>> +ENSMUST00000037992:ENSMUSG00000034616:SYNONYMOUS_CODING:1863:621:S>S:Allele,A:Gene,Ssh3 >>> >>> >>> CSQ=ENST00000382410:DEFB125:NON_SYNONYMOUS_CODING:184:62:H>Y:SIFT,tolerated(0.41):PolyPhen,benign(0):Condel,neutral(0.015):Grantham,83 >>> >>> I am curious what other formats are in use? >>> >>> >>> I'd prefer not to introduce whitespaces in the INFO field or change the >>> column delimiters to spaces or extend to whitespaces; it would break >>> existing software and wouldn't bring much benefit. >>> >>> Petr >>> >>> >>> >>> On Mon, 2012-11-26 at 11:20 +0000, Peter Cock wrote: >>> > On Mon, Nov 26, 2012 at 5:12 AM, Eric Banks <eb...@br...> >>> wrote: >>> > > Hi Bradford, >>> > > >>> > > I do understand where you're coming from, but truthfully I'd prefer >>> to go in >>> > > the opposite direction once we're open to changing delimiters. I've >>> never >>> > > quite understood why VCF is tab-delimited and not >>> whitespace-delimited. >>> > >>> > Tab separated makes it easy to use in Galaxy, R, etc, even Excel - >>> please >>> > keep that. It is a good thing! >>> > >>> > > You wouldn't believe how many times people have manually generated >>> > > VCFs that were space-delimited and couldn't understand why they were >>> > > failing in VCF parsers. >>> > >>> > I'd be asking why doesn't your parser give a clearer error message? >>> > If you've seen people fall over this pothole many times the parser >>> > concerned should be fixed. >>> > >>> > > I'd much rather that all whitespace be treated equally (as it is >>> > > visually). It makes for a much simpler spec. >>> > >>> > The problem with white space is you can't see how many characters >>> > there are - spaces and tabs are not treated equally visually. What >>> > would you expect if there were several spaces in a row? If you treat >>> > it as one separator you prevent using empty cells (I'm thinking in >>> > terms of generalities here, not just VCF). >>> > >>> > Regards, >>> > >>> > Peter >>> >>> >>> >>> >>> -- >>> The Wellcome Trust Sanger Institute is operated by Genome Research >>> Limited, a charity registered in England with number 1021457 and a >>> company registered in England with number 2742969, whose registered >>> office is 215 Euston Road, London, NW1 2BE. >>> >>> >>> ------------------------------------------------------------------------------ >>> Monitor your physical, virtual and cloud infrastructure from a single >>> web console. Get in-depth insight into apps, servers, databases, vmware, >>> SAP, cloud infrastructure, etc. Download 30-day Free Trial. >>> Pricing starts from $795 for 25 servers or applications! >>> http://p.sf.net/sfu/zoho_dev2dev_nov >>> _______________________________________________ >>> VCFtools-spec mailing list >>> VCF...@li... >>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >>> >> >> >> >> ------------------------------------------------------------------------------ >> Monitor your physical, virtual and cloud infrastructure from a single >> web console. Get in-depth insight into apps, servers, databases, vmware, >> SAP, cloud infrastructure, etc. Download 30-day Free Trial. >> Pricing starts from $795 for 25 servers or applications! >> http://p.sf.net/sfu/zoho_dev2dev_nov >> _______________________________________________ >> VCFtools-spec mailing list >> VCF...@li... >> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >> >> > > > -- > ----------------------------------------------------- > Hyun Min Kang, Ph.D. > Assistant Professor of Biostatistics > University of Michigan, Ann Arbor > Email : hm...@um... > > > > ------------------------------------------------------------------------------ > Monitor your physical, virtual and cloud infrastructure from a single > web console. Get in-depth insight into apps, servers, databases, vmware, > SAP, cloud infrastructure, etc. Download 30-day Free Trial. > Pricing starts from $795 for 25 servers or applications! > http://p.sf.net/sfu/zoho_dev2dev_nov > _______________________________________________ > VCFtools-spec mailing list > VCF...@li... > https://lists.sourceforge.net/lists/listinfo/vcftools-spec > > |