|
From: Danny C. <ch...@bc...> - 2012-12-03 20:32:35
|
Hi all,
We would like to speak in favor of allowing for spaces in the INFO
column. Many annotation and analysis pipelines involve copying
descriptive statements directly from a database which often includes
whitespace. Rather than requiring all such pipelines to include an
additional search/replace function to remove spaces, it would be better
to simply allow the space characters. Aside from the AWK issues which
are easily dealt with, I don't foresee any serious problems with this
approach as the space character has no other specified significance in
VCF.
Thanks,
Danny.
HGSC-BCM
On Mon 26 Nov 2012 04:03:43 PM CST, Heng Li wrote:
> Hyun, what's your annotation format in VCF? Could anyone show an
> example output of snpeff and/or other popular annotation tools?
> Perhaps it would be good to start from the existing ones and reach a
> consensus on how functional annotations should be stored in VCF. We
> then encourage everyone to stick to the convention. Note that I am
> thinking about a convention, which requires no changes to the spec.
>
> Heng
>
> On Nov 26, 2012, at 2:35 PM, Hyun Min Kang wrote:
>
>> Both Will's and Petr's suggestions make sense in a human-readable
>> format. There are a few points I'd like to highlight (not necessarily
>> with very solid proposals)
>>
>> ** Whether to distinguish coding sequence annotation from annotation
>> of other elements?*
>> 1. We can separate a CDS annotation (which is associated with
>> protein changes or alternative splicing) and other annotations
>> (mostly based on genomic coordinates)
>> 2. Or we can make a CDS annotation as a 'specialized, and
>> finer-grained' annotation of a more general form of other genomic
>> annotations.
>> - In the former case, we will need two separate annotations
>> (CDS-specific one, and a general, region-based one) for a coding
>> variant, and zero to multiple annotations for non-coding variants
>> - The first approach is probably easier to handle separately
>> between genomes and exomes, and the benefit of second approaches is
>> the possibility of more fine-grained non-coding information later on
>> (such as missense or nonsense variants, some specific mutations
>> within a region can be characterized)
>>
>> ** Whether (and how) to make the "dictionary" of the annotations in
>> the VCF header for facilitating automated parsing?*
>> - The two suggested formats are very well readable by human, but in
>> order to enable automated query without knowing the specific details
>> on how the annotation was performed, we will need a "dictionary" in
>> the header. For example, there needs to be a consensus on whether to
>> use the keyword "Transcript" to represent a transcript (in Ensemble
>> ID?), and "Gene" to represent a gene (in gene symbol?). Also, whether
>> to hard-code the hierarchical relationship between different elements
>> (e.g. Gene-Transcript-Exon-AminoAcid), or allow flexibility there is
>> an important issue.
>> - Especially the granularity of the "Consequence" part varies quite
>> a lot between different annotation softwares. And the consequence can
>> be structured in a hierarchical way (e.g. a frameshift indel is an
>> instance of LoF variant). Whether to predefine such rule in the VCF
>> spec, or have a meta-data representing the relationship between
>> different types of "functional consequences"
>>
>> ** Distinguishing region-based annotations and variant-specific
>> annotations?*
>> - Some annotation category requires only the knowledge of REF allele
>> (e.g. whether the variant overlaps with an exon, or an ENCODE
>> region), and some category requires the knowledge of both REF and ALT
>> alleles (e.g. missense and nonsense variants). I prefer to separate
>> these two types of annotations, or making the former (REF-only) as an
>> instance of the latter (REF and ALT requiring). The features for
>> these types of annotations can also be different. For example,
>> conservation score fits well to the former (and latter) category and
>> it is possible to make an annotation on non-variant site without
>> having the ALT allele. However, polyphen scores do not conform to the
>> former category without knowing the variant allele, so it is specific
>> to latter category.
>> - For the former category, it may be worthwhile to annotate every
>> possible genomic coordinate as some sharable format (e.g. this
>> includes, gencode and ENCODE annotation, conservation scores, 1000G
>> masks, ancestral allele information)
>>
>> ** Whether to encode the function annotation into VCF only, or make a
>> separate file if necessary?*
>> - Writing down all detailed annotation on every variant may be quite
>> large and redundant in some cases (especially when considering
>> non-coding annotations). It is also possible to consider to put only
>> "annotation IDs" in the VCF file and have a separate file describing
>> the details of the annotations (and the separate file may also
>> represent the hierarchy of the annotations if necessary)*. *
>>
>>
>> On Mon, Nov 26, 2012 at 11:47 AM, Will McLaren <wm...@eb...
>> <mailto:wm...@eb...>> wrote:
>>
>> Hello all,
>>
>> I'm the lead developer on the Ensembl VEP (Variant Effect
>> Predictor) software - I'd like to give the list our perspective
>> on how we add functional annotations in VCF 4.1 currently.
>>
>> The VEP parses VCF (alongside other formats) and users can choose
>> to output in VCF format too (though this is not the default, many
>> of our users use it).
>>
>> The format for the functional data that we use is similar to that
>> described by Petr (I suspect the example he shows is derived from
>> VEP output of some form). We use the CSQ key in the INFO field,
>> with the value consisting of "|" (pipe) separated chunks of data
>> fields; the chunks themselves are separated by commas.
>>
>> Each chunk contains functional annotation for one alt allele +
>> functional element combination. At the moment a functional
>> element can be a transcript, regulatory feature or transcription
>> factor binding motif.
>>
>> The fields and their order vary according to which command line
>> options are used (and therefore which additional data is added).
>> The order of fields is defined in a header line added to the VCF.
>> The user may also specify a list of fields that they would like
>> included, somewhat similar to a roll-your-own format.
>>
>> Missing data are left empty (i.e. you will see two consecutive
>> "|" delimiters if a field is empty).
>>
>> Example:
>>
>> ##fileformat=VCFv4.1
>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type
>> as predicted by VEP. Format:
>> Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|EXON|INTRON|DISTANCE|SIFT|PolyPhen">
>> #CHROM POS ID REF ALT QUAL FILTER INFO
>> 21 26960070 rs116645811 G A . .
>>
>> CSQ=A|ENSG00000154719|ENST00000307301|Transcript|missense_variant|1043|1001|334|T/M|aCg/aTg|rs116645811|10/11|||tolerated(0.05)|benign(0.001)
>>
>> Some thoughts from me:
>>
>> - I would definitely prefer to avoid introducing additional
>> whitespace. Currently I am whitespace ambivalent when parsing
>> input; changing this would cause a lot of problems for users
>> without any major benefits that I can see. In the few cases we
>> have to push in data that might have spaces, we replace them with
>> "_" underscores (and commas are replaced by "&" ampersands)
>>
>> - I would be strongly in favour of enforcing standards on the
>> functional types called - we use Sequence Ontology (SO) types,
>> and we've encouraged UCSC (successfully) and NCBI/dbSNP (not yet)
>> to switch to using them too. The SO guys are very open to
>> contributions if there are types not yet described
>>
>> - some flexibility in the data fields that go with the functional
>> annotation would be great - we report, for example, SIFT and
>> PolyPhen predictions which are very popular with our users, but
>> there's no reason to suppose these will be the flavours of the
>> day in 1 or 2 years' time. Not to mention the potential expansion
>> in non-coding annotations in a post-ENCODE world. But of course I
>> recognise flexibility scales inversely with ease of parsing
>>
>> - beyond not wanting to disrupt our users' parsers, I don't have
>> a problem changing the delimiters etc that we currently use
>>
>> - some of our fields are duplicated as they are specific only to
>> the variant, not particularly to the allele/functional element
>> combination - e.g. the Existing_variation field. This is a
>> hangover of the VCF output being derived from our default output,
>> which has one line per allele/functional element combo. I'd also
>> be in favour of resolving these out somehow to reduce duplications
>>
>> Regards
>>
>> Will McLaren
>> Ensembl Variation
>>
>>
>> On 26 November 2012 12:52, Petr Danecek <pd...@sa...
>> <mailto:pd...@sa...>> wrote:
>>
>> Hi Gonzalo,
>>
>> I welcome the idea of standardizing the functional
>> annotations. Here is
>> an example of a wildly evolved format that we have been using
>> so far:
>>
>> ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence
>> of the ALT
>> alleles from Ensembl 66 VEP v2.4, format
>> transcriptId:geneName:consequence[:codingSeqPosition:proteinPosition:proteinAlleles:proteinPredictions]+...[+gerpScore]">
>>
>> and two concrete examples, the first for a multiallelic site:
>>
>> CSQ=621:S>R:Grantham,110:Allele,C:Gene,Ssh3
>> +ENSMUST00000037992:ENSMUSG00000034616:SYNONYMOUS_CODING:1863:621:S>S:Allele,A:Gene,Ssh3
>>
>> CSQ=ENST00000382410:DEFB125:NON_SYNONYMOUS_CODING:184:62:H>Y:SIFT,tolerated(0.41):PolyPhen,benign(0):Condel,neutral(0.015):Grantham,83
>>
>> I am curious what other formats are in use?
>>
>>
>> I'd prefer not to introduce whitespaces in the INFO field or
>> change the
>> column delimiters to spaces or extend to whitespaces; it
>> would break
>> existing software and wouldn't bring much benefit.
>>
>> Petr
>>
>>
>>
>> On Mon, 2012-11-26 at 11:20 +0000, Peter Cock wrote:
>> > On Mon, Nov 26, 2012 at 5:12 AM, Eric Banks
>> <eb...@br...
>> <mailto:eb...@br...>> wrote:
>> > > Hi Bradford,
>> > >
>> > > I do understand where you're coming from, but truthfully
>> I'd prefer to go in
>> > > the opposite direction once we're open to changing
>> delimiters. I've never
>> > > quite understood why VCF is tab-delimited and not
>> whitespace-delimited.
>> >
>> > Tab separated makes it easy to use in Galaxy, R, etc, even
>> Excel - please
>> > keep that. It is a good thing!
>> >
>> > > You wouldn't believe how many times people have manually
>> generated
>> > > VCFs that were space-delimited and couldn't understand
>> why they were
>> > > failing in VCF parsers.
>> >
>> > I'd be asking why doesn't your parser give a clearer error
>> message?
>> > If you've seen people fall over this pothole many times the
>> parser
>> > concerned should be fixed.
>> >
>> > > I'd much rather that all whitespace be treated equally
>> (as it is
>> > > visually). It makes for a much simpler spec.
>> >
>> > The problem with white space is you can't see how many
>> characters
>> > there are - spaces and tabs are not treated equally
>> visually. What
>> > would you expect if there were several spaces in a row? If
>> you treat
>> > it as one separator you prevent using empty cells (I'm
>> thinking in
>> > terms of generalities here, not just VCF).
>> >
>> > Regards,
>> >
>> > Peter
>>
>>
>>
>>
>> --
>> The Wellcome Trust Sanger Institute is operated by Genome
>> Research
>> Limited, a charity registered in England with number 1021457
>> and a
>> company registered in England with number 2742969, whose
>> registered
>> office is 215 Euston Road, London, NW1 2BE.
>>
>> ------------------------------------------------------------------------------
>> Monitor your physical, virtual and cloud infrastructure from
>> a single
>> web console. Get in-depth insight into apps, servers,
>> databases, vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov
>> _______________________________________________
>> VCFtools-spec mailing list
>> VCF...@li...
>> <mailto:VCF...@li...>
>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Monitor your physical, virtual and cloud infrastructure from a single
>> web console. Get in-depth insight into apps, servers, databases,
>> vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov
>> _______________________________________________
>> VCFtools-spec mailing list
>> VCF...@li...
>> <mailto:VCF...@li...>
>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>>
>>
>>
>>
>> --
>> -----------------------------------------------------
>> Hyun Min Kang, Ph.D.
>> Assistant Professor of Biostatistics
>> University of Michigan, Ann Arbor
>> Email : hm...@um... <mailto:hm...@um...>
>>
>> ------------------------------------------------------------------------------
>> Monitor your physical, virtual and cloud infrastructure from a single
>> web console. Get in-depth insight into apps, servers, databases, vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov_______________________________________________
>> VCFtools-spec mailing list
>> VCF...@li...
>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>
>
> -- The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
> ------------------------------------------------------------------------
>
> No virus found in this message.
> Checked by AVG - www.avg.com <http://www.avg.com>
> Version: 2013.0.2793 / Virus Database: 2629/5908 - Release Date: 11/20/12
>
|