|
From: Eric B. <eb...@br...> - 2012-11-25 16:54:50
|
I personally use awk with my VCFs all the time. It would be very
unfortunate if we lost the ability to do simple awk queries on VCFs
because we start to allow whitespace in the INFO field.
I don't see why we should have the spec change to conform to the random
tools out there who don't adhere to 4.1. It's just not a compelling
reason. I'd much rather maintain the ability to use awk and require
that external tools adhere to the official spec.
-e
On 11/24/12 11:11 PM, bradford powell wrote:
> Since changes for 4.2 have been brought up, I suggest the following:
>
> Section 3, #8 contains the text:
> 8. INFO additional information: (String, no white-space, semi-colons,
> or equals-signs permitted; commas are permitted only as delimiters for
> lists of values) INFO fields are encoded as a semicolon-separated
> series of short keys with optional values in the format:
> <key>=<data>[,data]. ...
>
> I request that it be changed to:
> 8. INFO additional information: (String, no tabs, semi-colons, or
> equals-signs permitted; commas are permitted only as delimiters for
> lists of values) INFO fields are encoded as a semicolon-separated
> series of short keys with optional values in the format:
> <key>=<data>[,data]. ...
>
> (that is, change "white-space" to "tabs")
>
> The justification:
> There are some tools in the wild that use the info field to store
> functional annotation information including brief descriptions of the
> expected function of the gene. Well, at least I've written one and I
> know of a few others. These tools do not follow the 4.1 spec unless
> the descriptions are mangled (by changing space characters to
> underscores, for instance).
>
> While the other characters listed (tabs, semi-colons, equals-signs and
> commas) serve a syntactic purpose in the vcf file (to delmit overall
> fields, INFO fields, keys from values and lists of data respectively),
> I would argue that there is no need to exclude the space character
> (ascii 0x20) since the tab is already being used as a delimiter.
>
> Other implications (downsides?):
> One could not use just "awk" any more, but would need to use "awk -F$'\t'"
> In perl, "split /\t/" instead of "split"
> In python, "line.split('\t') instead of line.split()
> ... etc.
>
> Again, I would argue that splitting on tab characters is the proper
> thing to do anyway, as tab is specified as the separator of fixed
> fields (not "any whitespace character").
>
> Discussion...?
>
> -- Bradford
>
>
>
> ------------------------------------------------------------------------------
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!
> http://p.sf.net/sfu/zoho_dev2dev_nov
>
>
> _______________________________________________
> VCFtools-spec mailing list
> VCF...@li...
> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
--
Eric Banks, PhD
Broad Institute of Harvard and MIT
eb...@br...
617-714-7636
|