|
From: bradford p. <bra...@gm...> - 2012-11-26 02:33:23
|
Again, this change would not prevent the use of awk, but would require you
to specify the awk input field separator. So the question should not be
"Are we willing to give up the ability to use awk to allow spaces in INFO?"
but instead "Are we willing to require awk users to specify an input field
separator to allow spaces in INFO?" I want to clarify this so people do not
have the wrong impression in their minds about the impact of this change.
If the scripts you are writing are stand-alone awk scripts (i.e. meant to
be kept around and run multiple times) then you can specify FS in your
begin block. If the issue is convenience from the command line, then you
could always `alias tawk="awk -F $'\t'" ` in your .bashrc (if you use bash)
and retrain your fingers to call tawk instead of awk...
One reason for having a specification is to allow "random tools" to
interact with each other, and the reason I brought this up is because I
would like for what I write to adhere to the VCF specification.
It is admittedly difficult to write a specification to encompass the right
level of flexibility. If too flexible, it is just too complex for anyone to
implement and use. If too simple, then people will end up developing other
specifications to fill in the void, leading to proliferation of an excess
of standards. I think that loosening the character restrictions in this way
(allowing space characters) would allow for useful functionality that might
even prevent the need for a separate functional annotation format.
-- Bradford
On Sun, Nov 25, 2012 at 10:54 AM, Eric Banks <eb...@br...>wrote:
> I personally use awk with my VCFs all the time. It would be very
> unfortunate if we lost the ability to do simple awk queries on VCFs because
> we start to allow whitespace in the INFO field.
> I don't see why we should have the spec change to conform to the random
> tools out there who don't adhere to 4.1. It's just not a compelling
> reason. I'd much rather maintain the ability to use awk and require that
> external tools adhere to the official spec.
> -e
>
>
> On 11/24/12 11:11 PM, bradford powell wrote:
>
> Since changes for 4.2 have been brought up, I suggest the following:
>
> Section 3, #8 contains the text:
> 8. INFO additional information: (String, no white-space, semi-colons, or
> equals-signs permitted; commas are permitted only as delimiters for lists
> of values) INFO fields are encoded as a semicolon-separated series of short
> keys with optional values in the format: <key>=<data>[,data]. ...
>
> I request that it be changed to:
> 8. INFO additional information: (String, no tabs, semi-colons, or
> equals-signs permitted; commas are permitted only as delimiters for lists
> of values) INFO fields are encoded as a semicolon-separated series of short
> keys with optional values in the format: <key>=<data>[,data]. ...
>
> (that is, change "white-space" to "tabs")
>
> The justification:
> There are some tools in the wild that use the info field to store
> functional annotation information including brief descriptions of the
> expected function of the gene. Well, at least I've written one and I know
> of a few others. These tools do not follow the 4.1 spec unless the
> descriptions are mangled (by changing space characters to underscores, for
> instance).
>
> While the other characters listed (tabs, semi-colons, equals-signs and
> commas) serve a syntactic purpose in the vcf file (to delmit overall
> fields, INFO fields, keys from values and lists of data respectively), I
> would argue that there is no need to exclude the space character (ascii
> 0x20) since the tab is already being used as a delimiter.
>
> Other implications (downsides?):
> One could not use just "awk" any more, but would need to use "awk -F$'\t'"
> In perl, "split /\t/" instead of "split"
> In python, "line.split('\t') instead of line.split()
> ... etc.
>
> Again, I would argue that splitting on tab characters is the proper
> thing to do anyway, as tab is specified as the separator of fixed fields
> (not "any whitespace character").
>
> Discussion...?
>
> -- Bradford
>
>
>
> ------------------------------------------------------------------------------
> Monitor your physical, virtual and cloud infrastructure from a single
> web console. Get in-depth insight into apps, servers, databases, vmware,
> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
> Pricing starts from $795 for 25 servers or applications!http://p.sf.net/sfu/zoho_dev2dev_nov
>
>
>
> _______________________________________________
> VCFtools-spec mailing lis...@li...://lists.sourceforge.net/lists/listinfo/vcftools-spec
>
>
> --
> Eric Banks, PhD
> Broad Institute of Harvard and MITebanks@broadinstitute.org617-714-7636
>
>
|