|
From: Eric B. <eb...@br...> - 2012-11-26 05:12:55
|
Hi Bradford,
I do understand where you're coming from, but truthfully I'd prefer to
go in the opposite direction once we're open to changing delimiters.
I've never quite understood why VCF is tab-delimited and not
whitespace-delimited. You wouldn't believe how many times people have
manually generated VCFs that were space-delimited and couldn't
understand why they were failing in VCF parsers. I'd much rather that
all whitespace be treated equally (as it is visually). It makes for a
much simpler spec.
-e
On 11/25/12 9:33 PM, bradford powell wrote:
> Again, this change would not prevent the use of awk, but would require
> you to specify the awk input field separator. So the question should
> not be "Are we willing to give up the ability to use awk to allow
> spaces in INFO?" but instead "Are we willing to require awk users to
> specify an input field separator to allow spaces in INFO?" I want to
> clarify this so people do not have the wrong impression in their minds
> about the impact of this change.
>
> If the scripts you are writing are stand-alone awk scripts (i.e. meant
> to be kept around and run multiple times) then you can specify FS in
> your begin block. If the issue is convenience from the command line,
> then you could always `alias tawk="awk -F $'\t'" ` in your .bashrc (if
> you use bash) and retrain your fingers to call tawk instead of awk...
>
> One reason for having a specification is to allow "random tools" to
> interact with each other, and the reason I brought this up is because
> I would like for what I write to adhere to the VCF specification.
>
> It is admittedly difficult to write a specification to encompass the
> right level of flexibility. If too flexible, it is just too complex
> for anyone to implement and use. If too simple, then people will end
> up developing other specifications to fill in the void, leading to
> proliferation of an excess of standards. I think that loosening the
> character restrictions in this way (allowing space characters) would
> allow for useful functionality that might even prevent the need for a
> separate functional annotation format.
>
> -- Bradford
>
> On Sun, Nov 25, 2012 at 10:54 AM, Eric Banks
> <eb...@br... <mailto:eb...@br...>> wrote:
>
> I personally use awk with my VCFs all the time. It would be very
> unfortunate if we lost the ability to do simple awk queries on
> VCFs because we start to allow whitespace in the INFO field.
> I don't see why we should have the spec change to conform to the
> random tools out there who don't adhere to 4.1. It's just not a
> compelling reason. I'd much rather maintain the ability to use
> awk and require that external tools adhere to the official spec.
> -e
>
>
> On 11/24/12 11:11 PM, bradford powell wrote:
>> Since changes for 4.2 have been brought up, I suggest the following:
>>
>> Section 3, #8 contains the text:
>> 8. INFO additional information: (String, no white-space,
>> semi-colons, or equals-signs permitted; commas are permitted only
>> as delimiters for lists of values) INFO fields are encoded as a
>> semicolon-separated series of short keys with optional values in
>> the format: <key>=<data>[,data]. ...
>>
>> I request that it be changed to:
>> 8. INFO additional information: (String, no tabs, semi-colons, or
>> equals-signs permitted; commas are permitted only as delimiters
>> for lists of values) INFO fields are encoded as a
>> semicolon-separated series of short keys with optional values in
>> the format: <key>=<data>[,data]. ...
>>
>> (that is, change "white-space" to "tabs")
>>
>> The justification:
>> There are some tools in the wild that use the info field to store
>> functional annotation information including brief descriptions of
>> the expected function of the gene. Well, at least I've written
>> one and I know of a few others. These tools do not follow the 4.1
>> spec unless the descriptions are mangled (by changing space
>> characters to underscores, for instance).
>>
>> While the other characters listed (tabs, semi-colons,
>> equals-signs and commas) serve a syntactic purpose in the vcf
>> file (to delmit overall fields, INFO fields, keys from values and
>> lists of data respectively), I would argue that there is no need
>> to exclude the space character (ascii 0x20) since the tab is
>> already being used as a delimiter.
>>
>> Other implications (downsides?):
>> One could not use just "awk" any more, but would need to use "awk
>> -F$'\t'"
>> In perl, "split /\t/" instead of "split"
>> In python, "line.split('\t') instead of line.split()
>> ... etc.
>>
>> Again, I would argue that splitting on tab characters is the
>> proper thing to do anyway, as tab is specified as the separator
>> of fixed fields (not "any whitespace character").
>>
>> Discussion...?
>>
>> -- Bradford
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Monitor your physical, virtual and cloud infrastructure from a single
>> web console. Get in-depth insight into apps, servers, databases, vmware,
>> SAP, cloud infrastructure, etc. Download 30-day Free Trial.
>> Pricing starts from $795 for 25 servers or applications!
>> http://p.sf.net/sfu/zoho_dev2dev_nov
>>
>>
>> _______________________________________________
>> VCFtools-spec mailing list
>> VCF...@li... <mailto:VCF...@li...>
>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>
> --
> Eric Banks, PhD
> Broad Institute of Harvard and MIT
> eb...@br... <mailto:eb...@br...>
> 617-714-7636 <tel:617-714-7636>
>
>
--
Eric Banks, PhD
Broad Institute of Harvard and MIT
eb...@br...
617-714-7636
|