|
From: bradford p. <bra...@gm...> - 2012-11-25 04:11:20
|
Since changes for 4.2 have been brought up, I suggest the following:
Section 3, #8 contains the text:
8. INFO additional information: (String, no white-space, semi-colons, or
equals-signs permitted; commas are permitted only as delimiters for lists
of values) INFO fields are encoded as a semicolon-separated series of short
keys with optional values in the format: <key>=<data>[,data]. ...
I request that it be changed to:
8. INFO additional information: (String, no tabs, semi-colons, or
equals-signs permitted; commas are permitted only as delimiters for lists
of values) INFO fields are encoded as a semicolon-separated series of short
keys with optional values in the format: <key>=<data>[,data]. ...
(that is, change "white-space" to "tabs")
The justification:
There are some tools in the wild that use the info field to store
functional annotation information including brief descriptions of the
expected function of the gene. Well, at least I've written one and I know
of a few others. These tools do not follow the 4.1 spec unless the
descriptions are mangled (by changing space characters to underscores, for
instance).
While the other characters listed (tabs, semi-colons, equals-signs and
commas) serve a syntactic purpose in the vcf file (to delmit overall
fields, INFO fields, keys from values and lists of data respectively), I
would argue that there is no need to exclude the space character (ascii
0x20) since the tab is already being used as a delimiter.
Other implications (downsides?):
One could not use just "awk" any more, but would need to use "awk -F$'\t'"
In perl, "split /\t/" instead of "split"
In python, "line.split('\t') instead of line.split()
... etc.
Again, I would argue that splitting on tab characters is the proper thing
to do anyway, as tab is specified as the separator of fixed fields (not
"any whitespace character").
Discussion...?
-- Bradford
|