|
From: Bob H. <han...@br...> - 2011-03-03 19:31:48
|
Hi, Heng,
What I wrote before was wrong, but I still think there is a problem.
The correct motivating example is CNL (copy number likelihood), which
was recently added to the VCF 4.1 spec (by me).
Analagous to GL, it captures the likelihood of different copy numbers
for events that are copy number variable,
but without more precise allelic descriptions.
##FORMAT=<ID=CNL,Number=.,Type=Float,Description="Copy number genotype
likelihood for imprecise events">
An example would be something like:
1 1000 . A <CNV> . . SVTYPE=CNV;END=2000 CN:CNL
2:-1000.00,-53.32,0.00,-5.89 3:-1000,-322.0,-9.6,-0.3,-96.3
Here, the first sample is most likely copy number 2, the second is most
likely copy number 3.
More generally, though, the spec allows variable length arrays
(Number=.) for both INFO and FORMAT fields.
I think this is a good thing, and allows flexibility in the vcf format.
I also really like how you've made vcf
efficient for the common case, but probably no worse than the text
format for the rarer general cases.
To handle Number=., I think we would need alternative encodings for
Integer, Numeric and String:
Integer[Number=.] uint32_t, n*<uint32_t, int32_t[Y]>
Numeric[Number=.] uint32_t, n*<uint32_t, double[Y]>
String[Number=.] uint32_t, n*<uint32_t, char[Y]>
In each case, the first uint32 gives the total length of the remaining
data, in bytes.
In the repeating group, the uint32 is the number of following entries (Y).
For the string case, individual strings are null separated, as elsewhere
in the spec.
A value of "." for any sample is encoded as Y=0.
A value of ".,." is not supported for Integer or Numeric, but this is
also true of the rest of the spec, I think.
Perhaps this is a defect.
I also note that the spec is silent on how custom FORMAT fields of type
Character are encoded.
They could plausibly either be null-separated single-character strings,
or the nulls could be elided.
Either way, they would generalize to the Number=. case.
On a separate note, I'm now confused about whether the "missing" bit
represents "." or reduced ploidy.
I thought there was a difference between a genotype of "1" (ploidy 1,
one alt allele) vs. "1/."
(ploidy 2, one alt allele, other allele not confidently callable).
Maybe this could be solved by using -1 (i.e. 0x7) for a "." allele and
encoding variants with more
than 7 alleles using _GT.
-Bob
On 3/3/11 1:17 PM, Heng Li wrote:
> On Mar 3, 2011, at 11:19 AM, Bob Handsaker wrote:
>
>> Hi, Adam,
>>
>> I apologize for the tardy reply - I actually didn't see this go by on the first pass.
>> I have a couple of comments for a few cases that I have thought about.
>>
>> 1. I think we should make a small change to support FORMAT fields with "Number=.".
>> For example, CN and CNQ (copy number and copy number quality).
>> If Number=., then the custom Integer and Numeric fields could be of the form:
>> uint32_t+int32_t[n*X] and uint32_t+double[n*X] respectively where X is the count.
> Do you have an example to elaborate the use of CN?
>
>> 2. I think maybe we should say that the recommended usage is that the special formats _GT, _IBD, etc.
>> be used only internally within the BCF format. So, if you do a VCF->BCF conversion on GT, you might get
>> either GT or _GT, depending on the data, but if you then convert back to VCF, you will always get GT.
>> Of course, we can't prevent somebody from explicitly using the _GT format in VCF, but I think setting this out
>> as a clear recommendation would be good.
> Yes. The intention is that the _GT field should never appear in VCF at all.
>
>> 3. I also think the description could use more clarity and some examples about how the GT field is
>> encoded in certain cases, especially if you want it to be possible to write interoperable implementations
>> from the spec. For example, how is "./." encoded? If it is {0,0}, then how is this different from "0/0"?
>>
>> Similarly, I think some examples around the use of the "haploid bit" would be useful.
>> For example, if you have a chrX variant for a male with genotype "1", then is this encoded as {0x81,0x0} ?
>> How about a chrY variant for a female?
> Petr has reminded me that the highest bit in BCF/GT has been used to indicate whether the genotype is missing (i.e. the "./." genotype), which means haploid data can only be stored in the generic _GT field.
>
>> And, related I think, for _GT with variable ploidy> 2, what values are put in unused entries in the array
>> and how can you tell if a particular array entry carries a meaningful value or not?
>> Maybe the "haploid bit" can be used to mark an array entry as being unused.
> Yes, we can should the highest bit to indicate if the allele is missing/unused.
>
> I am attaching an updated document to address the points except 1 which I have not fully understood.
>
> Many thanks,
>
> Heng
>
>
>
>
>
>
>
>
>> Also, in general, I think it's preferable to specify a fixed value for unused entries (as opposed to saying it's undefined).
>> This both makes comparing entries easier and it can give you freedom to make backwards-compatible
>> changes to the spec in the future.
>>
>> -Bob
>>
>> On 2/26/11 7:20 AM, Adam Auton wrote:
>>> Dear All,
>>>
>>> Please find attached a proposed specification from Heng Li for a binary representation of the VCF format (BCF). As VCF files being generated by large scale sequencing projects grow in size, it is becoming more important to consider issues relating to the size and access speed of the VCF format. As such, the intention of BCF is to allow more efficient storage and access of all information stored in VCF files. Ideally, the two formats would be completely interchangeable. It would be useful if those working with VCF could look at this spec to see if they can identify problems, especially regarding information that could be represented in VCF but not easily in BCF, and vice versa.
>>>
>>> A couple of points about this specification that I think are worth raising.
>>> . The specification makes use of the 'A' and 'G' values proposed by Goncalo for determining the size of arrays. We would need to add these to the VCF4.1 spec if this is to be adopted. (And are there any other cases that people foresee needing?)
>>> . BCF currently allows predefined/reserved FORMAT headers to be missing, which would require software to explicitly handle these cases. Heng is keen on this for performance reasons, although I am less sure. Do others have opinions? Technically, VCF also allows such meta-information to be missing, which I personally think is a flaw.
>>> Comments and suggestions welcome.
>>>
>>> Adam
>>>
>>> --
>>> Adam Auton
>>> Post-Doctoral Research Scientist,
>>> Wellcome Trust Centre for Human Genetics,
>>> University of Oxford
>>> Roosevelt Drive
>>> Oxford
>>> OX3 7BN
>>> UK
>>>
>>> ------------------------------------------------------------------------------
>>> Free Software Download: Index, Search& Analyze Logs and other IT data in
>>> Real-Time with Splunk. Collect, index and harness all the fast moving IT data
>>> generated by your applications, servers and devices whether physical, virtual
>>> or in the cloud. Deliver compliance at lower cost and gain new business
>>> insights.
>>> http://p.sf.net/sfu/splunk-dev2dev
>>>
>>> _______________________________________________
>>> VCFtools-spec mailing list
>>>
>>> VCF...@li...
>>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
>> ------------------------------------------------------------------------------
>> Free Software Download: Index, Search& Analyze Logs and other IT data in
>> Real-Time with Splunk. Collect, index and harness all the fast moving IT data
>> generated by your applications, servers and devices whether physical, virtual
>> or in the cloud. Deliver compliance at lower cost and gain new business
>> insights. http://p.sf.net/sfu/splunk-dev2dev _______________________________________________
>> VCFtools-spec mailing list
>> VCF...@li...
>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec
|