|
From: Petr D. <pd...@sa...> - 2014-10-10 13:13:53
|
On Fri, 2014-10-10 at 08:36 -0400, lh3 wrote: > I have just run a test. If we write 0x80 in BCF, bcftools-1.0 will > output allele 63 in VCF. It doesn't matter how bcf_get_genotypes is > coded if the VCF writer doesn't support 0x80. The current behavior is b, > not a, c or d. As I said - you are considering only the BCF producing part of htslib, not the reader or other functionalities. Let's not argue about that. What I am trying to achieve in this thread is to reach a consensus about what should appear in the specification. As I said I can accept b) if also other think that's the way to go. However, saying "whatever htslib currently does" does not seem like a good way of closing this issue. > Using 0x80 is barely an improvement by itself - I won't expand on this > as it is minor either way. It is bad if we break compatibility just for > an opinion. As others said, if bcf keeps changing even for tiny things > like this, how can we use bcf? BCF is not changing constantly, I can't agree with that. We are about to release a new version and hopefully never touch it again. In the draft there were only three changes proposed for BCF, and only one was tiny, the other two were substantial: - IDX header field enables to edit headers without redoing the whole BCF file - end-of-vector byte allows to represent genotypes (and other fields) with different ploidies - removal of the leading string is a cosmetic change, I don't believe anyone was hurt by this Petr > Heng > > On 2014-10-10 04:41, Petr Danecek wrote: > > Heng, you checked only the code which concerns writing, but there are > > other places in htslib where only 0x80 is checked - see for example > > bcf_get_genotypes in vcf.[ch] or vcfutils.c. The code is not > > consistent, > > that's why I brought this up. If we want to freeze the specification in > > the state htslib is right now, closest is the option a) plus we need to > > fix the code. > > > > Speaking for bcftools/htslib, whatever option we go with, it could be > > done in couple of days including testing. Most difficult and > > error-prone > > would be the option d) and easiest one to change throughout > > htslib/bcftools would be the option c) and a). I don't really like the > > option b) but I will not stand in the way if everyone else likes it > > best. > > > > Petr > > > > > > On Thu, 2014-10-09 at 10:18 -0400, Heng Li wrote: > >> I had a look at the htslib code just now. Currently, htslib may write > >> vector-end > >> <https://github.com/samtools/htslib/blob/develop/vcf.c#L1618>, but it > >> never writes other negative values. When htslib outputs VCF > >> <https://github.com/samtools/htslib/blob/develop/htslib/vcf.h#L780>, > >> it > >> only recognizes vector-end, too. Other negative values are treated as > >> normal values. > > > >> This behavior is closer to b) but not exactly. In the spec, we could > >> say that except vector-end, other negative values are disallowed in > >> the > >> GT array. If we adopt option a, c or d, htslib-1.0/1.1 will not work > >> any more. It is not worth breaking the compatibility completely for > >> something that is working. If we want to change the encoding of GT, we > >> may wait until the next major BCF update. > > > >> Heng > >> > >> On Oct 9, 2014, at 7:53 AM, Heng Li <lh...@sa...> wrote: > >> > >> > We just need to write the htslib-1.0 behavior in the spec and make sure that the spec is fully compatible with htslib-1.0. I only care about compatibility. > >> > > >> > Heng > >> > > >> > On Oct 9, 2014, at 5:25 AM, Petr Danecek <pd...@sa...> wrote: > >> > > >> >> OK, so we have these options: > >> >> > >> >> a) missing GTs can be represented both as 0 and 0x80 > >> >> > >> >> b) 0x80 is not allowed, only 0 can be used. From the eight reserved values 0x80-0x87, only 0x81 has an interpretation > >> >> > >> >> c) 0x80 is interpreted as missing value, 0 is an error. Phase is attached to the first non-missing allele. > >> >> > >> >> d) 0x80 is interpreted as missing value, genotype is changed from (allele<<1)|phase to allele|phase. As in (c), the phase is attached to the first non-missing allele. > >> >> > >> >> e) is there some solution I missed? > >> >> > >> >> From the discussion so far there is no clear winner, therefore I'd like to identify the least objectionable option. Can you please +1 solutions which you find acceptable and if there are multiple, list them in the order of preference, your favourite as first? > >> >> > >> >> My preferences are: > >> >> +1 c > >> >> +1 d > >> >> +1 a > >> >> > >> >> Cheers, > >> >> Petr > >> >> > >> >> > >> >> > >> >> On 7 Oct 2014, at 17:07, Petr Danecek wrote: > >> >> > >> >>> Yeah, that was my original motivation. I just realized that we still have the 0x81 end-of-vector byte and six other reserved values. This makes it a bit awkward. > >> >>> > >> >>> On 7 Oct 2014, at 16:49, Heng Li wrote: > >> >>> > >> >>>> > >> >>>> On Oct 7, 2014, at 11:42 AM, Petr Danecek <pd...@sa...> wrote: > >> >>>> > >> >>>>> OK, that's fine with me. Would you allow both 0x80 and 0 for missing value or only allow 0? > >> >>>> > >> >>>> No. Because the current htslib will interpret [0x80,0x80] as a genotype “64/64”. Having two ways to represent missing values is also error-prone. > >> >>>> > >> >>>> Heng > >> >>>> > >> >>>>> I'd like to make this clear in the specification. > >> >>>>> > >> >>>>> > >> >>>>> On 7 Oct 2014, at 16:36, Heng Li wrote: > >> >>>>> > >> >>>>>> > >> >>>>>> On Oct 7, 2014, at 11:23 AM, Petr Danecek <pd...@sa...> wrote: > >> >>>>>> > >> >>>>>>> > >> >>>>>>> On 7 Oct 2014, at 16:11, Heng Li wrote: > >> >>>>>>> > >> >>>>>>>> If we have a genotype “1|.” or “.|1”, how could we keep it with 0x80? GT will always be a special case because it is not a standard array. > >> >>>>>>> > >> >>>>>>> Wouldn't that be representable as 0x80 followed by (1+1)<<1 | 1? > >> >>>>>> > >> >>>>>> Then it is [0x80,2<<1|1] for “.|1” (phasing goes with the second allele), but [2<<1|1,0x80] for “1|.” (phasing goes with the first allele). We are introducing a special case for 0x80 in GT. > >> >>>>>> > >> >>>>>> In the end, these are all negligible details. Compatibility is far more important. I would strongly prefer to keep the current GT encoding. > >> >>>>>> > >> >>>>>> Heng > >> >> > >> > > >> > > >> > > >> > -- > >> > The Wellcome Trust Sanger Institute is operated by Genome Research > >> > Limited, a charity registered in England with number 1021457 and a > >> > company registered in England with number 2742969, whose registered > >> > office is 215 Euston Road, London, NW1 2BE. > >> > > >> > ------------------------------------------------------------------------------ > >> > Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer > >> > Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports > >> > Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper > >> > Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer > >> > http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk > >> > _______________________________________________ > >> > VCFtools-spec mailing list > >> > VCF...@li... > >> > https://lists.sourceforge.net/lists/listinfo/vcftools-spec > >> -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |