|
From: Petr D. <pd...@sa...> - 2020-07-09 07:24:53
|
The specification is correct in what it says. They are genotype likelihoods, P(data|genotype). They don't scale to 1 as can be immediately seen in any VCF with the PL field, one of the values is almost always 0, for example "0,31,178". Petr On 09/07/2020 02:20, Jonathan Margoliash via VCFtools-spec wrote: > Or maybe this is the point of my confusion: > > What constitutes the model and its parameters, and what constitutes > the data? If the underlying reads (or whatever sequencing data is > available) is used as parameters to create the model, and the > genotypes are the different outcomes the model assigns different > probabilities based on that data, then this seems to be a probability > function which is incorrectly being called a likelihood function. But > if the choice of genotype for each sample at each locus is a set of > parameters in the model, and the reads (or underlying sequencing data) > is the dataset whose probability is being estimated here, then this is > an appropriately named likelihood function. > > I don't know which interpretation is correct. > > Thank you, > > Jonathan > > On Wed, Jul 8, 2020 at 3:50 PM Jonathan Margoliash > <jma...@en... <mailto:jma...@en...>> wrote: > > Hello there, > > I'm trying to interpret what the GL and PL fields in the VCF spec > mean. They're called likelihoods. But I believe they are > probability functions. Can someone clarify? > > My understanding (in line with this wikipedia article > [en.wikipedia.org] > <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Likelihood-5Ffunction-23Likelihood-5Ffunction-5Fof-5Fa-5Fparameterized-5Fmodel&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=xdvdTaAZDWitAtUqWIZL0A&m=jYWtNGrhV2_93ufy8-aApWZ4gnv7EYP0PwR6AOnvbI4&s=AFghjTcK3NOfa-UY9oa1b5dEFn5sAMHy9gSYpOdl5C8&e=>) > is that likelihood functions are functions over different sets of > models with a fixed data set in mind, while in the VCF spec the GL > and PL fields vary over different genotypes (different data > outcomes), while the model that produced the VCF remains the same. > > Or another way of putting it: the values of the GL field, after > exponentiating them to remove the log_10 scaling, should always > sum to 1, yes? That's a property of probability functions, not > likelihood functions. Or maybe I'm misunderstanding. > > My guess is that the VCF spec is calling them likelihoods because > they're the log-scaled, even though then they should correctly be > called log-probabilities. > > Any clarification appreciated. Thanks! > > Jonathan > > > > _______________________________________________ > VCFtools-spec mailing list > VCF...@li... > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_vcftools-2Dspec&d=DwICAg&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=xdvdTaAZDWitAtUqWIZL0A&m=jYWtNGrhV2_93ufy8-aApWZ4gnv7EYP0PwR6AOnvbI4&s=WQqvgVuhNetdwaRYShuK_4WVbuX6c_u2IT8sNXPbKSg&e= -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |