You can subscribe to this list here.
| 2011 |
Jan
(14) |
Feb
(85) |
Mar
(19) |
Apr
(7) |
May
(24) |
Jun
(14) |
Jul
(17) |
Aug
(20) |
Sep
(3) |
Oct
(19) |
Nov
(14) |
Dec
(14) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2012 |
Jan
(1) |
Feb
(1) |
Mar
(13) |
Apr
(21) |
May
(19) |
Jun
(7) |
Jul
(36) |
Aug
(22) |
Sep
(1) |
Oct
(15) |
Nov
(44) |
Dec
(5) |
| 2013 |
Jan
(14) |
Feb
(28) |
Mar
(17) |
Apr
(10) |
May
(10) |
Jun
(17) |
Jul
(22) |
Aug
(33) |
Sep
(3) |
Oct
(8) |
Nov
(19) |
Dec
(23) |
| 2014 |
Jan
|
Feb
(5) |
Mar
(8) |
Apr
|
May
(17) |
Jun
(21) |
Jul
|
Aug
(16) |
Sep
(34) |
Oct
(35) |
Nov
(5) |
Dec
(2) |
| 2015 |
Jan
(7) |
Feb
|
Mar
(2) |
Apr
(2) |
May
(11) |
Jun
(3) |
Jul
|
Aug
(3) |
Sep
(3) |
Oct
|
Nov
(3) |
Dec
(6) |
| 2016 |
Jan
(6) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
|
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2017 |
Jan
(1) |
Feb
(1) |
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2018 |
Jan
|
Feb
(2) |
Mar
(2) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2020 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(4) |
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
(1) |
| 2021 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2022 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2023 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|
From: Fatima E. <fe...@we...> - 2023-03-30 17:35:44
|
Hi everyone, I am trying to figure out how I can use the vcfR package to see what genes variations in my sample correspond to in the reference genome. I have a gff file and laos have filtered vcf files for my samples. I work with fish, and am doing a population genomic study. There will be 60 vcf files. There are six sampling locations (10 fish from each). Does anyone have an idea for how I can get started on this? I am trying to see if I will be able to combine the 10 samples (per location) into one output file perhaps. And my goal is to be able to see what genes and on what chromosomes the variations in my samples are occurring in. I very deeply appreciate your help. I am a first year MA student and my project is due within a month. I am new to the coding world and will be very grateful for any help I can get. Best, Fatima |
|
From: Samragni De <sam...@gm...> - 2022-03-27 16:24:00
|
How do I convert a xlsx file to a Variant call format file? Regards, Samragni De |
|
From: Petr D. <pd...@sa...> - 2020-07-09 07:24:53
|
The specification is correct in what it says. They are genotype likelihoods, P(data|genotype). They don't scale to 1 as can be immediately seen in any VCF with the PL field, one of the values is almost always 0, for example "0,31,178". Petr On 09/07/2020 02:20, Jonathan Margoliash via VCFtools-spec wrote: > Or maybe this is the point of my confusion: > > What constitutes the model and its parameters, and what constitutes > the data? If the underlying reads (or whatever sequencing data is > available) is used as parameters to create the model, and the > genotypes are the different outcomes the model assigns different > probabilities based on that data, then this seems to be a probability > function which is incorrectly being called a likelihood function. But > if the choice of genotype for each sample at each locus is a set of > parameters in the model, and the reads (or underlying sequencing data) > is the dataset whose probability is being estimated here, then this is > an appropriately named likelihood function. > > I don't know which interpretation is correct. > > Thank you, > > Jonathan > > On Wed, Jul 8, 2020 at 3:50 PM Jonathan Margoliash > <jma...@en... <mailto:jma...@en...>> wrote: > > Hello there, > > I'm trying to interpret what the GL and PL fields in the VCF spec > mean. They're called likelihoods. But I believe they are > probability functions. Can someone clarify? > > My understanding (in line with this wikipedia article > [en.wikipedia.org] > <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Likelihood-5Ffunction-23Likelihood-5Ffunction-5Fof-5Fa-5Fparameterized-5Fmodel&d=DwMFaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=xdvdTaAZDWitAtUqWIZL0A&m=jYWtNGrhV2_93ufy8-aApWZ4gnv7EYP0PwR6AOnvbI4&s=AFghjTcK3NOfa-UY9oa1b5dEFn5sAMHy9gSYpOdl5C8&e=>) > is that likelihood functions are functions over different sets of > models with a fixed data set in mind, while in the VCF spec the GL > and PL fields vary over different genotypes (different data > outcomes), while the model that produced the VCF remains the same. > > Or another way of putting it: the values of the GL field, after > exponentiating them to remove the log_10 scaling, should always > sum to 1, yes? That's a property of probability functions, not > likelihood functions. Or maybe I'm misunderstanding. > > My guess is that the VCF spec is calling them likelihoods because > they're the log-scaled, even though then they should correctly be > called log-probabilities. > > Any clarification appreciated. Thanks! > > Jonathan > > > > _______________________________________________ > VCFtools-spec mailing list > VCF...@li... > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_vcftools-2Dspec&d=DwICAg&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=xdvdTaAZDWitAtUqWIZL0A&m=jYWtNGrhV2_93ufy8-aApWZ4gnv7EYP0PwR6AOnvbI4&s=WQqvgVuhNetdwaRYShuK_4WVbuX6c_u2IT8sNXPbKSg&e= -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |
|
From: Jonathan M. <jma...@en...> - 2020-07-09 00:48:30
|
Or maybe this is the point of my confusion: What constitutes the model and its parameters, and what constitutes the data? If the underlying reads (or whatever sequencing data is available) is used as parameters to create the model, and the genotypes are the different outcomes the model assigns different probabilities based on that data, then this seems to be a probability function which is incorrectly being called a likelihood function. But if the choice of genotype for each sample at each locus is a set of parameters in the model, and the reads (or underlying sequencing data) is the dataset whose probability is being estimated here, then this is an appropriately named likelihood function. I don't know which interpretation is correct. Thank you, Jonathan On Wed, Jul 8, 2020 at 3:50 PM Jonathan Margoliash <jma...@en...> wrote: > Hello there, > > I'm trying to interpret what the GL and PL fields in the VCF spec mean. > They're called likelihoods. But I believe they are probability functions. > Can someone clarify? > > My understanding (in line with this wikipedia article > <https://en.wikipedia.org/wiki/Likelihood_function#Likelihood_function_of_a_parameterized_model>) > is that likelihood functions are functions over different sets of models > with a fixed data set in mind, while in the VCF spec the GL and PL fields > vary over different genotypes (different data outcomes), while the model > that produced the VCF remains the same. > > Or another way of putting it: the values of the GL field, after > exponentiating them to remove the log_10 scaling, should always sum to 1, > yes? That's a property of probability functions, not likelihood functions. > Or maybe I'm misunderstanding. > > My guess is that the VCF spec is calling them likelihoods because they're > the log-scaled, even though then they should correctly be called > log-probabilities. > > Any clarification appreciated. Thanks! > > Jonathan > |
|
From: Jonathan M. <jma...@en...> - 2020-07-08 22:57:29
|
Hello there, I'm trying to interpret what the GL and PL fields in the VCF spec mean. They're called likelihoods. But I believe they are probability functions. Can someone clarify? My understanding (in line with this wikipedia article <https://en.wikipedia.org/wiki/Likelihood_function#Likelihood_function_of_a_parameterized_model>) is that likelihood functions are functions over different sets of models with a fixed data set in mind, while in the VCF spec the GL and PL fields vary over different genotypes (different data outcomes), while the model that produced the VCF remains the same. Or another way of putting it: the values of the GL field, after exponentiating them to remove the log_10 scaling, should always sum to 1, yes? That's a property of probability functions, not likelihood functions. Or maybe I'm misunderstanding. My guess is that the VCF spec is calling them likelihoods because they're the log-scaled, even though then they should correctly be called log-probabilities. Any clarification appreciated. Thanks! Jonathan |
|
From: Jonathan M. <jma...@en...> - 2020-07-01 01:23:19
|
Hey there, I'm new to this mailing list. I'm a grad student at UCSD, and I'm interested in proposing an addition to the VCF spec, but please tell me if this is not a proper proposal or a proper place to make a proposal. I have a use case where I want a format field to include one number for each haploid call, e.g. if it a y-chromosome call, then I want to emit one number, if it is an autosome call then I want to emit two numbers. Currently the spec doesn't seem to allow for that, the options in 1.4.2 are only based on number of alleles, not ploidy. Can we get a P option added to the list? Jonathan |
|
From: Martin O. <doc...@gm...> - 2017-02-17 01:17:15
|
Dear spec list, I've been playing with some graphics for various genetic file types on the side as I develop a genetics testing website. I've posted them to github on the off chance other people may find them useful visual icons in their projects. https://github.com/doctormo/glyphicons-science/blob/master/genetic_form ats.svg I'm also posting to see if there is any objections or existing graphics. Searches online never show graphics for these formats so I assumed there were none before doodling. Best Regards, Martin Owens GenTB |
|
From: Petr D. <pd...@sa...> - 2016-01-08 16:32:40
|
Hello, there is an open issue on github which I'd like bring to your attention: https://github.com/samtools/hts-specs/issues/115 The proposal is to change the meaning of the FILTER column so that it becomes a list of filters, one per ALT allele. An alternative proposal is to reserve a INFO tag for this. Best wishes, Petr -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |
|
From: Petr D. <pd...@sa...> - 2016-01-08 16:25:39
|
Hi Aye, the genotype likelihoods do not take phasing into account. The ordering of the genotypes is the same regardless of the phasing and the 0|1, 1|0, 0/1, 1/0 appear at the same position in the list. Best wishes, Petr On Tue, 2016-01-05 at 20:29 +0000, Moe, Aye wrote: > Hi > > > I would like to get a clarification on if genotypes should be > enumerated for Number=G where phased data is involved. > > > In VCF spec 4.3., there’s some pseudo code for genotype ordering for > GL/PL fields. From the spec, it looks like the pseudo code is > applicable for unphased genotyping only. > > > If we have phased genotype, the pseudo code will not work as it will > consider 0|1 and 1|0 the same and will omit one of them. So for > fields with Number=G, should we not consider phased data at all? > > > Thanks. > > > Aye Sandar Moe > > Bioinformatics Engineer > > Icahn School of Medicine at Mount Sinai > > 1255 Fifth Avenue Suite C2 > > New York, NY 10029 > > (212)824-9656 > > ------------------------------------------------------------------------------ > _______________________________________________ > VCFtools-spec mailing list > VCF...@li... > https://lists.sourceforge.net/lists/listinfo/vcftools-spec -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |
|
From: Moe, A. <ay...@ms...> - 2016-01-05 20:44:10
|
Hi I would like to get a clarification on if genotypes should be enumerated for Number=G where phased data is involved. In VCF spec 4.3., there’s some pseudo code for genotype ordering for GL/PL fields. From the spec, it looks like the pseudo code is applicable for unphased genotyping only. If we have phased genotype, the pseudo code will not work as it will consider 0|1 and 1|0 the same and will omit one of them. So for fields with Number=G, should we not consider phased data at all? Thanks. Aye Sandar Moe Bioinformatics Engineer Icahn School of Medicine at Mount Sinai 1255 Fifth Avenue Suite C2 New York, NY 10029 (212)824-9656 |
|
From: Tommy C. <tc...@sa...> - 2015-12-07 16:39:08
|
Using the first 5 columns works fine, if you are not dealing with multiallelic SVs, which are not properly left aligned. I never use rsIDs myself, but instead rely on left alignment and chrom:pos:ref:alt comparison of normalised biallelic variants. On 07/12/2015 15:56:37, "Laura Clarke" <la...@eb...> wrote: >If I wanted a unique id from a vcf file I have generally used at least >the first 3 columns, if not the first 5. (with appropriate delimiter >changes) > >This should still work when considering alt haplotypes shouldn't it as >the chromosome name and position will be different? > >Laura > >On 07/12/2015 7/12/2015 - 3:50, Martin Pollard wrote: >> Hi, >> >> I’m not sure we should consider the plink use case here, as it is not a >>VCF based tool, nor would VCF’s ID field be particularly useful for >>Plink as a unique ID as any novel variants would have an ID of “.”. >> >> The TLDR of the hts-specs issue is: where the same variant exists on >>multiple alternate haplotypes (as happens with dbsnp) what can we do? >> >> Martin >> >> On 7 Dec 2015, at 15:40, Thomas W. Blackwell <tb...@um...> wrote: >> >>> Petr - >>> >>> Older versions of plink certainly rely on unique IDs. Don't know >>> about plink 1.9, but I'd be surprised if it's different. >>> >>> - tom blackwell - >>> >>> On Mon, 7 Dec 2015, Petr Danecek wrote: >>> >>>> Hello, >>>> >>>> there is an open thread on hts-specs which discusses the problem of >>>> unique IDs in VCF >>>> https://github.com/samtools/hts-specs/issues/105 >>>> >>>> I don't know of any tool that depends on unique IDs, are there any? If >>>> not, perhaps the requirement could be dropped. >>>> >>>> Best wishes, >>>> Petr >>>> >>>> >>>> >>>> -- >>>> The Wellcome Trust Sanger Institute is operated by Genome Research >>>> Limited, a charity registered in England with number 1021457 and a >>>> company registered in England with number 2742969, whose registered >>>> office is 215 Euston Road, London, NW1 2BE. >>>> >>>> >>>>----------------------------------------------------------------------- >>>>------- >>>> Go from Idea to Many App Stores Faster with Intel(R) XDK >>>> Give your users amazing mobile app experiences with Intel(R) XDK. >>>> Use one codebase in this all-in-one HTML5 development environment. >>>> Design, debug & build mobile apps & 2D/3D high-impact games for >>>>multiple OSs. >>>> http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 >>>> _______________________________________________ >>>> VCFtools-spec mailing list >>>> VCF...@li... >>>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >>>> >>> >>> >>>------------------------------------------------------------------------ >>>------ >>> Go from Idea to Many App Stores Faster with Intel(R) XDK >>> Give your users amazing mobile app experiences with Intel(R) XDK. >>> Use one codebase in this all-in-one HTML5 development environment. >>> Design, debug & build mobile apps & 2D/3D high-impact games for >>>multiple OSs. >>> http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 >>> _______________________________________________ >>> VCFtools-spec mailing list >>> VCF...@li... >>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >> >> >> > >-------------------------------------------------------------------------- >---- >Go from Idea to Many App Stores Faster with Intel(R) XDK >Give your users amazing mobile app experiences with Intel(R) XDK. >Use one codebase in this all-in-one HTML5 development environment. >Design, debug & build mobile apps & 2D/3D high-impact games for multiple >OSs. >http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 >_______________________________________________ >VCFtools-spec mailing list >VCF...@li... >https://lists.sourceforge.net/lists/listinfo/vcftools-spec |
|
From: Laura C. <la...@eb...> - 2015-12-07 15:56:49
|
If I wanted a unique id from a vcf file I have generally used at least the first 3 columns, if not the first 5. (with appropriate delimiter changes) This should still work when considering alt haplotypes shouldn't it as the chromosome name and position will be different? Laura On 07/12/2015 7/12/2015 - 3:50, Martin Pollard wrote: > Hi, > > I’m not sure we should consider the plink use case here, as it is not a VCF based tool, nor would VCF’s ID field be particularly useful for Plink as a unique ID as any novel variants would have an ID of “.”. > > The TLDR of the hts-specs issue is: where the same variant exists on multiple alternate haplotypes (as happens with dbsnp) what can we do? > > Martin > > On 7 Dec 2015, at 15:40, Thomas W. Blackwell <tb...@um...> wrote: > >> Petr - >> >> Older versions of plink certainly rely on unique IDs. Don't know >> about plink 1.9, but I'd be surprised if it's different. >> >> - tom blackwell - >> >> On Mon, 7 Dec 2015, Petr Danecek wrote: >> >>> Hello, >>> >>> there is an open thread on hts-specs which discusses the problem of >>> unique IDs in VCF >>> https://github.com/samtools/hts-specs/issues/105 >>> >>> I don't know of any tool that depends on unique IDs, are there any? If >>> not, perhaps the requirement could be dropped. >>> >>> Best wishes, >>> Petr >>> >>> >>> >>> -- >>> The Wellcome Trust Sanger Institute is operated by Genome Research >>> Limited, a charity registered in England with number 1021457 and a >>> company registered in England with number 2742969, whose registered >>> office is 215 Euston Road, London, NW1 2BE. >>> >>> ------------------------------------------------------------------------------ >>> Go from Idea to Many App Stores Faster with Intel(R) XDK >>> Give your users amazing mobile app experiences with Intel(R) XDK. >>> Use one codebase in this all-in-one HTML5 development environment. >>> Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. >>> http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 >>> _______________________________________________ >>> VCFtools-spec mailing list >>> VCF...@li... >>> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >>> >> >> ------------------------------------------------------------------------------ >> Go from Idea to Many App Stores Faster with Intel(R) XDK >> Give your users amazing mobile app experiences with Intel(R) XDK. >> Use one codebase in this all-in-one HTML5 development environment. >> Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. >> http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 >> _______________________________________________ >> VCFtools-spec mailing list >> VCF...@li... >> https://lists.sourceforge.net/lists/listinfo/vcftools-spec > > > |
|
From: Martin P. <mp...@sa...> - 2015-12-07 15:50:53
|
Hi, I’m not sure we should consider the plink use case here, as it is not a VCF based tool, nor would VCF’s ID field be particularly useful for Plink as a unique ID as any novel variants would have an ID of “.”. The TLDR of the hts-specs issue is: where the same variant exists on multiple alternate haplotypes (as happens with dbsnp) what can we do? Martin On 7 Dec 2015, at 15:40, Thomas W. Blackwell <tb...@um...> wrote: > Petr - > > Older versions of plink certainly rely on unique IDs. Don't know > about plink 1.9, but I'd be surprised if it's different. > > - tom blackwell - > > On Mon, 7 Dec 2015, Petr Danecek wrote: > >> Hello, >> >> there is an open thread on hts-specs which discusses the problem of >> unique IDs in VCF >> https://github.com/samtools/hts-specs/issues/105 >> >> I don't know of any tool that depends on unique IDs, are there any? If >> not, perhaps the requirement could be dropped. >> >> Best wishes, >> Petr >> >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by Genome Research >> Limited, a charity registered in England with number 1021457 and a >> company registered in England with number 2742969, whose registered >> office is 215 Euston Road, London, NW1 2BE. >> >> ------------------------------------------------------------------------------ >> Go from Idea to Many App Stores Faster with Intel(R) XDK >> Give your users amazing mobile app experiences with Intel(R) XDK. >> Use one codebase in this all-in-one HTML5 development environment. >> Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. >> http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 >> _______________________________________________ >> VCFtools-spec mailing list >> VCF...@li... >> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >> > > ------------------------------------------------------------------------------ > Go from Idea to Many App Stores Faster with Intel(R) XDK > Give your users amazing mobile app experiences with Intel(R) XDK. > Use one codebase in this all-in-one HTML5 development environment. > Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. > http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 > _______________________________________________ > VCFtools-spec mailing list > VCF...@li... > https://lists.sourceforge.net/lists/listinfo/vcftools-spec -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |
|
From: Thomas W. B. <tb...@um...> - 2015-12-07 15:40:55
|
Petr - Older versions of plink certainly rely on unique IDs. Don't know about plink 1.9, but I'd be surprised if it's different. - tom blackwell - On Mon, 7 Dec 2015, Petr Danecek wrote: > Hello, > > there is an open thread on hts-specs which discusses the problem of > unique IDs in VCF > https://github.com/samtools/hts-specs/issues/105 > > I don't know of any tool that depends on unique IDs, are there any? If > not, perhaps the requirement could be dropped. > > Best wishes, > Petr > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > ------------------------------------------------------------------------------ > Go from Idea to Many App Stores Faster with Intel(R) XDK > Give your users amazing mobile app experiences with Intel(R) XDK. > Use one codebase in this all-in-one HTML5 development environment. > Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. > http://pubads.g.doubleclick.net/gampad/clk?id=254741911&iu=/4140 > _______________________________________________ > VCFtools-spec mailing list > VCF...@li... > https://lists.sourceforge.net/lists/listinfo/vcftools-spec > |
|
From: Petr D. <pd...@sa...> - 2015-12-07 15:35:02
|
Hello, there is an open thread on hts-specs which discusses the problem of unique IDs in VCF https://github.com/samtools/hts-specs/issues/105 I don't know of any tool that depends on unique IDs, are there any? If not, perhaps the requirement could be dropped. Best wishes, Petr -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |
|
From: Tim P. <tim...@gm...> - 2015-11-25 10:43:39
|
Hi Petr, Thank you for your reply, I had not spotted the changes section in the 4.3 spec. It turns out that the data file I am using is actually in 4.0 format. As I have dug deeper I have found that the differences between 4.1 and 4.2 are not important for my task, and it seems that this is likely to be true for 4.0. It would be good to have a copy of the 4.0 spec on the site to be able to check. best regards Tim On 23 November 2015 at 22:38, Petr Danecek wrote: > Hi Tim, > > as far as I know there isn't a good summary of changes between 4.1 and > 4.2, but there were few. A summary of changes between 4.2 and 4.3 is > part of the 4.3 document, PDF is available here > http://samtools.github.io/hts-specs/ > > Unless you want to write your own VCF parser in python, you might > consider using programs like bcftools, which allows you to manipulate > VCF/BCF files and quickly extract all kinds of information. > http://samtools.github.io/bcftools/bcftools.html#query > > Best wishes, > Petr > > > On Sun, 2015-11-22 at 23:41 +0000, Tim Pizey wrote: >> Hi, >> >> I am new to this world, so I apologise in advance if these questions >> are ignorant. >> >> I need to parse a file in VCFv4.2 format and have been able to find a >> parser written in Python to parse files in VCFv4.1 format at >> https://github.com/jamescasbon/PyVCF >> >> There is no change log associated with the specification files, or >> section within the specification which gives the changes from the >> previous version. >> >> I have done a diff on VCFv4.1.tex and VCFv4.2.tex, and think that I >> can use that for my purposes but I have also done a diff between >> VCFv4.2.tex and VCFv4.3.tex and the changes are too numerous and >> complex for a diff to be the right way to tell them apart. >> >> Could someone more familiar with the specifications and the process of >> creating them add a changelog summarising the differences between >> versions? >> >> I would also appreciate comments on what the consequences are likely >> to be of using a 4.1 parser on a 4.2 file. >> >> best regards >> Tim Pizey >> -- Tim Pizey - http://tim.pizey.uk/ |
|
From: Petr D. <pd...@sa...> - 2015-11-23 22:38:59
|
Hi Tim, as far as I know there isn't a good summary of changes between 4.1 and 4.2, but there were few. A summary of changes between 4.2 and 4.3 is part of the 4.3 document, PDF is available here http://samtools.github.io/hts-specs/ Unless you want to write your own VCF parser in python, you might consider using programs like bcftools, which allows you to manipulate VCF/BCF files and quickly extract all kinds of information. http://samtools.github.io/bcftools/bcftools.html#query Best wishes, Petr On Sun, 2015-11-22 at 23:41 +0000, Tim Pizey wrote: > Hi, > > I am new to this world, so I apologise in advance if these questions > are ignorant. > > I need to parse a file in VCFv4.2 format and have been able to find a > parser written in Python to parse files in VCFv4.1 format at > https://github.com/jamescasbon/PyVCF > > There is no change log associated with the specification files, or > section within the specification which gives the changes from the > previous version. > > I have done a diff on VCFv4.1.tex and VCFv4.2.tex, and think that I > can use that for my purposes but I have also done a diff between > VCFv4.2.tex and VCFv4.3.tex and the changes are too numerous and > complex for a diff to be the right way to tell them apart. > > Could someone more familiar with the specifications and the process of > creating them add a changelog summarising the differences between > versions? > > I would also appreciate comments on what the consequences are likely > to be of using a 4.1 parser on a 4.2 file. > > best regards > Tim Pizey > -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |
|
From: Tim P. <tim...@gm...> - 2015-11-22 23:41:54
|
Hi, I am new to this world, so I apologise in advance if these questions are ignorant. I need to parse a file in VCFv4.2 format and have been able to find a parser written in Python to parse files in VCFv4.1 format at https://github.com/jamescasbon/PyVCF There is no change log associated with the specification files, or section within the specification which gives the changes from the previous version. I have done a diff on VCFv4.1.tex and VCFv4.2.tex, and think that I can use that for my purposes but I have also done a diff between VCFv4.2.tex and VCFv4.3.tex and the changes are too numerous and complex for a diff to be the right way to tell them apart. Could someone more familiar with the specifications and the process of creating them add a changelog summarising the differences between versions? I would also appreciate comments on what the consequences are likely to be of using a 4.1 parser on a 4.2 file. best regards Tim Pizey -- Tim Pizey - http://tim.pizey.uk/ |
|
From: Susan F. <fa...@eb...> - 2015-09-21 08:37:35
|
Hi Petr, Thanks for the clarification. That is very helpful. Many thanks, Susan. > On 21 Sep 2015, at 08:14, Petr Danecek <pd...@sa...> wrote: > > Hello Susan, > > not sure if this was the intention, but the examples show that the VCF > representation can be ambiguous and the same variation can be expressed > in multiple ways. The convention is to left-align indels and use the > shortest representation possible. However, it is not a requirement. For > some purposes it is practical to keep SNPs and indels as separate > records, oftentimes one wants them merged into one. When comparing VCFs > from different sources, the user should be aware of the possible > differences. > > The example is wrong in that the coordinates are not sorted, the > position 3 must come after 2. > > Petr > > > On Fri, 2015-09-18 at 11:15 +0100, Susan Fairley wrote: >> Hi, >> >> >> I’d be grateful for clarification regarding example 5.1.1 in the 4.2 >> VCF spec here: >> http://samtools.github.io/hts-specs/VCFv4.2.pdf >> >> >> I’ve also pasted the example below. >> >> >> The third line of example VCF generated is: >> #CHROM POS ID REF ALT QUAL FILTER INFO >> >> 20 2 . TC TCA . PASS DP=100 >> >> >> What I’m unclear on is, why this wouldn't be reported as: >> 20 3 . C CA . PASS DP=100 >> >> >> Is there an interaction with the reporting of the deletion at the 3rd >> position in the previous line? If that is the case, would these >> typically be reported on the same line? For example: >> 20 2 . TC T,TCA . PASS DP=100 >> >> >> Also, could the whole example be condensed to a single line: >> 20 2 . TC TG,T,TCA . PASS DP=100 >> >> >> As I said, I’m unclear why the third entry starts at position two >> instead of three. I’m afraid I’m not sure if this has been broken down >> to separate lines for illustration or if I’m misunderstanding >> something, but I’d be grateful if someone could help explain. Also, if >> this isn’t a suitable list for this question a pointer in the right >> direction would be appreciated. >> >> >> Many thanks, >> Susan. >> >> >> >> >> The example is: >> 5.1.1 >> >> >> Representing variation in VCF records >> >> Creating VCF entries for SNPs and small indels >> >> Example 1 >> >> >> For example, suppose we are looking at a locus in the genome: >> >> Example >> >> >> Sequence >> >> >> Alteration >> >> >> Ref >> >> >> atCga >> >> >> C is the reference base >> >> >> 1 >> >> >> atGga >> >> >> C base is a G in some >> individuals >> >> >> 2 >> >> >> at-ga >> >> >> C base is deleted >> w.r.t. the reference >> sequence >> >> >> 3 >> >> >> a t CAg a >> >> >> A base is inserted >> w.r.t. the reference >> sequence >> >> >> Representing these as VCF records would be done as follows: >> 1. A SNP polymorphism of C/G → {C, G} → C is the reference allele 2. A >> single base deletion of C → {tC, t} → tC is the reference allele >> 3. A single base insertion of A → {tC, tCA} → tC is the reference >> allele >> >> #CHROM POS ID REF ALT QUAL FILTER INFO >> >> 20 3 . C G . PASS DP=100 >> >> 20 2 . TC T . PASS DP=100 >> >> 20 2 . TC TCA . PASS DP=100 >> >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> VCFtools-spec mailing list >> VCF...@li... >> https://lists.sourceforge.net/lists/listinfo/vcftools-spec > > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. |
|
From: Petr D. <pd...@sa...> - 2015-09-21 07:14:11
|
Hello Susan, not sure if this was the intention, but the examples show that the VCF representation can be ambiguous and the same variation can be expressed in multiple ways. The convention is to left-align indels and use the shortest representation possible. However, it is not a requirement. For some purposes it is practical to keep SNPs and indels as separate records, oftentimes one wants them merged into one. When comparing VCFs from different sources, the user should be aware of the possible differences. The example is wrong in that the coordinates are not sorted, the position 3 must come after 2. Petr On Fri, 2015-09-18 at 11:15 +0100, Susan Fairley wrote: > Hi, > > > I’d be grateful for clarification regarding example 5.1.1 in the 4.2 > VCF spec here: > http://samtools.github.io/hts-specs/VCFv4.2.pdf > > > I’ve also pasted the example below. > > > The third line of example VCF generated is: > #CHROM POS ID REF ALT QUAL FILTER INFO > > 20 2 . TC TCA . PASS DP=100 > > > What I’m unclear on is, why this wouldn't be reported as: > 20 3 . C CA . PASS DP=100 > > > Is there an interaction with the reporting of the deletion at the 3rd > position in the previous line? If that is the case, would these > typically be reported on the same line? For example: > 20 2 . TC T,TCA . PASS DP=100 > > > Also, could the whole example be condensed to a single line: > 20 2 . TC TG,T,TCA . PASS DP=100 > > > As I said, I’m unclear why the third entry starts at position two > instead of three. I’m afraid I’m not sure if this has been broken down > to separate lines for illustration or if I’m misunderstanding > something, but I’d be grateful if someone could help explain. Also, if > this isn’t a suitable list for this question a pointer in the right > direction would be appreciated. > > > Many thanks, > Susan. > > > > > The example is: > 5.1.1 > > > Representing variation in VCF records > > Creating VCF entries for SNPs and small indels > > Example 1 > > > For example, suppose we are looking at a locus in the genome: > > Example > > > Sequence > > > Alteration > > > Ref > > > atCga > > > C is the reference base > > > 1 > > > atGga > > > C base is a G in some > individuals > > > 2 > > > at-ga > > > C base is deleted > w.r.t. the reference > sequence > > > 3 > > > a t CAg a > > > A base is inserted > w.r.t. the reference > sequence > > > Representing these as VCF records would be done as follows: > 1. A SNP polymorphism of C/G → {C, G} → C is the reference allele 2. A > single base deletion of C → {tC, t} → tC is the reference allele > 3. A single base insertion of A → {tC, tCA} → tC is the reference > allele > > #CHROM POS ID REF ALT QUAL FILTER INFO > > 20 3 . C G . PASS DP=100 > > 20 2 . TC T . PASS DP=100 > > 20 2 . TC TCA . PASS DP=100 > > > ------------------------------------------------------------------------------ > _______________________________________________ > VCFtools-spec mailing list > VCF...@li... > https://lists.sourceforge.net/lists/listinfo/vcftools-spec -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |
|
From: Susan F. <fa...@eb...> - 2015-09-18 10:15:44
|
Hi, I’d be grateful for clarification regarding example 5.1.1 in the 4.2 VCF spec here: http://samtools.github.io/hts-specs/VCFv4.2.pdf <http://samtools.github.io/hts-specs/VCFv4.2.pdf> I’ve also pasted the example below. The third line of example VCF generated is: #CHROM POS ID REF ALT QUAL FILTER INFO 20 2 . TC TCA . PASS DP=100 What I’m unclear on is, why this wouldn't be reported as: 20 3 . C CA . PASS DP=100 Is there an interaction with the reporting of the deletion at the 3rd position in the previous line? If that is the case, would these typically be reported on the same line? For example: 20 2 . TC T,TCA . PASS DP=100 Also, could the whole example be condensed to a single line: 20 2 . TC TG,T,TCA . PASS DP=100 As I said, I’m unclear why the third entry starts at position two instead of three. I’m afraid I’m not sure if this has been broken down to separate lines for illustration or if I’m misunderstanding something, but I’d be grateful if someone could help explain. Also, if this isn’t a suitable list for this question a pointer in the right direction would be appreciated. Many thanks, Susan. The example is: 5.1.1 Representing variation in VCF records Creating VCF entries for SNPs and small indels Example 1 For example, suppose we are looking at a locus in the genome: Example Sequence Alteration Ref atCga C is the reference base 1 atGga C base is a G in some individuals 2 at-ga C base is deleted w.r.t. the reference sequence 3 a t CAg a A base is inserted w.r.t. the reference sequence Representing these as VCF records would be done as follows: 1. A SNP polymorphism of C/G → {C, G} → C is the reference allele 2. A single base deletion of C → {tC, t} → tC is the reference allele 3. A single base insertion of A → {tC, tCA} → tC is the reference allele #CHROM POS ID REF ALT QUAL FILTER INFO 20 3 . C G . PASS DP=100 20 2 . TC T . PASS DP=100 20 2 . TC TCA . PASS DP=100 |
|
From: Shankar A. S. <sha...@gm...> - 2015-08-25 04:34:39
|
Thanks, Yossi. Shankar On Mon, Aug 24, 2015 at 8:19 PM, Yossi Farjoun <fa...@br...> wrote: > Shankar, > > I understand this to mean that the semi-colon is the separator of > different filters (for the same genotype) so that, for example, if a > genotype fails two different filters, one would have > FORMAT SAMPLE1 > GT:FT 0/1:FILTER1;FILTER2 > > but that each filter code may not have spaces or semicolons within it. > > Yossi. > > On Mon, Aug 24, 2015 at 12:41 AM, Shankar Ajay Subramanian < > sha...@gm...> wrote: > >> Hello, >> >> I had a question about the FT reserved keyword that's part of the FORMAT >> field. VCFv4.1 and VCFv4.2 definitions appear to make contradictory >> statements. The definition closes, in parentheses, by saying that >> semi-colons are not permitted in the string, while the body says that the >> list of codes are to be separated by semi-colons. Any clarification would >> be appreciated. >> >> FT : sample genotype filter indicating if this genotype was “called” >> (similar in concept to the FILTER field). Again, use PASS to indicate that >> all filters have been passed, a semi-colon separated list of codes for >> filters that fail, or ‘.’ to indicate that filters have not been applied. >> These values should be described in the metainformation in the same way as >> FILTERs (String, no white-space or semi-colons permitted) >> >> Thanks, >> Shankar >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> VCFtools-spec mailing list >> VCF...@li... >> https://lists.sourceforge.net/lists/listinfo/vcftools-spec >> >> > |
|
From: Yossi F. <fa...@br...> - 2015-08-25 03:50:59
|
Shankar, I understand this to mean that the semi-colon is the separator of different filters (for the same genotype) so that, for example, if a genotype fails two different filters, one would have FORMAT SAMPLE1 GT:FT 0/1:FILTER1;FILTER2 but that each filter code may not have spaces or semicolons within it. Yossi. On Mon, Aug 24, 2015 at 12:41 AM, Shankar Ajay Subramanian < sha...@gm...> wrote: > Hello, > > I had a question about the FT reserved keyword that's part of the FORMAT > field. VCFv4.1 and VCFv4.2 definitions appear to make contradictory > statements. The definition closes, in parentheses, by saying that > semi-colons are not permitted in the string, while the body says that the > list of codes are to be separated by semi-colons. Any clarification would > be appreciated. > > FT : sample genotype filter indicating if this genotype was “called” > (similar in concept to the FILTER field). Again, use PASS to indicate that > all filters have been passed, a semi-colon separated list of codes for > filters that fail, or ‘.’ to indicate that filters have not been applied. > These values should be described in the metainformation in the same way as > FILTERs (String, no white-space or semi-colons permitted) > > Thanks, > Shankar > > > ------------------------------------------------------------------------------ > > _______________________________________________ > VCFtools-spec mailing list > VCF...@li... > https://lists.sourceforge.net/lists/listinfo/vcftools-spec > > |
|
From: Shankar A. S. <sha...@gm...> - 2015-08-24 04:41:43
|
Hello, I had a question about the FT reserved keyword that's part of the FORMAT field. VCFv4.1 and VCFv4.2 definitions appear to make contradictory statements. The definition closes, in parentheses, by saying that semi-colons are not permitted in the string, while the body says that the list of codes are to be separated by semi-colons. Any clarification would be appreciated. FT : sample genotype filter indicating if this genotype was “called” (similar in concept to the FILTER field). Again, use PASS to indicate that all filters have been passed, a semi-colon separated list of codes for filters that fail, or ‘.’ to indicate that filters have not been applied. These values should be described in the metainformation in the same way as FILTERs (String, no white-space or semi-colons permitted) Thanks, Shankar |
|
From: Joachim B. <ki...@co...> - 2015-06-11 18:37:25
|
Hi Petr, I got an off-the-list reply from Erik Garrison earlier and he thinks it is not widely used either. For now, if I encounter GLE in a file, I will simply interpret it as a string value — unless it breaks the formatting completely. I am looking forward to the 4.3 spec by the way! Great stuff! Thanks, Kim CODAMONO, Toronto, Ontario, Canada. On June 11, 2015 at 11:03:31 AM, Petr Danecek (pd...@sa...) wrote: Hi Joachim, I think this was meant verbatim, otherwise the field would be defined as Type=Float, not as String. This is a bug in the specification which went unnoticed since v4.1, apparently no one has been using the field. I did not find any mention about GLE other than this http://sourceforge.net/p/vcftools/mailman/message/30123755/ I've open an issue on github for this https://github.com/samtools/hts-specs/issues/90 Perhaps a semicolon could be used a separator instead? Cheers, Petr On Thu, 2015-05-28 at 10:38 -0400, Joachim Baran wrote: > Hello, > > > I am working on a bioinformatics tool for bringing genomics data > (GFF3, GTF, GVF, VCF formats) into the cloud and storing it in NoSQL > databases (MongoDB, RethinkDB, Elasticsearch, etc.). > > > Most of the VCF 4.2 specification is quite clear, but I am having > trouble understanding the formatting of the GLE field. The example > provided in the specification is > "0:-75.22,1:-223.42,0/0:-323.03,1/0:-99.29,1/1:-802.53”. > > > Does the example mean that colons can appear in the GLE field — > despite the fact that colons are already used to separate genotype > fields? Or, does the example mean that the likelihoods are provided > with a GLE field value of > “-75.22,-223.42,-323.03,-99.29,-802.53” (comma separated), which > correspond to the genotype ordering 0, 1, 0/0, 1/0, 1/1? > > > I would appreciate if someone could clarify this part of the > specification. Thank you. > > > Best wishes, > > Kim > > > CODAMONO, Toronto, Ontario, Canada. > ------------------------------------------------------------------------------ > _______________________________________________ > VCFtools-spec mailing list > VCF...@li... > https://lists.sourceforge.net/lists/listinfo/vcftools-spec -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |