|
From: Petr D. <pd...@sa...> - 2011-05-24 10:34:01
|
Hi, the validator complains about N not being an integer, look at the tag cPd. Do you need to state this explicitly? The ploidy is implicitly present in the GT tag (x vs x/x vs x/x/x ...) ##FORMAT=<ID=cPd,Number=1,Type=Integer,Description="called Ploidy(level)"> 1 13376 . T . 120 PASS NS=1 GT:GQ:HQ:segDup:rCov:cPd 0/.:120:120,120:7:1.25:N Petr On May 23, 2011, at 2:41 PM, Shan Yang wrote: > Hi, Petr > > Thanks for the response. The VCFtools I used is vcftools_0.1.5. I > downloaded from here: > > http://sourceforge.net/projects/vcftools/files/ > > Here is an example file that contains 1st 1000 lines of my VCF file > and the error message I got from running the following command: > > perl -MVcf -e validate NA19240_example.vcf 2> err_example > > Thank you very much! > > Shan > > -----Original Message----- > From: Petr Danecek [mailto:pd...@sa...] > Sent: Monday, May 23, 2011 12:55 AM > To: Shan Yang > Cc: vcf...@li... > Subject: RE: [Vcftools-help] VCF format questions > > Hi Shan, > > I was not able to reproduce the errors you reported. Can you please > attach a sample of your VCF file? Also, what version of VCFtools are > you > running? > > Petr > > On Mon, 2011-05-23 at 07:36 +0000, Shan Yang wrote: >> Hi, >> >> I have wrote a VCF converter to convert Complete Genomics variation >> files to VCF file. However, the validation part does not go >> through. The most error message I got is related to no calls. For >> example, I got two error messages: >> >> column GS19240-1100-37-ASM at 1:11218 .. Could not validate the int >> [N] >> column GS19240-1100-37-ASM at 1:13376 .. Could not validate the int >> [N] >> >> And the corresponding VCF file lines are: >> >> 1 11218 . TGCCAGGGCGCCCCCTGCTGG >> TGCCAGGGCGCNNCCNGCNGG 0 PASS NS=1 >> GT:GQ:HQ:segDup:rCov:cPd 1/.:0:22,0:7:1.25:N >> 1 13376 . T . 120 PASS NS=1 >> GT:GQ:HQ:segDup:rCov:cPd 0/.:120:120,120:7:1.25:N >> >> So the first one is a partial nocall with some Ns in ALT and one of >> the allele has this partial nocall and the other is a complete no >> call. The second one is similar but one allele is called as >> reference and the other is nocalled. Although in your previous >> email, you said that you can represent nocalls with '.', does not >> seem to be so. >> >> So how should I represent nocalls (half or total) in VCF? Another >> problem on these cases are if the variant type is not snp, it will >> complain that the leading base is missing (i.e. Could not parse the >> allele(s) [N], first base does not match the reference.) If I would >> put a leading base in front of a nocall, should it be AN or A. or >> something else? >> >> Another thing I noticed is that although from VCF 4.1, the >> reference and alt are case insensitive, the validation program >> still complains wherever there is a lower case letter. >> >> Please comment on the no call situation. >> >> Thank you very much! >> >> Shan >> >> >> -----Original Message----- >> From: Petr Danecek [mailto:pd...@sa...] >> Sent: Wednesday, May 18, 2011 8:40 AM >> To: Shan Yang >> Cc: vcf...@li... >> Subject: RE: [Vcftools-help] VCF format questions >> >> Hi Shan, >> >> On Mon, 2011-05-16 at 19:10 +0000, Shan Yang wrote: >>> Hi, Petr >>> >>> Thanks a lot for your reply. So for question one, in cases where >>> multiple dbSNP id are mapped to one location, you would also discard >>> any of them or you will keep one in the ID space? >> >> a record can have multiple IDs, but one ID cannot be assigned to >> multiple loci. >> >> >>> Another question I forgot to mention in my previous email is that in >>> the example on the document page, there is one like this: >>> >>> #CHROM POS ID REF ALT >>> QUAL FILTER INFO FORMAT NA00001 >>> NA00002 NA00003 >>> >>> 20 1230237 . T . >>> 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0| >>> 0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 >>> >>> >>> Which is described as “a site that is called monomorphic reference >>> (i.e. with no alternate alleles)”. I don’t understand why this >>> record >>> has to be there since this a variant file. >> >> This is just an example showing how to represent records without any >> alternate alleles, it does not have to be there. Making a reference >> call >> is different from missing information (not making a call at all). >> >> >>> I don’t totally understand the “phasing” concept in the VCF file. My >>> understanding of phasing is we know certain variants are on the same >>> allele. So the phasing information is always among genotype calls at >>> different genomic locations, not the relationship between samples, >>> right? If this is the case, then it is not obvious which genotype >>> calls belong to one group if there is no identifier to assign them. >>> The “PS” in genotype field seems to serve this purpose, but I don’t >>> see this field being presented in the example file (like the one I >>> shown above). How can I interoperate the phasing information in >>> NA00001 and NA00002? >> >> In phased genotypes, the alleles are listed in the same order. Thus >> the >> first alleles of the phased records all appear on one chromosome >> and the >> second alleles on the other. As described in the spec, if PS tag is >> not >> given, all records are assumed to belong to the same group. >> >> >> >>> Another question is Complete Genomics specific (more or less). So we >>> have calls with “?” in them, meaning we are not sure the length of >>> the >>> variants given the evidence. For example, the 1st example shown here >>> is a case where we know at this location there is a snp of C->A in >>> one >>> allele, however, we are not sure what happened in the other >>> allele. We >>> are not even sure about the length of the 2nd allele (if we know the >>> length is 1 but not knowing the genotype, we will put an “N” instead >>> of a “?” there). The 2nd case, although there is a “?” in the 2nd >>> allele, there is some information in that allele that could >>> potentially be useful. >>> >>> chr1 753404 753405 half snp C A ? >>> >>> chr1 946133 946135 half sub TT G ?TT >>> >>> >>> >>> Is there a way to present this kind of case in VCF or we need to >>> discard the 2nd allele completely? Especially for the 2nd line I >>> shown >>> here, there will be quite some loss of information if we totally >>> discard the 2nd allele. I would love to hear your comments. >> >> Missing information is expressed by a dot. So these are valid GT >> records: ./. ./0 >> >> Best, >> Petr >> >> >>> >>> >>> >>> Thanks a lot! >>> >>> >>> >>> Shan >>> >>> From: Petr Danecek [mailto:pd...@sa...] >>> Sent: Friday, May 13, 2011 8:17 PM >>> To: Shan Yang >>> Cc: vcf...@li... >>> Subject: Re: [Vcftools-help] VCF format questions >>> >>> >>> >>> >>> Hello Shan, >>> >>> >>> >>> >>> >>> >>> >>> On May 13, 2011, at 7:59 PM, Shan Yang wrote: >>> >>> >>> >>> >>> Hi, >>> >>> >>> >>> >>> I am a scientist working for Complete Genomics. I am in the >>> process of >>> writing a converter that coverts CG variation file into VCF file. I >>> have read the document and have some questions: >>> >>> >>> >>> >>> 1) One of the required field is ID, which according to the >>> description: semi-colon separated list of unique identifiers where >>> available. If this is a dbSNP variant, it is encouraged to use the >>> rs >>> number(s). No identifier should be present in more than one data >>> record. >>> >>> >>> As far as I know, there could be dbSNP id that are mapped to >>> multiple >>> locations on the genome (largely due to mapping ambiguity). How do >>> you >>> deal with this kind of case? Discard them or add another field to >>> distinguish these non-unique entries? >>> >>> >>> >>> >>> >>> I remember encountering such dbSNP IDs before. There were multiple >>> IDs >>> for the same genomic locus and one of the IDs was mapped to multiple >>> locations. The correct solution was to discard the ambiguous ID. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> 2) In one of the examples shown in the document, the sequence >>> are >>> like this: >>> >>> >>> Ref: atcgcg--a >>> >>> >>> A1: atcg----a >>> >>> >>> A2: atcgcgcga >>> >>> >>> And the VCF record is >>> >>> >>> 20 2 . TCGCG TCG,TCGCGCG . >>> PASS DP=100 >>> >>> >>> >>> >>> To me, the “TC” in the “TCG” part is redundant. Could you just >>> presented as “GCG G,GCGCG”? Is there any reason why you >>> include two more letters in this case? >>> >>> >>> >>> >>> >>> >>> >>> >>> This is probably a copy-and-paste type of error and has been fixed, >>> the specification recommends that the simplest representation >>> possible >>> should be used. Thanks for noticing. >>> >>> >>> >>> >>> >>> Best, >>> >>> >>> Petr >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Thanks! >>> >>> >>> >>> >>> Shan >>> >>> >>> >>> -- The Wellcome Trust Sanger Institute is operated by Genome >>> Research >>> Limited, a charity registered in England with number 1021457 and a >>> company registered in England with number 2742969, whose registered >>> office is 215 Euston Road, London, NW1 2BE. >>> >>> >>> >>> >>> ______________________________________________________________________ >>> >>> The contents of this e-mail and any attachments are confidential and >>> only for use by the intended recipient. Any unauthorized use, >>> distribution or copying of this message is strictly prohibited. If >>> you >>> are not the intended recipient please inform the sender >>> immediately by >>> reply e-mail and delete this message from your system. Thank you for >>> your co-operation. >> >> >> >> >> -- >> The Wellcome Trust Sanger Institute is operated by Genome Research >> Limited, a charity registered in England with number 1021457 and a >> company registered in England with number 2742969, whose registered >> office is 215 Euston Road, London, NW1 2BE. >> >> ________________________________ >> >> The contents of this e-mail and any attachments are confidential >> and only for use by the intended recipient. Any unauthorized use, >> distribution or copying of this message is strictly prohibited. If >> you are not the intended recipient please inform the sender >> immediately by reply e-mail and delete this message from your >> system. Thank you for your co-operation. > > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > > ________________________________ > > The contents of this e-mail and any attachments are confidential and > only for use by the intended recipient. Any unauthorized use, > distribution or copying of this message is strictly prohibited. If > you are not the intended recipient please inform the sender > immediately by reply e-mail and delete this message from your > system. Thank you for your co-operation. > <err_example><NA19240_example.vcf> -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. |