Re: [Vcftools-help] VCF format questions

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

the validator complains about N not being an integer, look at the tag  
cPd. Do you need to state this explicitly? The ploidy is implicitly  
present in the GT tag (x vs x/x vs x/x/x ...)

##FORMAT=<ID=cPd,Number=1,Type=Integer,Description="called  
Ploidy(level)">
1       13376   .       T       .       120     PASS    NS=1     
GT:GQ:HQ:segDup:rCov:cPd        0/.:120:120,120:7:1.25:N

Petr

On May 23, 2011, at 2:41 PM, Shan Yang wrote:

> Hi, Petr
>
> Thanks for the response. The VCFtools I used is vcftools_0.1.5. I  
> downloaded from here:
>
> http://sourceforge.net/projects/vcftools/files/
>
> Here is an example file that contains 1st 1000 lines of my VCF file  
> and the error message I got from running the following command:
>
> perl -MVcf -e validate NA19240_example.vcf 2> err_example
>
> Thank you very much!
>
> Shan
>
> -----Original Message-----
> From: Petr Danecek [mailto:pd...@sa...]
> Sent: Monday, May 23, 2011 12:55 AM
> To: Shan Yang
> Cc: vcf...@li...
> Subject: RE: [Vcftools-help] VCF format questions
>
> Hi Shan,
>
> I was not able to reproduce the errors you reported. Can you please
> attach a sample of your VCF file? Also, what version of VCFtools are  
> you
> running?
>
> Petr
>
> On Mon, 2011-05-23 at 07:36 +0000, Shan Yang wrote:
>> Hi,
>>
>> I have wrote a VCF converter to convert Complete Genomics variation  
>> files to VCF file. However, the validation part does not go  
>> through. The most error message I got is related to no calls. For  
>> example, I got two error messages:
>>
>> column GS19240-1100-37-ASM at 1:11218 .. Could not validate the int  
>> [N]
>> column GS19240-1100-37-ASM at 1:13376 .. Could not validate the int  
>> [N]
>>
>> And the corresponding VCF file lines are:
>>
>> 1       11218   .       TGCCAGGGCGCCCCCTGCTGG    
>> TGCCAGGGCGCNNCCNGCNGG   0       PASS    NS=1     
>> GT:GQ:HQ:segDup:rCov:cPd        1/.:0:22,0:7:1.25:N
>> 1       13376   .       T       .       120     PASS    NS=1     
>> GT:GQ:HQ:segDup:rCov:cPd        0/.:120:120,120:7:1.25:N
>>
>> So the first one is a partial nocall with some Ns in ALT and one of  
>> the allele has this partial nocall and the other is a complete no  
>> call. The second one is similar but one allele is called as  
>> reference and the other is nocalled. Although in your previous  
>> email, you said that you can represent nocalls with '.', does not  
>> seem to be so.
>>
>> So how should I represent nocalls (half or total) in VCF? Another  
>> problem on these cases are if the variant type is not snp, it will  
>> complain that the leading base is missing (i.e. Could not parse the  
>> allele(s) [N], first base does not match the reference.) If I would  
>> put a leading base in front of a nocall, should it be AN or A. or  
>> something else?
>>
>> Another thing I noticed is that although from VCF 4.1, the  
>> reference and alt are case insensitive, the validation program  
>> still complains wherever there is a lower case letter.
>>
>> Please comment on the no call situation.
>>
>> Thank you very much!
>>
>> Shan
>>
>>
>> -----Original Message-----
>> From: Petr Danecek [mailto:pd...@sa...]
>> Sent: Wednesday, May 18, 2011 8:40 AM
>> To: Shan Yang
>> Cc: vcf...@li...
>> Subject: RE: [Vcftools-help] VCF format questions
>>
>> Hi Shan,
>>
>> On Mon, 2011-05-16 at 19:10 +0000, Shan Yang wrote:
>>> Hi, Petr
>>>
>>> Thanks a lot for your reply. So for question one, in cases where
>>> multiple dbSNP id are mapped to one location, you would also discard
>>> any of them or you will keep one in the ID space?
>>
>> a record can have multiple IDs, but one ID cannot be assigned to
>> multiple loci.
>>
>>
>>> Another question I forgot to mention in my previous email is that in
>>> the example on the document page, there is one like this:
>>>
>>> #CHROM             POS        ID            REF         ALT
>>> QUAL    FILTER   INFO      FORMAT              NA00001
>>> NA00002                NA00003
>>>
>>> 20           1230237                .               T              .
>>> 47           PASS      NS=3;DP=13;AA=T           GT:GQ:DP:HQ    0|
>>> 0:54:7:56,60                0|0:48:4:51,51   0/0:61:2
>>>
>>>
>>> Which is described as “a site that is called monomorphic reference
>>> (i.e. with no alternate alleles)”. I don’t understand why this  
>>> record
>>> has to be there since this a variant file.
>>
>> This is just an example showing how to represent records without any
>> alternate alleles, it does not have to be there. Making a reference  
>> call
>> is different from missing information (not making a call at all).
>>
>>
>>> I don’t totally understand the “phasing” concept in the VCF file. My
>>> understanding of phasing is we know certain variants are on the same
>>> allele. So the phasing information is always among genotype calls at
>>> different genomic locations, not the relationship between samples,
>>> right? If this is the case, then it is not obvious which genotype
>>> calls belong to one group if there is no identifier to assign them.
>>> The “PS” in genotype field seems to serve this purpose, but I don’t
>>> see this field being presented in the example file (like the one I
>>> shown above). How can I interoperate the phasing information in
>>> NA00001 and NA00002?
>>
>> In phased genotypes, the alleles are listed in the same order. Thus  
>> the
>> first alleles of the phased records all appear on one chromosome  
>> and the
>> second alleles on the other. As described in the spec, if PS tag is  
>> not
>> given, all records are assumed to belong to the same group.
>>
>>
>>
>>> Another question is Complete Genomics specific (more or less). So we
>>> have calls with “?” in them, meaning we are not sure the length of  
>>> the
>>> variants given the evidence. For example, the 1st example shown here
>>> is a case where we know at this location there is a snp of C->A in  
>>> one
>>> allele, however, we are not sure what happened in the other  
>>> allele. We
>>> are not even sure about the length of the 2nd allele (if we know the
>>> length is 1 but not knowing the genotype, we will put an “N” instead
>>> of a “?” there).  The 2nd case, although there is a  “?” in the 2nd
>>> allele, there is some information in that allele that could
>>> potentially be useful.
>>>
>>> chr1    753404  753405  half    snp     C       A       ?
>>>
>>> chr1    946133  946135  half    sub     TT      G       ?TT
>>>
>>>
>>>
>>> Is there a way to present this kind of case in VCF or we need to
>>> discard the 2nd allele completely? Especially for the 2nd line I  
>>> shown
>>> here, there will be quite some loss of information if we totally
>>> discard the 2nd allele. I would love to hear your comments.
>>
>> Missing information is expressed by a dot. So these are valid GT
>> records: ./. ./0
>>
>> Best,
>> Petr
>>
>>
>>>
>>>
>>>
>>> Thanks a lot!
>>>
>>>
>>>
>>> Shan
>>>
>>> From: Petr Danecek [mailto:pd...@sa...]
>>> Sent: Friday, May 13, 2011 8:17 PM
>>> To: Shan Yang
>>> Cc: vcf...@li...
>>> Subject: Re: [Vcftools-help] VCF format questions
>>>
>>>
>>>
>>>
>>> Hello Shan,
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On May 13, 2011, at 7:59 PM, Shan Yang wrote:
>>>
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>>
>>> I am a scientist working for Complete Genomics. I am in the  
>>> process of
>>> writing a converter that coverts CG variation file into VCF file. I
>>> have read the document  and have some questions:
>>>
>>>
>>>
>>>
>>> 1)      One of the required field is ID, which according to the
>>> description: semi-colon separated list of unique identifiers where
>>> available. If this is a dbSNP variant, it is encouraged to use the  
>>> rs
>>> number(s). No identifier should be present in more than one data
>>> record.
>>>
>>>
>>> As far as I know, there could be dbSNP id that are mapped to  
>>> multiple
>>> locations on the genome (largely due to mapping ambiguity). How do  
>>> you
>>> deal with this kind of case? Discard them or add another field to
>>> distinguish these non-unique entries?
>>>
>>>
>>>
>>>
>>>
>>> I remember encountering such dbSNP IDs before. There were multiple  
>>> IDs
>>> for the same genomic locus and one of the IDs was mapped to multiple
>>> locations. The correct solution was to discard the ambiguous ID.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2)      In one of the examples shown in the document, the sequence  
>>> are
>>> like this:
>>>
>>>
>>> Ref:        atcgcg--a
>>>
>>>
>>> A1:         atcg----a
>>>
>>>
>>> A2:         atcgcgcga
>>>
>>>
>>> And the VCF record is
>>>
>>>
>>> 20           2              .               TCGCG  TCG,TCGCGCG  .
>>> PASS      DP=100
>>>
>>>
>>>
>>>
>>> To me, the “TC” in the “TCG” part is redundant. Could you just
>>> presented as “GCG           G,GCGCG”? Is there any reason why you
>>> include two more letters in this case?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> This is probably a copy-and-paste type of error and has been fixed,
>>> the specification recommends that the simplest representation  
>>> possible
>>> should be used. Thanks for noticing.
>>>
>>>
>>>
>>>
>>>
>>> Best,
>>>
>>>
>>> Petr
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>> Shan
>>>
>>>
>>>
>>> -- The Wellcome Trust Sanger Institute is operated by Genome  
>>> Research
>>> Limited, a charity registered in England with number 1021457 and a
>>> company registered in England with number 2742969, whose registered
>>> office is 215 Euston Road, London, NW1 2BE.
>>>
>>>
>>>
>>>
>>> ______________________________________________________________________
>>>
>>> The contents of this e-mail and any attachments are confidential and
>>> only for use by the intended recipient. Any unauthorized use,
>>> distribution or copying of this message is strictly prohibited. If  
>>> you
>>> are not the intended recipient please inform the sender  
>>> immediately by
>>> reply e-mail and delete this message from your system. Thank you for
>>> your co-operation.
>>
>>
>>
>>
>> --
>> The Wellcome Trust Sanger Institute is operated by Genome Research
>> Limited, a charity registered in England with number 1021457 and a
>> company registered in England with number 2742969, whose registered
>> office is 215 Euston Road, London, NW1 2BE.
>>
>> ________________________________
>>
>> The contents of this e-mail and any attachments are confidential  
>> and only for use by the intended recipient. Any unauthorized use,  
>> distribution or copying of this message is strictly prohibited. If  
>> you are not the intended recipient please inform the sender  
>> immediately by reply e-mail and delete this message from your  
>> system. Thank you for your co-operation.
>
>
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
>
> ________________________________
>
> The contents of this e-mail and any attachments are confidential and  
> only for use by the intended recipient. Any unauthorized use,  
> distribution or copying of this message is strictly prohibited. If  
> you are not the intended recipient please inform the sender  
> immediately by reply e-mail and delete this message from your  
> system. Thank you for your co-operation.
> <err_example><NA19240_example.vcf>

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.