Re: [Vcftools-help] VCF format questions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Shan,

On Mon, 2011-05-16 at 19:10 +0000, Shan Yang wrote:
> Hi, Petr
> 
> Thanks a lot for your reply. So for question one, in cases where
> multiple dbSNP id are mapped to one location, you would also discard
> any of them or you will keep one in the ID space?

a record can have multiple IDs, but one ID cannot be assigned to
multiple loci.

> Another question I forgot to mention in my previous email is that in
> the example on the document page, there is one like this:
> 
> #CHROM             POS        ID            REF         ALT
> QUAL    FILTER   INFO      FORMAT              NA00001
> NA00002                NA00003
> 
> 20           1230237                .               T              .
> 47           PASS      NS=3;DP=13;AA=T           GT:GQ:DP:HQ    0|
> 0:54:7:56,60                0|0:48:4:51,51   0/0:61:2
> 
>  
> Which is described as “a site that is called monomorphic reference
> (i.e. with no alternate alleles)”. I don’t understand why this record
> has to be there since this a variant file. 

This is just an example showing how to represent records without any
alternate alleles, it does not have to be there. Making a reference call
is different from missing information (not making a call at all).

> I don’t totally understand the “phasing” concept in the VCF file. My
> understanding of phasing is we know certain variants are on the same
> allele. So the phasing information is always among genotype calls at
> different genomic locations, not the relationship between samples,
> right? If this is the case, then it is not obvious which genotype
> calls belong to one group if there is no identifier to assign them.
> The “PS” in genotype field seems to serve this purpose, but I don’t
> see this field being presented in the example file (like the one I
> shown above). How can I interoperate the phasing information in
> NA00001 and NA00002?

In phased genotypes, the alleles are listed in the same order. Thus the
first alleles of the phased records all appear on one chromosome and the
second alleles on the other. As described in the spec, if PS tag is not
given, all records are assumed to belong to the same group.

> Another question is Complete Genomics specific (more or less). So we
> have calls with “?” in them, meaning we are not sure the length of the
> variants given the evidence. For example, the 1st example shown here
> is a case where we know at this location there is a snp of C->A in one
> allele, however, we are not sure what happened in the other allele. We
> are not even sure about the length of the 2nd allele (if we know the
> length is 1 but not knowing the genotype, we will put an “N” instead
> of a “?” there).  The 2nd case, although there is a  “?” in the 2nd
> allele, there is some information in that allele that could
> potentially be useful.
> 
> chr1    753404  753405  half    snp     C       A       ?
> 
> chr1    946133  946135  half    sub     TT      G       ?TT
> 
>  
> 
> Is there a way to present this kind of case in VCF or we need to
> discard the 2nd allele completely? Especially for the 2nd line I shown
> here, there will be quite some loss of information if we totally
> discard the 2nd allele. I would love to hear your comments.

Missing information is expressed by a dot. So these are valid GT
records: ./. ./0

Best,
Petr

> 
>  
> 
> Thanks a lot!
> 
>  
> 
> Shan
> 
> From: Petr Danecek [mailto:pd...@sa...] 
> Sent: Friday, May 13, 2011 8:17 PM
> To: Shan Yang
> Cc: vcf...@li...
> Subject: Re: [Vcftools-help] VCF format questions
> 
> 
>  
> 
> Hello Shan,
> 
> 
>  
> 
> 
>  
> 
> On May 13, 2011, at 7:59 PM, Shan Yang wrote:
> 
> 
> 
> 
> Hi,
> 
> 
>  
> 
> I am a scientist working for Complete Genomics. I am in the process of
> writing a converter that coverts CG variation file into VCF file. I
> have read the document  and have some questions:
> 
> 
>  
> 
> 1)      One of the required field is ID, which according to the
> description: semi-colon separated list of unique identifiers where
> available. If this is a dbSNP variant, it is encouraged to use the rs
> number(s). No identifier should be present in more than one data
> record.
> 
> 
> As far as I know, there could be dbSNP id that are mapped to multiple
> locations on the genome (largely due to mapping ambiguity). How do you
> deal with this kind of case? Discard them or add another field to
> distinguish these non-unique entries?
> 
> 
>  
> 
> 
> I remember encountering such dbSNP IDs before. There were multiple IDs
> for the same genomic locus and one of the IDs was mapped to multiple
> locations. The correct solution was to discard the ambiguous ID.
> 
> 
>  
> 
> 
>  
> 
> 
> 
> 
> 2)      In one of the examples shown in the document, the sequence are
> like this:
> 
> 
> Ref:        atcgcg--a
> 
> 
> A1:         atcg----a
> 
> 
> A2:         atcgcgcga
> 
> 
> And the VCF record is
> 
> 
> 20           2              .               TCGCG  TCG,TCGCGCG  .
> PASS      DP=100
> 
> 
>  
> 
> To me, the “TC” in the “TCG” part is redundant. Could you just
> presented as “GCG           G,GCGCG”? Is there any reason why you
> include two more letters in this case?
> 
> 
>  
> 
> 
>  
> 
> 
> This is probably a copy-and-paste type of error and has been fixed,
> the specification recommends that the simplest representation possible
> should be used. Thanks for noticing.
> 
> 
>  
> 
> 
> Best,
> 
> 
> Petr
> 
> 
>  
> 
> 
> 
> 
>  
> 
> Thanks!
> 
> 
>  
> 
> Shan
> 
> 
> 
> -- The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
> 
> 
>  
> 
> ______________________________________________________________________
>  
> The contents of this e-mail and any attachments are confidential and
> only for use by the intended recipient. Any unauthorized use,
> distribution or copying of this message is strictly prohibited. If you
> are not the intended recipient please inform the sender immediately by
> reply e-mail and delete this message from your system. Thank you for
> your co-operation.

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.