Re: [Vcftools-help] VCF format questions

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thank you very much, Petr!! That is very helpful. Now my files all passed validation. You give great help!

However, I still have a few questions on VCF file format. Hope you can give some guidance on that.

1) The biggest problem of converting CG variant file into VCF format is how to distinguish called homozygous reference position and no called position.

Certainly, we can totally ignore both classes and just put confident variant calls into VCF file. However, doing this means for anything that are not in the file, the user won't know if that is because it is called hom-ref (confidently) or there is not enough information on that position. This information is especially important when people are doing genome comparison (like between tumor and normal samples). We can, of course, include one class (either hom-ref or nocalls) in the VCF file, but either way, it will make the file huge because VCF requires reference sequence to be presented in each line. I am wondering if there is any way to present this information in a more efficient way, like be able to say a block (can be quite long) is the same as reference without representing the actual sequence itself.

2) How to present something in between two adjacent bases in the reference.

In CG data, there are cases where there may be something (like an insertion) in between two adjacent bases on the reference. If it is a clear insertion, it is easy to represent in VCF file. However, sometimes, the possible insertion does not have enough supporting evidences, and then it ended up being no called. For example, say at chr1 between the 1000th base (say, "A") and 1001th base (say "G"), there is no-called "insertion". Since VCF does not quite have the concept of in between two adjacent bases, it has to be presented as something like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO            FORMAT      NA00001
1               1000    .       A       .       50      PASS            NS=1;DP=9       GT:GQ:DP    0/.:35:4

However, this is not correct. This is basically saying the 1000th position "A" is no called in one allele, not between "A" and "G" is no called. Same thing if I put 1001 and "G" in. It will be wrongly interpreted as "G" being no called. So how do I handle this kind of case where we no called nothing (on the reference).

3) About the separator ":".
So VCF uses ":" as the separator of fields in FORMAT and actual genome columns. In CG, we also use this symbol as separator in some of our fields. However, when trying to lift CG annotation to VCF file, I have to use some other symbol as separator to not confuse any program. Do you have any suggestions on what symbol to use for this purpose in cases where one field of FORMAT have multiple mini-fields.

Thank you very much!

Shan

-----Original Message-----
From: Petr Danecek [mailto:pd...@sa...]
Sent: Tuesday, May 24, 2011 3:31 AM
To: Shan Yang
Cc: vcf...@li...
Subject: Re: [Vcftools-help] VCF format questions

Hi,

the validator complains about N not being an integer, look at the tag
cPd. Do you need to state this explicitly? The ploidy is implicitly
present in the GT tag (x vs x/x vs x/x/x ...)

##FORMAT=<ID=cPd,Number=1,Type=Integer,Description="called
Ploidy(level)">
1       13376   .       T       .       120     PASS    NS=1
GT:GQ:HQ:segDup:rCov:cPd        0/.:120:120,120:7:1.25:N

Petr

On May 23, 2011, at 2:41 PM, Shan Yang wrote:

> Hi, Petr
>
> Thanks for the response. The VCFtools I used is vcftools_0.1.5. I
> downloaded from here:
>
> http://sourceforge.net/projects/vcftools/files/
>
> Here is an example file that contains 1st 1000 lines of my VCF file
> and the error message I got from running the following command:
>
> perl -MVcf -e validate NA19240_example.vcf 2> err_example
>
> Thank you very much!
>
> Shan
>
> -----Original Message-----
> From: Petr Danecek [mailto:pd...@sa...]
> Sent: Monday, May 23, 2011 12:55 AM
> To: Shan Yang
> Cc: vcf...@li...
> Subject: RE: [Vcftools-help] VCF format questions
>
> Hi Shan,
>
> I was not able to reproduce the errors you reported. Can you please
> attach a sample of your VCF file? Also, what version of VCFtools are
> you
> running?
>
> Petr
>
> On Mon, 2011-05-23 at 07:36 +0000, Shan Yang wrote:
>> Hi,
>>
>> I have wrote a VCF converter to convert Complete Genomics variation
>> files to VCF file. However, the validation part does not go
>> through. The most error message I got is related to no calls. For
>> example, I got two error messages:
>>
>> column GS19240-1100-37-ASM at 1:11218 .. Could not validate the int
>> [N]
>> column GS19240-1100-37-ASM at 1:13376 .. Could not validate the int
>> [N]
>>
>> And the corresponding VCF file lines are:
>>
>> 1       11218   .       TGCCAGGGCGCCCCCTGCTGG
>> TGCCAGGGCGCNNCCNGCNGG   0       PASS    NS=1
>> GT:GQ:HQ:segDup:rCov:cPd        1/.:0:22,0:7:1.25:N
>> 1       13376   .       T       .       120     PASS    NS=1
>> GT:GQ:HQ:segDup:rCov:cPd        0/.:120:120,120:7:1.25:N
>>
>> So the first one is a partial nocall with some Ns in ALT and one of
>> the allele has this partial nocall and the other is a complete no
>> call. The second one is similar but one allele is called as
>> reference and the other is nocalled. Although in your previous
>> email, you said that you can represent nocalls with '.', does not
>> seem to be so.
>>
>> So how should I represent nocalls (half or total) in VCF? Another
>> problem on these cases are if the variant type is not snp, it will
>> complain that the leading base is missing (i.e. Could not parse the
>> allele(s) [N], first base does not match the reference.) If I would
>> put a leading base in front of a nocall, should it be AN or A. or
>> something else?
>>
>> Another thing I noticed is that although from VCF 4.1, the
>> reference and alt are case insensitive, the validation program
>> still complains wherever there is a lower case letter.
>>
>> Please comment on the no call situation.
>>
>> Thank you very much!
>>
>> Shan
>>
>>
>> -----Original Message-----
>> From: Petr Danecek [mailto:pd...@sa...]
>> Sent: Wednesday, May 18, 2011 8:40 AM
>> To: Shan Yang
>> Cc: vcf...@li...
>> Subject: RE: [Vcftools-help] VCF format questions
>>
>> Hi Shan,
>>
>> On Mon, 2011-05-16 at 19:10 +0000, Shan Yang wrote:
>>> Hi, Petr
>>>
>>> Thanks a lot for your reply. So for question one, in cases where
>>> multiple dbSNP id are mapped to one location, you would also discard
>>> any of them or you will keep one in the ID space?
>>
>> a record can have multiple IDs, but one ID cannot be assigned to
>> multiple loci.
>>
>>
>>> Another question I forgot to mention in my previous email is that in
>>> the example on the document page, there is one like this:
>>>
>>> #CHROM             POS        ID            REF         ALT
>>> QUAL    FILTER   INFO      FORMAT              NA00001
>>> NA00002                NA00003
>>>
>>> 20           1230237                .               T              .
>>> 47           PASS      NS=3;DP=13;AA=T           GT:GQ:DP:HQ    0|
>>> 0:54:7:56,60                0|0:48:4:51,51   0/0:61:2
>>>
>>>
>>> Which is described as "a site that is called monomorphic reference
>>> (i.e. with no alternate alleles)". I don't understand why this
>>> record
>>> has to be there since this a variant file.
>>
>> This is just an example showing how to represent records without any
>> alternate alleles, it does not have to be there. Making a reference
>> call
>> is different from missing information (not making a call at all).
>>
>>
>>> I don't totally understand the "phasing" concept in the VCF file. My
>>> understanding of phasing is we know certain variants are on the same
>>> allele. So the phasing information is always among genotype calls at
>>> different genomic locations, not the relationship between samples,
>>> right? If this is the case, then it is not obvious which genotype
>>> calls belong to one group if there is no identifier to assign them.
>>> The "PS" in genotype field seems to serve this purpose, but I don't
>>> see this field being presented in the example file (like the one I
>>> shown above). How can I interoperate the phasing information in
>>> NA00001 and NA00002?
>>
>> In phased genotypes, the alleles are listed in the same order. Thus
>> the
>> first alleles of the phased records all appear on one chromosome
>> and the
>> second alleles on the other. As described in the spec, if PS tag is
>> not
>> given, all records are assumed to belong to the same group.
>>
>>
>>
>>> Another question is Complete Genomics specific (more or less). So we
>>> have calls with "?" in them, meaning we are not sure the length of
>>> the
>>> variants given the evidence. For example, the 1st example shown here
>>> is a case where we know at this location there is a snp of C->A in
>>> one
>>> allele, however, we are not sure what happened in the other
>>> allele. We
>>> are not even sure about the length of the 2nd allele (if we know the
>>> length is 1 but not knowing the genotype, we will put an "N" instead
>>> of a "?" there).  The 2nd case, although there is a  "?" in the 2nd
>>> allele, there is some information in that allele that could
>>> potentially be useful.
>>>
>>> chr1    753404  753405  half    snp     C       A       ?
>>>
>>> chr1    946133  946135  half    sub     TT      G       ?TT
>>>
>>>
>>>
>>> Is there a way to present this kind of case in VCF or we need to
>>> discard the 2nd allele completely? Especially for the 2nd line I
>>> shown
>>> here, there will be quite some loss of information if we totally
>>> discard the 2nd allele. I would love to hear your comments.
>>
>> Missing information is expressed by a dot. So these are valid GT
>> records: ./. ./0
>>
>> Best,
>> Petr
>>
>>
>>>
>>>
>>>
>>> Thanks a lot!
>>>
>>>
>>>
>>> Shan
>>>
>>> From: Petr Danecek [mailto:pd...@sa...]
>>> Sent: Friday, May 13, 2011 8:17 PM
>>> To: Shan Yang
>>> Cc: vcf...@li...
>>> Subject: Re: [Vcftools-help] VCF format questions
>>>
>>>
>>>
>>>
>>> Hello Shan,
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On May 13, 2011, at 7:59 PM, Shan Yang wrote:
>>>
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>>
>>> I am a scientist working for Complete Genomics. I am in the
>>> process of
>>> writing a converter that coverts CG variation file into VCF file. I
>>> have read the document  and have some questions:
>>>
>>>
>>>
>>>
>>> 1)      One of the required field is ID, which according to the
>>> description: semi-colon separated list of unique identifiers where
>>> available. If this is a dbSNP variant, it is encouraged to use the
>>> rs
>>> number(s). No identifier should be present in more than one data
>>> record.
>>>
>>>
>>> As far as I know, there could be dbSNP id that are mapped to
>>> multiple
>>> locations on the genome (largely due to mapping ambiguity). How do
>>> you
>>> deal with this kind of case? Discard them or add another field to
>>> distinguish these non-unique entries?
>>>
>>>
>>>
>>>
>>>
>>> I remember encountering such dbSNP IDs before. There were multiple
>>> IDs
>>> for the same genomic locus and one of the IDs was mapped to multiple
>>> locations. The correct solution was to discard the ambiguous ID.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2)      In one of the examples shown in the document, the sequence
>>> are
>>> like this:
>>>
>>>
>>> Ref:        atcgcg--a
>>>
>>>
>>> A1:         atcg----a
>>>
>>>
>>> A2:         atcgcgcga
>>>
>>>
>>> And the VCF record is
>>>
>>>
>>> 20           2              .               TCGCG  TCG,TCGCGCG  .
>>> PASS      DP=100
>>>
>>>
>>>
>>>
>>> To me, the "TC" in the "TCG" part is redundant. Could you just
>>> presented as "GCG           G,GCGCG"? Is there any reason why you
>>> include two more letters in this case?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> This is probably a copy-and-paste type of error and has been fixed,
>>> the specification recommends that the simplest representation
>>> possible
>>> should be used. Thanks for noticing.
>>>
>>>
>>>
>>>
>>>
>>> Best,
>>>
>>>
>>> Petr
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>> Shan
>>>
>>>
>>>
>>> -- The Wellcome Trust Sanger Institute is operated by Genome
>>> Research
>>> Limited, a charity registered in England with number 1021457 and a
>>> company registered in England with number 2742969, whose registered
>>> office is 215 Euston Road, London, NW1 2BE.
>>>
>>>
>>>
>>>
>>> ______________________________________________________________________
>>>
>>> The contents of this e-mail and any attachments are confidential and
>>> only for use by the intended recipient. Any unauthorized use,
>>> distribution or copying of this message is strictly prohibited. If
>>> you
>>> are not the intended recipient please inform the sender
>>> immediately by
>>> reply e-mail and delete this message from your system. Thank you for
>>> your co-operation.
>>
>>
>>
>>
>> --
>> The Wellcome Trust Sanger Institute is operated by Genome Research
>> Limited, a charity registered in England with number 1021457 and a
>> company registered in England with number 2742969, whose registered
>> office is 215 Euston Road, London, NW1 2BE.
>>
>> ________________________________
>>
>> The contents of this e-mail and any attachments are confidential
>> and only for use by the intended recipient. Any unauthorized use,
>> distribution or copying of this message is strictly prohibited. If
>> you are not the intended recipient please inform the sender
>> immediately by reply e-mail and delete this message from your
>> system. Thank you for your co-operation.
>
>
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome Research
> Limited, a charity registered in England with number 1021457 and a
> company registered in England with number 2742969, whose registered
> office is 215 Euston Road, London, NW1 2BE.
>
> ________________________________
>
> The contents of this e-mail and any attachments are confidential and
> only for use by the intended recipient. Any unauthorized use,
> distribution or copying of this message is strictly prohibited. If
> you are not the intended recipient please inform the sender
> immediately by reply e-mail and delete this message from your
> system. Thank you for your co-operation.
> <err_example><NA19240_example.vcf>

--
 The Wellcome Trust Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.

________________________________

The contents of this e-mail and any attachments are confidential and only for use by the intended recipient. Any unauthorized use, distribution or copying of this message is strictly prohibited. If you are not the intended recipient please inform the sender immediately by reply e-mail and delete this message from your system. Thank you for your co-operation.