Menu

#51 --max-missing yields wrong proportion

v1.0_(example)
closed
nobody
None
1
2014-12-04
2014-12-02
No

I am using vcftools v0.1.12b.
When I use the option --max-missing 0.85 I expect my file to be filtered leaving only the positions that contain less than 15% missing data.
However, there seems to be an error in the math.
Doing this filtering in a dataset with 36 individuals (72 chromossomes) I am finding some positions in the "recoded" file that contain up to 10 missing positions. From my testing I figured that vcftools is considering each individual with missing data as a single genotype, whereas individuals with data are being considered two genotpyes. This will, of course ruin the proportion.
I have tracked the issue in the source code do the way "get_N_chr()" is calculated (cpp/entry_getters.cpp file). I have, however, not been able to fix it in the code due to a lack of "C" skills. Hardcoding the output of this function to the correct number in my dataset, however, will yield the expected results with the correct proportion.

Related

Bugs: #51

Discussion

  • Francisco Pina Martins

    Here is an example line and the part of the header that is interesting for this situation, of the VCF file I am using. Yes, it is a rather old file.

    ##fileformat=VCFv4.0
    ##fileDate=20130524
    ##source="Stacks v0.999991"
    un      121     2       G       T       .       PASS    NS=23;AF=0.978:0.022;   GT:DP:GL        0/0:61:.,.,.    .:0:.,.,.       0/0:4:.,.,.     0/0:38:.,.,.    0/0:19:.,.,.    0/0:39:.,.,.    .:0:.,.,.       0/0:3:.,.,.     0/0:101:.,.,. 0/0:75:.,.,.     0/0:56:.,.,.    0/0:70:.,.,.    .:0:.,.,.       0/0:17:.,.,.    0/0:46:.,.,.    .:0:.,.,.       0/1:12:.,-7.42971,.     0/0:5:.,.,.     .:0:.,.,.       .:0:.,.,.       0/0:50:.,.,.    .:0:.,.,.       0/0:10:.,.,.    0/0:53:.,.,.   .:0:.,.,.       .:0:.,.,.       0/0:75:.,.,.    0/0:26:.,.,.    .:0:.,.,.       0/0:9:.,.,.     0/0:111:.,.,.   0/0:81:.,.,.    .:0:.,.,.       .:0:.,.,.       0/0:20:.,.,.    .:0:.,.,.
    

    It is indeed quite different from VCF generated by pyRAD. Could the issue simply be a poorly written VCF file?

    Thanks.

    PS - Edited for readability.

     

    Last edit: Francisco Pina Martins 2014-12-03
    • Anthony Marcketta

      The problem here is that your missing data is in the form of a single period '.' and not './.' which mean two different things in VCF. When there is just a single period, vcftools assumes that this individual is haploid at this position, which is why you are getting an incorrect number of missing chromosomes.


      From: Francisco Pina Martins [stuntspt@users.sf.net]
      Sent: Wednesday, December 03, 2014 5:58 AM
      To: [vcftools:bugs]
      Subject: [vcftools:bugs] #51 --max-missing yields wrong proportion

      Here is an example line and the part of the header that is interesting for this situation, of the VCF file I am using. Yes, it is a rather old file.

      fileformat=VCFv4.0
      fileDate=20130524
      source="Stacks v0.999991"

      un 121 2 G T . PASS NS=23;AF=0.978:0.022; GT:DP:GL 0/0:61:.,.,. .:0:.,.,. 0/0:4:.,.,. 0/0:38:.,.,. 0/0:19:.,.,. 0/0:39:.,.,. .:0:.,.,. 0/0:3:.,.,. 0/0:101:.,.,. 0/0:75:.,.,. 0/0:56:.,.,. 0/0:70:.,.,. .:0:.,.,. 0/0:17:.,.,. 0/0:46:.,.,. .:0:.,.,. 0/1:12:.,-7.42971,. 0/0:5:.,.,. .:0:.,.,. .:0:.,.,. 0/0:50:.,.,. .:0:.,.,. 0/0:10:.,.,. 0/0:53:.,.,. .:0:.,.,. .:0:.,.,. 0/0:75:.,.,. 0/0:26:.,.,. .:0:.,.,. 0/0:9:.,.,. 0/0:111:.,.,. 0/0:81:.,.,. .:0:.,.,. .:0:.,.,. 0/0:20:.,.,. .:0:.,.,.
      un 437 5 A T . PASS NS=14;AF=0.071:0.929; GT:DP:GL 1/1:4:.,.,. 1/1:67:.,.,. .:0:.,.,. 1/1:61:.,.,. .:0:.,.,. 1/1:31:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. 1/1:32:.,.,. 1/1:26:.,.,. .:0:.,.,. 1/1:52:.,.,. .:0:.,.,. 0/0:44:.,.,. .:0:.,.,. 1/1:25:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. 1/1:44:.,.,. .:0:.,.,. .:0:.,.,. 1/1:9:.,.,. 1/1:18:.,.,. 1/1:84:.,.,. .:0:.,.,. 1/1:40:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,.
      un 450 5 C G . PASS NS=14;AF=0.929:0.071; GT:DP:GL 0/0:4:.,.,. 0/0:67:.,.,. .:0:.,.,. 1/1:61:.,.,. .:0:.,.,. 0/0:31:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. 0/0:32:.,.,. 0/0:26:.,.,. .:0:.,.,. 0/0:52:.,.,. .:0:.,.,. 0/0:44:.,.,. .:0:.,.,. 0/0:25:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. 0/0:44:.,.,. .:0:.,.,. .:0:.,.,. 0/0:9:.,.,. 0/0:18:.,.,. 0/0:84:.,.,. .:0:.,.,. 0/0:40:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,.
      un 453 5 A T . PASS NS=14;AF=0.857:0.143; GT:DP:GL 0/0:4:.,.,. 0/0:67:.,.,. .:0:.,.,. 1/1:61:.,.,. .:0:.,.,. 0/0:31:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. 0/0:32:.,.,. 0/0:26:.,.,. .:0:.,.,. 0/0:52:.,.,. .:0:.,.,. 1/1:44:.,.,. .:0:.,.,. 0/0:25:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. 0/0:44:.,.,. .:0:.,.,. .:0:.,.,. 0/0:9:.,.,. 0/0:18:.,.,. 0/0:84:.,.,. .:0:.,.,. 0/0:40:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,.
      un 457 5 G T . PASS NS=14;AF=0.071:0.929; GT:DP:GL 1/1:4:.,.,. 1/1:67:.,.,. .:0:.,.,. 1/1:61:.,.,. .:0:.,.,. 1/1:31:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. 1/1:32:.,.,. 1/1:26:.,.,. .:0:.,.,. 1/1:52:.,.,. .:0:.,.,. 0/0:44:.,.,. .:0:.,.,. 1/1:25:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,. 1/1:44:.,.,. .:0:.,.,. .:0:.,.,. 1/1:9:.,.,. 1/1:18:.,.,. 1/1:84:.,.,. .:0:.,.,. 1/1:40:.,.,. .:0:.,.,. .:0:.,.,. .:0:.,.,.

      It is indeed quite different from VCF generated by pyRAD. Could the issue simply be a poorly written VCF file?

      Thanks.


      [bugs:#51]http://sourceforge.net/p/vcftools/bugs/51 --max-missing yields wrong proportion

      Status: open
      Group: v1.0_(example)
      Created: Tue Dec 02, 2014 02:29 PM UTC by Francisco Pina Martins
      Last Updated: Tue Dec 02, 2014 02:29 PM UTC
      Owner: nobody

      I am using vcftools v0.1.12b.
      When I use the option --max-missing 0.85 I expect my file to be filtered leaving only the positions that contain less than 15% missing data.
      However, there seems to be an error in the math.
      Doing this filtering in a dataset with 36 individuals (72 chromossomes) I am finding some positions in the "recoded" file that contain up to 10 missing positions. From my testing I figured that vcftools is considering each individual with missing data as a single genotype, whereas individuals with data are being considered two genotpyes. This will, of course ruin the proportion.
      I have tracked the issue in the source code do the way "get_N_chr()" is calculated (cpp/entry_getters.cpp file). I have, however, not been able to fix it in the code due to a lack of "C" skills. Hardcoding the output of this function to the correct number in my dataset, however, will yield the expected results with the correct proportion.


      Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/vcftools/bugs/51/https://sourceforge.net/p/vcftools/bugs/51

      To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/https://sourceforge.net/auth/subscriptions

       

      Related

      Bugs: #51

  • Francisco Pina Martins

    Ok, thanks for the clarification.
    You can close this as invalid. I'll just modify the vcf file to comply.

    Cheers!

     
  • Adam Auton

    Adam Auton - 2014-12-04
    • status: open --> closed
     

Log in to post a comment.

MongoDB Logo MongoDB