Menu

#19 vcf2fq -- use INDELs in consensus sequence

open
1
2015-05-04
2013-02-20
No

The current behaviour of the vcf2fq procedure is to identify regions surrounding INDELs and convert the INDEL plus the surrounding region to lowercase. This means that the generated FASTA sequence will be exactly the same length as the input sequence, but will only differ from the original sequence for SNPs (rather than boths SNPs and INDELs).

The attached patch changes this behaviour to include INDELs with a high likelihood in the final reference sequence (likelihood modified by the '-L' option).

In addition, the patch allows for a reference FASTA sequence to be provided, so that the VCF file only needs to have non-reference information included -- previously, the VCF file needed to have one line of information for every base.

While it should work correctly for multi-fasta files, this code has only been tested on small single-chromosome sequences (mitochondria, enterococci).

1 Attachments

Discussion

  • David Eccles (gringer)

    I've modified the diff file to account for additional fields in the fasta file headers that are not part of the sequence name

     
  • Adam Auton

    Adam Auton - 2013-02-21
    • assigned_to: Petr Danecek
     
  • Gabriel

    Gabriel - 2015-05-04

    Dear David,
    Thanks for this useful script. I would like to use it but I am having problems to apply the changes in "vcf2fq_indels_v2.diff" to the "vcfutils.pl" file. I would be grateful if you send me your modified version of vcfutils.pl.
    Thank you very much.
    GAbriel

     

Log in to post a comment.