virHap Wiki

A Software package for Inferring the haplotype structure of a mixed vi

Brought to you by: chedonat

Getting Started

Program overview

VirHap makes use of bowtie2 for the reads alignment and consists of several programs corresponding to
the dierent steps stated in the first section. The alignment step can be skipped if the user provides sam files of aligned reads. The most important commands are:

’filterSam’ : for filtering the reads according to their quality and their number of mismatches

’removeDup’: for the virions extraction (duplicate removals).

’coverSam’: for computing the coverage along the Genome (and some other information as the mismatches
frequencies, the insert size distribution, the consensus genome)

’bayesErrorModel’: for the error detection using a Bayesian approach, estimation of mutations rate
per bases and detection of mutations type

’MutantSelection’: for the mutations selection.

’haploCount’: Identify the local haplotypes and their counts and also exhibits the linkage association
between pairwise mutations.

’treeConstruct’: for the reconstruction of the final evolutionary trees and the generation of a list of
haplotypes along with their rates.

Note that each of the programs listed above can be run separately. Most of the time, the main input of a
program is the main output of the previous one.

The package also includes two pipelines which allow the user to run all the process in one go. The shell
script ’virahp.sh’ and the program ’pipeHap’ and the principal command remains treeConstruct .

The pipeline: pipeHap

pipeHap allow the user to run the whole process in one go. See section 4.8 for a detailed description.
pipeHap takes as input :

A reference genome

A collection of reads samples ( or sam files for a collection of aligned reads samples)
pipeHap outputs:

The generated evolutionary trees in the newick format (file suxe : supertree.newick). So that they
can be siualized in most of the tree visualisation software.

The distinct haplotypes (represented in list of mutations) and their rate (file suxe : hapdistinct.csv).

The inferred haplotypes genome (file suxe: hapgen.fasta) if the option ”-r” is provided.

The list of mutations and their rates (file suxe: newmutants.csv)

The command syntax is :
./pipeHap [options] reference file reads files

reference file is the reference genome filename and reads files is one or more (separated by commas)
reads sample files.

The main options are :
-S : if the user provided the sam files to the arguments reads file. (The alignment process will be
skipped)
-g: If the program will output the genomes of the reconstructed haplotypes. If not specified only the
list of mutations will represent the haplotypes.
-m/-M: specify the minimum and maximum mutations prevalences to consider (default : m =0.05, M
=1.0).
-u : specify this option if the reads are unpaired reads.

The shell script: virhap.sh

The package contains a shell script simply to use with less parameter and which allow also to run the
whole process. Its syntax is :

./virhap.sh reference file sam files max rate min rate

So it takes as input :
A reference genome : reference file
A collection of aligned reads samples : sam files
The maximum mutation prevalence to consider : max rate
The minimum mutation prevalence : min rate.

It output also the list of haplotypes, their genomes and their prevalence, the evolutionary trees constructed,and the list of mutations and their prevalence.

The main command : treeConstruct

The principal command of the whole package is treeConstruct, it is the command who performs the
viral quasispecies reconstruction. As illustrated in figure 2 it takes as input tables representing localhaplotypes count and construct an evolutionary supertree exhibiting the evolutionary history and at the leaves the haplotypes and their prevalences.

treeConstruct takes as input a csv or a text files containing the dierent local haplotypes count on each
set of close mutations. Each entries represent a table and is given in the following format:

Name set 1
Clones, mutations 1, mutations 2, ..., mutations m, Counts
clone 1,c11, c12,..,c1m,n1
clone 2,c21, c22,..,c2m,n2
...
clone N,cN1, cN2,..,cNm,nN
[Mutbase,mutbase 1,mutbase 2,...,mutbase m]
. . . .
Name set i
Clones, mutations 1, mutations 2, ..., mutations m, Counts
clone 1,c11, c12,..,c1m,n1
clone 2,c21, c22,..,c2m,n2
...
clone N,cN1, cN2,..,cNm,nN
[Mutbase,mutbase 1,mutbase 2,...,mutbase m]
. . . .

Clones, Counts and Mutbase are constant strings in the file. -Name set k : represent the name of the kth
set of close mutations. -mutations j: represent the jth mutations of the set. -clone i: are the observed clones and ni their count. -ci; j = 1 if the observed clone i has the mutations j and 0 else. -mutbase j: is the mutant base nucleotide at site of mutation j. A, G, C or T.

-the last line of each entry (Mutbase...) is optional, but itis useful if the program should ouptut the haplotypes genomes.

Ideally mutations name should respect the following nomenclature S’segment’ ’position’ exhibiting the
segment and the position of the mutations on the genome. e.g: S1 145 for the mutations at position 145 of
the first segment, S4 468 for mutations at position 468 of the fourth segment.

Remark. If the mutations names doesnt match the previous nomenclature or if the optional last line of
the tables (Mutbase...) is not provided, the program wont output the genomes of the haplotypes. Each
reconstructed haplotype will be represented by the collection or list of mutations which characterizes it.

As the two pipelines, ./treeconstruct output the following files :

- The generated subtrees, in the newick format (file suxe: subtrees.newick)
This file contains all the generated subtrees in the standard newick format. The subtrees could then be
visualised using most of the tree visualisation software.

The generated supertree in the newick format (file suxe : supertree.newick)
This file contains all the generated evolutionary trees in the standard newick format. The subtrees
could then be visualised using most of the tree visualisation software.
The distinct haplotypes and their rate (file suxe : hapdistinct.csv)
This files contains the distinct haplotypes reconstructed, For each haplotype, it also provide the rate
of the haplotype, its frequency of occurrence in the evolutionary trees. The file also provide at its
beginning the total number of generated evolutionary trees.
The generated haplotypes genome (file suxe: hapgen.fasta)
This file provides the genome sequences of each haplotype or viralquasispecies in a fasta file. The
identifier of the sequence i has the form ’>haplo ik rate i’ where rate i is its prevalence.
In case of a multiple segmented virus, the program will generate one fasta file per reconstructed
sequences, each having the name ’haplo i ratei hapgen.fasta’.
The final list of mutations and their rates (file suxe: newmutants.csv)
This file contains the final list of mutations or SNV and their estimated rate considered in the construction
of the evolutionarytrees.

Wiki: Home