Download Latest Version SVPhYlA_ver_1.0.tar.gz (39.2 MB)
Email in envelope

Get an email when there's a new version of SVPhylA

Home
Name Modified Size InfoDownloads / Week
SVPhyLA_README.txt 2020-03-13 5.6 kB
SVPhYlA_ver_1.0.tar.gz 2020-03-13 39.2 MB
Totals: 2 Items   39.2 MB 0
SVPhylA (Sequence Vectorization for Phylogenetics Analysis) version 1.0

Authors: Reinaldo Molina, Guillermin Aguero-Chapin and Aminael Sanchez-Rodriguez  


SVPhylA is a python tool for the calculation of several alignment-free distances for phylogenetics analysis from the most popular alignment-free approaches. Such alignment-free methods basically encode DNA and protein sequences (fasta files) into numerical vectors allowing the calculation of alignment-free distances which may be combined into a consensus/compromise matrix by using algorithms like DISTATIS based on Multidimensional Scaling (MSD), Lineal Principal Component Analysis (PCA) and PCA-Kernel (non-lineal). 
In addition, genetic distances derived can be either combined between them or with the alignment-free distances. So far, SVPhylA contain a module to compare tree topologies by using different distance measures as a validation procedure. The statistical validation (bootstrap and jacknife) of the alignment-free trees is being developed

SVPhylA is mostly designed to interact to MEGA (MOLECULAR EVOLUTIONARY GENETICS ANALYSIS). Alignmet-free distances  calculated by SVPhylA including the consensus/compromise matrixes can be imported to MEGA (.meg) to perform phylogenetic distance-based methods and in the other sense, several genetic distances calculated by MEGA can be combined in SVPhylA to provide a consensus genetic matrix representing the best aggregates of the individual ones for phylogenetics purposes

SVPhylA Procedure

Run: 1- CD to main folder
     2- Launch the graphical interface: python3 SVPhylA.py

Installation prerequisites:
mport sys
import csv
import re, os
import numpy as np
import FastaProc
import Data as dts
import distatis as dis
import TI2BioPp.TI2BioPmain
import scipy.cluster.hierarchy as sch
from pylab import savefig
from svphylaQt import *
from manifold_mds import Ui_smacof_mds
from sklearn import manifold
import multiprocessing
from utils import *
from dlgproperties import Ui_DialogProp
from sklearn.decomposition import PCA
import FuseMatPCA as FCA
import dendropy
from dendropy.calculate import treecompare
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
from dendropy.calculate import treecompare

FastaProc.py need these packages:
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial import distance
import scipy.cluster.hierarchy as sch
from pylab import savefig
from Bio import SeqIO
import Bio
from itertools import product
import warnings
import re
from  multiprocessing import Pool, Process
from propy import PseudoAAC as PAAC
from propy import Autocorrelation as AC
from propy import CTD as CTD
from propy import QuasiSequenceOrder as QSO
import pseknc_mod as pse


Module 1--To load TI2BioP (Topological Indices to BioPolymers) descriptors.  
Module 2--To load DNA, RNA and protein fasta files. You can Select de Dataset type to vectorize and the sequence vectorization algorithm. For example, you can select "All Aminoacids" to apply all vectorization algorithms or "All nucleotides" to do the same. At the same time you can select the type of disance matrix you want to get or select all them. Then the program will start encoding the sequences and calculating the selected distance matrices the by using in parallel all the PC processors (be patient). The results of the program will be placed in the same folder you have your dataset (Fasta Files) using the prefix output_Distance_Type_Vectorization_Algorithm.meg. In addition a Log file "you can enter a name" is generated with the information of all vectors corresponding to the sequences and the selected method. Such vectors are useful for protein functional and structural classification purposes.  
In the case of calculation of contiguous K-mers you set K-mer values and ZERO IN SPACED K-MERS. The recommended value for K-mers according to the paper of Bolden, 1998 is K=4 for proteins and K=10 for genomic sequences (benchmark datasets). For the calculation of Spaced K-mers, you can select the pattern/number of contiguous K-mers and the number of spaces in "Spaced button". For example in the paper of Bolden they get at the conclusion that for proteins Spaced K-mers lenght should be equal or less than 7. Therefore you can set K-mers patterns=4 and Spaced=3 or 2 or 1. 
For the calculation of Spaced K-mers for genomic sequences, the recommended K value is 10 and the lenght of the spaced K-mers is 11-12. All patterns of length 11 and
most patterns of length 12 produced better results than the corresponding approach with the contiguous 10-mer (Bolden 1998).
To calculate the PseAAC (Pseudo AminoAcid Composition), the recommended lambda=8 according to Chou KC (2001) for the prediction of protein attributes.
To vectorize DNA sequences you can use K-mers, Spaced K-mers and the Pseudo K-Nucleotide Composition (PseKNC) type I or type II and the properties of N-tuples of nucleotides
Module 3--DISTATIS to combine different distance matrices using MSD. You can choose the several distance matrices to be combined. The output matrix (compromise matrix) can be saved in MEGA or CSV format. 
Module 4--PCA to combine different distance matrices using lineal PCA and PCA Kernel. You can choose the several distance matrices to be combined. The output matrix (compromise matrix) can be saved in MEGA or CSV format.
Module 5--To compare the topology of the obtained trees against a reference tree using several distance measures (Euclidean, Robinson-Foulds, etc). You can only introduce the reference and the subject trees in Newick format.






  

 




Source: SVPhyLA_README.txt, updated 2020-03-13