Name | Modified | Size | Downloads / Week |
---|---|---|---|
SVPhyLA_README.txt | 2020-03-13 | 5.6 kB | |
SVPhYlA_ver_1.0.tar.gz | 2020-03-13 | 39.2 MB | |
Totals: 2 Items | 39.2 MB | 0 |
SVPhylA (Sequence Vectorization for Phylogenetics Analysis) version 1.0 Authors: Reinaldo Molina, Guillermin Aguero-Chapin and Aminael Sanchez-Rodriguez SVPhylA is a python tool for the calculation of several alignment-free distances for phylogenetics analysis from the most popular alignment-free approaches. Such alignment-free methods basically encode DNA and protein sequences (fasta files) into numerical vectors allowing the calculation of alignment-free distances which may be combined into a consensus/compromise matrix by using algorithms like DISTATIS based on Multidimensional Scaling (MSD), Lineal Principal Component Analysis (PCA) and PCA-Kernel (non-lineal). In addition, genetic distances derived can be either combined between them or with the alignment-free distances. So far, SVPhylA contain a module to compare tree topologies by using different distance measures as a validation procedure. The statistical validation (bootstrap and jacknife) of the alignment-free trees is being developed SVPhylA is mostly designed to interact to MEGA (MOLECULAR EVOLUTIONARY GENETICS ANALYSIS). Alignmet-free distances calculated by SVPhylA including the consensus/compromise matrixes can be imported to MEGA (.meg) to perform phylogenetic distance-based methods and in the other sense, several genetic distances calculated by MEGA can be combined in SVPhylA to provide a consensus genetic matrix representing the best aggregates of the individual ones for phylogenetics purposes SVPhylA Procedure Run: 1- CD to main folder 2- Launch the graphical interface: python3 SVPhylA.py Installation prerequisites: mport sys import csv import re, os import numpy as np import FastaProc import Data as dts import distatis as dis import TI2BioPp.TI2BioPmain import scipy.cluster.hierarchy as sch from pylab import savefig from svphylaQt import * from manifold_mds import Ui_smacof_mds from sklearn import manifold import multiprocessing from utils import * from dlgproperties import Ui_DialogProp from sklearn.decomposition import PCA import FuseMatPCA as FCA import dendropy from dendropy.calculate import treecompare import rpy2.robjects as robjects from rpy2.robjects.packages import importr from dendropy.calculate import treecompare FastaProc.py need these packages: import numpy as np import matplotlib.pyplot as plt from scipy.spatial import distance import scipy.cluster.hierarchy as sch from pylab import savefig from Bio import SeqIO import Bio from itertools import product import warnings import re from multiprocessing import Pool, Process from propy import PseudoAAC as PAAC from propy import Autocorrelation as AC from propy import CTD as CTD from propy import QuasiSequenceOrder as QSO import pseknc_mod as pse Module 1--To load TI2BioP (Topological Indices to BioPolymers) descriptors. Module 2--To load DNA, RNA and protein fasta files. You can Select de Dataset type to vectorize and the sequence vectorization algorithm. For example, you can select "All Aminoacids" to apply all vectorization algorithms or "All nucleotides" to do the same. At the same time you can select the type of disance matrix you want to get or select all them. Then the program will start encoding the sequences and calculating the selected distance matrices the by using in parallel all the PC processors (be patient). The results of the program will be placed in the same folder you have your dataset (Fasta Files) using the prefix output_Distance_Type_Vectorization_Algorithm.meg. In addition a Log file "you can enter a name" is generated with the information of all vectors corresponding to the sequences and the selected method. Such vectors are useful for protein functional and structural classification purposes. In the case of calculation of contiguous K-mers you set K-mer values and ZERO IN SPACED K-MERS. The recommended value for K-mers according to the paper of Bolden, 1998 is K=4 for proteins and K=10 for genomic sequences (benchmark datasets). For the calculation of Spaced K-mers, you can select the pattern/number of contiguous K-mers and the number of spaces in "Spaced button". For example in the paper of Bolden they get at the conclusion that for proteins Spaced K-mers lenght should be equal or less than 7. Therefore you can set K-mers patterns=4 and Spaced=3 or 2 or 1. For the calculation of Spaced K-mers for genomic sequences, the recommended K value is 10 and the lenght of the spaced K-mers is 11-12. All patterns of length 11 and most patterns of length 12 produced better results than the corresponding approach with the contiguous 10-mer (Bolden 1998). To calculate the PseAAC (Pseudo AminoAcid Composition), the recommended lambda=8 according to Chou KC (2001) for the prediction of protein attributes. To vectorize DNA sequences you can use K-mers, Spaced K-mers and the Pseudo K-Nucleotide Composition (PseKNC) type I or type II and the properties of N-tuples of nucleotides Module 3--DISTATIS to combine different distance matrices using MSD. You can choose the several distance matrices to be combined. The output matrix (compromise matrix) can be saved in MEGA or CSV format. Module 4--PCA to combine different distance matrices using lineal PCA and PCA Kernel. You can choose the several distance matrices to be combined. The output matrix (compromise matrix) can be saved in MEGA or CSV format. Module 5--To compare the topology of the obtained trees against a reference tree using several distance measures (Euclidean, Robinson-Foulds, etc). You can only introduce the reference and the subject trees in Newick format.