FFP 3.19 - Feature Frequency Profile Phylogenetics Package
Mar 01, 2012
Author: Gregory E. Sims
This is a collection of programs / utilities for implementing
the FFP (Feature Frequency Profile) method of phylogenetic
comparison. FFP is a class of alignment-free methods suitable
for (whole genome) comparisons from viral to mammalian scale
genomes.
This method has been used to perform various phylogenetic
analyses:
Sims GE and Kim SH (2011) Whole-genome phylogeny of Escherichia
coli/Shigella group by feature frequency profiles (FFPs). PNAS, 108,
8329-34.
Jun SR, Sims GE, Wu GA, Kim SH. (2010) Whole-proteome phylogeny of
prokaryotes by feature frequency profiles: An alignment-free method
with optimal feature resolution. PNAS, 107,133-8.
Sims GE, Jun SR, Wu GA, Kim SH. (2009) Alignment-free genome comparison
with feature frequency profiles (FFP) and optimal resolutions.
PNAS, 106,2677-82.
Sims GE, Jun SR, Wu GA, Kim SH (2009) Whole-genome phylogeny
of mammals: evolutionary information in genic and nongenic regions.
PNAS. 106,17077-82.
The utilities are designed to be implemented using unix
command pipes. In other words the output of programs
can be linked to the input of other programs. Therefore
many of the scripts are acceptable as filters to be used
in intermediate steps.
This package contains the following programs/scripts:
ffpgui [Experimental] A perl/Tk based GUI
interface for performing some of the
basic FFP operations. This utility
doesn't support grid-based/
multiprocessor job flow. Also ffpgui is
in the beta stage, but it should give an
example of what can be done with the
utilities listed below. Note, currently
ffpgui will only work properly in Cygwin
if you download and compile v804.029
perl/Tk module from CPAN. Automatic installation
of perl/Tk by Cygwin setup.exe will not
provide proper functionality (It uses a
much older version version of perl/Tk).
See INSTALL for further instructions.
Use --disable-gui to bypass installation.
ffpry Constructs an FFP profile from nucleic acid
sequences in FASTA format (.fna).
ffpaa Constructs an FFP profile from amino acid
sequences in FASTA format (.faa).
ffprwn This performs row normalization of the raw
FFP matrix.
ffpjsd This calculates the Jensen Shannon Divergence
between FFPs and outputs a Divergence
(Distance) matrix. A variety of other
distances/similarity metrics are available as
well.
ffpboot This performs bootstrapping or jacknifing
permutation of a raw FFP profile produced
by ffpry or ffpaa
ffpvocab This utility counts the number of words which
are used more than a paritcular threshold
in the FFP profile. This utility is used
to determine what is the best range of word
lengths to use for a genome collection
ffpre This utility calculates the Relative entropy
between the expected and observed frequencies
of features of length l (specified on the
command line) using an L-2 Markov Model.
ffpvprof Script which calculates the word usage
for a range of l. Runs ffpvocab
ffpreprof Script which calculates the Relative entropy
between observed and predicted frequencies for
a range of l. Runs ffpre.
ffpmerge This utility merges all rows of an FFP into
a single row. Use this for merging segments
of an FFP, for example different chromosomes
of a larger genome.
ffpcol This utility converts a FFP which has been
written out in key/value format to a columnar
format, so that each column corresponds to
the same feature in each row of the FFP.
ffptxt This utility creates a key/value FFP of text
data. This is useful for performing an FFP
analysis of human language texts. All non-
alphanumeric characters are ignored.
ffpfilt Eliminate high/low frequency features using
frequency cutoffs or probability based cutoffs
assuming a normal or extreme value distributions.
ffpcomplex Eliminate high/low complexity features using
a complexity cutoff or probability based cutoff
assuming a normal distribution.
ffpdf Finds clade distinguishing (diagnostic) features.
See Sims GE and Kim SH (2011) PNAS 108.
ffptree Build neighbor joining and UPGMA trees from ffpjsd
output.
OTHER REQUISITE PROGRAMS
We suggest that you obtain a copy of PHYLIP
(http://evolution.genetics.washington.edu/phylip.html)
for building trees, however you can use any tree building program
which will accept distance matrix input. The utility ffpjsd
will produce Phylip style 'infile's as well as raw distance
matrices. As of version 3.06, a tree building utility, ffptree
is included, which will allow you build Newick style tree output
directly as part of a ffp pipeline, which is compatible with the
Phylip (3.69) utilities.
QUICK START
The best way to get a quick start is to read through the
simple tutorial PDF file located in the ./doc dir of this
distribution. It is also installed during the make install
process in the /usr/local/share/doc/ffp directory (unless
you have specified a different base directory using
./configure --prefix during the build process).
EXAMPLES
***** How do I perform FFP comparison on a collection of nucleic
acid sequences, using a particular length of feature?
Assuming your nucleic acid .fna files are all in the current
working directory and are named with the .fna extension:
ffpry -l 5 *.fna | ffpcol | ffprwn | ffpjsd > matrix
for just two files (test1.fna and test2.fna):
ffpry -l 5 test1.fna test2.fna | ffpcol | ffprwn | ffpjsd > matrix
The above example uses all features of length
5. The output of ffpry will be in key-value form,
i.e. pairs of feature sequence followed by the raw
count. Each row corresponds to the features from
that sequence, in the order of input.The output of
ffpry is piped to ffpcol, which converts the key value
form into a column form, so that the raw counts
corresponds to the same feature across rows. The
utility ffprwn row normalizes each row of the ffp
feature matrix (output by ffpry), so that each
element of that row is a relative frequency.
The output is now piped to ffpjsd which calculates a
Jensen Shannon Divergence Matrix.
Alternatively you can save the output at each step in
intermediate files in the following form:
ffpry -l 5 *.fna > vectors
ffpcol vectors > vectors.col
ffprwn vectors.col > vectors.row
ffpjsd vectors.row > matrix
This may be useful if you want to perform multiple analyses
on some intermediate file (For example bootstrapping -- see
below).
**** How do I script and run commands?
All of these commands can be completed
programmatically in a shell script file
for example:
In a file named, for instance ffptest.sh
#!/bin/sh
ffpry -l 5 *.fna > vectors
ffpcol vectors > vectors.col
ffprwn vectors.col > vectors.row
ffpjsd vectors.row > matrix
Save the script and make it executable using:
chmod +x ffptest.sh
then run the script from the command line
./ffptest.sh
**** How do I perform bootstraping?
You can use the utility ffpboot to perform bootstrapping on
the output of ffpry or ffpaa.
ffpry -l 5 *.fna | ffpcol > vectors
ffpboot vectors | ffprwn | ffpjsd > matrix
**** How do I create multiple bootsrap sets?
The example below creates 100 bootstrap pseudoreplicate
JSD matrices.
ffpry -l 5 *.fna | ffpcol > vectors
for i in $(seq 1 1 20)
do
ffpboot vectors | ffprwn | ffpjsd > matrix.$i
done
**** How do I create phylip format infiles?
To create phylip format infiles to use with programs
such as NEIGHBOR, Use the command ffpjsd -p [FILE] which
will generate a phylip format infile. FILE specifies
the names of the taxa in the fna files you are using
This file should whitespace or newline delimit the
different taxa names.
i.e.
Taxa_1
Taxa_2
...
Taxa_N
Note the taxa should be in the order that they
are read into ffpry (use ls *.fna to get that
ordering).
For example:
ffpry -l 5 *.fna | ffpcol | ffprwn | ffpjsd -p names.txt > infile
Or without the wildcards (*.fna)
ffpry -l 5 1.fna 2.fna 3.fna | ffpcol | ffprwn | ffpjsd -p names.txt > infile
The file names.txt should contain the taxa names of 1.fna, 2.fna and 3.fna
in that order.
**** How do I build a neighbor joining tree? ****
In the sprit of the pipeline concept you can pipe output directly from
ffpjsd into ffptree, provide it is in phylip infile format. Continuing
the example from above:
ffpry -l 5 *.fna | ffpcol | ffprwn | ffpjsd -p names.txt | ffptree -q > tree
This produces a tree in Newick format. If you want to see the
human readable tree to, remove the -q switch. Note, lots of
output will be produces, but written to standard error so you will
need should redirect standard error to a file to save for later.
ffpry -l 5 *.fna | ffpcol | ffprwn | ffpjsd -p names.txt | ffptree 2> progress > tree
**** How do I do bootstrapping for use with Phylip?
ffpry -l 5 *.fna | ffpcol > vectors
for i in $(seq 1 1 20)
do
ffpboot vectors | ffprwn | ffpjsd -p species.txt >> infile
done
This will create a multiple dataset file for use with phylip.
Use the 'multiple datasets' option in the neighbor program.
**** How do I perform FFP on large genomes?
The most effective way to do this to calculate FFP's of
segments or units of the genome, for instance by chromosome
or by contig, the ffp's of individual units can be
merged together using the ffpmerge utility.
Say you have 10 chromosomes. Calculate the FFP of
each as a separate process and merge at the end of
calculations.
This is especially effective for multiprocess machines.
For example:
for fna_file in $(ls *.fna)
do
ffpry -l 10 $fna_file > $fna_file.vector &
done
ffpmerge *.vector > merged.vector
If your cluster machine uses a qeueing system (i.e. Grid
engine) then you can create individual shell scripts to
give to the scheduler and then merge unit vector files after
all scheduled jobs have completed. A simple example using
grid engine employs the $SGE_TASK_ID variable.
Save a file containing the paths to your sequences
There are 10 files total
$ cat > sequences.txt
seq.fna
seq2.fna
seq3.fna
....
Ctrl-D
$ cat > submit.sh
#!/bin/bash
#submit.sh
FILE=`head -n $SGE_TASK_ID < $1 | tail -n 1`
ffpry -l 10 $FILE > $FILE.vector &
In your shell:
chmod +x submit.sh
qsub -a 1-10 submit.sh sequences.txt
Ctrl-d
**** What if the ffp I get from ffprwn is very large and
ffpjsd take a long time?
In this case you may want to break up the calculation of
the JSD matrix, by assigning specific rows to different
CPUs using the -r option.
Starting with a normalized ffp with 10 rows:
ffpjsd -r 1 vector.row > row.1
The other 9 rows can be calculated by other CPUs, once
again using a cluster machine and the SGE_TASK_ID variable.
The results can be merged again using shell scripting:
for i in $(seq 1 1 10)
do
cat row.$i >> matrix
done
**** What is the difference between key/value FFPs and
columnar FFPs?
A key/value FFP is an FFP form which is generated by
default from the programs FFPry and FFPaa. For instance
the following command will generate a FFP of this form:
ffpry -l 5 test*.fna
The format will resemble:
RYRRR 2 RRRRY 3 RRRRR 0 ....
YRRRR 1 YYYRR 1 YRRRY 2 ...
...
Each row of the file is a FFP derived from a different
sequence file.
Columnar formats are required for input to ffprwn and
ffpboot.
In this format no feature keys are printed and the columns
in the file correspond to the counts of that feature in
each of the sequence files. For very sparse FFPs the key/
value FFP can generate smaller files. For example this
command will generate a key-value FFP:
ffpry -l 5 test*.fna
To convert a key/value FFP into a columnar format for input
to the other utilities the ffpcol utility should be used
as a filter.
ffpry -l 5 test*.fna | ffpcol
The output from ffpcol can be used in ffprwn and ffpjsd
ffpry -l 5 test*.fna | ffpcol | ffprwn | ffpjsd
**** How do I use a full 4 letter Nucleotide or 20 letter
amino acid alphabet?
By default character classing is used in both ffpry and
for amino acids in ffpaa. To disable this classing specify
option -d for ffpry, ffpaa and ffpcol. Please also take
note that when you disable RY coding with the -d option
you may need to add the -d option to subsequent filters
such as ffpcol and ffpmerge.
For example:
ffpry -l 5 -d test*.fna | ffpcol -d | ffprwn | ffpjsd
For amino acids:
ffpaa -l 5 -d test*.faa | ffpcol -d -a | ffprwn | ffpjsd
**** How do I use a spaced seed hash with FFP?
FFP refers to spaced seeds as masks - from the
manner in which masks are used in computer programming
to 'mask out' certain bit positions in low level bit
manipulations of numbers stored in binary format.
A spaced seed or mask of '01110' will allow both
CAAAG and TAAAA to match each other, as well as to
match any 5 letter word with AAA in the middle.
A mask can be specified using -w, for example:
ffpaa -l 5 -d -w "01110" test*.faa | ffpcol -d -a | ffprwn | ffpjsd
If you don't want explicitly supply a mask string
but want to allow a certain number of mismatches, use
-z. Here for example is how to create a random mask
with two mismatches allowed.
ffpaa -l 5 -d -z 2 test*.faa | ffpcol -d -a | ffprwn | ffpjsd
**** How do I compare text files with FFP?
FFP has the ability to compare text files with the utility
ffptxt. The procedure is much the same as with nucleic acid
and amino acid sequences. Specify text with the -t option
when using ffpcol.
ffptxt -l 4 file*.txt | ffpcol -t | ffprwn | ffpjsd
**** How do I get help?
All the utilities come with their own manual page which is installed
by default when using 'make install'.
man ffpry
will retrieve the manual for the ffpry utility.
If you have installed ffp in some alternate location (i.e.
by using ./configure --prefix), then the locations of the
ffp manuals may not be in your MANPATH environmental variable.
You can add the manual directory to MANPATH, or simply
read the manuals directly using
man /pathtomanual/ffpry.1
(Note the manual section extension '1').
**** Example trees
Some published examples are shown in the distribution
'examples' directory.
**** How do I run FFP in Windows?
Use Cygwin (www.cygwin.com). This package has been developed
and tested to perform in both Linux and Cygwin.
Cygwin is designed as an emulated Linux/Unix
environment which runs on the Windows operating
system -- it includes the majority of the GNU compilers
and utilities that are part of a standard Linux
distribution and performs superbly (albeit with
a small performace loss because of emulation).
**** Multiple fastas in a single file
By default the ffpry and ffpaa programs assume that a single
file, regardless of the number of fasta records contained in
that file, represents a single species/genome/proteome. Therefore
the l-mer frequencies represent the counts for all fasta records.
If you specify multiple files on the command line i.e.
ffpry *.fna
Which might expand (the expanding of which is done by your shell of
course) to:
ffpry test1.fna test2.fna test3.fna
then 3 separate ffp lines will be printed in the output. If in fact
you want all of these results to be merged together into one FFP you
can use the ffpmerge utility, or simply use the cat command, both
of which should produce equivalent results.
cat *.fna | ffpry
or
ffpry *.fna | ffpmerge --keys
If however you have a single (or multiple fasta files) with multiple
records which you want to be individual FFPs then you must specify
the -m option.
ffpry -m *.fna
**** How do I implement the alternate Hamming based distance refered to
as the Evolutionary FFP distance in Sims and Kim (2011), PNAS, 108
8329-34?
Trees presented in this paper are included in the examples subdirectory.
To implement this type of analysis on your own use the following
snippet of shell script code (assuming you are using the bash shell).
From your working directory containing all your genome fasta files
(assuming small single fasta genomes).
ffpry -l 20 *.fasta | ffpcol | ffpfilt -l 0.05 -u 0.95 -e > ffp.filtered
for i in {1..100} ; do
ffpboot -j -p 0.1 ffp.filtered | ffpjsd -H -p species.txt
done > infile
The features are filtered to remove high and low frequency features.
Then 100 pseudoreplicates are created using 10% jackknife sampling.
The output is a PHYLIP style infile which can be used directly as
input to the PHYLIP tool NEIGHBOR. The option argument to -p is a
tab or newline delimited file containing the names of the taxa in the
original order which was specified on the command line to the original
ffpry invocation. To confirm you have taxa named in the right order
in your 'species.txt' taxa name file, execute this shell expansion.
echo *.fasta
This will show you the order (which will be identical to the ordering
observed using ls *.fasta), in which you need to specify the names in
species.txt.
Rather than relying on the order from shell expansion you can specify
the genome file arguments explicitly
ffpry -l 20 genome1.fasta genome2.fasta genome3.fasta ...
See the examples subdirectory for more information, including sample
trees generated using FFP and Phylip.
**** How do I implement the block-FFP method mentioned in
Sims GE, Jun SR, Wu GA, Kim SH. (2009) Alignment-free genome comparison
with feature frequency profiles (FFP) and optimal resolutions.
PNAS, 106,2677-82.
There is currently no script included in this distribution which will
implement this method of genome comparison. Future releases will
contain executables which implement Block-FFP.
The main point to keep in mind is that FFP works best when you
are comparing genomes/sequences of similar length -- and a good
guideline is to make sure that your genomes are within A-fold
the size of each other where A is the number of symbols in your
alphabet.
********
Copyright (C) 2009-2012
Author: Gregory E. Sims
Report Bugs to gsims1997@yahoo.com