-------------------------------------------------------------------------------
,
Ferox 1.0 <'))><
`
- Sequence Alignment with Fuzzy k-mers -
-------------------------------------------------------------------------------
Ferox is a sequence aligner, developed by John Healy at the Galway-Mayo
Institute of Technology. The Ferox aligner uses one or more fuzzy seeds to
perform approximate string matching, allowing for both fast and sensitive
alignments.
-------------------------------------------------------------------------------
1. System Requirements & Memory
-------------------------------------------------------------------------------
As Ferox is written in Java, you will require a Java Virtual Machine, that
supports the 1.6.0_37 release of the language, to run the software. If you
already have Java installed on your system, you can check compatibility by
typing the following at a command prompt:
java -version
Any version of Java >= 1.6.0_37 will suffice. If you don't have the required
version of Java installed, you can download it for free from the Oracle Java
portal at http://www.oracle.com/technetwork/java/index.html.
When aligning large FASTA files, you should increase the memory available to the
Java Virtual Machine (JVM) by specifying the maximum heap space with the -Xmx
argument. It is recommended to give Ferox at least 1Gb of memory. Ideally, give
Ferox all the available memory at your disposal. Please note that no special
configuration is required to run a JVM in 64-bit mode. Specifying a maximum
heap size greater than 4.5Gb will be handled transparently by the Java
environment.
-------------------------------------------------------------------------------
2. Running Ferox
-------------------------------------------------------------------------------
Ferox has been designed to be easy to install and run. After downloading the
software and inflating the Zip archive, you should have a "ferox" directory
with the following structure:
-ferox.jar
-conf/ferox.xml
Editing and configuring fuzzy seeds is done declaratively using the file
ferox.xml in the conf directory (see section 3 - Creating and Configuring
Fuzzy Seeds). At a minimum, all that is required to run the aligner is the
following:
java -cp ./ferox.jar ferox.Align <reference-file> <query-file> <output-file>
For example, assuming that ref.fa and query.fa are FASTA files containing
the reference and query sequences respectively, Ferox can align the query to
the sequence and output the results in a file called "out" using the following
command:
java -Xmx1G -cp ./ferox.jar ferox.Align ref.fa query.fa out
This will result in a file called out.bed being generated. To enable
compatibility with mummerplot, you can output the result in .mums format, by
specifying the -mums parameter:
java -Xmx1G -cp ./ferox.jar ferox.Align ref.fa query.fa out -mums
When aligning a whole genome, as opposed to a set of sequence reads, against a
reference genome, you should use the -genome switch. For example, the following
aligns the genome contained in query.fa against the reference ref.fa:
java -Xmx1G -cp ./ferox.jar ferox.Align ref.fa query.fa out -genome
If you only want to align the query sequence(s) in the forward orientation,
use the -forward switch as follows:
java -Xmx1G -cp ./ferox.jar ferox.Align ref.fa query.fa out -genome -forward
Please note that, as the default amount of heap space available to a JVM is very
limited, you can use the -Xmx switch to increase the amount of memory. For example,
the following command runs Ferox with 8Gb of memory available to the JVM:
java -Xmx8G -cp ./ferox.jar ferox.Align ref.fa query.fa out
-------------------------------------------------------------------------------
3. Creating and Configuring Fuzzy Seeds
-------------------------------------------------------------------------------
At the heart of the Ferox is the concept of a fuzzy k-mer or fuzzy seed. Fuzzy
seeds are managed declaratively in the file conf/ferox.xml and are parsed and
read when the aligner is initialised during start-up. The configuration file
contains a number of different parameters that can be customised:
k-mer-size: The k-mer or word size to use. Larger values of k will not
necessarily increase alignment speed at the expense of
sensitivity. These facets of the aligner are controlled by
the fuzzy seed. Larger values of k will slow down the aligner
to a small degree, as there will be more characters to read into
each fuzzy k-mer. The default size of k is 24.
fuzzy-seed-class: The type of fuzzy seed to use. In this initial release
of Ferox, this parameter must be set to DefaultFuzzySeed.
fuzzy-hashkey-class: The type of fuzzy hash key to use. This must be set to
DefaultFuzzyHashKey for this release.
The <fuzzy-aligner> element encapsulates the match criteria and fuzzy seeds to
use during alignment. The minimum-alignment-match attribute is used to specify
the required percentage identity to use to filter out weak matches. The
percentage identity refers to the percentage of characters in a query sequence
that match a reference genome. The percentage identity is configured with a
fuzzy value in the interval [0...1]. For example, for two highly homologous
sequences, setting this value to 0.95 will require 95% of the characters in
a sequence read to match, before the read is considered a candidate alignment.
For weak homology, set this value to a lower level. Note that the value set by
minimum-alignment-match attribute will be ignored if the -genome switch is used.
You can define your own fuzzy seeds in Ferox by creating a new <seed> element
and its required child elements for each seed you want to use. A single seed
of 13h+11f is used by default. Seeds are declared as a sequence of hash and
asterisk characters in a <pattern> element, with hash characters denoting
positions that require an exact match and asterisks denoting wildcard positions.
<seed>
<pattern>#############***********</pattern>
<key-comparator>ferox.fuzzy.FuzzyDamerauLevenshtein</key-comparator>
<fuzzy-threshold>0.8</fuzzy-threshold>
</seed>
Seeds are not required to start with a hash characters, as the following
11f+13h example shows:
<seed>
<pattern>***********#############</pattern>
<key-comparator>ferox.fuzzy.FuzzyDamerauLevenshtein</key-comparator>
<fuzzy-threshold>0.8</fuzzy-threshold>
</seed>
In addition, hash characters are not required to be consecutive, as the following
fuzzy seed declaration demonstrates:
<seed>
<pattern>#####*************######</pattern>
<key-comparator>ferox.fuzzy.FuzzyHammingDistance</key-comparator>
<fuzzy-threshold>0.5</fuzzy-threshold>
</seed>
In practice, a single fuzzy seed will suffice for most purposes, especially with
longer reads. While there is not limit on the number of seeds you can use, adding
multiple seeds will increase the memory overhead of the aligner.
Each <seed> element can be configured with its own approximate string matching
algorithm by specifying a subclass of the type KeyComparator in the
<key-comparator> element. The following fuzzy string matching algorithms are
available with Ferox:
ferox.fuzzy.FuzzyLevenshtein
ferox.fuzzy.FuzzyDamerauLevenshtein
ferox.fuzzy.FuzzyHammingDistance
ferox.fuzzy.FuzzySmithWaterman
The final component of the fuzzy seed that must be configured is the
beta-cutoff threshold. This is specified in the <fuzzy-threshold> element. The
beta-cutoff threshold is a floating point fuzzy value in the interval [0..1].
Please note that the alignment speed of Ferox is controlled primarily by the
number of hash positions in a seed. A minimum of 11 hashes should be used, as
values below this threshold give rise to an escalating number of collisions
in the underlying hash map used by Ferox. The sensitivity of Ferox is
controlled by the beta-cutoff threshold. Lower beta values increase alignment
sensitivity with little impact on running time. If the size of beta is set too
low however, the potential of detecting and reporting spurious matches will be
increased.
-------------------------------------------------------------------------------
4. Extending Ferox
-------------------------------------------------------------------------------
You can extend the functionality of Ferox to use any approximate string matching
algorithm by implementing the ferox.fuzzy.KeyComparator interface and making
the compiled class file available to your Java CLASSPATH. The contractual method
in KeyComparator that you must implement is defined with the following method
signature:
public float fuzzyCompare(String s, String t);
Given a String s and a String t, compute the alignment of both strings using ANY
string matching algorithm. The only caveat is that the return type from the method
should be a float with a fuzzy value in the interval [0..1]. Once the class file
has been added to your CLASSPATH environmental, you can configure a fuzzy seed with
the custom string matching algorithm using the <key-comparator> element in the
file conf/ferox.xml. Obviously, you are not limited to using the "ferox" namespace.
Enjoy!