Home
Name Modified Size InfoDownloads / Week
README 2012-12-05 9.4 kB
ferox-1.0.zip 2012-12-05 42.9 kB
ferox-1.0.tar.gz 2012-12-05 41.1 kB
Totals: 3 Items   93.3 kB 0
-------------------------------------------------------------------------------
                                      ,          
                       Ferox 1.0   <'))><
                                      `
           - Sequence Alignment with Fuzzy k-mers - 
-------------------------------------------------------------------------------

Ferox is a sequence aligner, developed by John Healy at the Galway-Mayo
Institute of Technology. The Ferox aligner uses one or more fuzzy seeds to 
perform approximate string matching, allowing for both fast and sensitive
alignments.


-------------------------------------------------------------------------------
1. System Requirements & Memory
-------------------------------------------------------------------------------
As Ferox is written in Java, you will require a Java Virtual Machine, that 
supports the 1.6.0_37 release of the language, to run the software. If you 
already have Java installed on your system, you can check compatibility by
typing the following at a command prompt:

		java -version
		
Any version of Java >= 1.6.0_37 will suffice. If you don't have the required
version of Java installed, you can download it for free from the Oracle Java
portal at http://www.oracle.com/technetwork/java/index.html.		

When aligning large FASTA files, you should increase the memory available to the
Java Virtual Machine (JVM) by specifying the maximum heap space with the -Xmx 
argument. It is recommended to give Ferox at least 1Gb of memory. Ideally, give 
Ferox all the available memory at your disposal. Please note that no special 
configuration is required to run a JVM in 64-bit mode. Specifying a maximum
heap size greater than 4.5Gb will be handled transparently by the Java 
environment.


-------------------------------------------------------------------------------
2. Running Ferox
-------------------------------------------------------------------------------
Ferox has been designed to be easy to install and run. After downloading the
software and inflating the Zip archive, you should have a "ferox" directory 
with the following structure:

-ferox.jar
-conf/ferox.xml

Editing and configuring fuzzy seeds is done declaratively using the file 
ferox.xml in the conf directory (see section 3 - Creating and Configuring 
Fuzzy Seeds). At a minimum, all that is required to run the aligner is the 
following:

   java -cp ./ferox.jar ferox.Align <reference-file> <query-file> <output-file>


For example, assuming that ref.fa and query.fa are FASTA files containing
the reference and query sequences respectively, Ferox can align the query to
the sequence and output the results in a file called "out" using the following 
command:

	java -Xmx1G -cp ./ferox.jar ferox.Align ref.fa query.fa out


This will result in a file called out.bed being generated. To enable 
compatibility with mummerplot, you can output the result in .mums format, by 
specifying the -mums parameter:

	java -Xmx1G -cp ./ferox.jar ferox.Align ref.fa query.fa out -mums


When aligning a whole genome, as opposed to a set of sequence reads, against a 
reference genome, you should use the -genome switch. For example, the following
aligns the genome contained in query.fa against the reference ref.fa:

	java -Xmx1G -cp ./ferox.jar ferox.Align ref.fa query.fa out -genome



If you only want to align the query sequence(s) in the forward orientation, 
use the -forward switch as follows:

	java -Xmx1G -cp ./ferox.jar ferox.Align ref.fa query.fa out -genome -forward
	
	
Please note that, as the default amount of heap space available to a JVM is very
limited, you can use the -Xmx switch to increase the amount of memory. For example,
the following command runs Ferox with 8Gb of memory available to the JVM:	

	java -Xmx8G -cp ./ferox.jar ferox.Align ref.fa query.fa out


	

-------------------------------------------------------------------------------

3. Creating and Configuring Fuzzy Seeds
-------------------------------------------------------------------------------
At the heart of the Ferox is the concept of a fuzzy k-mer or fuzzy seed. Fuzzy
seeds are managed declaratively in the file conf/ferox.xml and are parsed and
read when the aligner is initialised during start-up. The configuration file 
contains a number of different parameters that can be customised:

	k-mer-size: The k-mer or word size to use. Larger values of k will not
	            necessarily increase alignment speed at the expense of 
	            sensitivity. These facets of the aligner are controlled by
	            the fuzzy seed. Larger values of k will slow down the aligner
	            to a small degree, as there will be more characters to read into
	            each fuzzy k-mer. The default size of k is 24.
	            
	fuzzy-seed-class: The type of fuzzy seed to use. In this initial release
	            of Ferox, this parameter must be set to DefaultFuzzySeed.

	fuzzy-hashkey-class: The type of fuzzy hash key to use. This must be set to
	            DefaultFuzzyHashKey for this release.


The <fuzzy-aligner> element encapsulates the match criteria and fuzzy seeds to
use during alignment. The minimum-alignment-match attribute is used to specify
the required percentage identity to use to filter out weak matches. The 
percentage identity refers to the percentage of characters in a query sequence
that match a reference genome. The percentage identity is configured with a
fuzzy value in the interval [0...1]. For example, for two highly homologous 
sequences, setting this value to 0.95 will require 95% of the characters in
a sequence read to match, before the read is considered a candidate alignment. 
For weak homology, set this value to a lower level. Note that the value set by
minimum-alignment-match attribute will be ignored if the -genome switch is used.

You can define your own fuzzy seeds in Ferox by creating a new <seed> element
and its required child elements for each seed you want to use. A single seed 
of 13h+11f is used by default. Seeds are declared as a sequence of hash and 
asterisk characters in a <pattern> element, with hash characters denoting
positions that require an exact match and asterisks denoting wildcard positions.

<seed>
	<pattern>#############***********</pattern>
	<key-comparator>ferox.fuzzy.FuzzyDamerauLevenshtein</key-comparator>
	<fuzzy-threshold>0.8</fuzzy-threshold>
</seed>


Seeds are not required to start with a hash characters, as the following 
11f+13h example shows:

<seed>
	<pattern>***********#############</pattern>
	<key-comparator>ferox.fuzzy.FuzzyDamerauLevenshtein</key-comparator>
	<fuzzy-threshold>0.8</fuzzy-threshold>
</seed>


In addition, hash characters are not required to be consecutive, as the following
fuzzy seed declaration demonstrates:

<seed>
	<pattern>#####*************######</pattern>
	<key-comparator>ferox.fuzzy.FuzzyHammingDistance</key-comparator>
	<fuzzy-threshold>0.5</fuzzy-threshold>
</seed>


In practice, a single fuzzy seed will suffice for most purposes, especially with
longer reads. While there is not limit on the number of seeds you can use, adding 
multiple seeds will increase the memory overhead of the aligner.

Each <seed> element can be configured with its own approximate string matching
algorithm by specifying a subclass of the type KeyComparator in the
<key-comparator> element. The following fuzzy string matching algorithms are
available with Ferox:

		ferox.fuzzy.FuzzyLevenshtein
		ferox.fuzzy.FuzzyDamerauLevenshtein
		ferox.fuzzy.FuzzyHammingDistance
		ferox.fuzzy.FuzzySmithWaterman

The final component of the fuzzy seed that must be configured is the 
beta-cutoff threshold. This is specified in the <fuzzy-threshold> element. The
beta-cutoff threshold is a floating point fuzzy value in the interval [0..1].
Please note that the alignment speed of Ferox is controlled primarily by the
number of hash positions in a seed. A minimum of 11 hashes should be used, as
values below this threshold give rise to an escalating number of collisions
in the underlying hash map used by Ferox. The sensitivity of Ferox is 
controlled by the beta-cutoff threshold. Lower beta values increase alignment
sensitivity with little impact on running time. If the size of beta is set too
low however, the potential of detecting and reporting spurious matches will be
increased.


-------------------------------------------------------------------------------
4. Extending Ferox
-------------------------------------------------------------------------------
You can extend the functionality of Ferox to use any approximate string matching
algorithm by implementing the ferox.fuzzy.KeyComparator interface and making
the compiled class file available to your Java CLASSPATH. The contractual method
in KeyComparator that you must implement is defined with the following method
signature:

	public float fuzzyCompare(String s, String t);

Given a String s and a String t, compute the alignment of both strings using ANY
string matching algorithm. The only caveat is that the return type from the method
should be a float with a fuzzy value in the interval [0..1]. Once the class file
has been added to your CLASSPATH environmental, you can configure a fuzzy seed with
the custom string matching algorithm using the <key-comparator> element in the
file conf/ferox.xml. Obviously, you are not limited to using the "ferox" namespace.

Enjoy!
Source: README, updated 2012-12-05