Read Me
***********
*Compiling*
***********
# gcc
make
# intel
make CC=intel
# gcc 32-bit (on 64-bit system, 32-bit is used by default on a 32-bit system)
make CC=gcc32
************
*PARAMETERS*
************
HAXAT takes the following parameters (this help is accessed through running with --help):
Usage: haxat [OPTIONS] DATABASE_FILE SEARCH_FILE
Searches a protein FASTA database file for any significant matches, given
the SEARCH_FILE which can be in SFF format or (F)FASTA format.
Example: haxat prot.fa query.fa
General options:
-l Maximum query length (default=4000)
-B Query both strands (default=false)
-F Assume 454 data input (default=true)
Alignment options:
-m Substitution matrix (default=BLOSUM62)
-g Cost to open a gap (nucleotide) (default=8)
-e Cost to extend a gap (nucleotide) (default=2)
-G Cost to open a gap (protein) (default=8)
-E Cost to extend a gap (protein) (default=2)
-S Single gap frame-shift penalty (default=25)
-D Double gap frame-shift penalty (default=45)
-k Flowpeak correction range (default=0.50)
-h Homopolymer penalty drop-off fraction (default=0.10)
-V Do peak insertion validation (default=true)
Result options:
-v Number of results to show for each query (default=50)
-s Minimum score to write (default=15)
-W Write queries without results in output (default=true)
-P Print-width (default=80, <=0 to printing on one row)
File name options:
-O Output file (default=stdout)
--Sh
Homopolymer single gap frame-shift penalty (default defined through
h, this overrides the definition through h)
--Dh
Homopolymer double gap frame-shift penalty (default defined through
h, this overrides the definition through h)
--help
Print help
--version
Print current version
--print-gap-penalties
Print all parameter and homopolymer specific gap penalties employed
--print-verbose-gap-penalties
Print all insert/delete penalties for all flow-peak values
--dump-smatrix
Prints the values stored in the currently loaded substituion matrix
--force-parameters
Skip parameter evaluation
--quiet
Do not write progress to stderr
Substituion matrices:
The -m flag takes either a file parameter or a reference to one of
the built-in substitution matricies. Built-in matricies are:
BLOSUM30, BLOSUM40, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90,
GONNET, PAM30, PAM70 and PAM250
***************
*INPUT FORMATS*
***************
*SFF file format*
The SFF file format is a Roche 454 format which FASTA can interpret as a QUERY_FILE input format. The SFF file format is documented in the GS FLX documentation, page 445-448.
* FASTA format*
The FASTA format is the widly used format for both nucleotide and protein data, further described in http://en.wikipedia.org/wiki/FASTA. The FASTA format is currently the only acceptable format for DATABASE_FILE input in FAAST but is also acceptable input for QUERY_FILE.
* FFASTA format*
FFASTA is a non-standard format for expressing flowgrams in a text-format. FFASTA is a FASTA-like format where each header is described by a leading “>”-sign followed by data. In the case of FFASTA the data is composed of flowpeak values separated by space. FFASTA expect the flow-order “TACG”. Any line starting with “#” is ignored. As an example FFASTA could look like this:
>flowgram-1
1.57 0.11 1.69 0.12 0.08 3.62 0.07 …
>flowgram-2
0.08 0.15 0.98 1.89 0.06 5.13 0.13 …
…
****************
*OUTPUT FORMATS*
****************
HAXAT produces output in a new alignment format to incorporate the protein-nucleotide alignments as well as annotations for the frame-shifting positions. Apart from reporting the usual score, identities, positives, coverage and gaps the output format also entails the number of frame-shifting gaps (nucleotide space gaps, nGaps).
Score = 593
Identities = 117/157 (75%), Positives = 135/157 (86%), Coverage = 473/478 (99%)
Gaps = 2, nGaps = 5
Strand = Plus / Plus
Query: 2 AGACGACGCAGATTTGGCCGCCGCCGCCGATACTACCGAAAGAGGCGAGGGGGATGGAGACGTCGATATGGGAGGCGT 79
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||+++ |||
Sbjct: 4 ArgArgArgArgPheGlyArgArgArgArgTyrTyrArgLysArgArgGlyGlyTrpArgArgArgPheArgIleArg 29
Query: 80 TGGAGACGGACTGCCTGGAGACGTCGGCGGGTAAGGAGATGGCGGCGTTCCGTCTTCCGTAGAGGGGGACGTAGAGCG 157
|||||| ||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 30 ---ArgArgArgProTrpArgArgTrpArgValArgArgTrpArgArgSerValPheArgArgGlyGlyArgArgAla 54
Query: 158 CGCCCCTACCGCATTTCAGCTTGGAACCCAAGAGTAGTGAGGAGAGTAGTAATTAAGGGGTGGTGGCcACTGATCCAA 234
||||||||||||||||||||||||||||||+++|||+++||| ||| ||||||||||-|+++||||||
Sbjct: 55 ArgProTyrArgIleSerAlaTrpAsnProLysValLeuArgAsnCysArgIleThrGlyTrpTrpProValIleGln 80
Query: 235 TGCATGGAGGGGTATGAGTACCTGAGATACAGACCGATGTACAAT---ATTGAGAAAAGATGGATATATAAAAAACAG 309
||||||+++||| |||+++++++++|||+++|||||| +++||| ||||||+++ ||||||
Sbjct: 81 CysMetAspGlyMetGluTrpIleLysTyrLysProMetAspLeuArgValGluAlaAsnTrpIlePheAsnLysGln 106
Query: 310 AGCAGTAGAGTGTTCAGTGAGGACATGGGCTACCTGATGCAGgTACGGTGGAGGATGGGCCTCAGgAACAATTTCCTT 386
|||++++++ +++||| ||||||||||||||||||-|||||||||||||||+++||||-| ||||||||
Sbjct: 107 AspSerLysIleGluThrGluGlnMetGlyTyrLeuMetGln-TyrGlyGlyGlyTrpSerSerGlyValIleSerLe 132
Query: 387 AAGGGgTTTATTCAATGAGCACGAACTGTGGAGAAATGTATGGTCCAGGTCTAACGACGGAATGGACCTAGCCcAGAT 463
| |-|||||||||||||+++ ||||||||||||+++||||||+++||||||||||||||||||||| -||||
Sbjct: 132 uGluGlyLeuPheAsnGluAsnArgLeuTrpArgAsnIleTrpSerLysSerAsnAspGlyMetAspLeuVal-ArgT 158
Query: 464 ACTTCGGGTGC 474
|||||||||||
Sbjct: 158 yrPheGlyCys 161
HAXAT can also produce FASTA output containing corrected query sequences, using the top-hit. The output can either be the part covered by the alignment (output format 2) or the complete corrected sequence (format 3) where the alignment covered part has been subjected to correction.
***************************
*HAXAT w. BLAST heuristics*
***************************
blastx -db $BLASTDB -query $Q -num_alignments 100 -outfmt 7 -show_gis -seg no > $Q.bres
grep '^#' -v $Q.bres | awk '{print $2}' > $Q.hits
blastdbcmd -db $BLASTDB -entry_batch $Q.hits > $Q.db
haxat $Q.db $Q $HAXAT_PARAMETERS