Menu

Tree [00e6ff] master /
 History

HTTPS access


File Date Author Commit
 examples 2012-03-08 Fredrik Lysholm Fredrik Lysholm [3dbac5] Initial add
 src 2012-11-19 Fredrik Lysholm Fredrik Lysholm [00e6ff] suppress warning by default initialization of r...
 AUTHOR 2012-03-08 Fredrik Lysholm Fredrik Lysholm [3dbac5] Initial add
 LICENSE 2012-03-08 Fredrik Lysholm Fredrik Lysholm [3dbac5] Initial add
 README 2012-11-19 Fredrik Lysholm Fredrik Lysholm [92e9a4] Added the FASTA-output format!
 makefile 2012-03-08 Fredrik Lysholm Fredrik Lysholm [4cf359] Removed threads declaration from mingw compiler...

Read Me

***********
*Compiling*
***********

# gcc
make

# intel
make CC=intel

# gcc 32-bit (on 64-bit system, 32-bit is used by default on a 32-bit system)
make CC=gcc32

************
*PARAMETERS*
************

HAXAT takes the following parameters (this help is accessed through running with --help):

Usage: haxat [OPTIONS]  DATABASE_FILE  SEARCH_FILE 
Searches a protein FASTA database file for any significant matches, given
the SEARCH_FILE which can be in SFF format or (F)FASTA format.

Example: haxat prot.fa query.fa 

General options:
-l   Maximum query length (default=4000)
-B   Query both strands (default=false)
-F   Assume 454 data input (default=true)

Alignment options:
-m   Substitution matrix (default=BLOSUM62)
-g   Cost to open a gap (nucleotide) (default=8)
-e   Cost to extend a gap (nucleotide) (default=2)
-G   Cost to open a gap (protein) (default=8)
-E   Cost to extend a gap (protein) (default=2)
-S   Single gap frame-shift penalty (default=25)
-D   Double gap frame-shift penalty (default=45)
-k   Flowpeak correction range (default=0.50)
-h   Homopolymer penalty drop-off fraction (default=0.10)
-V   Do peak insertion validation (default=true)

Result options:
-v   Number of results to show for each query (default=50)
-s   Minimum score to write (default=15)
-W   Write queries without results in output (default=true)
-P   Print-width (default=80, <=0 to printing on one row)

File name options:
-O   Output file (default=stdout)



--Sh
     Homopolymer single gap frame-shift penalty (default defined through 
     h, this overrides the definition through h)
--Dh
     Homopolymer double gap frame-shift penalty (default defined through 
     h, this overrides the definition through h)


--help
     Print help

--version
     Print current version
     
--print-gap-penalties
     Print all parameter and homopolymer specific gap penalties employed

--print-verbose-gap-penalties
     Print all insert/delete penalties for all flow-peak values
     
--dump-smatrix
     Prints the values stored in the currently loaded substituion matrix

--force-parameters     
     Skip parameter evaluation
     
--quiet
     Do not write progress to stderr
     

Substituion matrices:
     The -m flag takes either a file parameter or a reference to one of
     the built-in substitution matricies. Built-in matricies are:
     BLOSUM30, BLOSUM40, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90, 
     GONNET, PAM30, PAM70 and PAM250

***************
*INPUT FORMATS*
***************

*SFF file format*
The SFF file format is a Roche 454 format which FASTA can interpret as a QUERY_FILE input format. The SFF file format is documented in the GS FLX documentation, page 445-448.

* FASTA format*
The FASTA format is the widly used format for both nucleotide and protein data, further described in http://en.wikipedia.org/wiki/FASTA. The FASTA format is currently the only acceptable format for DATABASE_FILE input in FAAST but is also acceptable input for QUERY_FILE.

* FFASTA format*
FFASTA is a non-standard format for expressing flowgrams in a text-format. FFASTA is a FASTA-like format where each header is described by a leading “>”-sign followed by data. In the case of FFASTA the data is composed of flowpeak values separated by space. FFASTA expect the flow-order “TACG”. Any line starting with “#” is ignored. As an example FFASTA could look like this:

>flowgram-1
1.57 0.11 1.69 0.12 0.08 3.62 0.07 …
>flowgram-2
0.08 0.15 0.98 1.89 0.06 5.13 0.13 …
…


****************
*OUTPUT FORMATS*
****************
HAXAT produces output in a new alignment format to incorporate the protein-nucleotide alignments as well as annotations for the frame-shifting positions. Apart from reporting the usual score, identities, positives, coverage and gaps the output format also entails the number of frame-shifting gaps (nucleotide space gaps, nGaps).


Score = 593
Identities = 117/157 (75%), Positives = 135/157 (86%), Coverage = 473/478 (99%)
Gaps = 2, nGaps = 5
Strand = Plus / Plus


Query:     2 AGACGACGCAGATTTGGCCGCCGCCGCCGATACTACCGAAAGAGGCGAGGGGGATGGAGACGTCGATATGGGAGGCGT 79
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||+++      |||
Sbjct:     4 ArgArgArgArgPheGlyArgArgArgArgTyrTyrArgLysArgArgGlyGlyTrpArgArgArgPheArgIleArg 29


Query:    80 TGGAGACGGACTGCCTGGAGACGTCGGCGGGTAAGGAGATGGCGGCGTTCCGTCTTCCGTAGAGGGGGACGTAGAGCG 157
                ||||||      |||||||||   |||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct:    30 ---ArgArgArgProTrpArgArgTrpArgValArgArgTrpArgArgSerValPheArgArgGlyGlyArgArgAla 54


Query:   158 CGCCCCTACCGCATTTCAGCTTGGAACCCAAGAGTAGTGAGGAGAGTAGTAATTAAGGGGTGGTGGCcACTGATCCAA 234
             ||||||||||||||||||||||||||||||+++|||+++|||         |||   ||||||||||-|+++||||||
Sbjct:    55 ArgProTyrArgIleSerAlaTrpAsnProLysValLeuArgAsnCysArgIleThrGlyTrpTrpProValIleGln 80


Query:   235 TGCATGGAGGGGTATGAGTACCTGAGATACAGACCGATGTACAAT---ATTGAGAAAAGATGGATATATAAAAAACAG 309
             ||||||+++|||   |||+++++++++|||+++||||||         +++|||      ||||||+++   ||||||
Sbjct:    81 CysMetAspGlyMetGluTrpIleLysTyrLysProMetAspLeuArgValGluAlaAsnTrpIlePheAsnLysGln 106


Query:   310 AGCAGTAGAGTGTTCAGTGAGGACATGGGCTACCTGATGCAGgTACGGTGGAGGATGGGCCTCAGgAACAATTTCCTT 386
                |||++++++   +++|||   ||||||||||||||||||-|||||||||||||||+++||||-|   ||||||||
Sbjct:   107 AspSerLysIleGluThrGluGlnMetGlyTyrLeuMetGln-TyrGlyGlyGlyTrpSerSerGlyValIleSerLe 132


Query:   387 AAGGGgTTTATTCAATGAGCACGAACTGTGGAGAAATGTATGGTCCAGGTCTAACGACGGAATGGACCTAGCCcAGAT 463
             |   |-|||||||||||||+++   ||||||||||||+++||||||+++|||||||||||||||||||||   -||||
Sbjct:   132 uGluGlyLeuPheAsnGluAsnArgLeuTrpArgAsnIleTrpSerLysSerAsnAspGlyMetAspLeuVal-ArgT 158


Query:   464 ACTTCGGGTGC 474
             |||||||||||
Sbjct:   158 yrPheGlyCys 161



HAXAT can also produce FASTA output containing corrected query sequences, using the top-hit. The output can either be the part covered by the alignment (output format 2) or the complete corrected sequence (format 3) where the alignment covered part has been subjected to correction. 



***************************
*HAXAT w. BLAST heuristics*
***************************

blastx -db $BLASTDB -query $Q -num_alignments 100 -outfmt 7 -show_gis -seg no > $Q.bres
grep '^#' -v $Q.bres | awk '{print $2}' > $Q.hits
blastdbcmd -db $BLASTDB -entry_batch $Q.hits > $Q.db
haxat $Q.db $Q $HAXAT_PARAMETERS


Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.