######################################
README for HaMStR v. 13.2.6
###### License Information ###########
# Copyright (C) 2009 INGO EBERSBERGER, ebersberger@bio.uni-frankfurt.de
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published
# by the Free Software Foundation; either version 3 of the License
# or any later version.
# This program is distributed in the hope that it will be useful
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program; If not, see http://www.gnu.org/licenses
######################################
1) Installation
1.1 To use hamstrsearch_local you need first to install a number of programs that
are required to run the HaMStR search. Installation on Unix and MacOS X should
be straightforward. It may be a bit more interesting on Windows systems...
a) hmmsearch version 3 from http://hmmer.org/.
b) blastall from ftp://ftp.ncbi.nih.gov/blast/executables/release/. Alternatively,
you can use the blast+ suite.
c) genewise version 2.4.1 from http://www.ebi.ac.uk/~birney/wise2/
d) clustalw2 from http://www.clustal.org/download/current/
e) mafft-linsi from http://mafft.cbrc.jp/alignment/software/
MAC Users: By default, MacOS10 ships with BSD grep and sed.
I strongly suggest to switch to the GNU versions of these programs, as HaMStR
does not run with BSD sed and may have issues with BSD grep.
If you have MacPorts installed (recommended):
sudo port install grep
sudo port install gsed
Mavericks users may have issues with the libgcc. In these cases try running
xcode-select --install
first.
Alternatively, you can obtain the programs from the following URLs
gnugrep:
http://code.google.com/p/rudix/downloads/detail?name=grep-2.11-0.pkg
gnused:
http://code.google.com/p/rudix/downloads/detail?name=sed-4.2.1-1.dmg&can=2&q=label%3ARudix-2011
1.2 Adaptation of the HaMStR perl script
LINUX: This should be rather simple: just go to the bin directory of your HaMStR installation
and run ./configure
MAC: The procedure is simple too: just go to the bin directory of your HaMStR installation
and run ./configure_mac
NOTE: per default this will change all sed and grep commands in the perl script to gsed
and grep. If your version of sed or grep is named differently, e.g. gnugrep and gnused,
just edit the configure_mac script accordingly.
If required, e.g. when using non-standard program names you can also adjust the default values of
the following variables manually in the perl script:
my $prog = 'hmmsearch'; This is the name of the hmmsearch program
my $blast_prog = 'blastall'; if you use the blastall programm. (Default)
my $blast_prog = 'blastp'; if you use the blastp program from the blast+ suite.
my $alignmentprog = 'clustalw2'; This is the name of the clustalw-executable.
If you have completed the above steps, you should be able to run hamstrsearch_local. It is convenient to add the
path to the HaMStR script to the paths where your system looks for executables. If you have managed to do so,
you can omit the path to hamstr in the examples below and directly issue the command 'hamstr'.
2) Directory structure
Once you have unpacked the tar-file the following directory structure should be available:
hamstr.v13
bin #contains the perl script and the perlmodules
blast_dir #contains the blast dbs for the individual species
core_orthologs #contains the directories for the individual core_ortholog sets
data #contains the data in which orthologs should be searched for
tmp #a tmp directory to store metadata
Per default the paths in the hamstrsearch_local.pl script are adapted to this directory structure. If you feel that
you should change these settings you would need to change the paths accordingly. In particular, you can specify the
location of the core_orthologs directory and of the blast_dir using the appropriate command line flags listed below.
3) Testing hamstr (some demos included as well)
I have provided a small test set of ESTs and protein sequences that you can use to test your local set up of hamstr.
To run the test, change to the data directory in the hamstrsearch_local directory and issue the following command:
a) ../bin/hamstr -h
If everything works correctly, you should obtain a help message explaining the different options of HaMStR.
In this case, you can proceed to the next testing step.
b) ../bin/hamstr -sequence_file=testset_cDNA.fa -taxon=TEST -hmmset=modelorganisms_hmmer3 -refspec=DROME -hmm=317.hmm -central
The HaMStR search with 317.hmm should obtain 2 hits among the EST data. The results are stored as fasta in the
file fa_dir_testset_modelorganisms_hmmer3_DROME/317.fa (translated) and 317.cds.fa (coding sequence).
The hit sequences are also written to the file hamstrsearch_testset_cDNA__modelorganisms_hmmer3.out (translated)
and hamstrsearch_testset_cDNA_modelorganisms_hmmer3_cds.out (coding sequence).
c) ../bin/hamstr -sequence_file=testset_cDNA.fa -taxon=test -hmmset=modelorganisms_hmmer3 -refspec=DROME -hmm=317.hmm -representative -central
The HaMStR search with 317.hmm will obtain the same hits as in the previous search, however, the program will
output only the hit that is most similar to the reference sequence. If two or more hits match to non-overlapping
parts of the reference protein, these hits will be kept and subsequently concatenated. The Fasta-header of
the hit-sequence will then contain information which sequences have been concatenated, and how long they are.
d) ../bin/hamstr -sequence_file=testset-prot.fa -taxon=test2 -hmmset=modelorganisms_hmmer3 -refspec=DROME -hmm=239.hmm -central
The HaMStR search will result in 2 co-orthologs to the drosophila protein
e) ../bin/hamstr -sequence_file=testset-prot.fa -taxon=test2 -hmmset=modelorganisms_hmmer3 -refspec=DROME -hmm=239.hmm -representative -central
HaMStR will output only the sequence that is most similar to the reference protein
f) ../bin/hamstr -sequence_file=testset-prot.fa -taxon=test2 -hmmset=modelorganisms_hmmer3 -refspec=DROME -hmm=239.hmm -representative -concat -central
HaMStR will check all ortholog candidates whether they align to non-overlapping parts of the reference sequence.
If so, the option '-concat' will result in the concatenation of such sequences.
If all tests succeed everything should be fine and you are ready to use hamstrsearch_local for your analyses.
4) Options to hamstr
There are a number of options to hamstr that can be set on the command line. You will get the list of options also
when you issue the command ../bin/hamstr -h
-sequence_file=<>
path and name of the file containing the sequences hmmer is run against.
-hmmset=<>
specifies the name of the core-ortholog set.
The program will look for the files in the default directory 'core-orthologs' unless you specify
a different path via the option -hmmpath.
-refspec=<>
sets the reference species. Note, it has to be a species that contributed sequences
to the hmms you are using. NO DEFAULT IS SET! For a list of possible reference
taxa you can have a look at the speclist.txt file in the default core-ortholog sets
that come with this distribution. Please use the abreviations in this list. If you choose
to use core-orthologs where not every taxon is represented in all core-orthologs, you
can provide a comma-separated list with the preferred refspec first. The lower-ranking
reference species will only be used if a certain gene is not present in the preferred
refspecies due to alternative paths in the transitive closure to define the core-orthologs.
CURRENTLY NO CHECK IS IMPLEMENTED!
NOTE: A BLAST-DB FOR THE REFERENCE SPECIES IS REQUIRED!
-taxon
You need to specify a default taxon name from which your ESTs or protein sequences are derived.
-est
set this flag if you are searching in ESTs. Note, if neither the -est nor the -protein flag is set, HaMStR will
guess the sequence type. If you select this flag, make sure to specify how to deal with introns retained in the
ESTs. Check option -intron!
-protein
set this flag if you are searching in protein sequences. Note, if neither the -est nor the -protein flag is set, HaMStR will
guess the sequence type.
${bold}USING NON-DEFAULT PATHS$norm
-blastpath=<>
Lets you specify the absolute or relative path to the blast databases. DEFAULT: $blastpath
-hmmpath=<>
Lets you specify the absolute or relative path to the core ortholog set. DEFAULT: $hmmpath
-outpath=<>
You can determine the path to the HaMStR output. Default: current directory.
${bold}ADDITIONAL OPTIONS$norm
-append
set this flag if the output should be appended to the files *.out and *_cds.out. This becomes relevant when running
hamstrsearch with individual hmms and you want to combine the results.
-central
set this flag to store the modified infile in the same directory as the infile rather than in the output dir.
-checkCoorthologsRef
If the re-blast does not identify the original reference protein sequence as best hit, HaMStR will check whether the best blast
hit is likely a co-ortholog of the reference protein relative to the search taxon. NOTE: Setting this flag will substantially increase
the sensitivity of HaMStR but most likely affect also the specificity, especially when the search taxon is evolutionarily only very
distantly related to the reference taxon.
-cleartmp
set this flag to remove existing tmp dir in the HaMStR output directory.
-concat
set this flag if you want hamstr to concatenate sequences that align to non-overlapping parts of the reference protein.
If you choose this flag, no co-orthologs will be predicted.
-cpu
You can specify the number of parallel jobs in the HaMStR search. HaMStR uses the Parallel::ForkManager module for this purpose.
-eval_blast=<>
This option allows to set the e-value cut-off for the Blast search. Default: 10
-eval_hmmer=<>
This options allows to set the e-value cut-off for the HMM search.Default: 1
-filter=<T|F>
Set this flag to F if the re-blast should be performed without low-complexity filtering. Default is T.
-force
Setting this flag forces hamstr to overwrite existing output files (files ending with .out) without further asking.
-hit_limit=<>
By default, HaMStR will re-blast all hmmsearch hits against the reference proteome. Reduce the number
of hits for reblast with this option.
-hmm
Option to provide only a single hmm to be used for the search.
Note, this file has to end with .hmm
-intron=<keep|mask|remove>
Specify how to deal with introns that may occur in transcript sequences. Default: keep - Introns will be retained in the transcript
but will be identified by lower case letters.
-longhead
Set this flag in the case your sequence identifier contain whitespaces and you whish to keep
the entire sequence identifier throughout your analysis. HaMStR will then replace the whitespaces with
a '__'. If this flag is not set, HaMStR will truncate the sequence
Identifier at the first whitespace, however if and only if the sequence identifier then remain unique.
NOTE: too long sequence headers (~ > 30 chars) will cause trouble in the hmmsearch as the program will truncate
the output!
-nonoverlapping_cos
If you set this flag, non-overlapping co-orthologs will be reported as well. NOTE: this flag is still experimental
-rbh
set this flag if you want to use a reciprocal best hit criterion. Only the highest scoring
hit from the hmmer search will be used for re-blast.
-relaxed
set this flag if the reciprocity criterion is fulfilled when the re-blast against
any of the primer taxa was successfull. Note that setting this flag will substantially decrease the
stringency of the ortholog assignment with the consequence of an increased number of false positives.
-representative
From all sequences that fulfill the reciprocity criterion the one showing the highest similarity to the
core ortholog sequence in the reference species is identified and selected as representative.
-reuse
Set this flag if you want to prevent HaMStR from overwriting previous results.
-show_hmmsets
setting this flag will list all available core ortholog sets in the specified path. Can be combined with -hmmpath.
-silent
Supresses (almost) all print statements to the screen
-sort_global_align
setting this flag will tell hamstr to sort ortholog candidates according to their global alignment score to the reference
sequence rather than according to the score they have achieved in the hmmer search (local). NOTE: In the case of searching
EST data this flag is automatically set.
-strict
set this flag if the reciprocity criterion is only fulfilled when the re-blast against
all primer taxa was successfull
5) Generation of new core-ortholog sets (will be solved soon!)
This distribution comes with a couple of core-orthologs sets. However, you are free to
generate and use your own core-orthologs. There are a couple of conventions, however,
that should be obeyed. It may be the easiest way to have a look at the provided
files, but here are some general guidelines:
a) give your core-ortholog set a name, e.g. custom1
b) create a directory called 'custom1' in the core-orthologs directory
c) create your ortholog cluster from your taxon set of interest and your
favorite orthology prediction program.
d) the sequences in the indiviual ortholog cluster must be in fasta format
where the header should look like the following:
>core-ortholog-name|taxon_name|protein-id
The core-ortholog-name should also be the file name. The taxon_name should
of course be the name of the individual taxa used for the orthology prediction.
Please avoid blanks.
e) align the sequences in the core-ortholog cluster.
f) build and calibrate the hmms for the individual core-orthologs. The
hmm file names must be 'core-ortholog-name.hmm'.
g) put the hmms into a directory hmm_dir in the custom1 directory
h) enter all sequences for the core-orthologs into a single file called
custom1.fa and place this file in the directory custom1.
i) for each of the taxa you whish to use as reference species in the hamstrsearch,
generate a file containing all protein sequences that were used in the initial
orthology prediction. Name this file taxon_name_prot.fa and make sure that
no linebreaks interrupt a sequence. You can use the script nentferner.pl in the
bin directory to remove newlines.
j) generate a directory taxon_name in the directory blast_dir and place taxon_name_prot.fa
into this directory.
k) run formatdb -n taxon_name_prot -t taxon_name_prot -i taxon_name_prot.fa
When you have completed all the above steps, which is admittedly a bit tedious, you should
be able to run the hamstrsearch with your own core-orthologs. Good luck with it!