Download Latest Version hamstr.v13.2.6-bin-lib.tar.gz (10.5 MB)
Email in envelope

Get an email when there's a new version of Hamstr

Home
Name Modified Size InfoDownloads / Week
Archive 2014-08-07
README.txt 2017-03-07 15.3 kB
hamstr.v13.2.6.tar.gz 2016-12-07 78.5 MB
hamstr.v13.2.6-bin-lib.tar.gz 2016-12-07 10.5 MB
ReadMe-ReleaseNote-v13.2.6.txt 2016-02-12 4.4 kB
Totals: 5 Items   89.0 MB 3
######################################
README for HaMStR v. 13.2.6
###### License Information ###########
# Copyright (C) 2009 INGO EBERSBERGER, ebersberger@bio.uni-frankfurt.de
# This program is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published
# by the Free Software Foundation; either version 3 of the License
# or any later version.

# This program is distributed in the hope that it will be useful
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program; If not, see http://www.gnu.org/licenses
######################################
1) Installation
	1.1 To use hamstrsearch_local you need first to install a number of programs that 
		are required to run the HaMStR search. Installation on Unix and MacOS X should 
		be straightforward. It may be a bit more interesting on Windows systems...
   
		a) hmmsearch version 3 from http://hmmer.org/.
		b) blastall from ftp://ftp.ncbi.nih.gov/blast/executables/release/. Alternatively, 
		you can use the blast+ suite.
 		c) genewise version 2.4.1 from http://www.ebi.ac.uk/~birney/wise2/
 		d) clustalw2 from http://www.clustal.org/download/current/
		e) mafft-linsi from http://mafft.cbrc.jp/alignment/software/

		MAC Users: By default, MacOS10 ships with BSD grep and sed. 
		I strongly suggest to switch to the GNU versions of these programs, as HaMStR 
		does not run with BSD sed and may have issues with BSD grep.

		If you have MacPorts installed (recommended):
			sudo port install grep
			sudo port install gsed
		Mavericks users may have issues with the libgcc. In these cases try running 
			xcode-select --install 
		first.
	
		Alternatively, you can obtain the programs from the following URLs 
		gnugrep:	
		http://code.google.com/p/rudix/downloads/detail?name=grep-2.11-0.pkg
		
		gnused:	
		http://code.google.com/p/rudix/downloads/detail?name=sed-4.2.1-1.dmg&can=2&q=label%3ARudix-2011


	1.2 Adaptation of the HaMStR perl script
		LINUX: 	This should be rather simple: just go to the bin directory of your HaMStR installation 
			and run ./configure
		MAC:	The procedure is simple too: just go to the bin directory of your HaMStR installation 
			and run ./configure_mac
			NOTE: per default this will change all sed and grep commands in the perl script to gsed 
			and grep. If your version of sed or grep is named differently, e.g. gnugrep and gnused, 
			just edit the configure_mac script accordingly.		


		If required, e.g. when using non-standard program names you can also adjust the default values of 
		the following variables manually in the perl script:

		my $prog = 'hmmsearch'; This is the name of the hmmsearch program
		my $blast_prog = 'blastall'; if you use the blastall programm. (Default)
          	my $blast_prog = 'blastp'; if you use the blastp program from the blast+ suite. 
	  	my $alignmentprog = 'clustalw2'; This is the name of the clustalw-executable.

If you have completed the above steps, you should be able to run hamstrsearch_local. It is convenient to add the 
path to the HaMStR script to the paths where your system looks for executables. If you have managed to do so, 
you can omit the path to hamstr in the examples below and directly issue the command 'hamstr'.

2) Directory structure
Once you have unpacked the tar-file the following directory structure should be available:

	hamstr.v13
		bin		  #contains the perl script and the perlmodules
		blast_dir 	  #contains the blast dbs for the individual species
		core_orthologs 	  #contains the directories for the individual core_ortholog sets
		data 		  #contains the data in which orthologs should be searched for
		tmp 		  #a tmp directory to store metadata
	  
Per default the paths in the hamstrsearch_local.pl script are adapted to this directory structure. If you feel that 
you should change these settings you would need to change the paths accordingly. In particular, you can specify the 
location of the core_orthologs directory and of the blast_dir using the appropriate command line flags listed below.

3) Testing hamstr (some demos included as well)

I have provided a small test set of ESTs and protein sequences that you can use to test your local set up of hamstr. 
To run the test, change to the data directory in the hamstrsearch_local directory and issue the following command:

	a) ../bin/hamstr -h

	If everything works correctly, you should obtain a help message explaining the different options of HaMStR. 
	In this case, you can proceed to the next testing step.

	b) ../bin/hamstr -sequence_file=testset_cDNA.fa -taxon=TEST -hmmset=modelorganisms_hmmer3 -refspec=DROME  -hmm=317.hmm -central
  
	The HaMStR search with 317.hmm should obtain 2 hits among the EST data. The results are stored as fasta in the 
	file fa_dir_testset_modelorganisms_hmmer3_DROME/317.fa (translated) and 317.cds.fa (coding sequence). 
	The hit sequences are also written to the file hamstrsearch_testset_cDNA__modelorganisms_hmmer3.out (translated)
	and  hamstrsearch_testset_cDNA_modelorganisms_hmmer3_cds.out (coding sequence).

	c) ../bin/hamstr -sequence_file=testset_cDNA.fa -taxon=test -hmmset=modelorganisms_hmmer3 -refspec=DROME -hmm=317.hmm -representative -central

	The HaMStR search with 317.hmm will obtain the same hits as in the previous search, however, the program will
	output only the hit that is most similar to the reference sequence. If two or more hits match to non-overlapping 
	parts of the reference protein, these hits will be kept and subsequently concatenated. The Fasta-header of 
	the hit-sequence will then contain information which sequences have been concatenated, and how long they are.

	d) ../bin/hamstr -sequence_file=testset-prot.fa -taxon=test2 -hmmset=modelorganisms_hmmer3 -refspec=DROME -hmm=239.hmm -central
	The HaMStR search will result in 2 co-orthologs to the drosophila protein

	e) ../bin/hamstr -sequence_file=testset-prot.fa -taxon=test2 -hmmset=modelorganisms_hmmer3 -refspec=DROME -hmm=239.hmm -representative -central
	HaMStR will output only the sequence that is most similar to the reference protein

	f) ../bin/hamstr -sequence_file=testset-prot.fa -taxon=test2 -hmmset=modelorganisms_hmmer3 -refspec=DROME -hmm=239.hmm -representative -concat -central
	HaMStR will check all ortholog candidates whether they align to non-overlapping parts of the reference sequence. 
	If so, the option '-concat' will result in the concatenation of such sequences.
	
If all tests succeed everything should be fine and you are ready to use hamstrsearch_local for your analyses.

4) Options to hamstr

There are a number of options to hamstr that can be set on the command line. You will get the list of options also 
when you issue the command ../bin/hamstr -h

-sequence_file=<>
		path and name of the file containing the sequences hmmer is run against.
-hmmset=<>
		specifies the name of the core-ortholog set.
		The program will look for the files in the default directory 'core-orthologs' unless you specify
		a different path via the option -hmmpath.
-refspec=<>
		sets the reference species. Note, it has to be a species that contributed sequences 
		to the hmms you are using. NO DEFAULT IS SET! For a list of possible reference
		taxa you can have a look at the speclist.txt file in the default core-ortholog sets
		that come with this distribution. Please use the abreviations in this list. If you choose
		to use core-orthologs where not every taxon is represented in all core-orthologs, you
		can provide a comma-separated list with the preferred refspec first. The lower-ranking 
		reference species will only be used if a certain gene is not present in the preferred 
		refspecies due to alternative paths in the transitive closure to define the core-orthologs.
		CURRENTLY NO CHECK IS IMPLEMENTED!
		NOTE: A BLAST-DB FOR THE REFERENCE SPECIES IS REQUIRED!
-taxon
		You need to specify a default taxon name from which your ESTs or protein sequences are derived.
-est
		set this flag if you are searching in ESTs. Note, if neither the -est nor the -protein flag is set, HaMStR will
		guess the sequence type. If you select this flag, make sure to specify how to deal with introns retained in the 
		ESTs. Check option -intron!
-protein
		set this flag if you are searching in protein sequences. Note, if neither the -est nor the -protein flag is set, HaMStR will
		guess the sequence type.

${bold}USING NON-DEFAULT PATHS$norm

-blastpath=<>
		Lets you specify the absolute or relative path to the blast databases. DEFAULT: $blastpath
-hmmpath=<>
		Lets you specify the absolute or relative path to the core ortholog set. DEFAULT: $hmmpath
-outpath=<>
		You can determine the path to the HaMStR output. Default: current directory.
        
${bold}ADDITIONAL OPTIONS$norm

-append
		set this flag if the output should be appended to the files *.out and *_cds.out. This becomes relevant when running
		hamstrsearch with individual hmms and you want to combine the results.
-central
		set this flag to store the modified infile in the same directory as the infile rather than in the output dir.
-checkCoorthologsRef
		If the re-blast does not identify the original reference protein sequence as best hit, HaMStR will check whether the best blast 
		hit is likely a co-ortholog of the reference protein relative to the search taxon. NOTE: Setting this flag will substantially increase
		the sensitivity of HaMStR but most likely affect also the specificity, especially when the search taxon is evolutionarily only very 
		distantly related to the reference taxon.
-cleartmp
		set this flag to remove existing tmp dir in the HaMStR output directory.
-concat
		set this flag if you want hamstr to concatenate sequences that align to non-overlapping parts of the reference protein.
		If you choose this flag, no co-orthologs will be predicted.
-cpu
		You can specify the number of parallel jobs in the HaMStR search. HaMStR uses the Parallel::ForkManager module for this purpose.
-eval_blast=<>
		This option allows to set the e-value cut-off for the Blast search. Default: 10
-eval_hmmer=<>
		This options allows to set the e-value cut-off for the HMM search.Default: 1
-filter=<T|F>
		Set this flag to F if the re-blast should be performed without low-complexity filtering. Default is T.
-force
		Setting this flag forces hamstr to overwrite existing output files (files ending with .out) without further asking.
-hit_limit=<>
		By default, HaMStR will re-blast all hmmsearch hits against the reference proteome. Reduce the number
		of hits for reblast with this option.
-hmm
		Option to provide only a single hmm to be used for the search. 
		Note, this file has to end with .hmm 
-intron=<keep|mask|remove>
		Specify how to deal with introns that may occur in transcript sequences. Default: keep - Introns will be retained in the transcript
		but will be identified by lower case letters.
-longhead
		Set this flag in the case your sequence identifier contain whitespaces and you whish to keep
		the entire sequence identifier throughout your analysis. HaMStR will then replace the whitespaces with 
		a '__'. If this flag is not set, HaMStR will truncate the sequence
		Identifier at the first whitespace, however if and only if the sequence identifier then remain unique.
		NOTE: too long sequence headers (~ > 30 chars) will cause trouble in the hmmsearch as the program will truncate
		the output!
-nonoverlapping_cos
		If you set this flag, non-overlapping co-orthologs will be reported as well. NOTE: this flag is still experimental
-rbh
		set this flag if you want to use a reciprocal best hit criterion. Only the highest scoring
		hit from the hmmer search will be used for re-blast.
-relaxed
		set this flag if the reciprocity criterion is fulfilled when the re-blast against
		any of the primer taxa was successfull. Note that setting this flag will substantially decrease the
		stringency of the ortholog assignment with the consequence of an increased number of false positives.
-representative
		From all sequences that fulfill the reciprocity criterion the one showing the highest similarity to the
		core ortholog sequence in the reference species is identified and selected as representative.
-reuse
		Set this flag if you want to prevent HaMStR from overwriting previous results. 
-show_hmmsets
		setting this flag will list all available core ortholog sets in the specified path. Can be combined with -hmmpath.
-silent
		Supresses (almost) all print statements to the screen
-sort_global_align
		setting this flag will tell hamstr to sort ortholog candidates according to their global alignment score to the reference
		sequence rather than according to the score they have achieved in the hmmer search (local). NOTE: In the case of searching
		EST data this flag is automatically set.  
-strict
		set this flag if the reciprocity criterion is only fulfilled when the re-blast against
		all primer taxa was successfull

5) Generation of new core-ortholog sets (will be solved soon!)
This distribution comes with a couple of core-orthologs sets. However, you are free to
generate and use your own core-orthologs. There are a couple of conventions, however,
that should be obeyed. It may be the easiest way to have a look at the provided
files, but here are some general guidelines:
       a) give your core-ortholog set a name, e.g. custom1
       b) create a directory called 'custom1' in the core-orthologs directory
       c) create your ortholog cluster from your taxon set of interest and your
       	  favorite orthology prediction program.
       d) the sequences in the indiviual ortholog cluster must be in fasta format
       	  where the header should look like the following:
	  	>core-ortholog-name|taxon_name|protein-id
          The core-ortholog-name should also be the file name. The taxon_name should
	  of course be the name of the individual taxa used for the orthology prediction.
	  Please avoid blanks.
       e) align the sequences in the core-ortholog cluster.
       f) build and calibrate the hmms for the individual core-orthologs. The
       	  hmm file names must be 'core-ortholog-name.hmm'.
       g) put the hmms into a directory hmm_dir in the custom1 directory
       h) enter all sequences for the core-orthologs into a single file called
       	  custom1.fa and place this file in the directory custom1.
       i) for each of the taxa you whish to use as reference species in the hamstrsearch,
       	  generate a file containing all protein sequences that were used in the initial
	  orthology prediction. Name this file taxon_name_prot.fa and make sure that
	  no linebreaks interrupt a sequence. You can use the script nentferner.pl in the
	  bin directory to remove newlines.
       j) generate a directory taxon_name in the directory blast_dir and place taxon_name_prot.fa
          into this directory.
       k) run formatdb -n taxon_name_prot -t taxon_name_prot -i taxon_name_prot.fa
When you have completed all the above steps, which is admittedly a bit tedious, you should
be able to run the hamstrsearch with your own core-orthologs. Good luck with it!
Source: README.txt, updated 2017-03-07