Read Me
================================================================================
THE HITSA PIPELINE DOCUMENTATION
================================================================================
================================================================================
ABOUT
================================================================================
The HiTSA Pipeline was developed by researchers at the University of Idaho. Its homepage is here:
http://www.ibest.uidaho.edu/tools/hitsa/index.php
It is licensed under the MPL v 1.1:
http://www.mozilla.org/MPL/MPL-1.1.html
================================================================================
REQUIREMENTS
================================================================================
HiTSA is implemented in BASH and Perl scripts on Unix systems and uses the following bioinformatics programs:
NCBI -- http://www.ncbi.nlm.nih.gov/Class/BLAST/
blastall
formatdb
PHYLIP -- http://evolution.genetics.washington.edu/phylip.html
dnadist
neighbor
EMBOSS -- http://emboss.sourceforge.net/
seqret
clustalw -- http://www.ciri.upc.es/cela_pblade/CLUSTALW.htm
BioPerl -- http://bio.perl.org/
It also uses the commonly available Unix tools of awk, cut, gsed, and grep.
In order to run the pipeline in parallel, the following are also needed:
Sun Grid Engine -- qsub and qstat -- http://gridengine.sunsource.net/
clustalw-mpi -- http://web.bii.a-star.edu.sg/~kuobin/clustalw-mpi/
mpirun -- Type being whichever clustalw-mpi was compiled with. The most popular are LAM and MPICH:
LAM: http://www.lam-mpi.org/
MPICH: http://www-unix.mcs.anl.gov/mpi/mpich/
================================================================================
INSTALLATION
================================================================================
In order to install the hitsa pipeline, go to the scripts directory.
Edit the scriptdir variable at the beginning of pipeline.bash to indicate the path to the scripts folder (ie, the folder of the install_variables file).
Go to the "install_variables" file in the scripts directory. Edit the MPIRUN variable to indicate the location of the version of mpirun to use. (This was done because many people install both LAM and MPICH, and the mpirun in the path may not be the mpirun used to run the clustalw-mpi program.)
Edit all of the accompanying preferences defaults to desired values, and comment out any that will not be used.
It might also be good to edit the default prefs file in the references folder so that researchers using it as a template will have less to change.
================================================================================
RUNNING
================================================================================
The pipeline needs two things to run:
1. A directory containing sequence files with a suffix specified in the preferences file, with the format compatible with one of the input formats used by seqret (http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/seqret.html).
2. A preferences file in tab-delineated format. An example is in the references folder as "prefs" with comments for each variable.
To run the pipeline:
/path/to/script/pipeline.bash /path/to/directory /path/to/preferencesfile
================================================================================
DATABASES
================================================================================
We have been using three flavors from the RDP database, which can be found at:
http://rdp.cme.msu.edu/index.jsp
Instructions to make these databases:
MAKING THE RDP DATABASE
* Strain: both
* Source: both
* Size >= 1200
* Taxonomy: Bergey's
* Sequence formats: GenBank
* Alignment gaps: remove all gaps
* Choose domain bacteria
MAKING THE TYPESTRAIN DATABASE
* Strain: type
* Source: isolates
* Size >= 1200
* Taxonomy: Bergey's
* Sequence formats: GenBank
* Alignment gaps: remove all gaps
* Choose domain bacteria
MAKING THE RDP_SPECIES DATABASE
* Strain: both
* Source: isolates
* Size >= 1200
* Taxonomy: Bergey's
* Sequence formats: GenBank
* Alignment gaps: remove all gaps
*Choose domain bacteria; deselect Putative Chimera and unclassified_bacteria
--------------
Use seqret and make the .fasta file from the GenBank file -- use the name
of the database when it asks for output sequence:
seqret <genbank file> -osf fasta -outseq <name of database>
Use formatdb to form the database:
formatdb -i <name fasta file> -n <name of database> -p F -o T
================================================================================
SCRIPTS
================================================================================
pipeline.bash -- the main script that manages all the other scripts.
Scripts used by pipeline.bash:
install_variables
Not so much of a script but a repository for installation variables.
namechange.pl
Changes the fasta sequence names of all files in the current working directory with the given name suffix to the name of the filename
countN2.pl
Cuts off the primer ends of the sequences and checks for percentages of Ns and minimum length. Only outputs sequences meeting those specifications.
blastpicks.pl
Prints the names and lengths of blasthits from a blastfile to standard out.
blastadd.pl
Adds up all the lengths for a given name in a file formatted with "NAME\tLENGTH" and prints out only the names whose total is above the minimum.
splitgood.pl
Splits up a sequence file (in this case, the good sequences file) into separate files whose filenames are determined by the first word of the FASTA name.
blastdata.bash
Blasts each good sequence against the specified database
blastcheck.bash
If the number of blast files doesn't equal the number of sequences, something has gone wrong. This script checks to see which blast results are missing.
blastparser.pl
Parses the blast output into a file with tab delineated lines containing the query name, query length, hit name, description, significance, percent identiy, start, end, and length,.
blastcull.pl
Takes BLAST search results made by blastparser.pl and culls the top result for each sequence.
nameshort.pl
Many programs will only accept sequence IDs 10 characters in length. This script converts all given fasta input into sequence IDs in the form of SEQ####### along with a name report detailing which sequence ID belongs to which name. This way, the results can pass through the programs with a 10 character ID that can be changed back to the original name afterwards.
qstatcheck.pl
Given a process list file, uses qstat to check every 5 seconds on the process' status. Returns when all processes are finished.
findalign.pl
Finds where the last sequence starts and the first sequence ends from an alignment.
makeneighbor.pl
Makes an "input script" for the neighbor program given the root species to root the tree around. This must be done on the fly; there is no way of knowing ahead of time which place the root species will be at.
namesback.pl
Given a name report, will change the names for the short ten character IDs back to their original naems as recorded in namereport.
run_clustal
Used to submit a clustalw-MPI job to the Sun Grid Engine.
run_blast
Used to submit blastall jobs to the Sun Grid Engine.
neighbor_script
Used as a template for controlling neighbor
dnadist_script
Used as a template for controlling dnadist
Libraries
searchnames.pl
Made to search through a distances file and match up the sequence