Tree [r2] /
History



File Date Author Commit
examples 2006-05-01 skittisheclipse [r2] * Fixed an error where pipeline.bash tried to r...
references 2006-03-14 skittisheclipse [r1] Initial testing of Subversion commission. Remo...
scripts 2006-05-01 skittisheclipse [r2] * Fixed an error where pipeline.bash tried to r...
README 2006-03-14 skittisheclipse [r1] Initial testing of Subversion commission. Remo...

Read Me

================================================================================
THE HITSA PIPELINE DOCUMENTATION
================================================================================

================================================================================
ABOUT
================================================================================

The HiTSA Pipeline was developed by researchers at the University of Idaho.  Its homepage is here:

http://www.ibest.uidaho.edu/tools/hitsa/index.php

It is licensed under the MPL v 1.1:

http://www.mozilla.org/MPL/MPL-1.1.html

================================================================================
REQUIREMENTS
================================================================================

HiTSA is implemented in BASH and Perl scripts on Unix systems and uses the following bioinformatics programs:

NCBI -- http://www.ncbi.nlm.nih.gov/Class/BLAST/

blastall
formatdb

PHYLIP -- http://evolution.genetics.washington.edu/phylip.html

dnadist
neighbor

EMBOSS -- http://emboss.sourceforge.net/

seqret

clustalw -- http://www.ciri.upc.es/cela_pblade/CLUSTALW.htm

BioPerl -- http://bio.perl.org/

It also uses the commonly available Unix tools of awk, cut, gsed, and grep.

In order to run the pipeline in parallel, the following are also needed:

Sun Grid Engine -- qsub and qstat -- http://gridengine.sunsource.net/
clustalw-mpi -- http://web.bii.a-star.edu.sg/~kuobin/clustalw-mpi/
mpirun -- Type being whichever clustalw-mpi was compiled with. The most popular are LAM and MPICH:
  LAM: http://www.lam-mpi.org/
  MPICH: http://www-unix.mcs.anl.gov/mpi/mpich/

================================================================================
INSTALLATION
================================================================================

In order to install the hitsa pipeline, go to the scripts directory.

Edit the scriptdir variable at the beginning of pipeline.bash to indicate the path to the scripts folder (ie, the folder of the install_variables file).

Go to the "install_variables" file in the scripts directory.  Edit the MPIRUN variable to indicate the location of the version of mpirun to use.  (This was done because many people install both LAM and MPICH, and the mpirun in the path may not be the mpirun used to run the clustalw-mpi program.)

Edit all of the accompanying preferences defaults to desired values, and comment out any that will not be used.  

It might also be good to edit the default prefs file in the references folder so that researchers using it as a template will have less to change.

================================================================================
RUNNING
================================================================================

The pipeline needs two things to run:

1.  A directory containing sequence files with a suffix specified in the preferences file, with the format compatible with one of the input formats used by seqret (http://www.hgmp.mrc.ac.uk/Software/EMBOSS/Apps/seqret.html).
2.  A preferences file in tab-delineated format. An example is in the references folder as "prefs" with comments for each variable.

To run the pipeline:

/path/to/script/pipeline.bash /path/to/directory /path/to/preferencesfile

================================================================================
DATABASES
================================================================================

We have been using three flavors from the RDP database, which can be found at:

http://rdp.cme.msu.edu/index.jsp

Instructions to make these databases:

MAKING THE RDP DATABASE

* Strain: both
* Source: both
* Size >= 1200
* Taxonomy: Bergey's
* Sequence formats: GenBank
* Alignment gaps: remove all gaps

* Choose domain bacteria

MAKING THE TYPESTRAIN DATABASE

* Strain: type
* Source: isolates
* Size >= 1200
* Taxonomy: Bergey's
* Sequence formats: GenBank
* Alignment gaps: remove all gaps

* Choose domain bacteria

MAKING THE RDP_SPECIES DATABASE

* Strain: both
* Source: isolates
* Size >= 1200
* Taxonomy: Bergey's
* Sequence formats: GenBank
* Alignment gaps: remove all gaps       

*Choose domain bacteria; deselect Putative Chimera and unclassified_bacteria

--------------

Use seqret and make the .fasta file from the GenBank file -- use the name
of the database when it asks for output sequence:
  
  seqret <genbank file> -osf fasta -outseq <name of database>

Use formatdb to form the database:

  formatdb -i <name fasta file> -n <name of database> -p F -o T 

================================================================================
SCRIPTS
================================================================================

pipeline.bash -- the main script that manages all the other scripts.

Scripts used by pipeline.bash:

  install_variables
    Not so much of a script but a repository for installation variables.

  namechange.pl
    Changes the fasta sequence names of all files in the current working directory with the given name suffix to the name of the filename
    
  countN2.pl
    Cuts off the primer ends of the sequences and checks for percentages of Ns and minimum length.  Only outputs sequences meeting those specifications.
    
  blastpicks.pl
    Prints the names and lengths of blasthits from a blastfile to standard out.
    
  blastadd.pl
    Adds up all the lengths for a given name in a file formatted with "NAME\tLENGTH" and prints out only the names whose total is above the minimum.
  
  splitgood.pl
    Splits up a sequence file (in this case, the good sequences file) into separate files whose filenames are determined by the first word of the FASTA name.
  
  blastdata.bash
    Blasts each good sequence against the specified database
  
  blastcheck.bash
    If the number of blast files doesn't equal the number of sequences, something has gone wrong.  This script checks to see which blast results are missing.
  
  blastparser.pl
    Parses the blast output into a file with tab delineated lines containing the query name, query length, hit name, description, significance, percent identiy, start, end, and length,.
    
  blastcull.pl
    Takes BLAST search results made by blastparser.pl and culls the top result for each sequence.
  
  nameshort.pl
    Many programs will only accept sequence IDs 10 characters in length.  This script converts all given fasta input into sequence IDs in the form of SEQ####### along with a name report detailing which sequence ID belongs to which name.  This way, the results can pass through the programs with a 10 character ID that can be changed back to the original name afterwards.
  
  qstatcheck.pl
    Given a process list file, uses qstat to check every 5 seconds on the process' status.  Returns when all processes are finished.
  
  findalign.pl
    Finds where the last sequence starts and the first sequence ends from an alignment.
  
  makeneighbor.pl
    Makes an "input script" for the neighbor program given the root species to root the tree around.  This must be done on the fly; there is no way of knowing ahead of time which place the root species will be at.
  
  namesback.pl
    Given a name report, will change the names for the short ten character IDs back to their original naems as recorded in namereport.
    
  run_clustal
    Used to submit a clustalw-MPI job to the Sun Grid Engine.
  
  run_blast
    Used to submit blastall jobs to the Sun Grid Engine.
  
  neighbor_script
    Used as a template for controlling neighbor
  
  dnadist_script
    Used as a template for controlling dnadist
    
Libraries

  searchnames.pl
    Made to search through a distances file and match up the sequence