Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Tree [r66] /
History



File Date Author Commit
docs 2014-03-28 erinbeck [r13] initial add of docs directory
pangenome 4 days ago jinman [r66] MODIFIED: bin/download_ncbi_annotation.pl - now...
panoct 2015-04-28 grangersutton [r62] Add -R option to read precomputed clusters and ...
LICENSE.txt 2012-09-18 dfouts [r2] Initial checkin version 1.9
README.txt 2015-03-26 dfouts [r42] Updated the README.txt file

Read Me

Copy (C) 2011-2015  The J. Craig Venter Institute (JCVI).  All rights reserved

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

PATCHES
-------

Suggested updates should be directed to Derrick Fouts (dfouts@jcvi.org) or Granger Sutton (gsutton@jcvi.org) for consideration.

INTRODUCTION
------------

PanOCT, Pan-genome Ortholog Clustering Tool, a heuristic computer program, was created as a tool for pan-genomic analysis of closely related prokaryotic species or strains.  For more information please visit the PanOCT website at http://panoct.sourceforge.net/

PanOCT was written by: 

Derrick E. Fouts, Ph.D.
Associate Profesor
Genomic Medicine

and 

Granger Sutton, Ph.D.
Professor
Informatics

The J. Craig Venter Institute (JCVI)
9704 Medical Center Drive
Rockville, MD  20850
(301) 795-7874
dfouts@jcvi.org

SYSTEM REQUIREMENTS
-------------------

The programs should run on all Unix platforms.  It has been tested on CentOS Linux and Mac OS X 10.6 operating systems.  It has not been tested on all Unix platforms.

Memory usage guide (based on 4 Mbp bacterial genomes, estimations not guaranteed and based on version 1 of the code):
1 GB RAM	1-5 genomes
2 GB RAM	1-8 genomes
4 GB RAM	1-12 genomes
8 GB RAM	1-18 genomes
14 GB RAM	1-25 genomes

SOFTWARE REQUIREMENTS/DEPENDENCIES
----------------------------------

PanOCT.pl requires the following programs or packages for full functionality:

PERL version 5.10.0 or later (http://www.perl.org)

NCBI BLAST+ version 2.218 or later [Linux] (ftp://ftp.ncbi.nih.gov/blast/executables/blast+/) or NCBI BLASTALL version 2.2.10 or later [Linux] (ftp://ftp.ncbi.nih.gov/blast/executables/release/) or WUBLAST 2.0 (now AB-BLAST at http://blast.advbiocomp.com/)

Getopt::Std [should be installed with PERL]

Cwd to determine the current working directory (available from cpan.org).

Data::Dumper PERL module version 2.131 or later (available from cpan.org).  Used for debugging.

Scalar::Util PERL module version 1.41 (available from cpan.org).  Used to test whether cluster levels are numbers.

INCLUDED IN DISTRIBUTION
------------------------

-PERL SCRIPTS AND MODULES- 

/panoct/bin/panoct.pl:  The main PERL script for finding orthologous proteins.

INSTALLATION
------------

First, place the distribution tarball into a destination directory

Second, uncompress the distribution tarball by typing:

% tar -xvzf panoct_v<ver#>.tar.gz

REQUIRED INPUT FILES
--------------------

1) NCBI BLAST+ or BLASTALL(-m 8 or 9 option) or WU-BLAST tab-delimited input file
2) A text file containing unique genome identifiers, one identifier per line, to determine which genome is to be treated as the reference genome in the output files and which genomes to include in the analysis.  The genome identifier can be associated with specific proteins in two ways:  a) by placing the identifier after the protein identifier (e.g. NT08AB0001-GENOME_IDENTIFIER) or b) in the gene attribute file.
3) The genome attribute file is a tab-delimited file containing the following data:  contig id, protein identifier (e.g. locus), 5Ő coordinate, 3Ő coordinate, annotation, and genome identifier. 
4) The protein fasta file used in the all-versus-all BLASTP searches.  The protein fasta file is used by PanOCT to calculate the length of each protein, which is necessary in order to compute the BSR. 

INVOCATION
----------

  Usage: panoct.pl <options>
Example: panoct.pl -t example_blast.txt -f example_tags.txt -g example.gene_att -P example.pep -S Y -L 1 -M Y -H Y -V Y -N Y -F 1.33 -G y -c 0,25,50,75,100 -T
Version: <ver#>
 Option:
     -h: print this help page
     -c: argument is a comma separated list of numbers between 0-100 (0,50,75,100 might be a good choice) each number represents the percent of genomes
         needed for a cluster to be considered a core cluster. For each number two files are generated: one for core clusters and one for noncore
         clusters. The core file shows the order and orientation of core clusters with respect to each other. The noncore file shows the order and
         orientation of noncore clusters with respect to core clusters.
     -e: argument is a comma separated list of numbers between 0-100 each number represents the percent of genomes needed for a cluster to be considered
         a core cluster. For each number BSR distance matrices are computed using only the "core" clusters. [DEFAULT = 0,100]
     -d: no argument, generates a multifasta file of cluster centroids [on by DEFAULT]
     -T: no argument, prints cluster numbers as the first column in all table files [off by DEFAULT]
     -W: window size on either side of match to use CGN [DEFAULT = 5] must be between [1,20]
     -b: base directory path [DEFAULT = PWD]
     -p: path to btab file [DEFAULT = base directory]
     -t: name of btab (wublast-style or ncbi -m8 or -m9) input file [REQUIRED]
     -f: file containing unique genome identifier tags [REQUIRED]
     -g: gene attribute file (asmbl_id<tab>protein_identifier<tab>end5<tab>end3<tab>annotation<tab>genome_tag)
     -P: name of concatinated .pep file [REQUIRED to calc protein lengths]
     -Q: path to .pep file [DEFAULT = base directory]
     -i: aa % identity cut-off [DEFAULT = 35.0]
     -I: aa % identity cut-off for frameshift detection [DEFAULT = 35.0]
     -E: E-value [DEFAULT = 0.00001]
     -L: Minimum % match length [DEFAULT = 1]
     -H  Want to create a file with the id and annotation of each protein in a cluster?  (y)es or (n)o [DEFAULT = NO]
     -V: Want to create an ortholog matchtable?  (y)es or (n)o [DEFAULT = YES]
     -N: Want to create a normalized BLAST score file?  (y)es or (n)o [DEFAULT = NO]
     -M: Want to create microarray-like data for normalized BLAST scores?  (y)es or (n)o [DEFAULT = NO]
     -H: Want to create a table of hits (y)es or (n)o [DEFAULT = NO]
     -G: Want to create a normalized BLAST score histogram file containing a row for each genome and two rows for each pair of genomes?  (y)es or (n)o [DEFAULT = NO]
     -C: Want to create a file with the number of matches within clusters and statistics on protein length per cluster?  (y)es or (n)o [DEFAULT = YES]
     -A: Want to create a file grouping ortholog clusters which appear to be close paralogs?  (y)es or (n)o [DEFAULT = YES]
     -U: Want to create a file with ids of proteins which appear to be fragments or fusions based on clustering?  (y)es or (n)o [DEFAULT = YES]
     -B: Want to create a file with pairwise similarity scores used for clustering?  (y)es or (n)o [DEFAULT = YES]
     -S: Want to use strict criteria for ortholog determination?  (y)es, (m)intermediate or (n)o [DEFAULT = YES]
     -F: Deprecate shorter protein fragments when protein is split due to frameshift or other reason
         Takes an argument between 1.0 and 2.0 as a length ratio test - recommended value is 1.33 [DEFAULT = off]
     -a: Number of amino acids at the beginning or end of a match that can be missing and still be
         considerd a full length match - must be between 0 and 100 - [DEFAULT = 20]
     -s: Number of blast matches needed to confirm a protein fragment/frameshift [DEFAULT = 1]
     -D: DEBUG MODE (DEFAULT = off)
 Output: All stored within a subdirectory of the current working directory (PWD)
          1) panoct_report.txt:  a file containing runtime parameters used (e-value, %id, match length, and blast file used)
          2) panoct_matchtable.txt:  a tab-delimited file containing PanOCT clusters, one cluster per line.
                                     The first column is the reference genome and all subsequent columns are the remaining genomes
                                     in the order specified in the genome identifier "tags" file (specified with option f).
                                     e.g. NT08AB0001	NT16AB0001	NT17AB0001	NT20ABA0020
          3) panoct_matchtable_id.txt:  a tab-delimited file similar to panoct_matchtable.txt, but also containing the percent identity of each target protein in parentheses.
                                        e.g. NT08AB0001      NT16AB0001 (99.78%)     NT17AB0001 (99.26%)     NT20ABA0020 (99.78%) 
          4) panoct_id.txt:  a tab-delimited file containing reference protein annotation and percent identities to orthologs in each target genome.
                             e.g. NT08AB0001	chromosomal replication initiator protein DnaA	99.78	99.26	99.78
          5) panoct_frameshifts.txt:  a tab-delimited file containing proteins that are likely split due to frame-shifts.  It is organized by genome and assembly/contig 
                                      e.g. >genome ntab08
                                           >asmbl_id 1
                                           NT08AB3019	NT08AB3018	NT08AB3020

 Authors: Derrick E. Fouts, Ph.D. and Granger Sutton, Ph.D.
 Date: December 21, 2004; last revised March 25, 2015

=> Detailed description of input and output files:

PanOCT requires several input files to specify the genomes, genomic features, and feature attributes for PanOCT to cluster.

The input files are:

A genome tag file. The path to this file is a concatenation of the base directory specified by the -b option (default is
the working directory) and the genome tag file name specified by the -f option.

A feature attribute file. The path to this file is a concatenation of the base directory specified by the -b option (default
is the working directory) and the atribute file name specified by the -g option.

A tabular Blast output file comparing all features to be clustered against all features. The path to this file is a concatenation
of the Blast directory specified by the -p option (default is the base directory) and the Blast file name specified by the -t option.

A multi-fasta file of the protein sequences for the features. The path to this file is a concatenation of the proteine directory
specified by the -Q option (default is the base directory) and the protein file name specified by the -P option.

The genome tag file is a list of unique identifiers (tags) for each genome to be clustered. The tags are strings with no whitespace
characters. There is onle tag per line. The order of the tags in the file will be used to order the columns in some output files and
for other ordering purposes. The first tag in the file is called the reference genome which for some outputs gives more information
about the reference genome and instantiates a row ordering for some files.

The feature attribute file specifies the needed attributes of a feature one per line. Currently the only types of features supported
are proteins. The columns are tab delimited. The first column must be a positive integer which represents the molecule number for
finished genomes or the contig/scaffold number for unfinished genomes. Typically the numners go from 1 to the number of molecules. The
second column is the feature identifier (feat_id) which must be unique across all features not just within a genome. The third column
is the start coordinate on the molecule for the feature. The fourth column is the end coordinate for the feature on the molecule. If
the feature is on the reverse strand column 3 should be greater than column 4. Column 5 is the name/annotation for the feature. Column 6
is the genome tag for the feature (must correspond to a tag in the genome tag file).

The Blast file is a Blast tabular output file in either WU-BLAST or NCBI-BLAST tabular format.

The protein file is a multi-fasta formatted file with a protein sequence for every feature in the feature attribute file. Each fasta
header line must have the unique feat_id from the feature attribute file immediately after the >. Free text is allowed after the
feat_id which is treated as the protein name/annotation (same as column 5 in the feature attribute file).

****************************************************************************************

PanOCT generates a number of output files by default or specified by optional parameters.

A group of related files are known as the table files. These files all output one row of output per cluster and use tabs to delimit the
columns. The columns correspond to the genomes input to PanOCT. The columns are ordered in the same order of appearance as the genome
tags in the genome tag file with the reference genome first. The -T option specifies printing the cluster number as the first column of
all of the table files (default is not to print the cluster number). The clusters (rows in the file) are ordered by appearance of the
feat_id in the reference genome ordered first by molecule number and then by start coordinate. If there is no representative from the
reference genome in the cluster then the ordering by the second genome in the genome tag file is used and so on. The clusters are
implicitly numbered from 1 to the number of clusters as determined by this ordering. The table files are:

matchtable.txt has the feat_id for each member of a cluster. Columns corresponding to genomes which are not represented in a cluster
contain ---------- instead. Outputting this file is controlled by the -V parameter (YES or NO, default is YES).

matchtable_0_1.txt uses 0s and 1s for membership in a cluster. Columns corresponding to genomes which are represented in a cluster contain
a 1 while genomes not in the cluster contain 0 instead. Outputting this file is controlled by the -V parameter (YES or NO, default is YES).

matchtable_id.txt has the feat_id for each member of a cluster as well as the percent identity of the match to the reference genome
protein in parenteses following the feat_id. If the cluster does not contain a protein from the reference genome percent identity is given
to the first feat_id in the row. If the percent identity is 0.00 this means that the match between the proteins either did not exist in the
Blast file or fell below some cutoff. Columns corresponding to genomes which are not represented in a cluster are left blank instead.
Outputting this file is controlled by the -V parameter (YES or NO, default is YES).

id.txt has the feat_id and the feature's name/annotation for the reference genome as the first two columns followed by the percent identity
of the match to the reference genome protein. If the cluster does not contain a protein from the reference genome a row for that cluster is
not output. If the percent identity is 0.00 this means that the match between the proteins either did not exist in the Blast file or fell
below some cutoff. Columns corresponding to genomes which are not represented in a cluster are left blank instead. Outputting this file is
controlled by the -V parameter (YES or NO, default is YES).

micro.txt has the feat_id and the feature's name/annotation for the reference genome as the first two columns followed by a score between
1 (perfect match) to 100 (no match) of the match to the reference genome protein. This file is meant to mimic microarray hybridization data
to a reference genome based microarray. If the cluster does not contain a protein from the reference genome a row for that cluster is not
output. Columns corresponding to genomes which are not represented in a cluster are left blank instead. Outputting this file is controlled
by the -M parameter (YES or NO, default is NO).

BSR.txt has the feat_id and the feature's name/annotation for the reference genome as the first two columns followed by a Blast Score Ratio
(BSR) between 1 (perfect match) to 0 (no match) of the match to the reference genome protein. If the cluster does not contain a protein from
the reference genome a row for that cluster is not output. Columns corresponding to genomes which are not represented in a cluster are left
blank instead. Outputting this file is controlled by the -N parameter (YES or NO, default is NO).

hits.txt has the feat_id and the feature's name/annotation in square brackets for the reference genome as the first column followed by the
feat_id and the feature's name/annotation in square brackets for other members of the cluster. If the cluster does not contain a protein from
the reference genome a row for that cluster is not output. Columns corresponding to genomes which are not represented in a cluster are left
blank instead. Outputting this file is controlled by the -H parameter (YES or NO, default is NO).

****************************************************************************************

Another class of PanOCT output files are the matrix files.

Matrix files have entries for a square matrix where the rows and columns correspond to the genomes given to PanOCT representing some pairwise
measure between the genomes. Labels for the genomes are the last 7 characters of each genome tag.

The 8 matrix files are:

pairwise_identity_matrix.txt is a similarity matrix where each pairwise entry is the mean percent identity of matches between the genomes
with high CGN.

pairwise_BSR_matrix.txt is a similarity matrix where each pairwise entry is the mean BSR score of matches between the genomes with high CGN
multiplied by 100 to scale the scores between 0 to 100.

pairwise_BSR_distance_matrix.txt is a distance matrix where each entry in the BSR similarity matrix is subtracted from 100.

pairwise_BSR_distance_matrix_phylip.txt is a distance matrix where each entry in the BSR similarity matrix is subtracted from 100. The format is
modified to be compatible with the Phylip tree building tool.

pairwise_cluster_similarity_matrix.txt is a similarity matrix where each pairwise entry is a measure of shared gene content between two
genomes A and B. The measure used is (number of clusters in common between A and B) / ((number of clusters in A + number of clusters in B) / 2).
This measure is then multiplied by 100 to scale the values to be from 0 to 100.

pairwise_cluster_distance_matrix.txt is a distance matrix for shared gene content where we subtract every element of the above similarity
matriox from 100.

pairwise_cluster_distance_matrix_phylip.txt is a distance matrix for shared gene content where we subtract every element of the above similarity
matriox from 100. The format is modified to be compatible with the Phylip tree building tool.

pairwise_cutoffs_matrix.txt contains the BSR cutoffs determined for any pair of genomes when the strict orthologs option is used (-S YES,
default is NO). If a potential ortholog match does not have very much CGN (conserved gene neighbors) then it must exceed the pairwise BSR
cutoff.

Under control of the -e option, four additional matrix files are output for each threshold given [DEFAULT = 0,100]. The name and format of
the four files are the same as for the first four matrix files except the name is preceeded by threshold_.

****************************************************************************************

missing_blast_results.txt is a file of one feat_id per line where the feat_id was specified in the feature attribute file but did not appear
at all in the tabular Blast file or only as a search result but not as a query. This should never happen for an all against all Blast search.
Feat-ids which do not appear at all in the tabular Blast file are ignored.

****************************************************************************************

histograms.txt is a file of histograms, one per line, that are used by PanOCT to determine pairwise BSR cutoffs for separating paralogs from
orthologs under the -S YES option. These are histograms of 101 bins: the first 100 bins are evenly divided from >= 0 to < 1 and the last bin
is = 1. Remember BSR scores are normalized Blast scores from 0 to 1. If no match of a given type exists for a feat_id then the 0 bin is
incremented. Four types of histograms are labeled and output. Self histograms capture the best match within a genome to a feat_id that is not
the query feat_id (a paralog).Good_CGN histograms capture the best match between a pair of genomes that has good CGN support (more than half
of maximum possible)for each feat_id. Best histograms capture the best match between a pair of genomes for each feat_id. Second histograms
capture the second best match between a pair of genomes for each feat_id (presumably a paralog).

****************************************************************************************

frameshifts.txt is a file of probable protein fragments that are adjacent or on the ends of contigs. This fragment detection and file generation
only occurs under the strongly recommended -F option. If a pair of nearly adjacent proteins or nearly at the ends of contigs (the "nearly" is to
allow for spurious proteins such as bad gene calls or transposon insertions) have mostly nonoverlapping matches to other proteins and fewer full
length matches to other proteins then the proteins are flagged as fragments of the same gene. The longest matching fragment is "retained" and
treated as the sole protein of this gene for further analysis. This feat_id is output first on the line and then a tab delimited list of other
fragments from the same gene is output on the same line.

****************************************************************************************

fragments_fusions.txt is a file containing feat_ids with labeling that have been determined to be likely protein fragments of a single fragment
or fusions of multiple proteins. This file is only output if the -U YES option is specified. The determination of probable fragments and fusions
is based on the length of high quality matches within or between clusters where one protein is significantly shorter than another.The fragments
identified during frameshift detection and output in the _frameshifts.txt file are not output again here.

****************************************************************************************

below_cutoff_clusters.txt is a file containing feat_ids of proteins in clusters where the proteins have significantly lower BSR matches than is
typical as measured by the BSR pairwise cutoffs. This file is only output when -S YES is used.The first column is the cluster number and the second
column is the feat_id.

****************************************************************************************

paralog_weights.txt is a file showing the level of paralogous matches between pairs of clusters. For each pair of clusters where at least one high
quality match exists between members of the different clusters a line is output (first two columns are the cluster numbers, third column is the
number of matches). The quality of the match is stricter under the recommended -S YES option.

****************************************************************************************

paralogs.txt is a file containing the single linkage clustering using the links from the _paralog_weights.txt file. Each line is a tab delimited
set of cluster numbers. Each transposon tends to be a singleton cluster and very large sets of paralogous clusters tend to be transposons.

****************************************************************************************

cluster_weights.txt contains some basic statistics on each cluster under the -C YES option. Each line has 6 tab delimited columns: cluster number,
number of high quality matches within the cluster, minimum protein length, maximum protein length, mean protein length, and standard deviation of
protein length.

****************************************************************************************

There are a series of paired files which can be output to show the layout of clusters in the genomes. A series of percentage thresholds can be
specified using the -c option. For each threshold a pair of files is output (where "threshold below is replaced by a number between [1,100]:

threshold_core_adjacency_matrix.txt (where threshold is replaced by an actual number) shows the layout of "core" clusters where core is defined
to be clusters with a percentage of genomes present in the cluster >= threshold. The file has two lines per cluster. The first line has the cluster
number followed by a + or - indicating the relative orientation of the cluster with respect to the preceeding cluster, a : then the feat_id of the
reference genome or if the reference genome is not in the cluster the first genome in the genome tag file, and the corresponding gene
name/annotation. The second line shows all core clusters that are adjacent in any genome to the current cluster. Adjaceny is defined as the first
core cluster beside the current core cluster but will skip over noncore clusters. Each adjacency is shown as a triplet inside parentheses and
separated by commas. The triplet is: cluster number underscore 5 or 3 to represent the 5' or 3' end of the cluster/gene, cluster number
underscore 5 or 3, and the number of genomes with this adjacency. The underscore 5 or 3 shows the relative orientation of the clusters.
Three ||| separate adjaceny for the 5' versus 3' triplets for the given cluster.

threshold_noncore_adjacency_matrix.txt shows the same kind of adjacency information but for adjacency from a noncore cluster to a core cluster.

****************************************************************************************

pairwise_in_cluster.txt shows all of the pairwise matches between members of the same cluster. Each line has 20 columns: cluster number, genome tag 1, feat_id 1,
genome tag 2, feat_id 2, percent identity, e-value, normalized score, BSR, best match bit, bidirectional best bit, syntenic best bit, syntenic bidirectional best bit,
CGN bidirectional best bit, full length match bit, anchor bit, extend bit, size of clique, and size of clique all. If there is no match above cutoff then 0 is output
for percent identity which is not normally a valid value. This is only output for -B YES option.

pairwise_out_cluster.txt shows all of the pairwise matches between members of two different clusters. Each line has 21 columns: cluster #1, cluster #2, genome tag 1,
feat_id 1, genome tag 2, feat_id 2, percent identity, e-value, normalized score, BSR, best match bit, bidirectional best bit, syntenic best bit,
syntenic bidirectional best bit, CGN bidirectional best bit, full length match bit, anchor bit, extend bit, size of clique, and size of clique all.
If there is no match above cutoff then 0 is output for percent identity which is not normally a valid value. This is only output for -B YES option.

****************************************************************************************

centroids.fasta is a multi-fasta file containing the protein sequences for the centroids of the clusters. The fasta header line contains:
>centroid_cluster number feat_id protein name/annotation. This is output using the -d option.

****************************************************************************************

report.txt contains a few of the parameters PanOCT was called with and the feature counts per genome. "Raw" feature counts are determined by feat_ids
seen in both the feature attribute file and as query sequences in the tabular Blast file before cutoffs are applied. "Used" feature counts are for
both query and subject sequences in the tabular Blast file after cutoffs have been applied.

****************************************************************************************

parameters.txt contains a complete list of PanOCT's parameters that PanOCT was called with and any derived values.
 
          
EXAMPLE DATA
------------

Sample data can be found in /panoct_v<ver#>/example_dir