hreFinder 1.0
This program package hreFinder by Wei-Bung Wang and Shea Gardner includes several
executables and some python scripts.
__________________________
Copyright (c) 2012, Lawrence Livermore National Security, LLC.
Produced at the Lawrence Livermore National Laboratory
Written by Wei-Bung Wang weibung.wang@gmail.com and Shea Gardner, gardner26@llnl.gov.
CODE LLNL-CODE-576353.
All rights reserved.
This file is part of HgtFinder, Version: 1.0. For details, see HgtFinder at sourceforge.org . Please also read this link - Additional BSD Notice.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the disclaimer below.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the disclaimer (as noted below) in the documentation and/or other materials provided with the distribution.
- Neither the name of the LLNS/LLNL nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL LAWRENCE LIVERMORE NATIONAL SECURITY, LLC, THE U.S. DEPARTMENT OF ENERGY OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Additional BSD Notice
1. This notice is required to be provided under our contract with the U.S. Department of Energy (DOE). This work was produced at Lawrence Livermore National Laboratory under Contract No. DE-AC52-07NA27344 with the DOE.
2. Neither the United States Government nor Lawrence Livermore National Security, LLC nor any of their employees, makes any warranty, express or implied, or assumes any liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately-owned rights.
3. Also, reference herein to any specific commercial products, process, or services by trade name, trademark, manufacturer or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.
----------------------------------------------------------------
GET SOURCE CODE READY
To untar the file and get things ready, simple enter:
$ tar zxvf hreFinder.tar.gz
----------------------------------------------------------------
COMPILE
It is easy to compile. Simply get into the src/ directory and
type 'make':
$ cd hreFinder (if necessary)
$ cd src
$ make
This will create executables in bin/ directory.
You need the bioperl module Bio::TreeIO in your PERL5LIB for the script labelTree_HGT.pl
----------------------------------------------------------------
USAGE
There are many parameters of the main program 'hgtWithBlocks'.
It is fine to run hgtWithBlocks without any parameter, and it
will output the usage. However, the best way to run it is via
'run_config.py [input_tree] [genomes.fasta] [SNPs_all]'
or if you want to use a reference genome other than the first for sorting the SNPs by their order in the reference:
'run_config.py [input_tree] [genomes.fasta] [SNPs_all] [reference_genome_name]'
Read carefully through the first 100 lines in
run_config.py. The comments in run_config.py should outline how
to setup parameters.
If you use kSNP (https://sourceforge.net/projects/ksnp), at the bottom of the kSNP shell script, uncomment the exit line before the HRE commands, correct your path to HREfinder, and it will be run after kSNP. Output will be placed in the HRE.ML, HRE.parsimony, and HRE.NJ subdirectories within the directory containing the kSNP output.
When including a draft genome in contigs, concatenate the contigs into a single sequence for that genome with >=200 N's separating each contig. If you use kSNP, simply follow the instructions in the kSNP User Guide for merge_fasta_contigs.
If there are genomes in the fastafile that do not have SNPs or which are not in the tree file, the program will fail. Make sure that all genomes occur in all three input files, i.e. the SNPs_all file, the tree file, and the input fasta file. If you run kSNP, you must have positional information for all the genomes, so all of them must be listed in the -p finished_genomes input file, even if they are draft contigs. Do not run HREFinder with unassembled raw read genomes, since positional information is crucial to HREfinder.
----------------------------------------------------------------
EXAMPLE
There is a B.mallei/ directory with sample input files. To run
the example, simply locate the directory and run run_config.py:
$ cd hreFinder (if necessary)
$ cd B.mallei
$ ../run_config.py tree.ML.tre fastainput SNPs_all
The last command can be replaced by
$ ../run_config.py tree.ML.tre fastainput SNPs_all ATCC_23344
where ATCC_23344 is the master strain name. The SNPs are sorted
by their order in the master genome, so it is better not to
choose a genome in contig form to be the master genome.
----------------------------------------------------------------
EXPLANATION OF FILE NAMES
Here are the explanation of most file names in run_config.py.
The default values are in the parentheses.
tree_name (tree_input)
This is the input tree file name in newick format
fasta_file (fastainput)
This is the input file that contains the fasta data of all
strains.
SNP_all (SNPs_all)
This is the input file that lists that surrounding sequences of
all SNPs, and their positions/orientations in all strains.
SNP_all_matrix (SNPs_all_matrix)
This is the file of concatenated SNPs created from
SNPs_all that summarizes the alleles at all loci
of all strains.
tree_out (tree_out_c)
If there is no labels for internal nodes, this output file will
have a tree in newick form with internal nodes named.
tree_hre_label (tree_hre_labels.tre)
This is the tree with nodes and leaves labeled with the number of HRE events (x#) and mutations or errors (m#)
snp_map (snp_map_c)
This output file contains information to map SNP IDs to SNP
surrounding sequences.
hgtSNP (hreSNPs)
This contains a listing of SNPs predicted to be involved in HRE events, in the format of SNP_all input but with an extra first column numbering the loci.
block_out (block_out_c)
In this output file, each line is a block, and the numbers in
each line is the SNP IDs of the block.
event_out (event_out_c)
This output file contains the detected events in English. We do
not recommand parsing this file to get information.
snp_out (snp_out_c)
This output file contains inferred SNPs. HreFinder infers SNPs
for missing data, and "corrects" SNP alleles if HreFinder thinks
the SNPs are better explained by errors. Note that the accuracy
is data dependent.
hgt_only (hre_result_c)
This output file is essentially the same as event_out, but it
contains only HRE events.
event_summ (event_summ_c)
This output file is essentially the same as event_out, but not
in English. This file is good for parsing to extract more
information. See the first few lines of the file for the format
explanation.
hgt_from_to (hre_from_to_c)
This output file lists the numbers of HREs between each pair of
nodes in the tree if there is at least one HRE.
node_summ (node_summ_c)
This output file lists the number of mutations/errors/HRE of
each node, and their branch length.
picture (picture_c.png)
This output png file draws the events. SNPs are sorted by their
order in the master genome and separated by blocks. Red for
mutations, blue for errors, and cyan for HREs.
seq_out (seq_out_c)
This output file contains the HRE sequences. If an HRE occurs on
a leaf node, then the corresponding sequence is extracted from
fasta_file.
tree_name_int (tree_int)
This input/output file is for simulation. The program sib.py
that generates inversion doesn't read branch length as floating
point number, so we round the branch length to integers for
sib.py.
hre_from_to_by_locus
A table of all loci predicted to be involved in HRE and the donor
(from) and recipient (to) strains. For HRE's from or to internal
nodes, all the genomes under that node are listed in from/to columns
tanglegram.eps
Tanglegram picture of with lines connecting the nodes between which HRE's are predicted.
HRE's are plotted from the tree on the left to the tree on the right. "_" is pre-pended
to genome names of tree on the right so bioperl module will not automatically
connect taxa with same name.
----------------------------------------------------------------
In the run_config.py lines 16-21, you can change the costs to make
HRE more or less likely.
# Cost of events, feel free to change them
cost_error = 2
cost_mutation = 3
cost_HGT = 5
cost_out_group_open = 7
cost_out_group_extend = 1
----------------------------------------------------------------
RUNNING SIMULATION
Simply try to run single_run_simulation.py, read and run
simulation_for_paper.py in simulation/ directory.
----------------------------------------------------------------
SIMULATION WITHOUT INVERSIONS
In the src/ directory, open the file readSib.h, uncomment the
line of #define NO_INVERSION, re-compile, and it will be fine.
Note that the numbers in the paper are obtained by simulation
without inversions.
----------------------------------------------------------------
Some places we refer to HRE as HGT (horizontal gene transfer), the original
name for our method.
HRE is more accurate for what we are trying to predict, but to avoid introducing
unanticipated bugs we have left some of the original HGT's in variable names.