Copyright (c) 2013, The Developers
All rights reserved.
This directory contains the SnowyOwl gene prediction package.
Once the package is downloaded and decompressed in any location where you have
write privileges, make sure that the programs used by SnowyOwl are available and
edit CONFIG.template to suit your system.
===========
DIRECTORIES
===========
Projects:
By default, SnowyOwl creates a directory under Projects for storing all
intermediate and final results files for each genome data set.
Alternatively, the -o option can be used to specify a different root directory
for project files.
bin:
Contains scripts and programs used by SnowyOwl.
========
HARDWARE
========
SnowyOwl is designed to run on multi-processor workstations or servers. At least
3 processors are required, 12 are recommended, and more will shorten run time.
SnowyOwl is not designed for clusters.
24 GB of RAM is adequate for fungal genomes.
SnowyOwl will use temporary disk space approximately equal to the size of the
input files, and leave about 200 MB of output files.
SnowyOwl will optionally use TimeLogic boards and DeCypher software
for accelerated BLAST searching.
========
SOFTWARE
========
The following program packages, with the indicated or newer versions, are
required to run SnowyOwl; all these programs should be accessible through your
system PATH variable.
- UNIX, with both bash and tcsh shells
- Perl 5
- Python 2.7, with modules Biopython 1.59, pysam 0.6, paramiko 1.7.7.1,
doit 0.21, PyGTK 2.20
- Augustus 2.5.5 (http://bioinf.uni-greifswald.de/augustus/binaries/)
- GeneMark-ES 2.3e (http://exon.gatech.edu/license_download.cgi)
- NCBI Blast+ 2.2.25 (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
- Exonerate (http://www.ebi.ac.uk/~guy/exonerate/)
- Blat (http://hgdownload.cse.ucsc.edu/admin/exe/)
- samtools (https://sourceforge.net/projects/samtools/files/samtools/)
- tabix (https://sourceforge.net/projects/samtools/files/tabix/)
- Cd-hit (http://weizhong-lab.ucsd.edu/cd-hit/download.php)
SnowyOwl uses 1 or 2 protein databases for Blast searching. During development
we have used the Uniprot/Swissprot database for blastx searching and the NCBI
Refseq Fungi database for blastp searching.
==================
CONFIGURATION FILE
==================
The values set in the file "CONFIG" are used as defaults. Any of these can be
left empty and set on the command line using the same name. Values given on the
command line override values in CONFIG. If a required value (indicated by
"required" on the comment line in CONFIG) is set neither in CONFIG nor on the
command line, SnowyOwl will complain and exit.
When installing SnowyOwl, edit CONFIG.template to suit your system and your
preferences, and save as CONFIG.
In place of the default CONFIG file, you can specify a personalized configuration
file with the -c option on the command line.
SnowyOwl saves a CONFIG file with the values for each project in the project root
directory. Besides providing a record, this file allows you to restart a project
run without re-entering any parameters.
=================
RUNNING SNOWY OWL
=================
An example data set containing all the input files needed to predict the genes
on chr_5_1 of Aspergillus niger with SnowyOwl is available for download at
http://sourceforge.net/projects/snowyowl/files/.
To run SnowyOwl, you must specify the project parameters through the CONFIG file
or on the command line. You only need to enter values for parameters that differ
from the CONFIG file values. A GUI is available to help you enter the project
parameters.
If you enter
<path to SnowyOwl>/SnowyOwl --gui
a dialog will appear pre-populated with the values in the default CONFIG file.
Make appropriate changes and press the OK button at the bottom of the dialog,
and SnowyOwl will carry out a sanity check and then start its gene prediction run.
If you already have a CONFIG file containing all the parameter values for your
project, or you want to restart a project run using the project CONFIG file
generated by SnowyOwl, enter
<path to SnowyOwl>/SnowyOwl -c <path to custom CONFIG file>
Of course you can combine the two approaches if you want to change a few values
in a custom configuration:
<path to SnowyOwl>/SnowyOwl --gui -c <path to custom CONFIG file>
To avoid using the GUI (e.g. you are running SnowyOwl from a script) you can enter
all parameters on the command line. The value of every parameter in the CONFIG
file can be altered by prefixing the option name with '--' and following with
a space and the new value. For the parameters changed most often there are short
tags:
ProjectName : -p
ProjectDir : -o
Genome : -g
MaskedGenome: -n
Reads : -r
MappedReads : -m
Transcripts : -t
config_file : -c
label : -l
During the SnowyOwl run, the starting and finishing time for each step, and any
fatal error messages, are output to <ProjectDir>/logs/SnowyOwl.log. Detailed
progress and error output from the programs run by SnowyOwl are saved in
individual logs in the logs directory; consult these logs to troubleshoot any
problems that arise.
When the run finishes, the high-quality gene models predicted by SnowyOwl can be
found in <ProjectDir>/accepted.gff3.
More results are available in the <ProjectDir>/Predictions directory, and a
summary of all the models generated is in <ProjectDir>/logs/Prediction.log.
SnowyOwl keeps all its intermediate files, and will use them when a run is
restarted. Once you are satisfied with the results, you can delete any of the
intermediate files to free up disk space; you will want to keep at least
accepted.gff3 and CONFIG.
===========
INPUT FILES
===========
Genome sequence, in FASTA format.
[Optional] Masked genome sequence, in FASTA format. Positions where no gene
predictions are wanted, such as repetitive sequence or ribosomal DNA, can be
masked with N.
RNA-Seq reads, in FASTA or FASTQ format.
A directory containing classified_juncs.gz, a tabix-indexed list of splice
junction positions, and tuque.coverage.wig.gz, a tabix-indexed file of read
coverage depth profiles in bedGraph format. These files are generated when
tuqueSplice [http://sourceforge.net/projects/tuque/] is used to map RNA-Seq
reads or can be created from a .BAM read mapping file with the script
BAM_to_juncs_and_coverage.sh (see below).
A file of likely transcript sequences, assembled from RNA-Seq reads, in FASTA
format.
=================
AUXILIARY SCRIPTS
=================
SnowyOwl/bin/scripts/BAM_to_juncs_and_coverage.sh reads.bam genome.fasta
can be used to generate the needed classified.juncs.gz and tuque.coverage.wig.gz
files from a set of mapped RNA-Seq reads in BAM format. The program 'bedtools'
(available from http://code.google.com/p/bedtools/) must be on the system PATH.
SnowyOwl/bin/scripts/combine_new_and_old_predictions.sh old.models.gff3 new.models.gff3 start_num genome.fasta
can be used to conservatively merge new predictions with existing predictions,
preserving the names of any existing models that are the same as new models. It
produces a non-redundant combined set of old and new models with name
combined.gff3, and a list of the old models that have been replaced along with
their replacements.
SnowyOwl/bin/scripts/get_accepted_representatives.sh models.gff3 accepted.gff3
can be used to filter imperfect models from a set of scored gene models. It will
output files named accepted.gff3 and imperfect.gff3 and a list of the frequencies
of various flaws in the input models.
=====
HELP!
=====
Questions on the package can be posted on the discussion forum at
http://sourceforge.net/p/snowyowl/discussion/.