SnowyOwl Code

RNA-Seq based gene prediction pipeline for fungal genomes

Brought to you by: ian-d-reid

Tree [6a53b4] master / History

HTTPS access

File	Date	Author	Commit
bin	2014-05-01	Ian Reid	[6a53b4] Handle models without a numeric score
CONFIG	2013-08-02	ian_reid	[6ccac3] Merged python branch into master
CONFIG.template	2014-02-05	Ian Reid	[aafe6d] Added docstrings for functions
LICENSE	2013-08-05	Ian Reid	[3bed96] Added license and copyright notices
README.txt	2013-08-14	Ian Reid	[2c1324] Moved example data into a separate dwnload file
SnowyOwl	2013-12-11	Ian Reid	[f5d654] Modified islands2genes and score_models to allo...
runSnowyOwl.py	2014-02-05	Ian Reid	[c3746a] Added docstrings for functions

Read Me

Copyright (c) 2013, The Developers
All rights reserved.

This directory contains the SnowyOwl gene prediction package.

Once the package is downloaded and decompressed in any location where you have 
write privileges, make sure that the programs used by SnowyOwl are available and 
edit CONFIG.template to suit your system.


===========
DIRECTORIES
===========

Projects: 
By default, SnowyOwl creates a directory under Projects for storing all 
intermediate and final results files for each genome data set.
Alternatively, the -o option can be used to specify a different root directory 
for project files.

bin:
Contains scripts and programs used by SnowyOwl.


========
HARDWARE
========

SnowyOwl is designed to run on multi-processor workstations or servers. At least 
3 processors are required, 12 are recommended, and more will shorten run time. 
SnowyOwl is not designed for clusters.

24 GB of RAM is adequate for fungal genomes.

SnowyOwl will use temporary disk space approximately equal to the size of the 
input files, and leave about 200 MB of output files.

SnowyOwl will optionally use TimeLogic boards and DeCypher software 
for accelerated BLAST searching.

========
SOFTWARE
========

The following program packages, with the indicated or newer versions, are 
required to run SnowyOwl; all these programs should be accessible through your 
system PATH variable.
  - UNIX, with both bash and tcsh shells
  - Perl 5
  - Python 2.7, with modules Biopython 1.59, pysam 0.6, paramiko 1.7.7.1, 
  doit 0.21, PyGTK 2.20
  - Augustus 2.5.5      (http://bioinf.uni-greifswald.de/augustus/binaries/)
  - GeneMark-ES 2.3e    (http://exon.gatech.edu/license_download.cgi)
  - NCBI Blast+ 2.2.25  (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
  - Exonerate           (http://www.ebi.ac.uk/~guy/exonerate/)
  - Blat                (http://hgdownload.cse.ucsc.edu/admin/exe/)
  - samtools            (https://sourceforge.net/projects/samtools/files/samtools/)
  - tabix               (https://sourceforge.net/projects/samtools/files/tabix/)
  - Cd-hit              (http://weizhong-lab.ucsd.edu/cd-hit/download.php)

SnowyOwl uses 1 or 2 protein databases for Blast searching. During development 
we have used the Uniprot/Swissprot database for blastx searching and the NCBI 
Refseq Fungi database for blastp searching.

==================
CONFIGURATION FILE
==================

The values set in the file "CONFIG" are used as defaults. Any of these can be 
left empty and set on the command line using the same name. Values given on the 
command line override values in CONFIG. If a required value (indicated by
"required" on the comment line in CONFIG) is set neither in CONFIG nor on the 
command line, SnowyOwl will complain and exit.

When installing SnowyOwl, edit CONFIG.template to suit your system and your 
preferences, and save as CONFIG.

In place of the default CONFIG file, you can specify a personalized configuration 
file with the -c option on the command line.

SnowyOwl saves a CONFIG file with the values for each project in the project root 
directory. Besides providing a record, this file allows you to restart a project 
run without re-entering any parameters.

=================
RUNNING SNOWY OWL
=================

An example data set containing all the input files needed to predict the genes 
on chr_5_1 of Aspergillus niger with SnowyOwl is available for download at 
http://sourceforge.net/projects/snowyowl/files/.

To run SnowyOwl, you must specify the project parameters through the CONFIG file 
or on the command line. You only need to enter values for parameters that differ 
from the CONFIG file values.  A GUI is available to help you enter the project
parameters.

If you enter
    <path to SnowyOwl>/SnowyOwl --gui
a dialog will appear pre-populated with the values in the default CONFIG file.  
Make appropriate changes and press the OK button at the bottom of the dialog, 
and SnowyOwl will carry out a sanity check and then start its gene prediction run.

If you already have a CONFIG file containing all the parameter values for your 
project, or you want to restart a project run using the project CONFIG file 
generated by SnowyOwl, enter
    <path to SnowyOwl>/SnowyOwl -c <path to custom CONFIG file>

Of course you can combine the two approaches if you want to change a few values 
in a custom configuration:
    <path to SnowyOwl>/SnowyOwl --gui -c <path to custom CONFIG file>

To avoid using the GUI (e.g. you are running SnowyOwl from a script) you can enter 
all parameters on the command line. The value of every parameter in the CONFIG 
file can be altered by prefixing the option name with '--' and following with
a space and the new value. For the parameters changed most often there are short 
tags:
  ProjectName : -p
  ProjectDir  : -o
  Genome      : -g
  MaskedGenome: -n
  Reads       : -r
  MappedReads : -m
  Transcripts : -t
  config_file : -c
  label       : -l


During the SnowyOwl run, the starting and finishing time for each step, and any 
fatal error messages, are output to <ProjectDir>/logs/SnowyOwl.log. Detailed 
progress and error output from the programs run by SnowyOwl are saved in 
individual logs in the logs directory; consult these logs to troubleshoot any 
problems that arise.

When the run finishes, the high-quality gene models predicted by SnowyOwl can be 
found in <ProjectDir>/accepted.gff3.
More results are available in the <ProjectDir>/Predictions directory, and a 
summary of all the models generated is in <ProjectDir>/logs/Prediction.log.

SnowyOwl keeps all its intermediate files, and will use them when a run is 
restarted. Once you are satisfied with the results, you can delete any of the 
intermediate files to free up disk space; you will want to keep at least 
accepted.gff3 and CONFIG.

===========
INPUT FILES
===========
Genome sequence, in FASTA format.

[Optional] Masked genome sequence, in FASTA format. Positions where no gene 
predictions are wanted, such as repetitive sequence or ribosomal DNA, can be 
masked with N.

RNA-Seq reads, in FASTA or FASTQ format.

A directory containing classified_juncs.gz, a tabix-indexed list of splice 
junction positions, and tuque.coverage.wig.gz, a tabix-indexed file of read
coverage depth profiles in bedGraph format. These files are generated when 
tuqueSplice [http://sourceforge.net/projects/tuque/] is used to map RNA-Seq 
reads or can be created from a .BAM read mapping file with the script
BAM_to_juncs_and_coverage.sh (see below).

A file of likely transcript sequences, assembled from RNA-Seq reads, in FASTA 
format.


=================
AUXILIARY SCRIPTS
=================
SnowyOwl/bin/scripts/BAM_to_juncs_and_coverage.sh reads.bam genome.fasta
 
can be used to generate the needed classified.juncs.gz and tuque.coverage.wig.gz 
files from a set of mapped RNA-Seq reads in BAM format. The program 'bedtools' 
(available from http://code.google.com/p/bedtools/) must be on the system PATH.

SnowyOwl/bin/scripts/combine_new_and_old_predictions.sh old.models.gff3 new.models.gff3 start_num genome.fasta
can be used to conservatively merge new predictions with existing predictions, 
preserving the names of any existing models that are the same as new models. It 
produces a non-redundant combined set of old and new models with name 
combined.gff3, and a list of the old models that have been replaced along with 
their replacements.

SnowyOwl/bin/scripts/get_accepted_representatives.sh  models.gff3  accepted.gff3
can be used to filter imperfect models from a set of scored gene models. It will 
output files named accepted.gff3 and imperfect.gff3 and a list of the frequencies 
of various flaws in the input models.


=====
HELP!
=====
Questions on the package can be posted on the discussion forum at 
http://sourceforge.net/p/snowyowl/discussion/.

SnowyOwl Code

RNA-Seq based gene prediction pipeline for fungal genomes

Branches

Tree [6a53b4] master / Download Snapshot History

Read Me

Tree [6a53b4] master /

History