Download Latest Version Harfang.v1.1.tar.gz (89.5 kB)
Email in envelope

Get an email when there's a new version of Harfang

Home
Name Modified Size InfoDownloads / Week
README.txt 2017-05-10 5.7 kB
Harfang.v1.1.tar.gz 2017-05-10 89.5 kB
Accessories.tar.gz 2017-01-17 30.4 kB
Harfang.v1.0.tar.gz 2017-01-12 88.4 kB
Totals: 4 Items   214.0 kB 0
# Copyright (c) 2017, Ian Reid
# All rights reserved.
# Version 1.1

This directory contains the Harfang transcript and gene prediction package.

Once the package is downloaded and decompressed in any location where you have 
write privileges, make sure that the programs used by Harfang are available and 
edit CONFIG.template to suit your system.


===========
DIRECTORIES
===========

bin:
Contains scripts and programs used by Harfang.


========
HARDWARE
========

Harfang is designed to run on multi-processor workstations or servers. At least 
3 processors are required, 12 are recommended, and more will shorten run time. 
Harfang is not designed for clusters.

24 GB of RAM is adequate for fungal genomes.

Harfang will use temporary disk space approximately equal to the size of the 
input files, and leave about 200 MB of output files.

========
SOFTWARE
========

The following program packages, with the indicated or newer versions, are 
required to run Harfang; all these programs should be accessible through your 
system PATH variable.
  - UNIX, with bash shell
  - Perl 5
  - Python 2.7, with modules Biopython 1.59, pysam 0.6, doit 0.29.0
  - Augustus 2.5.5        (http://bioinf.uni-greifswald.de/augustus/binaries/)
  - NCBI Blast+ 2.2.25    (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
  - samtools, bgzip, tabix (https://github.com/samtools/samtools/releases/download/)
  - Cd-hit                (http://weizhong-lab.ucsd.edu/cd-hit/download.php)

Harfang uses a protein database for Blast searching. During development
we have used the NCBI Refseq Fungi database.

==================
CONFIGURATION FILE
==================

The values set in the file "CONFIG" are used as defaults. Any of these can be 
left empty and set on the command line using the same name. Values given on the 
command line override values in CONFIG. If a required value (indicated by
"required" on the comment line in CONFIG) is set neither in CONFIG nor on the 
command line, Harfang will complain and exit.

When installing Harfang, edit CONFIG.template to suit your system and your 
preferences, and save as CONFIG.

In place of the default CONFIG file, you can specify a personalized configuration 
file with the -c option on the command line.

Harfang saves a CONFIG file with the values for each project in the project root 
directory. Besides providing a record, this file allows you to restart a project 
run without re-entering any parameters.

=================
RUNNING HARFANG
=================

Harfang needs Augustus to be already trained for the target genome. A convenient
way to achieve this is to run Braker with the same genome and RNA-Seq BAM file
that will be used for Harfang. The --species parameter used for Braker should then
be used as the Species parameter for Harfang.

To run Harfang, you must specify the project parameters through the CONFIG file
or on the command line. You only need to enter values for parameters that differ 
from the CONFIG file values.

If you already have a CONFIG file containing all the parameter values for your 
project, or you want to restart a project run using the project CONFIG file 
generated by Harfang, enter
    <path to Harfang>/Harfang -c <path to custom CONFIG file>

The value of every parameter in the CONFIG file can be altered on the command line
by prefixing the option name with '--' and following with a space and the new value.
For the parameters changed most often there are short tags:
  ProjectName : -p
  ProjectDir  : -o
  Genome      : -g
  MaskedGenome: -n
  MappedReads : -m
  config_file : -c
  label       : -l


During the Harfang run, the starting and finishing time for each step, and any 
fatal error messages, are output to <ProjectDir>/logs/Harfang.log. Detailed 
progress and error output from the programs run by Harfang are saved in 
individual logs in the logs directory; consult these logs to troubleshoot any 
problems that arise.

When the run finishes, the high-quality gene models predicted by Harfang can be 
found in <ProjectDir>/accepted.gff3.
More results are available in the <ProjectDir>/Predictions directory, and a 
summary of all the models generated is in <ProjectDir>/logs/Prediction.log.

Harfang keeps all its intermediate files, and will use them when a run is 
restarted. Once you are satisfied with the results, you can delete any of the 
intermediate files to free up disk space; you will want to keep at least 
accepted.gff3 and CONFIG.

===========
INPUT FILES
===========
Genome sequence, in FASTA format.

[Optional] Masked genome sequence, in FASTA format. Positions where no gene 
predictions are wanted, such as repetitive sequence or ribosomal DNA, can be 
masked with N.

A directory with subdirectories plus/ and minus/, each containing classified_juncs.gz,
a tabix-indexed list of splice junction positions, and tuque.coverage.wig.gz, a
tabix-indexed file of read coverage depth profiles in bedGraph format. These files can
be created with the script BAM_to_juncs_and_coverage.sh (see below) from coordinate-sorted
BAM files mapping RNA-Seq reads to the forward and reverse strands of the genome.

An Augustus species name that leads to training results for the target genome.

[Optional] Existing transcript or gene models in GFF3 format.


=================
AUXILIARY SCRIPTS
=================
Harfang/bin/scripts/strand-specific_BAM_to_juncs_and_coverage.sh map.BAM genome.fasta output_directory
can be used to generate the needed classified.juncs.gz and tuque.coverage.wig.gz 
files from a set of mapped RNA-Seq reads in BAM format.

=====
HELP!
=====
Questions on the package can be posted on the discussion forum at 
http://sourceforge.net/p/harfang/discussion/.
Source: README.txt, updated 2017-05-10