Download Latest Version ContamFinder1.1.1.zip (23.2 kB)
Email in envelope

Get an email when there's a new version of ContamFinder

Home
Name Modified Size InfoDownloads / Week
ContamFinder1.1.1.zip 2019-09-17 23.2 kB
README.txt 2019-09-17 4.6 kB
ContamFinder1.0.1.zip 2016-10-07 21.1 kB
Totals: 3 Items   48.9 kB 0
######################################
README for ContamFinder v. 1.1.1
###### License Information ###########
# Copyright (C) 2016 JANUS BORNER, janusborner@gmail.com
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
# published by the Free Software Foundation; either version 3 of
# the License or any later version.
# This program is distributed in the hope that it will be useful
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program; If not, see http://www.gnu.org/licenses
######################################

ContamFinder requires ruby1.9 or newer to run. It also requires these
software packages:

BLAST+
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST

exonerate
https://github.com/nathanweeks/exonerate

The GHOSTX-based version additionally requires:

GHOSTX
http://www.bi.cs.titech.ac.jp/ghostx/

and the RAPsearch-based version additionally requires:

RAPSearch2
https://sourceforge.net/projects/rapsearch2/

If you cannot install these applications system-wide (e.g. because you
don't have administrator priveleges), you can also copy them into the
ContamFinder directory. In this case you need to set the relative paths
to the executables in the contamfinder_ghostx.rb file (or whichever
version you want to use).

To run ContamFinder open a command line in the ContamFinder directory
and type:

ruby contamfinder_ghostx.rb

ContamFinder will then subsequently process all fasta files in the
source_fasta directory. Please note that the current version of
ContamFinder uses relative file paths and therefore has to be executed
from the directory which contains the ruby scripts.

In the contamfinder_ghostx.rb file, you can set the evalue cut-off for
the sequence similarity searches (default 1e-10) and the minimum number
of parasite hits for a positive match (default 3).

ContamFinder saves all search results and intermediary results in the
temp directory. To facilitate debugging and to allow for more detailed
analyses of the results, ContamFinder does not delete these files after
a successful run. For large input datasets, these temporary files may
become very large (several GB).

Databases

Due to the massive size of the databases and datasets used in Borner &
Burmester (2016), we have only included small sample fragments of the
fasta files that were used to create the full databases (located in the
example_database_files folder). If you would like to obtain the databases
used in Borner & Burmester (2016) or if you require assistence in
generating your own, please contact janusborner@gmail.com

Currently, ContamFinder looks for the string 'OS=PARASITE' in the fasta
header of the database sequences. Only if the headers of the best hitting
sequences contain this string, the query is considered a positive hit. This
is, admittedly, a bit of a dirty hack. Future versions of ContamFinder will
allow for user input to specify which organisms constitute a positive hit.

ContamFinder requires the fasta file that was used for the creation of the
parasite-only sequence database to be present in the database directory. It
should have the same name as the database with the file extension ".fa"
Since this file is used for the gene prediction step, low complexity regions
should not be masked in this file.

This is how the databases for Borner & Burmester (2016) were generated:

For the parasite-only sequence database used in the first step of the
pipeline, we downloaded all available apicomplexan proteomes from EuPathDB.
Low complexity regions were masked using the seg filter from the BLAST+
package. At the beginning of each fasta header we inserted the string
'OS=PARASITE ' to facilitate identification of parasite hits in the
search results. This fasta file was used to generate the EuPathDB search
database.

For the full database containing all available proteome data, we
downloaded all protein sequences annoted with the keyword
'complete proteome' from the UniProt database. Sequences from apicomplexan
proteomes were removed and low complexity regions were masked using the
segfilter. The resulting fasta file and the fasta file containing the
EuPathDB sequences were concatenated and the UniProt + EuPathDB search
database was generated.

Changelog

1.1.1
 Fixed a bug in CDS prediction when exonerate found multiple matches.
1.1.0
 Added CDS prediction (based on exonerate alignments).

Source: README.txt, updated 2019-09-17