Name | Modified | Size | Downloads / Week |
---|---|---|---|
ContamFinder1.1.1.zip | 2019-09-17 | 23.2 kB | |
README.txt | 2019-09-17 | 4.6 kB | |
ContamFinder1.0.1.zip | 2016-10-07 | 21.1 kB | |
Totals: 3 Items | 48.9 kB | 0 |
###################################### README for ContamFinder v. 1.1.1 ###### License Information ########### # Copyright (C) 2016 JANUS BORNER, janusborner@gmail.com # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License as # published by the Free Software Foundation; either version 3 of # the License or any later version. # This program is distributed in the hope that it will be useful # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU # General Public License for more details. # You should have received a copy of the GNU General Public License # along with this program; If not, see http://www.gnu.org/licenses ###################################### ContamFinder requires ruby1.9 or newer to run. It also requires these software packages: BLAST+ ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST exonerate https://github.com/nathanweeks/exonerate The GHOSTX-based version additionally requires: GHOSTX http://www.bi.cs.titech.ac.jp/ghostx/ and the RAPsearch-based version additionally requires: RAPSearch2 https://sourceforge.net/projects/rapsearch2/ If you cannot install these applications system-wide (e.g. because you don't have administrator priveleges), you can also copy them into the ContamFinder directory. In this case you need to set the relative paths to the executables in the contamfinder_ghostx.rb file (or whichever version you want to use). To run ContamFinder open a command line in the ContamFinder directory and type: ruby contamfinder_ghostx.rb ContamFinder will then subsequently process all fasta files in the source_fasta directory. Please note that the current version of ContamFinder uses relative file paths and therefore has to be executed from the directory which contains the ruby scripts. In the contamfinder_ghostx.rb file, you can set the evalue cut-off for the sequence similarity searches (default 1e-10) and the minimum number of parasite hits for a positive match (default 3). ContamFinder saves all search results and intermediary results in the temp directory. To facilitate debugging and to allow for more detailed analyses of the results, ContamFinder does not delete these files after a successful run. For large input datasets, these temporary files may become very large (several GB). Databases Due to the massive size of the databases and datasets used in Borner & Burmester (2016), we have only included small sample fragments of the fasta files that were used to create the full databases (located in the example_database_files folder). If you would like to obtain the databases used in Borner & Burmester (2016) or if you require assistence in generating your own, please contact janusborner@gmail.com Currently, ContamFinder looks for the string 'OS=PARASITE' in the fasta header of the database sequences. Only if the headers of the best hitting sequences contain this string, the query is considered a positive hit. This is, admittedly, a bit of a dirty hack. Future versions of ContamFinder will allow for user input to specify which organisms constitute a positive hit. ContamFinder requires the fasta file that was used for the creation of the parasite-only sequence database to be present in the database directory. It should have the same name as the database with the file extension ".fa" Since this file is used for the gene prediction step, low complexity regions should not be masked in this file. This is how the databases for Borner & Burmester (2016) were generated: For the parasite-only sequence database used in the first step of the pipeline, we downloaded all available apicomplexan proteomes from EuPathDB. Low complexity regions were masked using the seg filter from the BLAST+ package. At the beginning of each fasta header we inserted the string 'OS=PARASITE ' to facilitate identification of parasite hits in the search results. This fasta file was used to generate the EuPathDB search database. For the full database containing all available proteome data, we downloaded all protein sequences annoted with the keyword 'complete proteome' from the UniProt database. Sequences from apicomplexan proteomes were removed and low complexity regions were masked using the segfilter. The resulting fasta file and the fasta file containing the EuPathDB sequences were concatenated and the UniProt + EuPathDB search database was generated. Changelog 1.1.1 Fixed a bug in CDS prediction when exonerate found multiple matches. 1.1.0 Added CDS prediction (based on exonerate alignments).