Pathoscope Wiki

Predicts strains of genomes in Nextgen seq alignment file (sam/bl8)

Brought to you by: mani2012

This project can now be found here.

clinical_pathoscope

Authors:

Clinical Pathoscope 1.0

Introduction:

Clinical Pathoscope is a program to identify pathogens/commensals/contaminants in unassembled sequencing reads.

1. Installation

Download Clinical Pathoscope code from http://sourceforge.net/projects/pathoscope/files/clinical_pathoscope_v1.0.3.tar.gz/download
Extract the code to a separate folder
You could issue the following command to extract the files:
"tar xvf clinical_pathoscope_v1.0.3.tar.gz"
Download bowtie2 from http://sourceforge.net/projects/bowtie-bio/
Example sample fastq file can be downloaded from http://sourceforge.net/projects/pathoscope/files/simulated_sample.fastq.gz/download
Reference databases (human, viral, & bacterial) as well as their associated alignment indexes can be downloaded from http://www.bu.edu/jlab/wp-assets/databases.tar.gz

2. Running

Prerequisite: Need to have python 2.7.3 or later version installed and add python to your PATH variable (Usually already done as part of python installation). For earlier versions of python, you will need to install the argparse module: https://pypi.python.org/pypi/argparse
Change directory to where you extracted the code
Create a config file by filling in the necessary information shown in config.txt
Simply run runClinicalPathoscope.py with the config file to generate the shell script to run Clinical Pathoscope for a particular sample ("python runClinicalPathoscope.py config.txt")

3. Output Files

TSV file format (You may need to rename this file as .csv for opening in some version of Excel such as LibreOffice):

At the top of the file in the first row, there are two fields called "Total Number of Aligned Reads" and "Total Number of Mapped Genomes". They represent the total number of reads that are aligned and the total number of genomes to which those reads align from the given alignment file.

Columns in the TSV file:
1. Genome:
  This is the name of the genome found in the alignment file.
2. Final Guess:
  This represent the percentage of reads that are mapped to the genome in Column 1 (reads aligning to multiple genomes are assigned proportionally) after pathoscope reassignment is performed.
3. Final Best Hit:
  This represents the percentage of reads that have their the highest score mapped to the genome in Column 1 after the pathoscope reassignment is performed. Difference between this an the previous is that second-highest (etc) scores are ignored.
4. Final Best Hit Read Numbers:
  This represent the number reads that have their highest score mapped to the genome in Column 1 (may include a fraction when a read is aligned to multiple top hit genomes with the same highest score) and after pathoscope reassignment is performed.
5. Final High Confidence Hits:
  This represent the percentage of reads that are mapped to the genome in Column 1 with a high confidence alignment (50%-100% alignment probability) to this genome and after pathoscope reassignment is performed. If this value is equal to the value in #2, then all the reads are
6. Final Low Confidence Hits:
  This represent the percentage of reads that are mapped to the genome in Column 1 with an a low alignment hit score (1%-50%) to this genome and after pathoscope reassignment is performed. These reads are shared with other genomes and their "best hit" could be to another genome.
7. Initial Guess:
  This represent the percentage of reads that are mapped to the genome in Column 1 (reads aligning to multiple genomes are assigned proportionally) before pathoscope reassignment is performed.
8. Initial Best Hit:
  This represents the percentage of reads that have their the highest score mapped to the genome in Column 1 before the pathoscope reassignment is performed. Difference between this an the previous is that second-highest (etc) scores are ignored.
9. Initial Best Hit Read Numbers:
  This represent the number of best hit reads that are mapped to the genome in Column 1 (may include a fraction when a read is aligned to multiple top hit genomes with the same highest score) and before pathoscope reassignment is performed.
10. Initial High Confidence Hits:
  This represent the percentage of reads that are mapped to the genome in Column 1 with an alignment hit score of 50%-100% to this genome and before pathoscope reassignment is performed.
11. Initial Low Confidence Hits:
  This represent the percentage of reads that are mapped to the genome in Column 1 with an alignment hit score of 1%-50% to this genome and before pathoscope reassignment is performed.
Updated alignment file:
Pathoscope will generate an updated alignment file in either .sam or BLAST (bl8) format depending on the initial input format type. This updated file will contain all reads in the input file, but replacing the the previous alignment scores with post-Pathoscope reassignment scores. Alignments that don't achieve the Pathoscope threshold value (parameter -s, default 0.01) will be deleted from this file. For example, for a default score, the updated file will not retain any alignments with reassignment probabilities less that 1% after Pathoscope. This means that the updated file will likely be smaller than the original, and will contain only the high-probability reassignments. This new file can then be used for downstream analyses such as SNP calling, and genome/scaffold assembly.
Shell script containing all commands and parameters that were executed during a given run. This allows the user to reproduce their exact analysis.

4. Additional information

Clincal Pathoscope comes bundled with the original Pathoscope (Version 1.0 ), 3 prebuilt bowtie2 databases for human, bacteria, and virus, and our 1 simulated dataset.
The human host library consisted of two sequences; the GRCh37/hg19 build of the human genome, as well as the human ribosomal DNA sequence (GenBank:U13369).
The bacterial library was downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.fna.tar.gz, 12/15/12).
The viral library was also obtained from NCBI (ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/all.fna.tar.gz, 1/10/13).
To use databases other than those provided with the software, the user must provide their own Bowtie2 indexes. See http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml for specific details regarding how to crea
te an index using Bowtie2.
For the simulated reads (simulated_sample.fastq), 10 million 100-base-pair (bp) reads were generated for each sample with 90% of reads originating from the host transcriptome (human RNA), 9% from bacterial genomes, and 1% from viral genomes.

Simulated reads composition (Accession Number | Name):
Bateria:
gi|296112228 | Moraxella_catarrhalis_RH4_chromosome,_complete_genome
gi|378696079 | Haemophilus_influenzae_10810,_complete_genome
gi|16271976 | Haemophilus_influenzae_Rd_KW20_chromosome,_complete_genome
gi|387787130 | Streptococcus_pneumoniae_ST556_chromosome,_complete_genome
gi|392427891 | Streptococcus_intermedius_JTH08,_complete_genome
Virus:
gi|49169782 | Human coronavirus NL63 (HCoV-NL63)
gi|9627719 | Human enterovirus A (HEV-A)
gi|160700581 | Human rhinovirus C (HRV-C)
gi|8486122,gi|8486125,gi|8486127,gi|8486129 | Influenza A virus (A/Puerto Rico/8/34/H1N1)
gi|8486131,gi|8486134,gi|8486136,gi|8486138 | (H1N1)
gi|77125236 | Human bocavirus (HBoV)
gi|56160876 | Human adenovirus type 7 (AdV7)

5. License: GNU-GPL

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

6. Support and Contact

Pathoscope is developed at the JohnsonLab in Boston University.
W. Evan Johnson, Ph.D.
Division of Computational Biomedicine
Boston University School of Medicine
72 E. Concord St., E-645
Boston, MA 02118
For support queries, please open a ticket or contact us at
jperezrogers@users.sourceforge.net
mani2012@users.sourceforge.net
https://sourceforge.net/p/pathoscope/tickets/