Pathoscope Wiki

Predicts strains of genomes in Nextgen seq alignment file (sam/bl8)

Brought to you by: mani2012

Home

Pathoscope 2.0

Introduction:

Pathoscope 2.0 consists of four core and two optional analysis modules for sequencing-based metagenomic profiling. The PathoLib module extracts genome reference libraries (target or host/filter) from all available sequences in the NCBI Nucleotide database that belong to a user-defined taxonomic clade. The PathoMap module aligns the reads to the target reference library and removes any reads that have sequence similarity with the host or filter genomes. PathoID resolves read ambiguity, identifies which of the target genomes are present in the sample and estimates the proportions of reads originating from each genome. PathoReport provides two report files: 1) a summary report (.tsv) that contains the numbers and proportions of reads aligned to each genome identified in the sample, and 2) detailed report (.xml) including read coverage, read assignments, and contiguous sequences generated by combining the reads. The PathoDB is an optional module that provides additional annotation (organism taxonomic lineage, gene loci, protein products) for all sequences identified in the sample. The PathoQC module can be used to preprocess the reads prior to alignment with PathoMap.

Please refer to the following papers:
* Pathoscope: Species identification and strain attribution with unassembled sequencing data; at http://genome.cshlp.org/content/23/10/1721

PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples; at http://www.microbiomejournal.com/content/2/1/33

Tutorial:
http://sourceforge.net/projects/pathoscope/files/pathoscope2.0_v0.02_tutorial.pdf

1. Installation

Download the code from http://sourceforge.net/projects/pathoscope/
Extract the code to a separate folder
You could issue the following command to extract the files:
"tar xvf pathoscope_2.0.tar.gz"

Optional:

If you want to install the complete PathoDB and download the complete NT library with the taxonomy id, you could download it from the following links:
ftp://pathoscope.bumc.bu.edu/data/pathodb.sql.gz
ftp://pathoscope.bumc.bu.edu/data/nt_ti.fa.gz

2. Running

Prerequisite: Need to have python 2.7.3 or later version installed and add python to your PATH variable (Usually already done as part of python installation)
Change directory to where you extracted the code
Simply run "python pathoscope/pathoscope.py -h" for top level usage information.
Run "python pathoscope/pathoscope.py LIB -h" for detailed usage information to run patholib.
Run "python pathoscope/pathoscope.py MAP -h" for detailed usage information to run pathomap.
Run "python pathoscope/pathoscope.py ID -h" for detailed usage information to run pathoid.
Run "python pathoscope/pathoscope.py REP -h" for detailed usage information to run pathoid.
There are also some unit tests for testing the validity of the functions.
Change directory to "pathoscope/pathomap/bowtie2wrapper/unittest" and simply run "python testBowtie2Wrap.py".
Change directory to "pathoscope/pathoid/unittest" and simply run "python testPathoID.py".

3. Usage

    usage: pathoscope.py [-h] [--version] [-verbose] {LIB,MAP,ID,REP} ...

    Pathoscope

    positional arguments:
      {LIB,MAP,ID,REP}  Select one of the following sub-commands
        LIB             Pathoscope taxon level reference genome Library creation
                        Module
        MAP             Pathoscope MAP Module
        ID              Pathoscope ID Module
        REP             Pathoscope Report Module

    optional arguments:
      -h, --help        show this help message and exit
      --version         show program's version number and exit
      -verbose          Prints verbose text while running

4. Example

There is a sample alignment file called 'MAP_3852_align.sam' that is included with this package in the example folder to test the pathoid and pathoreport modules.
You may also download the files called nt_ti.fa.gz and pathoscope2_example.tar.gz separately for running patholib and pathomap.
Test using the example alignment file included in the package as follows:

Suppose you have the alignment file 'MAP_3852_align.sam' in the 'example' directory and want the outputs generated in the 'results' directory, then run the following command.

pathoid and pathoreport:
Generate TSV(Tab Separated Value) file Report that can be opened in Excel and an updated alignment file:
"python pathoscope/pathoscope.py ID -alignFile example/MAP_3852_align.sam -expTag 3852 -outDir results"

Generate XML file and TSV file Report using the pathoreport module:
"python pathoscope/pathoscope.py REP -samfile results/updated_MAP_3852_align.sam -outDir results"

TSV file format (You may need to rename this file as .csv for opening in some version of Excel such as LibreOffice):

At the top of the file in the first row, there are two fields called "Total Number of Aligned Reads" and "Total Number of Mapped Genomes". They represent the total number of reads that are aligned and the total number of genomes to which those reads align from the given alignment file.

Columns in the TSV file:
1. Genome:
This is the name of the genome found in the alignment file.
2. Final Guess:
This represent the percentage of reads that are mapped to the genome in Column 1 (reads aligning to multiple genomes are assigned proportionally) after pathoscope reassignment is performed.
3. Final Best Hit:
This represents the percentage of reads that have their the highest score mapped to the genome in Column 1 after the pathoscope reassignment is performed. Difference between this an the previous is that second-highest (etc) scores are ignored.
4. Final Best Hit Read Numbers:
This represent the number reads that have their highest score mapped to the genome in Column 1 (may include a fraction when a read is aligned to multiple top hit genomes with the same highest score) and after pathoscope reassignment is performed.
5. Final High Confidence Hits:
This represent the percentage of reads that are mapped to the genome in Column 1 with a high confidence alignment (50%-100% alignment probability) to this genome and after pathoscope reassignment is performed. If this value is equal to the value in #2, then all the reads are
6. Final Low Confidence Hits:
This represent the percentage of reads that are mapped to the genome in Column 1 with an a low alignment hit score (1%-50%) to this genome and after pathoscope reassignment is performed. These reads are shared with other genomes and their "best hit" could be to another genome.
7. Initial Guess:
This represent the percentage of reads that are mapped to the genome in Column 1 (reads aligning to multiple genomes are assigned proportionally) before pathoscope reassignment is performed.
8. Initial Best Hit:
This represents the percentage of reads that have their the highest score mapped to the genome in Column 1 before the pathoscope reassignment is performed. Difference between this an the previous is that second-highest (etc) scores are ignored.
9. Initial Best Hit Read Numbers:
This represent the number of best hit reads that are mapped to the genome in Column 1 (may include a fraction when a read is aligned to multiple top hit genomes with the same highest score) and before pathoscope reassignment is performed.
10. Initial High Confidence Hits:
This represent the percentage of reads that are mapped to the genome in Column 1 with an alignment hit score of 50%-100% to this genome and before pathoscope reassignment is performed.
11. Initial Low Confidence Hits:
This represent the percentage of reads that are mapped to the genome in Column 1 with an alignment hit score of 1%-50% to this genome and before pathoscope reassignment is performed.
Updated alignment file:
Pathoscope will generate an updated alignment file in either .sam or BLAST (bl8) format depending on the initial input format type. This updated file will contain all reads in the input file, but replacing the the previous alignment scores with post-Pathoscope reassignment scores. Alignments that don't achieve the Pathoscope threshold value (parameter -s, default 0.01) will be deleted from this file. For example, for a default score, the updated file will not retain any alignments with reassignment probabilities less that 1% after Pathoscope. This means that the updated file will likely be smaller than the original, and will contain only the high-probability reassignments. This new file can then be used for downstream analyses such as SNP calling, and genome/scaffold assembly.
Additional information on interpreting the results:
In purified samples for which the source genome is present in the database, Pathoscope will usually reassign all the reads with high probability to the single source genome. In addition, Pathoscope does extremely well in identifying strains in mixed samples and in accurately estimating the proportions of the aligned reads that come from each genome.

Alignment: Although Pathoscope is fairly independent of the aligner used, there are a few important alignment tricks that will increase the accuracy of the results. For example, be sure to include all optimally scoring alignments and any high-scoring suboptimal alignments. For example, suppose you have several E. coli strains in your database and you are using Bowtie 2 to align the reads. If a read aligns perfectly multiple strains in the database, by default Bowtie will randomly report only one of the alignments. In this case, Pathoscope will not be able to reassign the read to the correct genome or definitely identify the correct strain. Instead, one should use the -k parameter using a value greater than the number of similar stains (e.g. -k 50 if there are 30 E. coli strains). BLAST by default will report optimal and suboptimal alignments. For GNUMAP the --print_all flag should be used. However, making alignment thresholds too low can slow down the aligner, increase output file size, and in some extreme cases can lead to incorrect results because reads that should align to any genomes may have a low scoring match with an incorrect genome.

Closely related substrains: If read coverage is low and/or if you have closely related strains in the database, Pathoscope may fail to assign all reads to one genome--but rather split the reads between the two (equally or maybe unequally). This type of result either indicates that both strains/substrains are present, or that there is not enough information in the reads for Pathoscope to definitively identify the source strain in the sample so it picks both. In our manuscript, we show that for nearly identical strains (99.9% similarity), usually ~20% (0.2X) coverage is usually sufficient to identify a difference. Note that this is about 1/500th of the data needed for whole genome assembly. However, in this same scenario, when read coverage was 5-10% (0.05X-0.10X) Pathoscope often struggled to distinguish between the nearly identical genomes.

Genome not in the database: Another common result occurs if the true species/strain is not contained in the database or if the true genome is incomplete. In these cases, Pathoscope does an excellent job at identifying the nearest fully assembled relative that is in the database, as long as there are at least some reads that align to that genome. However, one common result that we observed is that in these cases, Pathoscope won't completely converge to a single neighboring genome. Rather, Pathoscope will usually assign read proportions to multiple related genomes. In most cases, the one that scores the highest is the closest fully assembled neighbor. For example, in one case we sequenced reads from a clinical sample from a sick primate that was infected with a newly identified adenovirus. When we aligned the reads and applied Pathoscope, nearly a dozen adenoviruses received some proportion of the reads. The highest scoring genome, which received ~56% of the reads, turned out to be the adenovirus in the database that had the closest genome similarity to the infecting strain. Interestingly, when we added the correct genome into the mix, all the reads were reassigned to this correct genome with nearly 100% confidence, and all the other adenoviruses dropped off the list. We have observed this exact phenomenon in multiple examples so far.

Pathoscope and parsimony: One positive/negative behavior of Pathoscope is that it has the tendency to want to identify a parsimonious list of genomes. This is advantageous if there is only one single strain of each particular species in the sample, or if the mere identification of the species or strain type is important. However, if there are multiple strains/substrains of the same species, there is the possibility that Pathoscope will miss one of the substrains by reassigning the reads to another substrain. Basically, if there are reads that uniquely align to a single genome, Pathoscope can find the genome and nearly recover its proportions. However, if there are multiple substrains present and nothing is mapping uniquely, then there is a possibility that Pathoscope will miss one of the strains. This could be a problem in viral infections where there can be many subpopulations present (e.g. HIV). In these cases, the more appropriate approach is to remove the unique read penalty (theta parameter) and use a more standard mixture model. This will remove the tendency of Pathoscope toward parsimony.

5. License: GNU-GPL

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.

6. Support and Contact

Pathoscope is developed at the JohnsonLab in Boston University.
W. Evan Johnson, Ph.D.
Division of Computational Biomedicine
Boston University School of Medicine
72 E. Concord St., E-645
Boston, MA 02118

Developers:
Solaiappan Manimaran
Changjin Hong

For support queries, please open a ticket or contact us at mani2012@users.sourceforge.net
https://sourceforge.net/p/pathoscope/tickets/

Project Admins:

Solaiappan Manimaran