MicrobeGPS Wiki

The Explorative Taxonomic Profiling Tool for Metagenomic Data

Brought to you by: lindnerm

Pipeline

Authors:

Attachments

Pipeline.jpg (78764 bytes)

[Home]

MicrobeGPS Pipeline

The MicrobeGPS Pipeline window shows the analysis pipeline and allows you to adjust the analysis parameters. You can either run each step separately or run all steps at once.

The window is divided into four panels. The upper left panel shows the MicrobeGPS Pipeline, where all seven steps are listed. A red light means that the step is not yet completed, a yellow light means that the step is currently running, a green light means that the step is complete. A red X indicates that a step failed. To run a step, its previous step must be completed (green light). Clicking on one step shows its Parameters in the upper right panel. These parameters can be changed as long as the step has a red light. A Parameter Description is shown in the lower left panel. Progress messages, errors and other notifications are shown on the Console in the lower right panel.

You can eiter preconfigure all steps and then run the complete pipeline from the beginning to the end, or you can each step separately. The former has the advantage that you can start your analysis and leave the program alone while it is running. The latter allows you to check the progress of each step such that you can, for example, rerun a step with a different parameter setting.

Pipeline parameters

Raw filtering

Minimum Genome Support discards all reference sequences that obtained less than the specified number of reads in total (including shared reads). A higher threshold reduces the number of genomes to be analyzed (lower run time) at the risk of discarding genomes of low abundant species.

Max. Read Matches discards all reads having matches to more than the specified number of genomes. These reads are uninformative for the clustering step of MicrobeGPS.

Max. Read Mapping Error filters out all read mappings with an error above the specified value. Here, the error is defined as the fraction of mismatches in the total read length.

Quality Filtering

Min. Number Unique Reads discards references with less than the specified number of unique reads. Here, reads are considered as unique reads when they were mapped uniquely to this reference. Reads are also considered as unique when there are multiple matches to the same reference (i.e. also on different chromosomes). A higher threshold can further reduce the amount of data to be analyzed, especially in datasets with many noisy read mappings (e.g. when a large fraction originates from completely uncharacterized organisms).

Max. Homogeneity puts a threshold on the homogeneity of the distribution of the reads on the genome. The homogeneity of the read distribution over the genome is measured by comparing the read distribution to a uniform distribution using the Kolmogorov-Smirnov test statistic. Organisms with a test statistic higher than the specified value are discarded. Note, that this ist not a p-value, but the raw Kolmogorov-Smirnov test statistic!

Calculate Candidates

Min. Genome Validity discards all organisms below the specified validity threshold. The validity is the estimated fraction of the genome that could be covered by reads. This threshold should be kept low (or even zero) when many uncharacterized organisms are expected in the dataset. Higher thresholds may be used to only keep very certain candidates.

Coverage Similarity sets the characteristics of the so-called Core Reads (CR). These are reads mapping to genomes with similar genome coverage depth. This parameter defines the maximum relative coverage differences of all target genomes of a read. A lower coverage similarity parameter requires a narrower range of measured coverages of all genomes a read maps to. This yields to less CR in total.

Fraction Shared USR defines in the clustering step the minimum required fraction of CR (or Unique Source Reads USR) shared with another reference to be put in the same cluster. This threshold prevents the clustering scheme from accidentally merging two reference in the same cluster that have by chance a similar coverage. Lower thresholds should only be set when really required.

Fraction Shared Reads allows putting references in the same cluster that were not joined via the shared CR. Here, the references are required to have a fraction of reads mapping to both references set by this parameter.

Wiki: Home
Wiki: Interactive data analysis