This module performs a series of sanity checks to the automatically produced bins. It splits bins that need splitting and merges bins that were oversplit. See parameters below for an explanation of the criteria.
If you are concerned that good bins will be mutilated, you can initially set the status of this module to "skipped" and run it separately after binning. The pipeline logbook provides information on which bins were split and merged and why. If necessary, you can adjust the parameters below.
Seconds.
None
.hmm file used for quality assessment (profileDBforQualityAssessment, String): If unspecified, metawatt will use the first .hmm file (alphabetically) that it detected.
Minimum taxonomic agreement to merge bins (setTaxonomyThreshold, double, 0.2): Taxonomic agreement is calculated as follows: Each bin has a taxonomic profile calculated from its contigs, and consisting of the number of 500 bp fragments with a diamond blastx hit to a taxon. This is the profile that is displayed as a pie diagram in the lower part of the screen. When two bins are compared, all taxa that are represented in either bin are considered. The taxonomic profile is considered as a vector in hyperspace (one dimension per relevant taxon). These vectors are then normalized so that the value for the taxon with the most hits equals 1. Then the two vectors are subtracted. If the remainder is lower then the minimum taxonomic agreement (this parameter), it is considered a match and the value of the vector in this dimension is added to a total. If, after comparing all dimensions (taxa), the total is higher than the minimum taxonomic agreement, the bins may be merged (if the other criteria are also acceptable, see below).
Minimum conserved genes complementarity to merge bins (setConservedGenesThreshold, double, 0.0): Conserved gene completeness (as a value between 0-1) with and without considering the threshold (see module [Six Frame PFAM]) and degree of duplication (as a value between 0-1) are calculated for the virtually merged bins A and B. The score is calculated as: (completeness of bins A+B) - (completeness of bin A) + (completeness considering threshold of bins A+B) - (completeness considering threshold of bin A) - 2*((degree of duplication of bins A+B) - (degree of duplication of bin A).
Threshold for coverage binning (setCovBinningThreshold, double, 0.99): If the cosine between the coverage vectors of two bins is below this value, those bins will not be merged.
Bin completeness threshold (setBinCompletenessThreshold, double, 1.0%): Completeness is estimated with any of the hmm profile databases provided to the project. If a bin is less complete in all databases, it will be destroyed. The value should be specified as a percentage.
/output/[sample file name].binning.optimized.sibci
Discussion: Problem [Compute tetranucleotide frequencies] module: No bins?
Wiki: Pipeline modules
Wiki: Six Frame PFAM
Wiki: Strategy hints