After start up, a window is displayed with two menu's: File and Tools. On the left side there is a panel with four tabs: "Binning control", "Binning", "Contigs" and "Logbook". On the right there is the GC versus coverage plot and below that an empty area where later on the characteristics of a selected bin will be shown.
To get started, select “New” from the file menu, or click the “New project” button.
(Alternatively, you can also open the test binning project "test.metawattproject" in the test folder. This project contains a small artificial metagenome consisting of 5 common marine microorganisms. It has already been binned and annotated but you can redo any step to get a flavor for the program.)
Enter the name of the project and a filename for saving.
After you created your project you can add your first sample. Click the “new sample” button next to the project name at the top. Select the fasta file with the assembled contigs and enter a name for the sample.
Under the tab "Binning control" you can now enter the average read length for the sample (important to compute sequencing coverage during annotation), the taxonomy file and the blast database. You can also indicate how many processors you would like to use.
There are also some fields related to the mapping of reads to the contigs. This could be used to compute frequencies of single nucleotide polymorphisms and for estimating differences in population abundances across samples. It is still an experimental feature that I still need to play around with before it can become a standard practice. So you can ignore these fields for the moment.
The option "minimum binsize" is used to filter out small bins at the end of the binning procedure. By default,bins smaller than 100 kb are destroyed. If this behaviour is undesirable, the minimum binsize can be changed. More importantly, I find it generally quite useful to define a minimum contig size (nt). Shorter contigs will not be binned. Because binning is not very accurate for short DNA fragments, you will find the quality of your bins will improve if you set this thershold at (for example) 500 bp.
As explained elsewhere, Metawatt depends on the "blastn" and "makeblastdb" programs from the NCBI Blast suite and on the programs "build-icm" and "simple-score" from the glimmer package. In previous versions, these dependencies were checked automatically at start up. For mapping reads you also need the programs "bowtie2" and "bowtie2-build". From version 1.6 you can use the option "check dependencies" from the tools menu to check whether these programs are available to Metawatt.
If this is the first time you run the binner you need to create the taxonomy file and the blast DB. This can be done via the “tools” menu. Create the taxonomy file first and after that the blast DB. Finally, set the taxonomy file and blast database (you just created) for your current project.
When you want to create the taxonomy file you will be asked for the folder "that contains the reference genbank files". What is meant here is the folder that contains the very many folders, each containing a reference genome. In other words, the folder where you placed the all.tar.gz file you downloaded from the NCBI. Metawatt version 1.6 also contains a taxonomy file to get you started. This file was generated with the complete genomes available in January 2013.
Under the "Binning control" tab click "compute N4 frequencies" and after that "annotate" to prepare your sample for binning. Annotation consists of the computation of GC content and coverage for each contig. The GC content is computed directly from the sequence, like the tetranucleotide composition. The coverage is calculated by parsing the number of reads assigned to each contig from the header of the contig fasta file and multiplying the read count by the average read length. Alternatively, the coverage can also be parsed directly from the header line. The regular expressions used are:
(?:(?:numreads)|(?:read_count))[=_](\d+)
and
cov[a-z]*?[=_]([\d\.]+)
During annotation, the contigs are also fragmented into 500 bp fragments and each fragment is blasted separately (blastn e-value cutoff 1e-3) into the blast database that you prepared. Based on the blast results and the taxonomy file, a taxonomic profile is calculated for each contig.
After annotation, move to the "Binning" tab. At the top you see two buttons, the left one labeled "Binset" and the right one labeled taxon rank. Below these buttons, you will see three list with bins, the top one labelled "Shortlist", the middle one "Generated bins" and the bottom one labeled "Taxa". Click on the binset button and select "None". In the generated bin list select bin 0. This is a "bin" that contains all your assembled contigs. You can now inspect the taxonomic composition of your metagenome in the downmost panel at the right. In the topmost panel you can inspect the GC/coverage plot and (hopefully) you see some "clouds" that already hint at the presence of distinct, binnable populations. In the Taxa list you should see the different taxa that were discovered by blast and some information about their GC content, N50, etc. When you click on any of these taxa, you will see the position of the contigs assigned to the selected taxon highlighted in blue on the GC/coverage plot. You can sort the taxa in the list by clicking on the header of the list. You can sort the taxa by their name, N50 value, coverage, etc. The taxonomic rank visualized can be set by clicking on the "Taxon rank" button at the top of the Binning tab.
It is also worthwhile to select the "Contigs" tab. Here you find a list of the contigs in the currently selected bin or assigned to the currently selected taxon. Again, you can sort the contigs in the list by clicking on the header of the different columns.
Move back to the "Binning control" tab and click the "Bin with tetranucleotides" button. The contigs are now binned based on their tetranucleotide composition. Binning should proceed pretty fast (seconds, at most a few minutes). Again, you can inspect the results under the "Binning" tab. As described in the paper, binning is performed at three confidence levels: high, medium and low. High confidence will produce small bins. Sometimes this is good, sometimes you will find the data are "overbinned" i.e. a single population is split over multiple bins. In that case, the medium and low confidence binning results may be better. You can view the results at each level of confidence by clicking the "Binset" button at the top of the "Binning" tab. At each confidence level you will find a different set of bins in the "Generated bins" list. When you click on a bin in this list, you can observe the position of the contigs of the bins on the GC/Coverage plot and the taxonomic composition in the panel below the plot.
The panel showing the pie with the taxonomic composition has a 4-button navigation field in its top left corner. This field contains a back and forward button. These buttons work like the back and forward button in your web browser, you can go back to the previously selected bin, etc. The up and down buttons in the navigator take you to higher or lower level of confidence. This way, you can easily see which level is most appropriate for this population. You can also click on the taxon squares in the legend of the pie. When you click on a given taxon the panel shows you how this taxon is distributed over different bins. This way you can easily see whether the taxon was correctly "captured" by the bin.
When you have decided that a certain bin is the best representation of a certain population, you can shortlist it by right clicking the pie, or by right clicking the bin in the list and selecting "add to shortlist". The bin is now added to your shortlist. Shortlisted bins are also indicated on the GC/coverage plot as a dot with a name and two bars indicating the distribution of GC value and coverage of the contigs that define the bin. You can fill your shortlist with all binnable populations you find.
The shortlist is shared by all samples!!
When you are done, right click the shortlist and you can perform IMM binning with the shortlisted bins. Metawatt will ask you whether it should automatically complement your selected bins with those parts of your metagenome that you did not select. This way Metawatt creates additional IMM models that serve as negative models to prevent allocation of these non-interesting contigs to your IMM bins. The IMM bins can be inspected by selecting "IMM" as the binset. IMM binning is often, but not always, better than tetranucleotide binning. After IMM binning, you can replace bins in your shortlist with the IMM bins. Note that IMM binning is performed for all samples of the project.
If you see a clear cloud in the GC/coverage plot that is obviously not correctly binned, or see obvious taxonomic contaminations of your bins, you can also create bins from the GC/coverage plot and add them to your shortlist. This can be done by selecting contigs on the GC/coverage plot by dragging with the mouse. Such a bin could for example be used for IMM binning. In the contig list, you can also manually select contigs and add or remove them to shortlisted bins. The various actions of the popup menus can also be used creatively to manipulate your bins in a way you judge correct. Of course you need to describe these actions in a reasonable way for your presentation.
It is important to carefully inspect the coverages and e-values reported for each taxon and compare them to each other. It is also insightful to observe how many of the contig fragments actually produced a blast hit. When your population is remote from all reference genomes, it will have only very few blast hits and poor evalues to scattered taxa. When your population is closer to a reference genome, it will have many blast hits, good evalues and the hits will only be to a single or few closely related genera. Recovery and inspection of 16S sequences can be very useful. It ca also be useful to search for relevant draft genomes and add them to your taxonomy file and blast database.
Markov model for the bin. Via the popup menu you can also save the bin's contigs as a fasta file.
The taxonomy pies can be exported as svg graphics for use in presentations and publications. Unfortunately the coverage plot can currently only be captured with a screen shot (Batik library version 1.7 does not export bitmap images). The GC/coverage has a few options on its right side that can be used to produce a good image. The bin lists can also be exported as csv for further processing of bin characteristics in other programs.