|
From: Ino de B. <ino...@sc...> - 2012-09-20 13:18:08
|
Hello everybody, I have been using MUMmer to align a metavelvet assembly of Illumina reads against a metagenomic reference of ~60 species. What I want to assess is how well the assembly covers the reference metagenome. I run nucmer only once with the metavelvet contigs as a query and all the reference genomes of the metagenome combined in one file. So far I have been using only the --maxmatch parameter for nucmer and the default parameters. Afterwards I filter with delta-filter -g, to get the longest mutually consistent set for all reference-query pairs. What I can easily extract then is for each contig how "pure" it is, i.e. the query coverage. The incorrect bases or gaps could come from (1) errors in the sequencing process i.e. misreading bases/contamination, or (2) misassembly where reads belonging to different species have been combined into one species. What I would like to know is: - Is just using nucmer --maxmatch with default parameters a good idea for determining this query coverage per contig for metagenomics or do you recommend other parameters? - When using delta-filter -g, what if the longest mutually consistent is equal for a query against two genomes in the reference metagenome e.g. a query that represents a shared region. Are both alignments outputted in that case or is simply one chosen randomly? It doesn't matter for determining the query coverage, since they will be equal, but it would be interesting to know. - How does the extension step work exactly, I understand Smith-Waterman is used and that --breaklen allows you to specify when it should stop trying to extend the alignment in case of a negatively scoring aligment (right?), but what are the substitution and gap scoring schemes applied? And how does that relate to the alignment identity? At what minimum identity score is the extension stopped? Maybe I should read the articles better but I couldn't really find that. - I would also like to know how much of each genome in the reference metagenome is covered by the contigs. What would be a good way to do this? Maybe delta-filter -r, sort on reference genome with show-coords -r and count the non-overlapping bases? I reckon this would include repeats since I choose --maxmatch and it would be no problem for shared regions between genomes either since delta-filter -r checks for the longest consistent match per reference genome. Or should I not do any delta-filtering at all? I guess I don't really understand what -r does, does it only remove query sequences that overlap? Hope someone can help and thanks a lot for making the software, it's great! Best regards, Ino de Bruijn |