Re: [MUMmer-help] overlapping matches when calling mummer -mum

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Afif,
Nucmer alignments will often overlap for the same reason MUMs can overlap
(repeats). In the example you give the alignments overlap on the reference,
but they do not overlap on the query. delta-filter -q tries to find the
best alignments that cover as much of the query sequence as possible. Thus,
this little 714 bp sequence appears duplicated in your query, but not your
reference. delta-filter keeps it because it explains positions 3744105-3744818
on the query.

I promise you will find many similar examples as you continue to play with
the data. These types of alignments are very common when aligning
assemblies to references (due either to mis-assembly or true variants), and
sorting them all out is a very difficult problem.

Best,
-Adam

On Mon, Nov 2, 2015 at 7:01 PM, Afif Elghraoui <ael...@sd...> wrote:

> Hi, Adam,
> Thanks for your prompt response. Please see below inline--
>
> On 11/02/2015 02:56 PM, Adam Phillippy wrote:
> > This is the intended behavior of the program. I believe you are
> > misinterpreting the meaning of a maximal unique match. A maximal match
> > is defined as a match that cannot be extended on either end without
> > encountering a mismatch. Just because a MUM contains a repetitive
> > sequence, does not make the whole MUM non-unique. For the two
> > sequences with unique seqs 'U' and tandem repeats 'T':
> >
> > A: UUUTTTUUU
> > B: UUUTTUUU
> >
> > There would be MUMs found on either side: UUUTT and TTUUU. These two
> > MUMs overlap on both T's in B and overlap on the middle T in A. Both
> > MUMs are unique, i.e. they don't appear anywhere else as a whole ...
> > though they do contain substrings that are repetitive. I hope this
> > little example makes it a little more clear.
>
> Yes, thanks. That makes it clear.
>
> > Nucmer or dnadiff are the best tools in MUMmer for comparing contigs
> > to a reference. I prefer to run:
> > > nucmer -maxmatch -banded ref.fna. contigs.fna
> > > delta-filter -q out.delta > out.qdelta
> > > show-coords -THrcl out.qdelta > out.coords
> >
> > This will report the best alignments found for each contig to the
> > reference, which you can parse to identify large differences or run
> > show-snps to identify smaller polymorphisms.
> >
>
> I have a single contig, but I've already tried these before I switched
> to the plain mummer program. These are the various settings I used for
> nucmer: (Some of these parameter settings were chosen based on my
> mistaken understanding of a MUM)
>
> nucmer01/Makefile:NUCMER = nucmer --maxmatch --banded -D 10
> nucmer02/Makefile:NUCMER = nucmer --mumreference --banded -D 10
> nucmer03/Makefile:NUCMER = nucmer --mum --banded -D 10
> nucmer04/Makefile:NUCMER = nucmer --maxmatch --banded -D 5
> nucmer05/Makefile:NUCMER = nucmer -c 40
> nucmer06/Makefile:NUCMER = nucmer --maxmatch --banded -D 10 -c 40
> nucmer07/Makefile:NUCMER = nucmer --maxmatch
> nucmer08/Makefile:NUCMER = nucmer --forward
> nucmer09/Makefile:NUCMER = nucmer --banded -D 1
> nucmer10/Makefile:NUCMER = nucmer --noextend
> nucmer11/Makefile:NUCMER = nucmer --noextend --mum
>
> my nucmer04/ attempt matches your example if the default value of -D is
> 5 as the nucmer -help text says. In any case, My out.qcoords
> (show-coords based on delta-filter -q output) for that attempt has these
> corresponding lines:
>
> 3569553 3701911 3570622 3702980 132359  132359  99.99   4419977 4425991
> 2.99    2.99 ...
> 3701734 3741905 3702919 3743090 40172   40172   99.98   4419977 4425991
> 0.91    0.91 ...
> 3741192 3741905 3744105 3744818 714     714     98.60   4419977 4425991
> 0.02    0.02 ...
>
> I thought that delta-filter -q would throw out the alignment
> corresponding to the middle line there since it overlaps another
> alignment on the query. Is delta-filter -q supposed to do that?
>
> Basically, my goal is to break the variant detection (both large and
> small) into steps to get a better handle of it for automatic processing
> of dozens of completed assemblies. For my first step, I want to know
> what exactly should be inserted or deleted from the reference in order
> to transform it into the query. After that, my next step would be to
> process all the differences and annotate/categorize them.
>
> I was apparently able to achieve my first step using plain old Unix diff
> once I appropriately formatted my files for it, though.
>
> Many thanks and regards
> Afif
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> MUMmer-help mailing list
> MUM...@li...
> https://lists.sourceforge.net/lists/listinfo/mummer-help
>