Re: [MUMmer-help] summary statistics for draft assembly alignments?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Jacqueline,
Sorry, dnadiff hasn't made it into the online docs yet, but there is a
description of it in the source distribution under README and
docs/dnadiff.README

I have some experience aligning bird genomes with MUMmer, and it can be
very computationally intensive. MUMmer was originally designed for
bacterial genome alignment, and thus doesn't scale all that well to large
genomes -- but it can be done.

The first question is how similar are you genomes? Nucmer doesn't perform
very well if the similarity drops below ~90% identity.

Second, you will want to run it in the "-mumreference" mode with the zebra
finch genome as the reference. This will exclude repetitive matches that
would otherwise extend the runtime too long. Also, depending on your
available memory you may need to align the chromosomes one at a time or in
batches (e.g. one finch chromosome as the reference per run of nucmer). If
you have enough RAM to do it all in one go:

> nucmer -mumreference zebrafinch.fasta yourgenome.fasta
> show-coords -THrcl out.delta > out.coords

That will produce a tab-delimited set of alignments for you to further
analyze. Usually, I would generate summary statistics like the ones you
mentioned using dnadiff, but I think that script would take too long on
your dataset (because it also reports all SNPs, etc). If you want to try
it, no guarantees, you can run it like so:

> dnadiff -d out.delta

And it will analyze the alignment (delta) file you generated previously.

The big caveat here is that by running numcer in the -mumreference mode,
many repeats will not be aligned. You'll have to keep this in mind when
compiling your statistics.

If this all seems to take too long using these tools, you can try another
aligner that scales better for large genomes, like BLAT.

Best,
-Adam

On Mon, Dec 16, 2013 at 3:10 PM, Jacqueline R M Doyle <jm...@pu...>wrote:

> Hi,
>
> I have recently done a couple different assemblies of an avian genome and
> a reviewer has suggested aligning the two assemblies to the zebra finch
> genome and seeing which assembly aligns best.  The idea here is that the
> assembly that overlaps most closely with the zebra finch genome is probably
> the best one to use for downstream analyses.  I'd like to align each
> assembly to the zebra finch genome using nucmer and then generate some
> summary statistics like number of aligned/unaligned contigs, total
> aligned/unaligned length, percent of aligned bases, etc.  What is the best
> way to go about generating this type of data?  I found references to a
> script called "dnadiff" in the email help archives, but couldn't find the
> scrip referenced in the MUMmer 3 manual.
>
> Best wishes...
>
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
> _______________________________________________
> MUMmer-help mailing list
> MUM...@li...
> https://lists.sourceforge.net/lists/listinfo/mummer-help
>