From: Adam P. <aph...@gm...> - 2017-08-21 12:12:55
|
Hi Manish, Your parameter set `nucmer --maxmatch -c 100 -b 500 -l 50` will seed an alignment for every pair of ~100 bp repeats and larger. There's quite a few of them in the human genome :) The problem with repeats is that the program considers all pairs of them, so if you have 10 repeats, that leads to 10*10=100 alignments and so on. This quadratic relationship is what makes the runtime so bad. If you really need to find all those repeats, you could try breaking up the genome into multiple pieces and parallelizing the search. Otherwise, you can lower the sensitivity of the search by further increasing the -l and -c options, or using the -mumreference option, which will use only 'unique' seeds and therefore avoid aligning many of the repeats. I usually always run nucmer with -mumreference when dealing with large, repetitive genomes. Best, -Adam On Fri, Aug 18, 2017 at 10:24 AM, Manish Goel <go...@mp...> wrote: > Hi All, > > I am trying to run nucmer to align two human genomes using: > > nucmer --maxmatch -c 100 -b 500 -l 50 refGenome queryGenome > > The program starts and runs fine but get stuck at the last step (finishing > data). > > delta = > running NUCMER > 1: PREPARING DATA > 2,3: RUNNING mummer AND CREATING CLUSTERS > # reading input file "out.ntref" of length 3088286426 > # construct suffix tree for sequence of length 3088286426 > # (maximum reference length is 2305843009213693948) > # (maximum query length is 18446744073709551615) > # process 30882864 characters per dot > #........................................................... > ......................................... > # CONSTRUCTIONTIME ****/software/lib/MUMmer3.23/mummer out.ntref 4578.29 > # reading input file ********** of length 3088496978 > # matching query-file ************ > # against subject-file "out.ntref" > # COMPLETETIME ****/software/lib/MUMmer3.23/mummer out.ntref 37778.09 > # SPACE *****/software/lib/MUMmer3.23/mummer out.ntref 6013.53 > 4: FINISHING DATA > > It is writing the out.delta file but it seems that it is doing so very > slowly. I let the program run for more than 3 months (no kidding) before I > killed the job. I started a new job more than 10days back, it is still > running with the out.delta file more than 1.2gb and growing. Last edit to > out.delta happened 18hrs prior to the time I write this email. I know that > my nucmer installation is working as I have successfully aligned multiple > plant genomes, albeit they also took around 90% of their running time at > the "Finishing data" step only. > > Any suggestion why nucmer is showing this behavior and how to resolve it? > I would hypothesize that it is because human genome is quite large, but I > don't want to believe that nucmer would take more than 3months to align it. > > Thanks for your time and efforts. > > Regards > > Manish Goel > > > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > MUMmer-help mailing list > MUM...@li... > https://lists.sourceforge.net/lists/listinfo/mummer-help > |