Re: [Dart-help] help with windowlicker
Brought to you by:
ihh
From: Ian H. <ih...@be...> - 2008-04-28 19:30:48
|
Hi Marcin... So, I never ran xrate on an alignment this big before (6 million columns!) and it turned out there was a drastic inefficiency in the distance matrix code, that was causing it to scale as the square of the number of columns. Unnoticeable for most alignments (since the code inside the loop was pretty quick), but your example caused it to hang... I think it was quite possible that your xrate job was getting killed because it was running too long. Are you running queue management software of some kind (e.g. Sun Grid Engine)? That can often kill long-running jobs. I couldn't find a specific bug that was crashing on this data, although I didn't finish running it because it was taking sooooo long. Anyway, I've now fixed the inefficiency bug. The tree estimation by neighbor-joining now finishes pretty fast. However, it now takes a long time to complete the second step of the tree-estimation procedure, which is to refine the branch lengths of the tree using the EM algorithm. There is less that I can do to optimize this. The neighbor-joining code is not heavily dependent on the alignment length, but the branch-length refinement is. I could optimize the branch-length code in the same way that the neighbor-joining is optimized, so it was less dependent on alignment length, but that would take a little while, and unfortunately I don't have the time right now. Instead, what I have done is introduced a subtle hack. If you specify the command-line option "--maxrounds 0", then this will bypass the branch-length refinement code (specifically, what it does is tell xrate that you want the EM algorithm to run for a maximum of zero iterations, or "rounds"). Thus, with "--maxrounds 0" the tree will be estimated using neighbor-joining only. Sorry this has taken so long to fix! Like I said, I never used xrate on an alignment this big before (for our screens, we pre-specified the tree, so we went directly to windowlicker). Hopefully, this will now work for you, but let me know if you have further issues. Cheers, Ian marcin joachimiak wrote: > hey, > > Thanks for the help! > The short example worked just fine. > > log3 with my previous command so far produced this: > Finding eigenvectors using MatrixExpEigenPrepare > Largest eigenvalue was 0; set to 0 > Updating cached M() & J() > Finding eigenvectors using MatrixExpEigenPrepare > Largest eigenvalue was 0; set to 0 > Setting eigenvalue #2 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0)) > Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0)) > Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #2 ((-1.3333,0)) > Updating cached M() & J() > Sequence_database: converting ASCII sequences to score profiles > Processed 3 sequences > Estimating tree for the following sequences: CP000112 CP000527 AE017286 > Estimating distances of sequences 0 and 1 > > You can download the FASTA and Stock files here: > http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.fasta.gz > http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.stock.gz > > I'll have more updates on Monday -- enjoy the weekend, > marcin > - Show quoted text - > > On Sat, Apr 26, 2008 at 1:25 PM, Ian Holmes <ih...@be...> wrote: >> Thanks Marcin. >> >> Something has definitely gone wrong if no alignment is output. Can you >> check that the following outputs an alignment+tree? >> >> xrate -log 5 --noannotate -e ~/dart/grammars/jukescantor.eg >> ~/dart/src/ecfg/t/notree.stk >> >> >> Looking at your log messages it seems like the program has crashed while >> trying to do neighbor-joining to estimate the tree. For the above command, I >> get log messages after the "Estimating tree..." message. See below (BTW many >> of these are on stderr, not stdout) >> >> If you can reproduce this on the short example pairwise alignment but not >> with the 3M alignments, there's something else wrong that needs >> investigation. In that case, can you first try running "-log 3" instead of >> "-log 5" (this will give more verbose log messages during the >> neighbor-joining algorithm), then mail the stderr log messages to me, and/or >> please post a URL to your large file so I can download it and attempt to >> reproduce the error. >> >> Sorry about this... >> Ian >> >> >> >> >> Warning -- used random number generator during initialisation >> Read 1 alignments >> Sequence_database: converting ASCII sequences to score profiles >> Estimating tree for the following sequences: X Y >> Optimizing branch lengths of all trees in alignment database, using EM >> Optimizing tree branch lengths by EM. >> Tree before optimization: >> (X:5.0000000000,Y:5.0000000000)root; >> EM iteration #1: log-likelihood = -56.0101 bits >> Optimized tree after step #1: >> (X:5.1160179765,Y:5.1160180140)root; >> EM iteration #2: log-likelihood = -56.0118 bits >> Warning: log-likelihood dropped from -56.0101 to -56.0118 bits during EM >> Failed EM improvement threshold for the 1th time; stopping >> Checking post-iteration log-likelihood >> Post-iteration log-likelihood = -56.0118 bits >> Restoring previous best branch lengths >> Tree after optimization: >> (X:5.0000000000,Y:5.0000000000)root; >> Alignment database log-likelihood: -56.0101 bits >> Processing alignment Alignment1 (1 of 1): 14 columns >> # STOCKHOLM 1.0 >> #=GF NH (X:5.0000000000,Y:5.0000000000)root; >> X AAAAAAaaUUUUUU >> Y GGGGGGaaCCCCCC >> // >> >> >> >> >> marcin joachimiak wrote: >> >>> Hey there, >>> >>> The xrate tree job took a while but it seemed to finish without >>> errors. Alas there isn't any tree output ... >>> >>> Here's what I get from stdout: >>> >>> Parsed command line: xrate DvH_DP4_G20_mVISTA.stock -e >>> /usr2/people/gtl/bin/dart//grammars/jukescantor.eg --noannotate -log 5 >>> Warning -- used random number generator during initialisation >>> Read 1 alignments >>> Sequence_database: converting ASCII sequences to score profiles >>> Estimating tree for the following sequences: CP000112 CP000527 AE017286 >>> >>> The alignment is 3 bacterial genomes, ~3M each -- that wouldn't be a >>> problem right? And this was run on a hefty machine ... >>> >>> >>> marcin >>> >>> On Thu, Apr 24, 2008 at 7:36 PM, marcin joachimiak >>> <mar...@gm...> wrote: >>> >>>> much better, thanks! >>>> I did notice the trees for each alignment in the output ... >>>> >>>> The tree estimation is running, I'll keep reporting ... >>>> >>>> >>>> >>>> >>>> On Thu, Apr 24, 2008 at 7:22 PM, Ian Holmes <ih...@be...> wrote: >>>> > marcin joachimiak wrote: >>>> > > Excellent, its working now! >>>> > >>>> > Cool! >>>> > >>>> > >>>> > > I'll have the tree data soon and may try a run with as well, but >> best >>>> > > if I could know that the other tree is different from the xrate >> one >>>> > > with the Jukes-Cantor model. Can I get the Jukes-Cantor model >>>> > > tree/distances from xrate? >>>> > >>>> > Yes, it will insert the trees into the alignment files it outputs, >> using >>>> > the New-Hampshire-encoded-in-Stockholm format from my previous >> email. >>>> > >>>> > One caveat. The way you're running it now, it is calculating a >> separate >>>> > tree *for every window* in the alignment. This may not be what you >> want, >>>> > and it's certainly going to slow things down. >>>> > >>>> > If you want to calculate the tree just once for the whole alignment, >> type >>>> > >>>> > xrate MY_ALIGNMENT.stock -e $DARTDIR/grammars/jukescantor.eg >>>> > --noannotate -log 5 > MY_ALIGNMENT_WITH_TREE.stock >>>> > >>>> > >>>> > The "--noannotate" just means "don't waste time trying to annotate >> this >>>> > alignment, I just want the tree". The "-log 5" will just print some >> log >>>> > messages on stderr, so you can watch the progress of the >> tree-building >>>> > algorithms. >>>> > >>>> > hth, >>>> > >>>> > I. >>>> > >>>> >>>> |