Re: [Dart-help] help with windowlicker
Brought to you by:
ihh
From: marcin j. <mar...@gm...> - 2008-04-29 17:48:58
|
Hey hey, Indeed, the alignment size limitation makes sense and was one of the things I was suspecting. The job was run on a machine w.o queue management but I will investigate whether it could have been killed by other means. The 0 iteration option sounds like it will do the trick -- so I just need to checkout the DART code from cvs again, right? Thanks for adding this in and I definitely understand the time constraints etc., this should suffice for my purposes. I could of course try to obtain a tree in other ways, but actually I'm trying to turn this into a pipeline that can run on multi-genome alignments, its almost there. cheers, marcin On Mon, Apr 28, 2008 at 12:28 PM, Ian Holmes <ih...@be...> wrote: > Hi Marcin... > > So, I never ran xrate on an alignment this big before (6 million > columns!) and it turned out there was a drastic inefficiency in the > distance matrix code, that was causing it to scale as the square of the > number of columns. Unnoticeable for most alignments (since the code > inside the loop was pretty quick), but your example caused it to hang... > > I think it was quite possible that your xrate job was getting killed > because it was running too long. Are you running queue management > software of some kind (e.g. Sun Grid Engine)? That can often kill > long-running jobs. I couldn't find a specific bug that was crashing on > this data, although I didn't finish running it because it was taking > sooooo long. > > Anyway, I've now fixed the inefficiency bug. The tree estimation by > neighbor-joining now finishes pretty fast. However, it now takes a long > time to complete the second step of the tree-estimation procedure, which > is to refine the branch lengths of the tree using the EM algorithm. > > There is less that I can do to optimize this. The neighbor-joining code > is not heavily dependent on the alignment length, but the branch-length > refinement is. I could optimize the branch-length code in the same way > that the neighbor-joining is optimized, so it was less dependent on > alignment length, but that would take a little while, and unfortunately > I don't have the time right now. > > Instead, what I have done is introduced a subtle hack. If you specify > the command-line option "--maxrounds 0", then this will bypass the > branch-length refinement code (specifically, what it does is tell xrate > that you want the EM algorithm to run for a maximum of zero iterations, > or "rounds"). > > Thus, with "--maxrounds 0" the tree will be estimated using > neighbor-joining only. > > Sorry this has taken so long to fix! Like I said, I never used xrate on > an alignment this big before (for our screens, we pre-specified the > tree, so we went directly to windowlicker). > > Hopefully, this will now work for you, but let me know if you have > further issues. > > Cheers, > Ian > > > > > marcin joachimiak wrote: > > hey, > > > > Thanks for the help! > > The short example worked just fine. > > > > log3 with my previous command so far produced this: > > Finding eigenvectors using MatrixExpEigenPrepare > > Largest eigenvalue was 0; set to 0 > > Updating cached M() & J() > > Finding eigenvectors using MatrixExpEigenPrepare > > Largest eigenvalue was 0; set to 0 > > Setting eigenvalue #2 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0)) > > Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0)) > > Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #2 ((-1.3333,0)) > > Updating cached M() & J() > > Sequence_database: converting ASCII sequences to score profiles > > Processed 3 sequences > > Estimating tree for the following sequences: CP000112 CP000527 AE017286 > > Estimating distances of sequences 0 and 1 > > > > You can download the FASTA and Stock files here: > > http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.fasta.gz > > http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.stock.gz > > > > I'll have more updates on Monday -- enjoy the weekend, > > marcin > > - Show quoted text - > > > > On Sat, Apr 26, 2008 at 1:25 PM, Ian Holmes <ih...@be...> wrote: > >> Thanks Marcin. > >> > >> Something has definitely gone wrong if no alignment is output. Can you > >> check that the following outputs an alignment+tree? > >> > >> xrate -log 5 --noannotate -e ~/dart/grammars/jukescantor.eg > >> ~/dart/src/ecfg/t/notree.stk > >> > >> > >> Looking at your log messages it seems like the program has crashed while > >> trying to do neighbor-joining to estimate the tree. For the above command, I > >> get log messages after the "Estimating tree..." message. See below (BTW many > >> of these are on stderr, not stdout) > >> > >> If you can reproduce this on the short example pairwise alignment but not > >> with the 3M alignments, there's something else wrong that needs > >> investigation. In that case, can you first try running "-log 3" instead of > >> "-log 5" (this will give more verbose log messages during the > >> neighbor-joining algorithm), then mail the stderr log messages to me, and/or > >> please post a URL to your large file so I can download it and attempt to > >> reproduce the error. > >> > >> Sorry about this... > >> Ian > >> > >> > >> > >> > >> Warning -- used random number generator during initialisation > >> Read 1 alignments > >> Sequence_database: converting ASCII sequences to score profiles > >> Estimating tree for the following sequences: X Y > >> Optimizing branch lengths of all trees in alignment database, using EM > >> Optimizing tree branch lengths by EM. > >> Tree before optimization: > >> (X:5.0000000000,Y:5.0000000000)root; > >> EM iteration #1: log-likelihood = -56.0101 bits > >> Optimized tree after step #1: > >> (X:5.1160179765,Y:5.1160180140)root; > >> EM iteration #2: log-likelihood = -56.0118 bits > >> Warning: log-likelihood dropped from -56.0101 to -56.0118 bits during EM > >> Failed EM improvement threshold for the 1th time; stopping > >> Checking post-iteration log-likelihood > >> Post-iteration log-likelihood = -56.0118 bits > >> Restoring previous best branch lengths > >> Tree after optimization: > >> (X:5.0000000000,Y:5.0000000000)root; > >> Alignment database log-likelihood: -56.0101 bits > >> Processing alignment Alignment1 (1 of 1): 14 columns > >> # STOCKHOLM 1.0 > >> #=GF NH (X:5.0000000000,Y:5.0000000000)root; > >> X AAAAAAaaUUUUUU > >> Y GGGGGGaaCCCCCC > >> // > >> > >> > >> > >> > >> marcin joachimiak wrote: > >> > >>> Hey there, > >>> > >>> The xrate tree job took a while but it seemed to finish without > >>> errors. Alas there isn't any tree output ... > >>> > >>> Here's what I get from stdout: > >>> > >>> Parsed command line: xrate DvH_DP4_G20_mVISTA.stock -e > >>> /usr2/people/gtl/bin/dart//grammars/jukescantor.eg --noannotate -log 5 > >>> Warning -- used random number generator during initialisation > >>> Read 1 alignments > >>> Sequence_database: converting ASCII sequences to score profiles > >>> Estimating tree for the following sequences: CP000112 CP000527 AE017286 > >>> > >>> The alignment is 3 bacterial genomes, ~3M each -- that wouldn't be a > >>> problem right? And this was run on a hefty machine ... > >>> > >>> > >>> marcin > >>> > >>> On Thu, Apr 24, 2008 at 7:36 PM, marcin joachimiak > >>> <mar...@gm...> wrote: > >>> > >>>> much better, thanks! > >>>> I did notice the trees for each alignment in the output ... > >>>> > >>>> The tree estimation is running, I'll keep reporting ... > >>>> > >>>> > >>>> > >>>> > >>>> On Thu, Apr 24, 2008 at 7:22 PM, Ian Holmes <ih...@be...> wrote: > >>>> > marcin joachimiak wrote: > >>>> > > Excellent, its working now! > >>>> > > >>>> > Cool! > >>>> > > >>>> > > >>>> > > I'll have the tree data soon and may try a run with as well, but > >> best > >>>> > > if I could know that the other tree is different from the xrate > >> one > >>>> > > with the Jukes-Cantor model. Can I get the Jukes-Cantor model > >>>> > > tree/distances from xrate? > >>>> > > >>>> > Yes, it will insert the trees into the alignment files it outputs, > >> using > >>>> > the New-Hampshire-encoded-in-Stockholm format from my previous > >> email. > >>>> > > >>>> > One caveat. The way you're running it now, it is calculating a > >> separate > >>>> > tree *for every window* in the alignment. This may not be what you > >> want, > >>>> > and it's certainly going to slow things down. > >>>> > > >>>> > If you want to calculate the tree just once for the whole alignment, > >> type > >>>> > > >>>> > xrate MY_ALIGNMENT.stock -e $DARTDIR/grammars/jukescantor.eg > >>>> > --noannotate -log 5 > MY_ALIGNMENT_WITH_TREE.stock > >>>> > > >>>> > > >>>> > The "--noannotate" just means "don't waste time trying to annotate > >> this > >>>> > alignment, I just want the tree". The "-log 5" will just print some > >> log > >>>> > messages on stderr, so you can watch the progress of the > >> tree-building > >>>> > algorithms. > >>>> > > >>>> > hth, > >>>> > > >>>> > I. > >>>> > > >>>> > >>>> > |