Re: [Dart-help] help with windowlicker
Brought to you by:
ihh
From: marcin j. <mar...@gm...> - 2008-05-19 21:09:53
|
hey Looks like I now have windowlicker output with 49500 alignment and trees like: // # STOCKHOLM 1.0 #=GF NH (CP000527:0.1287975598,(CP000112:0.0000000010,AE017286:0.0000000 010)node_3:0.1287875646)root; #=GF SC_max_rev -643.47 CP000112 ACAAAGTTTGCTATCCCTTCTTTGAGCGGAGTTCCGGGCTTGAAGCCGGTGTCGGCAATCAGGT CP000527 ACGAAGGCGGCTACGCCCTCTTCGACCGTGGTCGCGGGGCGGAAGCCCGTGTCAGCGATGAGGT AE017286 ---------------------------------------------------------------- CP000112 CGTCCACATTGGCATATGTTGCTTCCACATCGCCGGGCTGCATGGGCATCATGTTTTTAACGGC CP000527 CGTCGACGTCGGCGTAGGTCGCTGCCACGTCGCCGGGCTGCATGGGCAGCATGTTCTTGACCGC AE017286 ---------------------------------------------------------------- CP000112 CTTTTTGCCCAGGCATTCTTCCAGAGTCGCGATGAACTCGCCAAGCTCTACCGTGTTGTTGTTG CP000527 CTTCTTGCCGAGGCACTCTTCGAGTACTTCGATGAACCTGCCGAGTTCGACGGTGTTGTTGTTC AE017286 ---------------------------------------------------------------- CP000112 CCAATGTT CP000527 CCGATGTT AE017286 -------- But I'll need a little guidance as to what these are are and how to interpret them. Arre these ~50,000 ncRNA predictions? For completeness, here's the command I used to generate the output: perl /usr2/people/gtl/bin/dart/perl/windowlicker.pl DvH_DP4_G20_mVISTA_wtree.stock -- -e $DARTDIR/grammars/jukescantor.eg &> windowlicker.out thanks, marcin On Mon, Apr 28, 2008 at 12:28 PM, Ian Holmes <ih...@be...> wrote: > Hi Marcin... > > So, I never ran xrate on an alignment this big before (6 million > columns!) and it turned out there was a drastic inefficiency in the > distance matrix code, that was causing it to scale as the square of the > number of columns. Unnoticeable for most alignments (since the code > inside the loop was pretty quick), but your example caused it to hang... > > I think it was quite possible that your xrate job was getting killed > because it was running too long. Are you running queue management > software of some kind (e.g. Sun Grid Engine)? That can often kill > long-running jobs. I couldn't find a specific bug that was crashing on > this data, although I didn't finish running it because it was taking > sooooo long. > > Anyway, I've now fixed the inefficiency bug. The tree estimation by > neighbor-joining now finishes pretty fast. However, it now takes a long > time to complete the second step of the tree-estimation procedure, which > is to refine the branch lengths of the tree using the EM algorithm. > > There is less that I can do to optimize this. The neighbor-joining code > is not heavily dependent on the alignment length, but the branch-length > refinement is. I could optimize the branch-length code in the same way > that the neighbor-joining is optimized, so it was less dependent on > alignment length, but that would take a little while, and unfortunately > I don't have the time right now. > > Instead, what I have done is introduced a subtle hack. If you specify > the command-line option "--maxrounds 0", then this will bypass the > branch-length refinement code (specifically, what it does is tell xrate > that you want the EM algorithm to run for a maximum of zero iterations, > or "rounds"). > > Thus, with "--maxrounds 0" the tree will be estimated using > neighbor-joining only. > > Sorry this has taken so long to fix! Like I said, I never used xrate on > an alignment this big before (for our screens, we pre-specified the > tree, so we went directly to windowlicker). > > Hopefully, this will now work for you, but let me know if you have > further issues. > > Cheers, > Ian > > > marcin joachimiak wrote: >> hey, >> >> Thanks for the help! >> The short example worked just fine. >> >> log3 with my previous command so far produced this: >> Finding eigenvectors using MatrixExpEigenPrepare >> Largest eigenvalue was 0; set to 0 >> Updating cached M() & J() >> Finding eigenvectors using MatrixExpEigenPrepare >> Largest eigenvalue was 0; set to 0 >> Setting eigenvalue #2 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0)) >> Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0)) >> Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #2 ((-1.3333,0)) >> Updating cached M() & J() >> Sequence_database: converting ASCII sequences to score profiles >> Processed 3 sequences >> Estimating tree for the following sequences: CP000112 CP000527 AE017286 >> Estimating distances of sequences 0 and 1 >> >> You can download the FASTA and Stock files here: >> http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.fasta.gz >> http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.stock.gz >> >> I'll have more updates on Monday -- enjoy the weekend, >> marcin >> - Show quoted text - >> >> On Sat, Apr 26, 2008 at 1:25 PM, Ian Holmes <ih...@be...> wrote: >>> Thanks Marcin. >>> >>> Something has definitely gone wrong if no alignment is output. Can you >>> check that the following outputs an alignment+tree? >>> >>> xrate -log 5 --noannotate -e ~/dart/grammars/jukescantor.eg >>> ~/dart/src/ecfg/t/notree.stk >>> >>> >>> Looking at your log messages it seems like the program has crashed while >>> trying to do neighbor-joining to estimate the tree. For the above command, I >>> get log messages after the "Estimating tree..." message. See below (BTW many >>> of these are on stderr, not stdout) >>> >>> If you can reproduce this on the short example pairwise alignment but not >>> with the 3M alignments, there's something else wrong that needs >>> investigation. In that case, can you first try running "-log 3" instead of >>> "-log 5" (this will give more verbose log messages during the >>> neighbor-joining algorithm), then mail the stderr log messages to me, and/or >>> please post a URL to your large file so I can download it and attempt to >>> reproduce the error. >>> >>> Sorry about this... >>> Ian >>> >>> >>> >>> >>> Warning -- used random number generator during initialisation >>> Read 1 alignments >>> Sequence_database: converting ASCII sequences to score profiles >>> Estimating tree for the following sequences: X Y >>> Optimizing branch lengths of all trees in alignment database, using EM >>> Optimizing tree branch lengths by EM. >>> Tree before optimization: >>> (X:5.0000000000,Y:5.0000000000)root; >>> EM iteration #1: log-likelihood = -56.0101 bits >>> Optimized tree after step #1: >>> (X:5.1160179765,Y:5.1160180140)root; >>> EM iteration #2: log-likelihood = -56.0118 bits >>> Warning: log-likelihood dropped from -56.0101 to -56.0118 bits during EM >>> Failed EM improvement threshold for the 1th time; stopping >>> Checking post-iteration log-likelihood >>> Post-iteration log-likelihood = -56.0118 bits >>> Restoring previous best branch lengths >>> Tree after optimization: >>> (X:5.0000000000,Y:5.0000000000)root; >>> Alignment database log-likelihood: -56.0101 bits >>> Processing alignment Alignment1 (1 of 1): 14 columns >>> # STOCKHOLM 1.0 >>> #=GF NH (X:5.0000000000,Y:5.0000000000)root; >>> X AAAAAAaaUUUUUU >>> Y GGGGGGaaCCCCCC >>> // >>> >>> >>> >>> >>> marcin joachimiak wrote: >>> >>>> Hey there, >>>> >>>> The xrate tree job took a while but it seemed to finish without >>>> errors. Alas there isn't any tree output ... >>>> >>>> Here's what I get from stdout: >>>> >>>> Parsed command line: xrate DvH_DP4_G20_mVISTA.stock -e >>>> /usr2/people/gtl/bin/dart//grammars/jukescantor.eg --noannotate -log 5 >>>> Warning -- used random number generator during initialisation >>>> Read 1 alignments >>>> Sequence_database: converting ASCII sequences to score profiles >>>> Estimating tree for the following sequences: CP000112 CP000527 AE017286 >>>> >>>> The alignment is 3 bacterial genomes, ~3M each -- that wouldn't be a >>>> problem right? And this was run on a hefty machine ... >>>> >>>> >>>> marcin >>>> >>>> On Thu, Apr 24, 2008 at 7:36 PM, marcin joachimiak >>>> <mar...@gm...> wrote: >>>> >>>>> much better, thanks! >>>>> I did notice the trees for each alignment in the output ... >>>>> >>>>> The tree estimation is running, I'll keep reporting ... >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Apr 24, 2008 at 7:22 PM, Ian Holmes <ih...@be...> wrote: >>>>> > marcin joachimiak wrote: >>>>> > > Excellent, its working now! >>>>> > >>>>> > Cool! >>>>> > >>>>> > >>>>> > > I'll have the tree data soon and may try a run with as well, but >>> best >>>>> > > if I could know that the other tree is different from the xrate >>> one >>>>> > > with the Jukes-Cantor model. Can I get the Jukes-Cantor model >>>>> > > tree/distances from xrate? >>>>> > >>>>> > Yes, it will insert the trees into the alignment files it outputs, >>> using >>>>> > the New-Hampshire-encoded-in-Stockholm format from my previous >>> email. >>>>> > >>>>> > One caveat. The way you're running it now, it is calculating a >>> separate >>>>> > tree *for every window* in the alignment. This may not be what you >>> want, >>>>> > and it's certainly going to slow things down. >>>>> > >>>>> > If you want to calculate the tree just once for the whole alignment, >>> type >>>>> > >>>>> > xrate MY_ALIGNMENT.stock -e $DARTDIR/grammars/jukescantor.eg >>>>> > --noannotate -log 5 > MY_ALIGNMENT_WITH_TREE.stock >>>>> > >>>>> > >>>>> > The "--noannotate" just means "don't waste time trying to annotate >>> this >>>>> > alignment, I just want the tree". The "-log 5" will just print some >>> log >>>>> > messages on stderr, so you can watch the progress of the >>> tree-building >>>>> > algorithms. >>>>> > >>>>> > hth, >>>>> > >>>>> > I. >>>>> > >>>>> >>>>> > |