Re: [Dart-help] help with windowlicker
Brought to you by:
ihh
From: Ian H. <ih...@be...> - 2008-05-19 22:45:43
|
marcin joachimiak wrote: > hey > > Looks like I now have windowlicker output with 49500 alignment and trees like: > // > # STOCKHOLM 1.0 > #=GF NH (CP000527:0.1287975598,(CP000112:0.0000000010,AE017286:0.0000000 > 010)node_3:0.1287875646)root; > #=GF SC_max_rev -643.47 > CP000112 ACAAAGTTTGCTATCCCTTCTTTGAGCGGAGTTCCGGGCTTGAAGCCGGTGTCGGCAATCAGGT > CP000527 ACGAAGGCGGCTACGCCCTCTTCGACCGTGGTCGCGGGGCGGAAGCCCGTGTCAGCGATGAGGT > AE017286 ---------------------------------------------------------------- > > CP000112 CGTCCACATTGGCATATGTTGCTTCCACATCGCCGGGCTGCATGGGCATCATGTTTTTAACGGC > CP000527 CGTCGACGTCGGCGTAGGTCGCTGCCACGTCGCCGGGCTGCATGGGCAGCATGTTCTTGACCGC > AE017286 ---------------------------------------------------------------- > > CP000112 CTTTTTGCCCAGGCATTCTTCCAGAGTCGCGATGAACTCGCCAAGCTCTACCGTGTTGTTGTTG > CP000527 CTTCTTGCCGAGGCACTCTTCGAGTACTTCGATGAACCTGCCGAGTTCGACGGTGTTGTTGTTC > AE017286 ---------------------------------------------------------------- > > CP000112 CCAATGTT > CP000527 CCGATGTT > AE017286 -------- > > > But I'll need a little guidance as to what these are are and how to > interpret them. Arre these ~50,000 ncRNA predictions? > > > For completeness, here's the command I used to generate the output: > perl /usr2/people/gtl/bin/dart/perl/windowlicker.pl > DvH_DP4_G20_mVISTA_wtree.stock -- -e $DARTDIR/grammars/jukescantor.eg > &> windowlicker.out Oops. It does not look like you have actually specified a genefinding grammar. The score "SC_max_rev" that's reported indicates that windowlicker was using xrate's default "rev" grammar which is not a genefinder... this is consistent with the command line above. You need to download a genefinding grammar e.g. from here http://biowiki.org/results/ncRNA_predictions/12fly/ncRna_v22.eg ...then change your command line to include this grammar... perl /usr2/people/gtl/bin/dart/perl/windowlicker.pl DvH_DP4_G20_mVISTA_wtree.stock -- -e $DARTDIR/grammars/jukescantor.eg -g ncRna_v22.eg &> windowlicker.out I am sorry that with all the discussion of how to get the trees right, I/we somehow lost sight of this fairly important point... I think we did mention it initially but while trying to figure out options for the tree estimation, it must have got dropped at some point... I. > > thanks, > marcin > > On Mon, Apr 28, 2008 at 12:28 PM, Ian Holmes <ih...@be...> wrote: >> Hi Marcin... >> >> So, I never ran xrate on an alignment this big before (6 million >> columns!) and it turned out there was a drastic inefficiency in the >> distance matrix code, that was causing it to scale as the square of the >> number of columns. Unnoticeable for most alignments (since the code >> inside the loop was pretty quick), but your example caused it to hang... >> >> I think it was quite possible that your xrate job was getting killed >> because it was running too long. Are you running queue management >> software of some kind (e.g. Sun Grid Engine)? That can often kill >> long-running jobs. I couldn't find a specific bug that was crashing on >> this data, although I didn't finish running it because it was taking >> sooooo long. >> >> Anyway, I've now fixed the inefficiency bug. The tree estimation by >> neighbor-joining now finishes pretty fast. However, it now takes a long >> time to complete the second step of the tree-estimation procedure, which >> is to refine the branch lengths of the tree using the EM algorithm. >> >> There is less that I can do to optimize this. The neighbor-joining code >> is not heavily dependent on the alignment length, but the branch-length >> refinement is. I could optimize the branch-length code in the same way >> that the neighbor-joining is optimized, so it was less dependent on >> alignment length, but that would take a little while, and unfortunately >> I don't have the time right now. >> >> Instead, what I have done is introduced a subtle hack. If you specify >> the command-line option "--maxrounds 0", then this will bypass the >> branch-length refinement code (specifically, what it does is tell xrate >> that you want the EM algorithm to run for a maximum of zero iterations, >> or "rounds"). >> >> Thus, with "--maxrounds 0" the tree will be estimated using >> neighbor-joining only. >> >> Sorry this has taken so long to fix! Like I said, I never used xrate on >> an alignment this big before (for our screens, we pre-specified the >> tree, so we went directly to windowlicker). >> >> Hopefully, this will now work for you, but let me know if you have >> further issues. >> >> Cheers, >> Ian >> >> >> marcin joachimiak wrote: >>> hey, >>> >>> Thanks for the help! >>> The short example worked just fine. >>> >>> log3 with my previous command so far produced this: >>> Finding eigenvectors using MatrixExpEigenPrepare >>> Largest eigenvalue was 0; set to 0 >>> Updating cached M() & J() >>> Finding eigenvectors using MatrixExpEigenPrepare >>> Largest eigenvalue was 0; set to 0 >>> Setting eigenvalue #2 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0)) >>> Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0)) >>> Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #2 ((-1.3333,0)) >>> Updating cached M() & J() >>> Sequence_database: converting ASCII sequences to score profiles >>> Processed 3 sequences >>> Estimating tree for the following sequences: CP000112 CP000527 AE017286 >>> Estimating distances of sequences 0 and 1 >>> >>> You can download the FASTA and Stock files here: >>> http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.fasta.gz >>> http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.stock.gz >>> >>> I'll have more updates on Monday -- enjoy the weekend, >>> marcin >>> - Show quoted text - >>> >>> On Sat, Apr 26, 2008 at 1:25 PM, Ian Holmes <ih...@be...> wrote: >>>> Thanks Marcin. >>>> >>>> Something has definitely gone wrong if no alignment is output. Can you >>>> check that the following outputs an alignment+tree? >>>> >>>> xrate -log 5 --noannotate -e ~/dart/grammars/jukescantor.eg >>>> ~/dart/src/ecfg/t/notree.stk >>>> >>>> >>>> Looking at your log messages it seems like the program has crashed while >>>> trying to do neighbor-joining to estimate the tree. For the above command, I >>>> get log messages after the "Estimating tree..." message. See below (BTW many >>>> of these are on stderr, not stdout) >>>> >>>> If you can reproduce this on the short example pairwise alignment but not >>>> with the 3M alignments, there's something else wrong that needs >>>> investigation. In that case, can you first try running "-log 3" instead of >>>> "-log 5" (this will give more verbose log messages during the >>>> neighbor-joining algorithm), then mail the stderr log messages to me, and/or >>>> please post a URL to your large file so I can download it and attempt to >>>> reproduce the error. >>>> >>>> Sorry about this... >>>> Ian >>>> >>>> >>>> >>>> >>>> Warning -- used random number generator during initialisation >>>> Read 1 alignments >>>> Sequence_database: converting ASCII sequences to score profiles >>>> Estimating tree for the following sequences: X Y >>>> Optimizing branch lengths of all trees in alignment database, using EM >>>> Optimizing tree branch lengths by EM. >>>> Tree before optimization: >>>> (X:5.0000000000,Y:5.0000000000)root; >>>> EM iteration #1: log-likelihood = -56.0101 bits >>>> Optimized tree after step #1: >>>> (X:5.1160179765,Y:5.1160180140)root; >>>> EM iteration #2: log-likelihood = -56.0118 bits >>>> Warning: log-likelihood dropped from -56.0101 to -56.0118 bits during EM >>>> Failed EM improvement threshold for the 1th time; stopping >>>> Checking post-iteration log-likelihood >>>> Post-iteration log-likelihood = -56.0118 bits >>>> Restoring previous best branch lengths >>>> Tree after optimization: >>>> (X:5.0000000000,Y:5.0000000000)root; >>>> Alignment database log-likelihood: -56.0101 bits >>>> Processing alignment Alignment1 (1 of 1): 14 columns >>>> # STOCKHOLM 1.0 >>>> #=GF NH (X:5.0000000000,Y:5.0000000000)root; >>>> X AAAAAAaaUUUUUU >>>> Y GGGGGGaaCCCCCC >>>> // >>>> >>>> >>>> >>>> >>>> marcin joachimiak wrote: >>>> >>>>> Hey there, >>>>> >>>>> The xrate tree job took a while but it seemed to finish without >>>>> errors. Alas there isn't any tree output ... >>>>> >>>>> Here's what I get from stdout: >>>>> >>>>> Parsed command line: xrate DvH_DP4_G20_mVISTA.stock -e >>>>> /usr2/people/gtl/bin/dart//grammars/jukescantor.eg --noannotate -log 5 >>>>> Warning -- used random number generator during initialisation >>>>> Read 1 alignments >>>>> Sequence_database: converting ASCII sequences to score profiles >>>>> Estimating tree for the following sequences: CP000112 CP000527 AE017286 >>>>> >>>>> The alignment is 3 bacterial genomes, ~3M each -- that wouldn't be a >>>>> problem right? And this was run on a hefty machine ... >>>>> >>>>> >>>>> marcin >>>>> >>>>> On Thu, Apr 24, 2008 at 7:36 PM, marcin joachimiak >>>>> <mar...@gm...> wrote: >>>>> >>>>>> much better, thanks! >>>>>> I did notice the trees for each alignment in the output ... >>>>>> >>>>>> The tree estimation is running, I'll keep reporting ... >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Apr 24, 2008 at 7:22 PM, Ian Holmes <ih...@be...> wrote: >>>>>> > marcin joachimiak wrote: >>>>>> > > Excellent, its working now! >>>>>> > >>>>>> > Cool! >>>>>> > >>>>>> > >>>>>> > > I'll have the tree data soon and may try a run with as well, but >>>> best >>>>>> > > if I could know that the other tree is different from the xrate >>>> one >>>>>> > > with the Jukes-Cantor model. Can I get the Jukes-Cantor model >>>>>> > > tree/distances from xrate? >>>>>> > >>>>>> > Yes, it will insert the trees into the alignment files it outputs, >>>> using >>>>>> > the New-Hampshire-encoded-in-Stockholm format from my previous >>>> email. >>>>>> > >>>>>> > One caveat. The way you're running it now, it is calculating a >>>> separate >>>>>> > tree *for every window* in the alignment. This may not be what you >>>> want, >>>>>> > and it's certainly going to slow things down. >>>>>> > >>>>>> > If you want to calculate the tree just once for the whole alignment, >>>> type >>>>>> > >>>>>> > xrate MY_ALIGNMENT.stock -e $DARTDIR/grammars/jukescantor.eg >>>>>> > --noannotate -log 5 > MY_ALIGNMENT_WITH_TREE.stock >>>>>> > >>>>>> > >>>>>> > The "--noannotate" just means "don't waste time trying to annotate >>>> this >>>>>> > alignment, I just want the tree". The "-log 5" will just print some >>>> log >>>>>> > messages on stderr, so you can watch the progress of the >>>> tree-building >>>>>> > algorithms. >>>>>> > >>>>>> > hth, >>>>>> > >>>>>> > I. >>>>>> > >>>>>> >>>>>> |