Re: [Dart-help] help with windowlicker
Brought to you by:
ihh
From: marcin j. <mar...@gm...> - 2008-05-19 22:55:53
|
No problem, thanks again for all the hand-holding. I've started up a new job with the gene-finding grammar which will take at least a few hours based on the last (default grammar) run. So what is the default 'rev' grammar useful for? Just so I have a better context, what can I expect to get out of this given that the genomes are from fairly uncharacterized and potentially unusual sulfate-reducing anaerobic bacteria? Also, the three genomes have one close pair and two distant pairs. thanks. marcin On Mon, May 19, 2008 at 3:44 PM, Ian Holmes <ih...@be...> wrote: > marcin joachimiak wrote: >> hey >> >> Looks like I now have windowlicker output with 49500 alignment and trees like: >> // >> # STOCKHOLM 1.0 >> #=GF NH (CP000527:0.1287975598,(CP000112:0.0000000010,AE017286:0.0000000 >> 010)node_3:0.1287875646)root; >> #=GF SC_max_rev -643.47 >> CP000112 ACAAAGTTTGCTATCCCTTCTTTGAGCGGAGTTCCGGGCTTGAAGCCGGTGTCGGCAATCAGGT >> CP000527 ACGAAGGCGGCTACGCCCTCTTCGACCGTGGTCGCGGGGCGGAAGCCCGTGTCAGCGATGAGGT >> AE017286 ---------------------------------------------------------------- >> >> CP000112 CGTCCACATTGGCATATGTTGCTTCCACATCGCCGGGCTGCATGGGCATCATGTTTTTAACGGC >> CP000527 CGTCGACGTCGGCGTAGGTCGCTGCCACGTCGCCGGGCTGCATGGGCAGCATGTTCTTGACCGC >> AE017286 ---------------------------------------------------------------- >> >> CP000112 CTTTTTGCCCAGGCATTCTTCCAGAGTCGCGATGAACTCGCCAAGCTCTACCGTGTTGTTGTTG >> CP000527 CTTCTTGCCGAGGCACTCTTCGAGTACTTCGATGAACCTGCCGAGTTCGACGGTGTTGTTGTTC >> AE017286 ---------------------------------------------------------------- >> >> CP000112 CCAATGTT >> CP000527 CCGATGTT >> AE017286 -------- >> >> >> But I'll need a little guidance as to what these are are and how to >> interpret them. Arre these ~50,000 ncRNA predictions? >> >> >> For completeness, here's the command I used to generate the output: >> perl /usr2/people/gtl/bin/dart/perl/windowlicker.pl >> DvH_DP4_G20_mVISTA_wtree.stock -- -e $DARTDIR/grammars/jukescantor.eg >> &> windowlicker.out > > > Oops. It does not look like you have actually specified a genefinding > grammar. The score "SC_max_rev" that's reported indicates that > windowlicker was using xrate's default "rev" grammar which is not a > genefinder... this is consistent with the command line above. > > You need to download a genefinding grammar e.g. from here > > http://biowiki.org/results/ncRNA_predictions/12fly/ncRna_v22.eg > > ...then change your command line to include this grammar... > > perl /usr2/people/gtl/bin/dart/perl/windowlicker.pl > DvH_DP4_G20_mVISTA_wtree.stock -- -e $DARTDIR/grammars/jukescantor.eg > -g ncRna_v22.eg &> windowlicker.out > > > I am sorry that with all the discussion of how to get the trees right, > I/we somehow lost sight of this fairly important point... I think we did > mention it initially but while trying to figure out options for the tree > estimation, it must have got dropped at some point... > > I. > > > > > >> >> thanks, >> marcin >> >> On Mon, Apr 28, 2008 at 12:28 PM, Ian Holmes <ih...@be...> wrote: >>> Hi Marcin... >>> >>> So, I never ran xrate on an alignment this big before (6 million >>> columns!) and it turned out there was a drastic inefficiency in the >>> distance matrix code, that was causing it to scale as the square of the >>> number of columns. Unnoticeable for most alignments (since the code >>> inside the loop was pretty quick), but your example caused it to hang... >>> >>> I think it was quite possible that your xrate job was getting killed >>> because it was running too long. Are you running queue management >>> software of some kind (e.g. Sun Grid Engine)? That can often kill >>> long-running jobs. I couldn't find a specific bug that was crashing on >>> this data, although I didn't finish running it because it was taking >>> sooooo long. >>> >>> Anyway, I've now fixed the inefficiency bug. The tree estimation by >>> neighbor-joining now finishes pretty fast. However, it now takes a long >>> time to complete the second step of the tree-estimation procedure, which >>> is to refine the branch lengths of the tree using the EM algorithm. >>> >>> There is less that I can do to optimize this. The neighbor-joining code >>> is not heavily dependent on the alignment length, but the branch-length >>> refinement is. I could optimize the branch-length code in the same way >>> that the neighbor-joining is optimized, so it was less dependent on >>> alignment length, but that would take a little while, and unfortunately >>> I don't have the time right now. >>> >>> Instead, what I have done is introduced a subtle hack. If you specify >>> the command-line option "--maxrounds 0", then this will bypass the >>> branch-length refinement code (specifically, what it does is tell xrate >>> that you want the EM algorithm to run for a maximum of zero iterations, >>> or "rounds"). >>> >>> Thus, with "--maxrounds 0" the tree will be estimated using >>> neighbor-joining only. >>> >>> Sorry this has taken so long to fix! Like I said, I never used xrate on >>> an alignment this big before (for our screens, we pre-specified the >>> tree, so we went directly to windowlicker). >>> >>> Hopefully, this will now work for you, but let me know if you have >>> further issues. >>> >>> Cheers, >>> Ian >>> >>> >>> marcin joachimiak wrote: >>>> hey, >>>> >>>> Thanks for the help! >>>> The short example worked just fine. >>>> >>>> log3 with my previous command so far produced this: >>>> Finding eigenvectors using MatrixExpEigenPrepare >>>> Largest eigenvalue was 0; set to 0 >>>> Updating cached M() & J() >>>> Finding eigenvectors using MatrixExpEigenPrepare >>>> Largest eigenvalue was 0; set to 0 >>>> Setting eigenvalue #2 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0)) >>>> Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0)) >>>> Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #2 ((-1.3333,0)) >>>> Updating cached M() & J() >>>> Sequence_database: converting ASCII sequences to score profiles >>>> Processed 3 sequences >>>> Estimating tree for the following sequences: CP000112 CP000527 AE017286 >>>> Estimating distances of sequences 0 and 1 >>>> >>>> You can download the FASTA and Stock files here: >>>> http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.fasta.gz >>>> http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.stock.gz >>>> >>>> I'll have more updates on Monday -- enjoy the weekend, >>>> marcin >>>> - Show quoted text - >>>> >>>> On Sat, Apr 26, 2008 at 1:25 PM, Ian Holmes <ih...@be...> wrote: >>>>> Thanks Marcin. >>>>> >>>>> Something has definitely gone wrong if no alignment is output. Can you >>>>> check that the following outputs an alignment+tree? >>>>> >>>>> xrate -log 5 --noannotate -e ~/dart/grammars/jukescantor.eg >>>>> ~/dart/src/ecfg/t/notree.stk >>>>> >>>>> >>>>> Looking at your log messages it seems like the program has crashed while >>>>> trying to do neighbor-joining to estimate the tree. For the above command, I >>>>> get log messages after the "Estimating tree..." message. See below (BTW many >>>>> of these are on stderr, not stdout) >>>>> >>>>> If you can reproduce this on the short example pairwise alignment but not >>>>> with the 3M alignments, there's something else wrong that needs >>>>> investigation. In that case, can you first try running "-log 3" instead of >>>>> "-log 5" (this will give more verbose log messages during the >>>>> neighbor-joining algorithm), then mail the stderr log messages to me, and/or >>>>> please post a URL to your large file so I can download it and attempt to >>>>> reproduce the error. >>>>> >>>>> Sorry about this... >>>>> Ian >>>>> >>>>> >>>>> >>>>> >>>>> Warning -- used random number generator during initialisation >>>>> Read 1 alignments >>>>> Sequence_database: converting ASCII sequences to score profiles >>>>> Estimating tree for the following sequences: X Y >>>>> Optimizing branch lengths of all trees in alignment database, using EM >>>>> Optimizing tree branch lengths by EM. >>>>> Tree before optimization: >>>>> (X:5.0000000000,Y:5.0000000000)root; >>>>> EM iteration #1: log-likelihood = -56.0101 bits >>>>> Optimized tree after step #1: >>>>> (X:5.1160179765,Y:5.1160180140)root; >>>>> EM iteration #2: log-likelihood = -56.0118 bits >>>>> Warning: log-likelihood dropped from -56.0101 to -56.0118 bits during EM >>>>> Failed EM improvement threshold for the 1th time; stopping >>>>> Checking post-iteration log-likelihood >>>>> Post-iteration log-likelihood = -56.0118 bits >>>>> Restoring previous best branch lengths >>>>> Tree after optimization: >>>>> (X:5.0000000000,Y:5.0000000000)root; >>>>> Alignment database log-likelihood: -56.0101 bits >>>>> Processing alignment Alignment1 (1 of 1): 14 columns >>>>> # STOCKHOLM 1.0 >>>>> #=GF NH (X:5.0000000000,Y:5.0000000000)root; >>>>> X AAAAAAaaUUUUUU >>>>> Y GGGGGGaaCCCCCC >>>>> // >>>>> >>>>> >>>>> >>>>> >>>>> marcin joachimiak wrote: >>>>> >>>>>> Hey there, >>>>>> >>>>>> The xrate tree job took a while but it seemed to finish without >>>>>> errors. Alas there isn't any tree output ... >>>>>> >>>>>> Here's what I get from stdout: >>>>>> >>>>>> Parsed command line: xrate DvH_DP4_G20_mVISTA.stock -e >>>>>> /usr2/people/gtl/bin/dart//grammars/jukescantor.eg --noannotate -log 5 >>>>>> Warning -- used random number generator during initialisation >>>>>> Read 1 alignments >>>>>> Sequence_database: converting ASCII sequences to score profiles >>>>>> Estimating tree for the following sequences: CP000112 CP000527 AE017286 >>>>>> >>>>>> The alignment is 3 bacterial genomes, ~3M each -- that wouldn't be a >>>>>> problem right? And this was run on a hefty machine ... >>>>>> >>>>>> >>>>>> marcin >>>>>> >>>>>> On Thu, Apr 24, 2008 at 7:36 PM, marcin joachimiak >>>>>> <mar...@gm...> wrote: >>>>>> >>>>>>> much better, thanks! >>>>>>> I did notice the trees for each alignment in the output ... >>>>>>> >>>>>>> The tree estimation is running, I'll keep reporting ... >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Apr 24, 2008 at 7:22 PM, Ian Holmes <ih...@be...> wrote: >>>>>>> > marcin joachimiak wrote: >>>>>>> > > Excellent, its working now! >>>>>>> > >>>>>>> > Cool! >>>>>>> > >>>>>>> > >>>>>>> > > I'll have the tree data soon and may try a run with as well, but >>>>> best >>>>>>> > > if I could know that the other tree is different from the xrate >>>>> one >>>>>>> > > with the Jukes-Cantor model. Can I get the Jukes-Cantor model >>>>>>> > > tree/distances from xrate? >>>>>>> > >>>>>>> > Yes, it will insert the trees into the alignment files it outputs, >>>>> using >>>>>>> > the New-Hampshire-encoded-in-Stockholm format from my previous >>>>> email. >>>>>>> > >>>>>>> > One caveat. The way you're running it now, it is calculating a >>>>> separate >>>>>>> > tree *for every window* in the alignment. This may not be what you >>>>> want, >>>>>>> > and it's certainly going to slow things down. >>>>>>> > >>>>>>> > If you want to calculate the tree just once for the whole alignment, >>>>> type >>>>>>> > >>>>>>> > xrate MY_ALIGNMENT.stock -e $DARTDIR/grammars/jukescantor.eg >>>>>>> > --noannotate -log 5 > MY_ALIGNMENT_WITH_TREE.stock >>>>>>> > >>>>>>> > >>>>>>> > The "--noannotate" just means "don't waste time trying to annotate >>>>> this >>>>>>> > alignment, I just want the tree". The "-log 5" will just print some >>>>> log >>>>>>> > messages on stderr, so you can watch the progress of the >>>>> tree-building >>>>>>> > algorithms. >>>>>>> > >>>>>>> > hth, >>>>>>> > >>>>>>> > I. >>>>>>> > >>>>>>> >>>>>>> > |