Re: [Dart-help] help with windowlicker

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

No problem, thanks again for all the hand-holding. I've started up a
new job with the gene-finding grammar which will take at least a few
hours based on the last (default grammar) run. So what is the default
'rev' grammar useful for?

Just so I have a better context, what can I expect to get out of this
given that the genomes are from fairly uncharacterized and potentially
unusual sulfate-reducing anaerobic bacteria? Also, the three genomes
have one close pair and two distant pairs.

thanks.
marcin

On Mon, May 19, 2008 at 3:44 PM, Ian Holmes <ih...@be...> wrote:
> marcin joachimiak wrote:
>> hey
>>
>> Looks like I now have windowlicker output with 49500 alignment and trees like:
>> //
>> # STOCKHOLM 1.0
>> #=GF         NH (CP000527:0.1287975598,(CP000112:0.0000000010,AE017286:0.0000000
>> 010)node_3:0.1287875646)root;
>> #=GF SC_max_rev -643.47
>>      CP000112   ACAAAGTTTGCTATCCCTTCTTTGAGCGGAGTTCCGGGCTTGAAGCCGGTGTCGGCAATCAGGT
>>      CP000527   ACGAAGGCGGCTACGCCCTCTTCGACCGTGGTCGCGGGGCGGAAGCCCGTGTCAGCGATGAGGT
>>      AE017286   ----------------------------------------------------------------
>>
>>      CP000112   CGTCCACATTGGCATATGTTGCTTCCACATCGCCGGGCTGCATGGGCATCATGTTTTTAACGGC
>>      CP000527   CGTCGACGTCGGCGTAGGTCGCTGCCACGTCGCCGGGCTGCATGGGCAGCATGTTCTTGACCGC
>>      AE017286   ----------------------------------------------------------------
>>
>>      CP000112   CTTTTTGCCCAGGCATTCTTCCAGAGTCGCGATGAACTCGCCAAGCTCTACCGTGTTGTTGTTG
>>      CP000527   CTTCTTGCCGAGGCACTCTTCGAGTACTTCGATGAACCTGCCGAGTTCGACGGTGTTGTTGTTC
>>      AE017286   ----------------------------------------------------------------
>>
>>      CP000112   CCAATGTT
>>      CP000527   CCGATGTT
>>      AE017286   --------
>>
>>
>> But I'll need a little guidance as to what these are are and how to
>> interpret them. Arre these ~50,000 ncRNA predictions?
>>
>>
>> For completeness, here's the command I used to generate the output:
>> perl /usr2/people/gtl/bin/dart/perl/windowlicker.pl
>> DvH_DP4_G20_mVISTA_wtree.stock -- -e $DARTDIR/grammars/jukescantor.eg
>> &> windowlicker.out
>
>
> Oops. It does not look like you have actually specified a genefinding
> grammar. The score "SC_max_rev" that's reported indicates that
> windowlicker was using xrate's default "rev" grammar which is not a
> genefinder... this is consistent with the command line above.
>
> You need to download a genefinding grammar e.g. from here
>
> http://biowiki.org/results/ncRNA_predictions/12fly/ncRna_v22.eg
>
> ...then change your command line to include this grammar...
>
> perl /usr2/people/gtl/bin/dart/perl/windowlicker.pl
> DvH_DP4_G20_mVISTA_wtree.stock -- -e $DARTDIR/grammars/jukescantor.eg
> -g ncRna_v22.eg &> windowlicker.out
>
>
> I am sorry that with all the discussion of how to get the trees right,
> I/we somehow lost sight of this fairly important point... I think we did
> mention it initially but while trying to figure out options for the tree
> estimation, it must have got dropped at some point...
>
> I.
>
>
>
>
>
>>
>> thanks,
>> marcin
>>
>> On Mon, Apr 28, 2008 at 12:28 PM, Ian Holmes <ih...@be...> wrote:
>>> Hi Marcin...
>>>
>>> So, I never ran xrate on an alignment this big before (6 million
>>> columns!) and it turned out there was a drastic inefficiency in the
>>> distance matrix code, that was causing it to scale as the square of the
>>> number of columns. Unnoticeable for most alignments (since the code
>>> inside the loop was pretty quick), but your example caused it to hang...
>>>
>>> I think it was quite possible that your xrate job was getting killed
>>> because it was running too long. Are you running queue management
>>> software of some kind (e.g. Sun Grid Engine)? That can often kill
>>> long-running jobs. I couldn't find a specific bug that was crashing on
>>> this data, although I didn't finish running it because it was taking
>>> sooooo long.
>>>
>>> Anyway, I've now fixed the inefficiency bug. The tree estimation by
>>> neighbor-joining now finishes pretty fast. However, it now takes a long
>>> time to complete the second step of the tree-estimation procedure, which
>>> is to refine the branch lengths of the tree using the EM algorithm.
>>>
>>> There is less that I can do to optimize this. The neighbor-joining code
>>> is not heavily dependent on the alignment length, but the branch-length
>>> refinement is. I could optimize the branch-length code in the same way
>>> that the neighbor-joining is optimized, so it was less dependent on
>>> alignment length, but that would take a little while, and unfortunately
>>> I don't have the time right now.
>>>
>>> Instead, what I have done is introduced a subtle hack. If you specify
>>> the command-line option "--maxrounds 0", then this will bypass the
>>> branch-length refinement code (specifically, what it does is tell xrate
>>> that you want the EM algorithm to run for a maximum of zero iterations,
>>> or "rounds").
>>>
>>> Thus, with "--maxrounds 0" the tree will be estimated using
>>> neighbor-joining only.
>>>
>>> Sorry this has taken so long to fix! Like I said, I never used xrate on
>>> an alignment this big before (for our screens, we pre-specified the
>>> tree, so we went directly to windowlicker).
>>>
>>> Hopefully, this will now work for you, but let me know if you have
>>> further issues.
>>>
>>> Cheers,
>>> Ian
>>>
>>>
>>> marcin joachimiak wrote:
>>>> hey,
>>>>
>>>> Thanks for the help!
>>>> The short example worked just fine.
>>>>
>>>> log3 with my previous command so far produced this:
>>>> Finding eigenvectors using MatrixExpEigenPrepare
>>>> Largest eigenvalue was 0; set to 0
>>>> Updating cached M() & J()
>>>> Finding eigenvectors using MatrixExpEigenPrepare
>>>> Largest eigenvalue was 0; set to 0
>>>> Setting eigenvalue #2 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0))
>>>> Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0))
>>>> Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #2 ((-1.3333,0))
>>>> Updating cached M() & J()
>>>> Sequence_database: converting ASCII sequences to score profiles
>>>> Processed 3 sequences
>>>> Estimating tree for the following sequences: CP000112 CP000527 AE017286
>>>> Estimating distances of sequences 0 and 1
>>>>
>>>> You can download the FASTA and Stock files here:
>>>> http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.fasta.gz
>>>> http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.stock.gz
>>>>
>>>> I'll have more updates on Monday -- enjoy the weekend,
>>>> marcin
>>>> - Show quoted text -
>>>>
>>>> On Sat, Apr 26, 2008 at 1:25 PM, Ian Holmes <ih...@be...> wrote:
>>>>> Thanks Marcin.
>>>>>
>>>>>  Something has definitely gone wrong if no alignment is output. Can you
>>>>> check that the following outputs an alignment+tree?
>>>>>
>>>>>  xrate -log 5 --noannotate -e ~/dart/grammars/jukescantor.eg
>>>>> ~/dart/src/ecfg/t/notree.stk
>>>>>
>>>>>
>>>>>  Looking at your log messages it seems like the program has crashed while
>>>>> trying to do neighbor-joining to estimate the tree. For the above command, I
>>>>> get log messages after the "Estimating tree..." message. See below (BTW many
>>>>> of these are on stderr, not stdout)
>>>>>
>>>>>  If you can reproduce this on the short example pairwise alignment but not
>>>>> with the 3M alignments, there's something else wrong that needs
>>>>> investigation. In that case, can you first try running "-log 3" instead of
>>>>> "-log 5" (this will give more verbose log messages during the
>>>>> neighbor-joining algorithm), then mail the stderr log messages to me, and/or
>>>>> please post a URL to your large file so I can download it and attempt to
>>>>> reproduce the error.
>>>>>
>>>>>  Sorry about this...
>>>>>  Ian
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  Warning -- used random number generator during initialisation
>>>>>  Read 1 alignments
>>>>>  Sequence_database: converting ASCII sequences to score profiles
>>>>>  Estimating tree for the following sequences: X Y
>>>>>  Optimizing branch lengths of all trees in alignment database, using EM
>>>>>  Optimizing tree branch lengths by EM.
>>>>>  Tree before optimization:
>>>>>  (X:5.0000000000,Y:5.0000000000)root;
>>>>>  EM iteration #1: log-likelihood = -56.0101 bits
>>>>>  Optimized tree after step #1:
>>>>>  (X:5.1160179765,Y:5.1160180140)root;
>>>>>  EM iteration #2: log-likelihood = -56.0118 bits
>>>>>  Warning: log-likelihood dropped from -56.0101 to -56.0118 bits during EM
>>>>>  Failed EM improvement threshold for the 1th time; stopping
>>>>>  Checking post-iteration log-likelihood
>>>>>  Post-iteration log-likelihood = -56.0118 bits
>>>>>  Restoring previous best branch lengths
>>>>>  Tree after optimization:
>>>>>  (X:5.0000000000,Y:5.0000000000)root;
>>>>>  Alignment database log-likelihood: -56.0101 bits
>>>>>  Processing alignment Alignment1 (1 of 1): 14 columns
>>>>>  # STOCKHOLM 1.0
>>>>>  #=GF NH (X:5.0000000000,Y:5.0000000000)root;
>>>>>  X       AAAAAAaaUUUUUU
>>>>>  Y       GGGGGGaaCCCCCC
>>>>>  //
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  marcin joachimiak wrote:
>>>>>
>>>>>> Hey there,
>>>>>>
>>>>>> The xrate tree job took a while but it seemed to finish without
>>>>>> errors. Alas there isn't any tree output ...
>>>>>>
>>>>>> Here's what I get from stdout:
>>>>>>
>>>>>> Parsed command line: xrate DvH_DP4_G20_mVISTA.stock -e
>>>>>> /usr2/people/gtl/bin/dart//grammars/jukescantor.eg --noannotate -log 5
>>>>>> Warning -- used random number generator during initialisation
>>>>>> Read 1 alignments
>>>>>> Sequence_database: converting ASCII sequences to score profiles
>>>>>> Estimating tree for the following sequences: CP000112 CP000527 AE017286
>>>>>>
>>>>>> The alignment is 3 bacterial genomes, ~3M each -- that wouldn't be a
>>>>>> problem right? And this was run on a hefty machine ...
>>>>>>
>>>>>>
>>>>>> marcin
>>>>>>
>>>>>> On Thu, Apr 24, 2008 at 7:36 PM, marcin joachimiak
>>>>>> <mar...@gm...> wrote:
>>>>>>
>>>>>>> much better, thanks!
>>>>>>>  I did notice the trees for each alignment in the output ...
>>>>>>>
>>>>>>>  The tree estimation is running, I'll keep reporting ...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  On Thu, Apr 24, 2008 at 7:22 PM, Ian Holmes <ih...@be...> wrote:
>>>>>>>  > marcin joachimiak wrote:
>>>>>>>  >  > Excellent, its working now!
>>>>>>>  >
>>>>>>>  >  Cool!
>>>>>>>  >
>>>>>>>  >
>>>>>>>  >  > I'll have the tree data soon and may try a run with as well, but
>>>>> best
>>>>>>>  >  > if I could know that the other tree is different from the xrate
>>>>> one
>>>>>>>  >  > with the Jukes-Cantor model. Can I get the Jukes-Cantor model
>>>>>>>  >  > tree/distances from xrate?
>>>>>>>  >
>>>>>>>  >  Yes, it will insert the trees into the alignment files it outputs,
>>>>> using
>>>>>>>  >  the New-Hampshire-encoded-in-Stockholm format from my previous
>>>>> email.
>>>>>>>  >
>>>>>>>  >  One caveat. The way you're running it now, it is calculating a
>>>>> separate
>>>>>>>  >  tree *for every window* in the alignment. This may not be what you
>>>>> want,
>>>>>>>  >  and it's certainly going to slow things down.
>>>>>>>  >
>>>>>>>  >  If you want to calculate the tree just once for the whole alignment,
>>>>> type
>>>>>>>  >
>>>>>>>  >  xrate MY_ALIGNMENT.stock -e $DARTDIR/grammars/jukescantor.eg
>>>>>>>  >  --noannotate -log 5 > MY_ALIGNMENT_WITH_TREE.stock
>>>>>>>  >
>>>>>>>  >
>>>>>>>  >  The "--noannotate" just means "don't waste time trying to annotate
>>>>> this
>>>>>>>  >  alignment, I just want the tree". The "-log 5" will just print some
>>>>> log
>>>>>>>  >  messages on stderr, so you can watch the progress of the
>>>>> tree-building
>>>>>>>  >  algorithms.
>>>>>>>  >
>>>>>>>  >  hth,
>>>>>>>  >
>>>>>>>  >  I.
>>>>>>>  >
>>>>>>>
>>>>>>>
>