Re: [Dart-help] help with windowlicker

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

hey

Looks like I now have windowlicker output with 49500 alignment and trees like:
//
# STOCKHOLM 1.0
#=GF         NH (CP000527:0.1287975598,(CP000112:0.0000000010,AE017286:0.0000000
010)node_3:0.1287875646)root;
#=GF SC_max_rev -643.47
     CP000112   ACAAAGTTTGCTATCCCTTCTTTGAGCGGAGTTCCGGGCTTGAAGCCGGTGTCGGCAATCAGGT
     CP000527   ACGAAGGCGGCTACGCCCTCTTCGACCGTGGTCGCGGGGCGGAAGCCCGTGTCAGCGATGAGGT
     AE017286   ----------------------------------------------------------------

     CP000112   CGTCCACATTGGCATATGTTGCTTCCACATCGCCGGGCTGCATGGGCATCATGTTTTTAACGGC
     CP000527   CGTCGACGTCGGCGTAGGTCGCTGCCACGTCGCCGGGCTGCATGGGCAGCATGTTCTTGACCGC
     AE017286   ----------------------------------------------------------------

     CP000112   CTTTTTGCCCAGGCATTCTTCCAGAGTCGCGATGAACTCGCCAAGCTCTACCGTGTTGTTGTTG
     CP000527   CTTCTTGCCGAGGCACTCTTCGAGTACTTCGATGAACCTGCCGAGTTCGACGGTGTTGTTGTTC
     AE017286   ----------------------------------------------------------------

     CP000112   CCAATGTT
     CP000527   CCGATGTT
     AE017286   --------

But I'll need a little guidance as to what these are are and how to
interpret them. Arre these ~50,000 ncRNA predictions?

For completeness, here's the command I used to generate the output:
perl /usr2/people/gtl/bin/dart/perl/windowlicker.pl
DvH_DP4_G20_mVISTA_wtree.stock -- -e $DARTDIR/grammars/jukescantor.eg
&> windowlicker.out

thanks,
marcin

On Mon, Apr 28, 2008 at 12:28 PM, Ian Holmes <ih...@be...> wrote:
> Hi Marcin...
>
> So, I never ran xrate on an alignment this big before (6 million
> columns!) and it turned out there was a drastic inefficiency in the
> distance matrix code, that was causing it to scale as the square of the
> number of columns. Unnoticeable for most alignments (since the code
> inside the loop was pretty quick), but your example caused it to hang...
>
> I think it was quite possible that your xrate job was getting killed
> because it was running too long. Are you running queue management
> software of some kind (e.g. Sun Grid Engine)? That can often kill
> long-running jobs. I couldn't find a specific bug that was crashing on
> this data, although I didn't finish running it because it was taking
> sooooo long.
>
> Anyway, I've now fixed the inefficiency bug. The tree estimation by
> neighbor-joining now finishes pretty fast. However, it now takes a long
> time to complete the second step of the tree-estimation procedure, which
> is to refine the branch lengths of the tree using the EM algorithm.
>
> There is less that I can do to optimize this. The neighbor-joining code
> is not heavily dependent on the alignment length, but the branch-length
> refinement is. I could optimize the branch-length code in the same way
> that the neighbor-joining is optimized, so it was less dependent on
> alignment length, but that would take a little while, and unfortunately
> I don't have the time right now.
>
> Instead, what I have done is introduced a subtle hack. If you specify
> the command-line option "--maxrounds 0", then this will bypass the
> branch-length refinement code (specifically, what it does is tell xrate
> that you want the EM algorithm to run for a maximum of zero iterations,
> or "rounds").
>
> Thus, with "--maxrounds 0" the tree will be estimated using
> neighbor-joining only.
>
> Sorry this has taken so long to fix! Like I said, I never used xrate on
> an alignment this big before (for our screens, we pre-specified the
> tree, so we went directly to windowlicker).
>
> Hopefully, this will now work for you, but let me know if you have
> further issues.
>
> Cheers,
> Ian
>
>
> marcin joachimiak wrote:
>> hey,
>>
>> Thanks for the help!
>> The short example worked just fine.
>>
>> log3 with my previous command so far produced this:
>> Finding eigenvectors using MatrixExpEigenPrepare
>> Largest eigenvalue was 0; set to 0
>> Updating cached M() & J()
>> Finding eigenvectors using MatrixExpEigenPrepare
>> Largest eigenvalue was 0; set to 0
>> Setting eigenvalue #2 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0))
>> Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0))
>> Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #2 ((-1.3333,0))
>> Updating cached M() & J()
>> Sequence_database: converting ASCII sequences to score profiles
>> Processed 3 sequences
>> Estimating tree for the following sequences: CP000112 CP000527 AE017286
>> Estimating distances of sequences 0 and 1
>>
>> You can download the FASTA and Stock files here:
>> http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.fasta.gz
>> http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.stock.gz
>>
>> I'll have more updates on Monday -- enjoy the weekend,
>> marcin
>> - Show quoted text -
>>
>> On Sat, Apr 26, 2008 at 1:25 PM, Ian Holmes <ih...@be...> wrote:
>>> Thanks Marcin.
>>>
>>>  Something has definitely gone wrong if no alignment is output. Can you
>>> check that the following outputs an alignment+tree?
>>>
>>>  xrate -log 5 --noannotate -e ~/dart/grammars/jukescantor.eg
>>> ~/dart/src/ecfg/t/notree.stk
>>>
>>>
>>>  Looking at your log messages it seems like the program has crashed while
>>> trying to do neighbor-joining to estimate the tree. For the above command, I
>>> get log messages after the "Estimating tree..." message. See below (BTW many
>>> of these are on stderr, not stdout)
>>>
>>>  If you can reproduce this on the short example pairwise alignment but not
>>> with the 3M alignments, there's something else wrong that needs
>>> investigation. In that case, can you first try running "-log 3" instead of
>>> "-log 5" (this will give more verbose log messages during the
>>> neighbor-joining algorithm), then mail the stderr log messages to me, and/or
>>> please post a URL to your large file so I can download it and attempt to
>>> reproduce the error.
>>>
>>>  Sorry about this...
>>>  Ian
>>>
>>>
>>>
>>>
>>>  Warning -- used random number generator during initialisation
>>>  Read 1 alignments
>>>  Sequence_database: converting ASCII sequences to score profiles
>>>  Estimating tree for the following sequences: X Y
>>>  Optimizing branch lengths of all trees in alignment database, using EM
>>>  Optimizing tree branch lengths by EM.
>>>  Tree before optimization:
>>>  (X:5.0000000000,Y:5.0000000000)root;
>>>  EM iteration #1: log-likelihood = -56.0101 bits
>>>  Optimized tree after step #1:
>>>  (X:5.1160179765,Y:5.1160180140)root;
>>>  EM iteration #2: log-likelihood = -56.0118 bits
>>>  Warning: log-likelihood dropped from -56.0101 to -56.0118 bits during EM
>>>  Failed EM improvement threshold for the 1th time; stopping
>>>  Checking post-iteration log-likelihood
>>>  Post-iteration log-likelihood = -56.0118 bits
>>>  Restoring previous best branch lengths
>>>  Tree after optimization:
>>>  (X:5.0000000000,Y:5.0000000000)root;
>>>  Alignment database log-likelihood: -56.0101 bits
>>>  Processing alignment Alignment1 (1 of 1): 14 columns
>>>  # STOCKHOLM 1.0
>>>  #=GF NH (X:5.0000000000,Y:5.0000000000)root;
>>>  X       AAAAAAaaUUUUUU
>>>  Y       GGGGGGaaCCCCCC
>>>  //
>>>
>>>
>>>
>>>
>>>  marcin joachimiak wrote:
>>>
>>>> Hey there,
>>>>
>>>> The xrate tree job took a while but it seemed to finish without
>>>> errors. Alas there isn't any tree output ...
>>>>
>>>> Here's what I get from stdout:
>>>>
>>>> Parsed command line: xrate DvH_DP4_G20_mVISTA.stock -e
>>>> /usr2/people/gtl/bin/dart//grammars/jukescantor.eg --noannotate -log 5
>>>> Warning -- used random number generator during initialisation
>>>> Read 1 alignments
>>>> Sequence_database: converting ASCII sequences to score profiles
>>>> Estimating tree for the following sequences: CP000112 CP000527 AE017286
>>>>
>>>> The alignment is 3 bacterial genomes, ~3M each -- that wouldn't be a
>>>> problem right? And this was run on a hefty machine ...
>>>>
>>>>
>>>> marcin
>>>>
>>>> On Thu, Apr 24, 2008 at 7:36 PM, marcin joachimiak
>>>> <mar...@gm...> wrote:
>>>>
>>>>> much better, thanks!
>>>>>  I did notice the trees for each alignment in the output ...
>>>>>
>>>>>  The tree estimation is running, I'll keep reporting ...
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  On Thu, Apr 24, 2008 at 7:22 PM, Ian Holmes <ih...@be...> wrote:
>>>>>  > marcin joachimiak wrote:
>>>>>  >  > Excellent, its working now!
>>>>>  >
>>>>>  >  Cool!
>>>>>  >
>>>>>  >
>>>>>  >  > I'll have the tree data soon and may try a run with as well, but
>>> best
>>>>>  >  > if I could know that the other tree is different from the xrate
>>> one
>>>>>  >  > with the Jukes-Cantor model. Can I get the Jukes-Cantor model
>>>>>  >  > tree/distances from xrate?
>>>>>  >
>>>>>  >  Yes, it will insert the trees into the alignment files it outputs,
>>> using
>>>>>  >  the New-Hampshire-encoded-in-Stockholm format from my previous
>>> email.
>>>>>  >
>>>>>  >  One caveat. The way you're running it now, it is calculating a
>>> separate
>>>>>  >  tree *for every window* in the alignment. This may not be what you
>>> want,
>>>>>  >  and it's certainly going to slow things down.
>>>>>  >
>>>>>  >  If you want to calculate the tree just once for the whole alignment,
>>> type
>>>>>  >
>>>>>  >  xrate MY_ALIGNMENT.stock -e $DARTDIR/grammars/jukescantor.eg
>>>>>  >  --noannotate -log 5 > MY_ALIGNMENT_WITH_TREE.stock
>>>>>  >
>>>>>  >
>>>>>  >  The "--noannotate" just means "don't waste time trying to annotate
>>> this
>>>>>  >  alignment, I just want the tree". The "-log 5" will just print some
>>> log
>>>>>  >  messages on stderr, so you can watch the progress of the
>>> tree-building
>>>>>  >  algorithms.
>>>>>  >
>>>>>  >  hth,
>>>>>  >
>>>>>  >  I.
>>>>>  >
>>>>>
>>>>>
>