Re: [Dart-help] help with windowlicker
Brought to you by:
ihh
|
From: marcin j. <mar...@gm...> - 2008-05-19 21:09:53
|
hey
Looks like I now have windowlicker output with 49500 alignment and trees like:
//
# STOCKHOLM 1.0
#=GF NH (CP000527:0.1287975598,(CP000112:0.0000000010,AE017286:0.0000000
010)node_3:0.1287875646)root;
#=GF SC_max_rev -643.47
CP000112 ACAAAGTTTGCTATCCCTTCTTTGAGCGGAGTTCCGGGCTTGAAGCCGGTGTCGGCAATCAGGT
CP000527 ACGAAGGCGGCTACGCCCTCTTCGACCGTGGTCGCGGGGCGGAAGCCCGTGTCAGCGATGAGGT
AE017286 ----------------------------------------------------------------
CP000112 CGTCCACATTGGCATATGTTGCTTCCACATCGCCGGGCTGCATGGGCATCATGTTTTTAACGGC
CP000527 CGTCGACGTCGGCGTAGGTCGCTGCCACGTCGCCGGGCTGCATGGGCAGCATGTTCTTGACCGC
AE017286 ----------------------------------------------------------------
CP000112 CTTTTTGCCCAGGCATTCTTCCAGAGTCGCGATGAACTCGCCAAGCTCTACCGTGTTGTTGTTG
CP000527 CTTCTTGCCGAGGCACTCTTCGAGTACTTCGATGAACCTGCCGAGTTCGACGGTGTTGTTGTTC
AE017286 ----------------------------------------------------------------
CP000112 CCAATGTT
CP000527 CCGATGTT
AE017286 --------
But I'll need a little guidance as to what these are are and how to
interpret them. Arre these ~50,000 ncRNA predictions?
For completeness, here's the command I used to generate the output:
perl /usr2/people/gtl/bin/dart/perl/windowlicker.pl
DvH_DP4_G20_mVISTA_wtree.stock -- -e $DARTDIR/grammars/jukescantor.eg
&> windowlicker.out
thanks,
marcin
On Mon, Apr 28, 2008 at 12:28 PM, Ian Holmes <ih...@be...> wrote:
> Hi Marcin...
>
> So, I never ran xrate on an alignment this big before (6 million
> columns!) and it turned out there was a drastic inefficiency in the
> distance matrix code, that was causing it to scale as the square of the
> number of columns. Unnoticeable for most alignments (since the code
> inside the loop was pretty quick), but your example caused it to hang...
>
> I think it was quite possible that your xrate job was getting killed
> because it was running too long. Are you running queue management
> software of some kind (e.g. Sun Grid Engine)? That can often kill
> long-running jobs. I couldn't find a specific bug that was crashing on
> this data, although I didn't finish running it because it was taking
> sooooo long.
>
> Anyway, I've now fixed the inefficiency bug. The tree estimation by
> neighbor-joining now finishes pretty fast. However, it now takes a long
> time to complete the second step of the tree-estimation procedure, which
> is to refine the branch lengths of the tree using the EM algorithm.
>
> There is less that I can do to optimize this. The neighbor-joining code
> is not heavily dependent on the alignment length, but the branch-length
> refinement is. I could optimize the branch-length code in the same way
> that the neighbor-joining is optimized, so it was less dependent on
> alignment length, but that would take a little while, and unfortunately
> I don't have the time right now.
>
> Instead, what I have done is introduced a subtle hack. If you specify
> the command-line option "--maxrounds 0", then this will bypass the
> branch-length refinement code (specifically, what it does is tell xrate
> that you want the EM algorithm to run for a maximum of zero iterations,
> or "rounds").
>
> Thus, with "--maxrounds 0" the tree will be estimated using
> neighbor-joining only.
>
> Sorry this has taken so long to fix! Like I said, I never used xrate on
> an alignment this big before (for our screens, we pre-specified the
> tree, so we went directly to windowlicker).
>
> Hopefully, this will now work for you, but let me know if you have
> further issues.
>
> Cheers,
> Ian
>
>
> marcin joachimiak wrote:
>> hey,
>>
>> Thanks for the help!
>> The short example worked just fine.
>>
>> log3 with my previous command so far produced this:
>> Finding eigenvectors using MatrixExpEigenPrepare
>> Largest eigenvalue was 0; set to 0
>> Updating cached M() & J()
>> Finding eigenvectors using MatrixExpEigenPrepare
>> Largest eigenvalue was 0; set to 0
>> Setting eigenvalue #2 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0))
>> Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0))
>> Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #2 ((-1.3333,0))
>> Updating cached M() & J()
>> Sequence_database: converting ASCII sequences to score profiles
>> Processed 3 sequences
>> Estimating tree for the following sequences: CP000112 CP000527 AE017286
>> Estimating distances of sequences 0 and 1
>>
>> You can download the FASTA and Stock files here:
>> http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.fasta.gz
>> http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.stock.gz
>>
>> I'll have more updates on Monday -- enjoy the weekend,
>> marcin
>> - Show quoted text -
>>
>> On Sat, Apr 26, 2008 at 1:25 PM, Ian Holmes <ih...@be...> wrote:
>>> Thanks Marcin.
>>>
>>> Something has definitely gone wrong if no alignment is output. Can you
>>> check that the following outputs an alignment+tree?
>>>
>>> xrate -log 5 --noannotate -e ~/dart/grammars/jukescantor.eg
>>> ~/dart/src/ecfg/t/notree.stk
>>>
>>>
>>> Looking at your log messages it seems like the program has crashed while
>>> trying to do neighbor-joining to estimate the tree. For the above command, I
>>> get log messages after the "Estimating tree..." message. See below (BTW many
>>> of these are on stderr, not stdout)
>>>
>>> If you can reproduce this on the short example pairwise alignment but not
>>> with the 3M alignments, there's something else wrong that needs
>>> investigation. In that case, can you first try running "-log 3" instead of
>>> "-log 5" (this will give more verbose log messages during the
>>> neighbor-joining algorithm), then mail the stderr log messages to me, and/or
>>> please post a URL to your large file so I can download it and attempt to
>>> reproduce the error.
>>>
>>> Sorry about this...
>>> Ian
>>>
>>>
>>>
>>>
>>> Warning -- used random number generator during initialisation
>>> Read 1 alignments
>>> Sequence_database: converting ASCII sequences to score profiles
>>> Estimating tree for the following sequences: X Y
>>> Optimizing branch lengths of all trees in alignment database, using EM
>>> Optimizing tree branch lengths by EM.
>>> Tree before optimization:
>>> (X:5.0000000000,Y:5.0000000000)root;
>>> EM iteration #1: log-likelihood = -56.0101 bits
>>> Optimized tree after step #1:
>>> (X:5.1160179765,Y:5.1160180140)root;
>>> EM iteration #2: log-likelihood = -56.0118 bits
>>> Warning: log-likelihood dropped from -56.0101 to -56.0118 bits during EM
>>> Failed EM improvement threshold for the 1th time; stopping
>>> Checking post-iteration log-likelihood
>>> Post-iteration log-likelihood = -56.0118 bits
>>> Restoring previous best branch lengths
>>> Tree after optimization:
>>> (X:5.0000000000,Y:5.0000000000)root;
>>> Alignment database log-likelihood: -56.0101 bits
>>> Processing alignment Alignment1 (1 of 1): 14 columns
>>> # STOCKHOLM 1.0
>>> #=GF NH (X:5.0000000000,Y:5.0000000000)root;
>>> X AAAAAAaaUUUUUU
>>> Y GGGGGGaaCCCCCC
>>> //
>>>
>>>
>>>
>>>
>>> marcin joachimiak wrote:
>>>
>>>> Hey there,
>>>>
>>>> The xrate tree job took a while but it seemed to finish without
>>>> errors. Alas there isn't any tree output ...
>>>>
>>>> Here's what I get from stdout:
>>>>
>>>> Parsed command line: xrate DvH_DP4_G20_mVISTA.stock -e
>>>> /usr2/people/gtl/bin/dart//grammars/jukescantor.eg --noannotate -log 5
>>>> Warning -- used random number generator during initialisation
>>>> Read 1 alignments
>>>> Sequence_database: converting ASCII sequences to score profiles
>>>> Estimating tree for the following sequences: CP000112 CP000527 AE017286
>>>>
>>>> The alignment is 3 bacterial genomes, ~3M each -- that wouldn't be a
>>>> problem right? And this was run on a hefty machine ...
>>>>
>>>>
>>>> marcin
>>>>
>>>> On Thu, Apr 24, 2008 at 7:36 PM, marcin joachimiak
>>>> <mar...@gm...> wrote:
>>>>
>>>>> much better, thanks!
>>>>> I did notice the trees for each alignment in the output ...
>>>>>
>>>>> The tree estimation is running, I'll keep reporting ...
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 24, 2008 at 7:22 PM, Ian Holmes <ih...@be...> wrote:
>>>>> > marcin joachimiak wrote:
>>>>> > > Excellent, its working now!
>>>>> >
>>>>> > Cool!
>>>>> >
>>>>> >
>>>>> > > I'll have the tree data soon and may try a run with as well, but
>>> best
>>>>> > > if I could know that the other tree is different from the xrate
>>> one
>>>>> > > with the Jukes-Cantor model. Can I get the Jukes-Cantor model
>>>>> > > tree/distances from xrate?
>>>>> >
>>>>> > Yes, it will insert the trees into the alignment files it outputs,
>>> using
>>>>> > the New-Hampshire-encoded-in-Stockholm format from my previous
>>> email.
>>>>> >
>>>>> > One caveat. The way you're running it now, it is calculating a
>>> separate
>>>>> > tree *for every window* in the alignment. This may not be what you
>>> want,
>>>>> > and it's certainly going to slow things down.
>>>>> >
>>>>> > If you want to calculate the tree just once for the whole alignment,
>>> type
>>>>> >
>>>>> > xrate MY_ALIGNMENT.stock -e $DARTDIR/grammars/jukescantor.eg
>>>>> > --noannotate -log 5 > MY_ALIGNMENT_WITH_TREE.stock
>>>>> >
>>>>> >
>>>>> > The "--noannotate" just means "don't waste time trying to annotate
>>> this
>>>>> > alignment, I just want the tree". The "-log 5" will just print some
>>> log
>>>>> > messages on stderr, so you can watch the progress of the
>>> tree-building
>>>>> > algorithms.
>>>>> >
>>>>> > hth,
>>>>> >
>>>>> > I.
>>>>> >
>>>>>
>>>>>
>
|