Re: [Dart-help] help with windowlicker

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hey hey,

Indeed, the alignment size limitation makes sense and was one of the
things I was suspecting.

The job was run on a machine w.o queue management but I will
investigate whether it could have been killed by other means.

The 0 iteration option sounds like it will do the trick -- so I just
need to checkout the DART code from cvs again, right? Thanks for
adding this in and I definitely understand the time constraints etc.,
this should suffice for my purposes. I could of course try to obtain a
tree in other ways, but actually I'm trying to turn this into a
pipeline that can run on multi-genome alignments, its almost there.

cheers,
marcin

On Mon, Apr 28, 2008 at 12:28 PM, Ian Holmes <ih...@be...> wrote:
> Hi Marcin...
>
>  So, I never ran xrate on an alignment this big before (6 million
>  columns!) and it turned out there was a drastic inefficiency in the
>  distance matrix code, that was causing it to scale as the square of the
>  number of columns. Unnoticeable for most alignments (since the code
>  inside the loop was pretty quick), but your example caused it to hang...
>
>  I think it was quite possible that your xrate job was getting killed
>  because it was running too long. Are you running queue management
>  software of some kind (e.g. Sun Grid Engine)? That can often kill
>  long-running jobs. I couldn't find a specific bug that was crashing on
>  this data, although I didn't finish running it because it was taking
>  sooooo long.
>
>  Anyway, I've now fixed the inefficiency bug. The tree estimation by
>  neighbor-joining now finishes pretty fast. However, it now takes a long
>  time to complete the second step of the tree-estimation procedure, which
>  is to refine the branch lengths of the tree using the EM algorithm.
>
>  There is less that I can do to optimize this. The neighbor-joining code
>  is not heavily dependent on the alignment length, but the branch-length
>  refinement is. I could optimize the branch-length code in the same way
>  that the neighbor-joining is optimized, so it was less dependent on
>  alignment length, but that would take a little while, and unfortunately
>  I don't have the time right now.
>
>  Instead, what I have done is introduced a subtle hack. If you specify
>  the command-line option "--maxrounds 0", then this will bypass the
>  branch-length refinement code (specifically, what it does is tell xrate
>  that you want the EM algorithm to run for a maximum of zero iterations,
>  or "rounds").
>
>  Thus, with "--maxrounds 0" the tree will be estimated using
>  neighbor-joining only.
>
>  Sorry this has taken so long to fix! Like I said, I never used xrate on
>  an alignment this big before (for our screens, we pre-specified the
>  tree, so we went directly to windowlicker).
>
>  Hopefully, this will now work for you, but let me know if you have
>  further issues.
>
>  Cheers,
>  Ian
>
>
>
>
>  marcin joachimiak wrote:
>  > hey,
>  >
>  > Thanks for the help!
>  > The short example worked just fine.
>  >
>  > log3 with my previous command so far produced this:
>  > Finding eigenvectors using MatrixExpEigenPrepare
>  > Largest eigenvalue was 0; set to 0
>  > Updating cached M() & J()
>  > Finding eigenvectors using MatrixExpEigenPrepare
>  > Largest eigenvalue was 0; set to 0
>  > Setting eigenvalue #2 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0))
>  > Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #0 ((-1.3333,0))
>  > Setting eigenvalue #3 ((-1.3333,0)) equal to eigenvalue #2 ((-1.3333,0))
>  > Updating cached M() & J()
>  > Sequence_database: converting ASCII sequences to score profiles
>  > Processed 3 sequences
>  > Estimating tree for the following sequences: CP000112 CP000527 AE017286
>  > Estimating distances of sequences 0 and 1
>  >
>  > You can download the FASTA and Stock files here:
>  > http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.fasta.gz
>  > http://www.microbesonline.org/tmp/marcin/DvH_DP4_G20_mVISTA.stock.gz
>  >
>  > I'll have more updates on Monday -- enjoy the weekend,
>  > marcin
>  > - Show quoted text -
>  >
>  > On Sat, Apr 26, 2008 at 1:25 PM, Ian Holmes <ih...@be...> wrote:
>  >> Thanks Marcin.
>  >>
>  >>  Something has definitely gone wrong if no alignment is output. Can you
>  >> check that the following outputs an alignment+tree?
>  >>
>  >>  xrate -log 5 --noannotate -e ~/dart/grammars/jukescantor.eg
>  >> ~/dart/src/ecfg/t/notree.stk
>  >>
>  >>
>  >>  Looking at your log messages it seems like the program has crashed while
>  >> trying to do neighbor-joining to estimate the tree. For the above command, I
>  >> get log messages after the "Estimating tree..." message. See below (BTW many
>  >> of these are on stderr, not stdout)
>  >>
>  >>  If you can reproduce this on the short example pairwise alignment but not
>  >> with the 3M alignments, there's something else wrong that needs
>  >> investigation. In that case, can you first try running "-log 3" instead of
>  >> "-log 5" (this will give more verbose log messages during the
>  >> neighbor-joining algorithm), then mail the stderr log messages to me, and/or
>  >> please post a URL to your large file so I can download it and attempt to
>  >> reproduce the error.
>  >>
>  >>  Sorry about this...
>  >>  Ian
>  >>
>  >>
>  >>
>  >>
>  >>  Warning -- used random number generator during initialisation
>  >>  Read 1 alignments
>  >>  Sequence_database: converting ASCII sequences to score profiles
>  >>  Estimating tree for the following sequences: X Y
>  >>  Optimizing branch lengths of all trees in alignment database, using EM
>  >>  Optimizing tree branch lengths by EM.
>  >>  Tree before optimization:
>  >>  (X:5.0000000000,Y:5.0000000000)root;
>  >>  EM iteration #1: log-likelihood = -56.0101 bits
>  >>  Optimized tree after step #1:
>  >>  (X:5.1160179765,Y:5.1160180140)root;
>  >>  EM iteration #2: log-likelihood = -56.0118 bits
>  >>  Warning: log-likelihood dropped from -56.0101 to -56.0118 bits during EM
>  >>  Failed EM improvement threshold for the 1th time; stopping
>  >>  Checking post-iteration log-likelihood
>  >>  Post-iteration log-likelihood = -56.0118 bits
>  >>  Restoring previous best branch lengths
>  >>  Tree after optimization:
>  >>  (X:5.0000000000,Y:5.0000000000)root;
>  >>  Alignment database log-likelihood: -56.0101 bits
>  >>  Processing alignment Alignment1 (1 of 1): 14 columns
>  >>  # STOCKHOLM 1.0
>  >>  #=GF NH (X:5.0000000000,Y:5.0000000000)root;
>  >>  X       AAAAAAaaUUUUUU
>  >>  Y       GGGGGGaaCCCCCC
>  >>  //
>  >>
>  >>
>  >>
>  >>
>  >>  marcin joachimiak wrote:
>  >>
>  >>> Hey there,
>  >>>
>  >>> The xrate tree job took a while but it seemed to finish without
>  >>> errors. Alas there isn't any tree output ...
>  >>>
>  >>> Here's what I get from stdout:
>  >>>
>  >>> Parsed command line: xrate DvH_DP4_G20_mVISTA.stock -e
>  >>> /usr2/people/gtl/bin/dart//grammars/jukescantor.eg --noannotate -log 5
>  >>> Warning -- used random number generator during initialisation
>  >>> Read 1 alignments
>  >>> Sequence_database: converting ASCII sequences to score profiles
>  >>> Estimating tree for the following sequences: CP000112 CP000527 AE017286
>  >>>
>  >>> The alignment is 3 bacterial genomes, ~3M each -- that wouldn't be a
>  >>> problem right? And this was run on a hefty machine ...
>  >>>
>  >>>
>  >>> marcin
>  >>>
>  >>> On Thu, Apr 24, 2008 at 7:36 PM, marcin joachimiak
>  >>> <mar...@gm...> wrote:
>  >>>
>  >>>> much better, thanks!
>  >>>>  I did notice the trees for each alignment in the output ...
>  >>>>
>  >>>>  The tree estimation is running, I'll keep reporting ...
>  >>>>
>  >>>>
>  >>>>
>  >>>>
>  >>>>  On Thu, Apr 24, 2008 at 7:22 PM, Ian Holmes <ih...@be...> wrote:
>  >>>>  > marcin joachimiak wrote:
>  >>>>  >  > Excellent, its working now!
>  >>>>  >
>  >>>>  >  Cool!
>  >>>>  >
>  >>>>  >
>  >>>>  >  > I'll have the tree data soon and may try a run with as well, but
>  >> best
>  >>>>  >  > if I could know that the other tree is different from the xrate
>  >> one
>  >>>>  >  > with the Jukes-Cantor model. Can I get the Jukes-Cantor model
>  >>>>  >  > tree/distances from xrate?
>  >>>>  >
>  >>>>  >  Yes, it will insert the trees into the alignment files it outputs,
>  >> using
>  >>>>  >  the New-Hampshire-encoded-in-Stockholm format from my previous
>  >> email.
>  >>>>  >
>  >>>>  >  One caveat. The way you're running it now, it is calculating a
>  >> separate
>  >>>>  >  tree *for every window* in the alignment. This may not be what you
>  >> want,
>  >>>>  >  and it's certainly going to slow things down.
>  >>>>  >
>  >>>>  >  If you want to calculate the tree just once for the whole alignment,
>  >> type
>  >>>>  >
>  >>>>  >  xrate MY_ALIGNMENT.stock -e $DARTDIR/grammars/jukescantor.eg
>  >>>>  >  --noannotate -log 5 > MY_ALIGNMENT_WITH_TREE.stock
>  >>>>  >
>  >>>>  >
>  >>>>  >  The "--noannotate" just means "don't waste time trying to annotate
>  >> this
>  >>>>  >  alignment, I just want the tree". The "-log 5" will just print some
>  >> log
>  >>>>  >  messages on stderr, so you can watch the progress of the
>  >> tree-building
>  >>>>  >  algorithms.
>  >>>>  >
>  >>>>  >  hth,
>  >>>>  >
>  >>>>  >  I.
>  >>>>  >
>  >>>>
>  >>>>
>