I've been playing around a bit with the complearn toolkit (great stuff, btw), and noticed that the quartet distance calculation S(T) runs in N^4 (actually N^5 if one counts the path traversal routine as it is implemented), which becomes prohibitive for more than a few tens of objects. What is currently calculated is
sum_{x,y,u,v} d(x,y, u,v | T)
with d(x,y,u,v|T) = d(a,b) + d(c,d) { a,b,c,d being the consistent quartet ordering of x,y,u,v in T)
After this the value is scaled. As far as I can see, this is implemented as a four-way loop (over all quartets) in the code, then finding the quartet that is consistent. I think this can be done two orders faster ( roughly N^2 multiplied with the path traversal routine ).
First observe, that given two nodes (x,y) the value d(x,y) will be added to the overall sum as many times as there are other pairs (u,v) that do not cross the path between x and y. Which are these pairs?
For a ternary tree, and the path between the leafs x and y, each internal node in the path will have two branches that participate in the path (one going to x, one going to y). The third branch is connected to at least one other object. Now observe that if there are k objects connected through this third branch, all pairs of these objects are, together with (x,y), quartets which are consistent with the tree. There are thus $k \choose 2$ of these pairs, all of which will contribute at least d(x,y) to the overall sum.
All pairs that are connected through different internal nodes by definition cross the path of x to y, and therefore d(x,y) will not be part of the overall sum for these pairs.
This can be repeated for each internal node, and gives exactly the number of times d(x,y) should be added to the overall sum. Now do this for all pairs of nodes and the overall sum is computed.
The faster algorithm for computing S(T) thus proceeds as follows:
- for each branch of an internal node in the ternary tree, compute and cache the number of leafs (objects) connected to that branch
- loop over all nodes x
-- loop over all nodes y (unequal to x)
--- for each internal node between x and y, obtain k, the number of objects in the non-traversed branch
--- add (k choose 2) times d(x,y) to the overall sum
- Scale with the precomputed minimum and maximum score
The algorithm runs in N^2 for the two outer loops, multiplied with the quantity for the path traversal routine (proportional to internal path length, I guess).
I haven't been able to implement this algorithm yet in the libqtree source, as I first have to figure out exactly how the qtree data structures work. If I didn't make a mistake, this should however make the maketree algorithm run so much faster that it should easily scale to 100s possibly 1000s of objects (depending on the effectiveness of the search algorithm).
Hi,
I've been playing around a bit with the complearn toolkit (great stuff, btw), and noticed that the quartet distance calculation S(T) runs in N^4 (actually N^5 if one counts the path traversal routine as it is implemented), which becomes prohibitive for more than a few tens of objects. What is currently calculated is
sum_{x,y,u,v} d(x,y, u,v | T)
with d(x,y,u,v|T) = d(a,b) + d(c,d) { a,b,c,d being the consistent quartet ordering of x,y,u,v in T)
After this the value is scaled. As far as I can see, this is implemented as a four-way loop (over all quartets) in the code, then finding the quartet that is consistent. I think this can be done two orders faster ( roughly N^2 multiplied with the path traversal routine ).
First observe, that given two nodes (x,y) the value d(x,y) will be added to the overall sum as many times as there are other pairs (u,v) that do not cross the path between x and y. Which are these pairs?
For a ternary tree, and the path between the leafs x and y, each internal node in the path will have two branches that participate in the path (one going to x, one going to y). The third branch is connected to at least one other object. Now observe that if there are k objects connected through this third branch, all pairs of these objects are, together with (x,y), quartets which are consistent with the tree. There are thus $k \choose 2$ of these pairs, all of which will contribute at least d(x,y) to the overall sum.
All pairs that are connected through different internal nodes by definition cross the path of x to y, and therefore d(x,y) will not be part of the overall sum for these pairs.
This can be repeated for each internal node, and gives exactly the number of times d(x,y) should be added to the overall sum. Now do this for all pairs of nodes and the overall sum is computed.
The faster algorithm for computing S(T) thus proceeds as follows:
- for each branch of an internal node in the ternary tree, compute and cache the number of leafs (objects) connected to that branch
- loop over all nodes x
-- loop over all nodes y (unequal to x)
--- for each internal node between x and y, obtain k, the number of objects in the non-traversed branch
--- add (k choose 2) times d(x,y) to the overall sum
- Scale with the precomputed minimum and maximum score
The algorithm runs in N^2 for the two outer loops, multiplied with the quantity for the path traversal routine (proportional to internal path length, I guess).
I haven't been able to implement this algorithm yet in the libqtree source, as I first have to figure out exactly how the qtree data structures work. If I didn't make a mistake, this should however make the maketree algorithm run so much faster that it should easily scale to 100s possibly 1000s of objects (depending on the effectiveness of the search algorithm).
Any thoughts on this?
-Maarten-
http://www.cs.vu.nl/~mkeijzer