Re: [Treebase-devel] [eX-purgate bulk]: Re: [eX-purgate bulk]: taxonomic query telecon

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On May 29, 2009, at 10:52 AM, Rutger Vos wrote:

>> It's my understanding that we have not yet imported ncbi's  
>> classification
>> into TreeBASE2? Is this correct? If so, then "containing any  
>> Primates" is
>> not an option at the moment -- although it should be eventually  
>> (I'm putting
>> it in our poster!) -- and it should be an easy thing to add.
>
> Absolutely, not yet - but very important. A propos, what would be the
> right way to do this? I've thought about it a little bit and imagined
> importing the NCBI taxonomy as a very large tree against whose
> topology we run queries would be a way to do this that doesn't require
> schema changes. Reasonable? Silly?

That was my original proposal (before you joined the team) -- but it  
got nixed by others.  My rationale was that we would develop a system  
so that users could specify any TreeBASE tree to use as a  
classification, with the idea that we would not be locked into any  
particular classification system using dedicated tables.  On the other  
hand, seeing as we already have ncbi_taxids, it makes sense to be  
wedded to ncbi because any other classification tree (e.g. ToLWeb)  
would have weaker links among taxon labels (it would have to be done  
by string matching). Plus, I'm a bit skeptical that our tree parsing  
and importing system can really handle 500k-node trees (can headless  
Mesquite handle that?).

At any rate, importing and indexing the ncbi tables is easily done and  
explained here. Ideally we want a one-click process for downloading,  
refreshing and reindexing these tables using the latest version from  
ncbi. The two tables can have fields that exactly mirror the fields in  
the download, plus two more fields (left_index and right_index).  
Optionally, we can build a path table for transitive closure searches,  
but seeing as there is only one tree, it is probably sufficient to use  
the left/right id system.

>> PhyloWS seems to be missing a specification on how to  search on a  
>> tree
>> topology.
>
> The wiki floats the idea of using PhyloCode for that, but I'm not sure
> if it can satisfy all our requirements.

There is some verbiage here (see point 3 below), but it doesn't  
specify how to input a query tree.

bp

Find/search examples:
Task: Find trees by nodes
Input: a list of node specifiers, and a designation of what the  
specifiers should match (node label, sequence ID, taxon, gene name)
Task: Find trees by clade
Input: clade specification (phylocode)
Task: Find, or filter trees matching a query topology.
The query topology might have polytomies, of which matching trees may  
be a specialization.
Input: A database (or result set) of trees, a query tree, and a  
distance metric
Output: The matching trees (names, identifiers), or alternatively the  
subtrees of matching trees projected onto the query topology