From: William P. <wil...@ya...> - 2009-05-29 16:10:55
|
On May 29, 2009, at 10:52 AM, Rutger Vos wrote: >> It's my understanding that we have not yet imported ncbi's >> classification >> into TreeBASE2? Is this correct? If so, then "containing any >> Primates" is >> not an option at the moment -- although it should be eventually >> (I'm putting >> it in our poster!) -- and it should be an easy thing to add. > > Absolutely, not yet - but very important. A propos, what would be the > right way to do this? I've thought about it a little bit and imagined > importing the NCBI taxonomy as a very large tree against whose > topology we run queries would be a way to do this that doesn't require > schema changes. Reasonable? Silly? That was my original proposal (before you joined the team) -- but it got nixed by others. My rationale was that we would develop a system so that users could specify any TreeBASE tree to use as a classification, with the idea that we would not be locked into any particular classification system using dedicated tables. On the other hand, seeing as we already have ncbi_taxids, it makes sense to be wedded to ncbi because any other classification tree (e.g. ToLWeb) would have weaker links among taxon labels (it would have to be done by string matching). Plus, I'm a bit skeptical that our tree parsing and importing system can really handle 500k-node trees (can headless Mesquite handle that?). At any rate, importing and indexing the ncbi tables is easily done and explained here. Ideally we want a one-click process for downloading, refreshing and reindexing these tables using the latest version from ncbi. The two tables can have fields that exactly mirror the fields in the download, plus two more fields (left_index and right_index). Optionally, we can build a path table for transitive closure searches, but seeing as there is only one tree, it is probably sufficient to use the left/right id system. >> PhyloWS seems to be missing a specification on how to search on a >> tree >> topology. > > The wiki floats the idea of using PhyloCode for that, but I'm not sure > if it can satisfy all our requirements. There is some verbiage here (see point 3 below), but it doesn't specify how to input a query tree. bp Find/search examples: Task: Find trees by nodes Input: a list of node specifiers, and a designation of what the specifiers should match (node label, sequence ID, taxon, gene name) Task: Find trees by clade Input: clade specification (phylocode) Task: Find, or filter trees matching a query topology. The query topology might have polytomies, of which matching trees may be a specialization. Input: A database (or result set) of trees, a query tree, and a distance metric Output: The matching trees (names, identifiers), or alternatively the subtrees of matching trees projected onto the query topology |