From: William P. <wil...@ya...> - 2009-05-28 23:56:51
|
On May 28, 2009, at 5:20 PM, Rutger Vos wrote: > Here's a doodle poll to pick a date/time: http://doodle.com/7x5aykszup954ysa I will be in London until the 5th. Conceivably I can Skype from London, but not knowing the conference schedule, I can't commit. > Val & I have concluded we should have a telecon (or other discussion > format) about taxonomic queries. > > The proliferation of possibilities is too big for us to solve. If a > client says "[...] where taxon=Primates [...]", what does that mean? > Match only against leaf nodes called Primates? Only match trees > containing only Primates? Etc. What about lists of taxa, or > (topologically) disjoint sets? It's my understanding that we have not yet imported ncbi's classification into TreeBASE2? Is this correct? If so, then "containing any Primates" is not an option at the moment -- although it should be eventually (I'm putting it in our poster!) -- and it should be an easy thing to add. Anyway, this is why my prototype PhyloWS API has a proliferation of terms to search on: taxon_name (the fullnamestring in the taxon_variants table) taxon_label (the label string on either a tree leaf or on a matrix row) h.taxon_name (the higher taxon name in the ncbi tables -- i.e. get all trees that have any kind of descendant from this higer name) ncbi_taxid (the ncbi taxid) h.ncbi_taxid (the ncbi taxid but one that searches for all descendants of this node in the ncbi classification) ubio_namebankid (the namebankid from ubio) taxon_id (TreeBASE's own taxon_id from the taxon table) Hilmar objects to the use of multiple terms, and would rather that I just use "taxonIdentifier", but then have some special namespace for what I'm searching on. (e.g. "taxonIdentifier any ncbi_taxid:12345" vs "taxonIdentifier any ubio_namebankid:12345"). He would probably also protest about having separate "taxon_name" and "taxon_label". In this instance, I don't mind using only "name" but then having the server know to treat this as "taxon_name OR taxon_label". In terms of h.taxon_name, I don't see any way around it: we really need a separate term to mean "any kind of Primates" instead of exactly matching "Primates". > ...what does that mean? Match only against leaf nodes called > Primates? Only match trees containing only Primates "taxon_name any Primates" means "find any tree that has a node that maps to the name Primates". I think it is difficult to express "match trees containing only Primates", but I could approximate it like so: h.taxon_name any Primates NOT (h.taxon_name any Scandentia OR h.taxon_name any Glires OR h.taxon_name any Dermoptera) To do it exactly right, we need a special syntax so that the database knows to search the ncbi classification for the opposite of a subclade (i.e. everything except the specified subclade). PhyloWS seems to be missing a specification on how to search on a tree topology. One possible solution is to take advantage of the query tree structure supported by CQL. For example: /phylows/find/tree/?query=%28%28name+any+Homo+and+name+any+Pan%29+and +name+any+Gorilla%29 ... returns any tree that has Homo and Pan and Gorilla in it. Whereas this query: /phylows/find/topology/?query=%28%28name+any+Homo+and+name+any+Pan %29+and+name+any+Gorilla%29 ... returns any tree that matches the topology: ((Homo, Pan),Gorilla) bp |