From: Rutger V. <rut...@gm...> - 2009-05-28 21:20:22
|
Hi, Val & I have concluded we should have a telecon (or other discussion format) about taxonomic queries. The proliferation of possibilities is too big for us to solve. If a client says "[...] where taxon=Primates [...]", what does that mean? Match only against leaf nodes called Primates? Only match trees containing only Primates? Etc. What about lists of taxa, or (topologically) disjoint sets? Also, what syntax do we use for this. Perhaps CQL isn't expressive enough for this so we may need another mini-syntax inside a CQL query (ideas floated: PhyloCode, NHX conventions). Here's a doodle poll to pick a date/time: http://doodle.com/7x5aykszup954ysa Rutger -- Dr. Rutger A. Vos Department of zoology University of British Columbia http://www.nexml.org http://rutgervos.blogspot.com |
From: William P. <wil...@ya...> - 2009-05-28 23:56:51
|
On May 28, 2009, at 5:20 PM, Rutger Vos wrote: > Here's a doodle poll to pick a date/time: http://doodle.com/7x5aykszup954ysa I will be in London until the 5th. Conceivably I can Skype from London, but not knowing the conference schedule, I can't commit. > Val & I have concluded we should have a telecon (or other discussion > format) about taxonomic queries. > > The proliferation of possibilities is too big for us to solve. If a > client says "[...] where taxon=Primates [...]", what does that mean? > Match only against leaf nodes called Primates? Only match trees > containing only Primates? Etc. What about lists of taxa, or > (topologically) disjoint sets? It's my understanding that we have not yet imported ncbi's classification into TreeBASE2? Is this correct? If so, then "containing any Primates" is not an option at the moment -- although it should be eventually (I'm putting it in our poster!) -- and it should be an easy thing to add. Anyway, this is why my prototype PhyloWS API has a proliferation of terms to search on: taxon_name (the fullnamestring in the taxon_variants table) taxon_label (the label string on either a tree leaf or on a matrix row) h.taxon_name (the higher taxon name in the ncbi tables -- i.e. get all trees that have any kind of descendant from this higer name) ncbi_taxid (the ncbi taxid) h.ncbi_taxid (the ncbi taxid but one that searches for all descendants of this node in the ncbi classification) ubio_namebankid (the namebankid from ubio) taxon_id (TreeBASE's own taxon_id from the taxon table) Hilmar objects to the use of multiple terms, and would rather that I just use "taxonIdentifier", but then have some special namespace for what I'm searching on. (e.g. "taxonIdentifier any ncbi_taxid:12345" vs "taxonIdentifier any ubio_namebankid:12345"). He would probably also protest about having separate "taxon_name" and "taxon_label". In this instance, I don't mind using only "name" but then having the server know to treat this as "taxon_name OR taxon_label". In terms of h.taxon_name, I don't see any way around it: we really need a separate term to mean "any kind of Primates" instead of exactly matching "Primates". > ...what does that mean? Match only against leaf nodes called > Primates? Only match trees containing only Primates "taxon_name any Primates" means "find any tree that has a node that maps to the name Primates". I think it is difficult to express "match trees containing only Primates", but I could approximate it like so: h.taxon_name any Primates NOT (h.taxon_name any Scandentia OR h.taxon_name any Glires OR h.taxon_name any Dermoptera) To do it exactly right, we need a special syntax so that the database knows to search the ncbi classification for the opposite of a subclade (i.e. everything except the specified subclade). PhyloWS seems to be missing a specification on how to search on a tree topology. One possible solution is to take advantage of the query tree structure supported by CQL. For example: /phylows/find/tree/?query=%28%28name+any+Homo+and+name+any+Pan%29+and +name+any+Gorilla%29 ... returns any tree that has Homo and Pan and Gorilla in it. Whereas this query: /phylows/find/topology/?query=%28%28name+any+Homo+and+name+any+Pan %29+and+name+any+Gorilla%29 ... returns any tree that matches the topology: ((Homo, Pan),Gorilla) bp |
From: Mark D. <mj...@ge...> - 2009-06-05 14:13:16
|
On Thu, 2009-05-28 at 19:56 -0400, William Piel wrote: > > I will be in London until the 5th. Conceivably I can Skype from > London, but not knowing the conference schedule, I can't commit. Okay. We I picked the call time based on your response to Rutger's poll. Maybe we'll just postpone the call until next week. > It's my understanding that we have not yet imported ncbi's > classification into TreeBASE2? Is this correct? Yes. > -- Mark Jason Dominus mj...@ge... Penn Genome Frontiers Institute +1 215 573 5387 |
From: Rutger V. <rut...@gm...> - 2009-05-29 14:52:37
|
On Thu, May 28, 2009 at 7:56 PM, William Piel <wil...@ya...> wrote: > > On May 28, 2009, at 5:20 PM, Rutger Vos wrote: > > Here's a doodle poll to pick a date/time: http://doodle.com/7x5aykszup954ysa > > I will be in London until the 5th. Conceivably I can Skype from London, but > not knowing the conference schedule, I can't commit. Obviously you're a key participant so let's just play it by ear; I suppose we can start the discussion by email and either call in next week or whenever you can make it. > It's my understanding that we have not yet imported ncbi's classification > into TreeBASE2? Is this correct? If so, then "containing any Primates" is > not an option at the moment -- although it should be eventually (I'm putting > it in our poster!) -- and it should be an easy thing to add. Absolutely, not yet - but very important. A propos, what would be the right way to do this? I've thought about it a little bit and imagined importing the NCBI taxonomy as a very large tree against whose topology we run queries would be a way to do this that doesn't require schema changes. Reasonable? Silly? > Hilmar objects to the use of multiple terms, and would rather that I just > use "taxonIdentifier", but then have some special namespace for what I'm > searching on. (e.g. "taxonIdentifier any ncbi_taxid:12345" vs > "taxonIdentifier any ubio_namebankid:12345"). I can see the point of that: it would make inclusion into CDAO easier if all we need is a generic taxonIdentifier object. On the other hand, it would imply overloading the identifier string, with the namespacing smuggling some amount of extra semantics into something that really ought to be an opaque string. > PhyloWS seems to be missing a specification on how to search on a tree > topology. The wiki floats the idea of using PhyloCode for that, but I'm not sure if it can satisfy all our requirements. Rutger -- Dr. Rutger A. Vos Department of zoology University of British Columbia http://www.nexml.org http://rutgervos.blogspot.com |
From: William P. <wil...@ya...> - 2009-05-29 16:10:55
|
On May 29, 2009, at 10:52 AM, Rutger Vos wrote: >> It's my understanding that we have not yet imported ncbi's >> classification >> into TreeBASE2? Is this correct? If so, then "containing any >> Primates" is >> not an option at the moment -- although it should be eventually >> (I'm putting >> it in our poster!) -- and it should be an easy thing to add. > > Absolutely, not yet - but very important. A propos, what would be the > right way to do this? I've thought about it a little bit and imagined > importing the NCBI taxonomy as a very large tree against whose > topology we run queries would be a way to do this that doesn't require > schema changes. Reasonable? Silly? That was my original proposal (before you joined the team) -- but it got nixed by others. My rationale was that we would develop a system so that users could specify any TreeBASE tree to use as a classification, with the idea that we would not be locked into any particular classification system using dedicated tables. On the other hand, seeing as we already have ncbi_taxids, it makes sense to be wedded to ncbi because any other classification tree (e.g. ToLWeb) would have weaker links among taxon labels (it would have to be done by string matching). Plus, I'm a bit skeptical that our tree parsing and importing system can really handle 500k-node trees (can headless Mesquite handle that?). At any rate, importing and indexing the ncbi tables is easily done and explained here. Ideally we want a one-click process for downloading, refreshing and reindexing these tables using the latest version from ncbi. The two tables can have fields that exactly mirror the fields in the download, plus two more fields (left_index and right_index). Optionally, we can build a path table for transitive closure searches, but seeing as there is only one tree, it is probably sufficient to use the left/right id system. >> PhyloWS seems to be missing a specification on how to search on a >> tree >> topology. > > The wiki floats the idea of using PhyloCode for that, but I'm not sure > if it can satisfy all our requirements. There is some verbiage here (see point 3 below), but it doesn't specify how to input a query tree. bp Find/search examples: Task: Find trees by nodes Input: a list of node specifiers, and a designation of what the specifiers should match (node label, sequence ID, taxon, gene name) Task: Find trees by clade Input: clade specification (phylocode) Task: Find, or filter trees matching a query topology. The query topology might have polytomies, of which matching trees may be a specialization. Input: A database (or result set) of trees, a query tree, and a distance metric Output: The matching trees (names, identifiers), or alternatively the subtrees of matching trees projected onto the query topology |
From: Mark D. <mj...@ge...> - 2009-06-05 17:37:35
|
On Fri, 2009-05-29 at 10:52 -0400, Rutger Vos wrote: > > I will be in London until the 5th. Conceivably I can Skype from London, but > > not knowing the conference schedule, I can't commit. > > Obviously you're a key participant so let's just play it by ear; I > suppose we can start the discussion by email and either call in next > week or whenever you can make it. Yes, let's do that. -- Mark Jason Dominus mj...@ge... Penn Genome Frontiers Institute +1 215 573 5387 |
From: Mark D. <mj...@ge...> - 2009-06-04 15:27:29
|
On Thu, 2009-05-28 at 17:20 -0400, Rutger Vos wrote: > Val & I have concluded we should have a telecon (or other discussion > format) about taxonomic queries. > Here's a doodle poll to pick a date/time: http://doodle.com/7x5aykszup954ysa Did we pick a date and time? -- Mark Jason Dominus mj...@ge... Penn Genome Frontiers Institute +1 215 573 5387 |
From: Mark D. <mj...@ge...> - 2009-06-04 20:13:20
|
On Thu, 2009-06-04 at 11:27 -0400, Mark Dominus wrote: > Did we pick a date and time? The phone call will be Friday, 5 June at 14:30 EDT, 11:30 PDT. Bill, Rutger, Val and I will attend. If anyone else wants to be involved, email Val. -- Mark Jason Dominus mj...@ge... Penn Genome Frontiers Institute +1 215 573 5387 |
From: William P. <wil...@ya...> - 2009-06-05 18:36:12
|
On Jun 4, 2009, at 4:12 PM, Mark Dominus wrote: > The phone call will be Friday, 5 June at 14:30 EDT, 11:30 PDT. > > Bill, Rutger, Val and I will attend. If anyone else wants to be > involved, email Val. I have my iChat switched on -- but I don't see anyone else on-line. Is this really happening? (or is it happening by land-line phone?) bp |
From: Val T. <va...@ci...> - 2009-06-05 18:57:30
|
Sorry, I was in a taxi in traffic until a few minutes ago. I am in my office now. Val On Jun 5, 2009, at 2:35 PM, William Piel wrote: > > On Jun 4, 2009, at 4:12 PM, Mark Dominus wrote: > >> The phone call will be Friday, 5 June at 14:30 EDT, 11:30 PDT. >> >> Bill, Rutger, Val and I will attend. If anyone else wants to be >> involved, email Val. > > I have my iChat switched on -- but I don't see anyone else on-line. > Is this really happening? (or is it happening by land-line phone?) > > bp > > |