Re: [Ncl-devel] annotation API for NCL

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

OK, I understand what you're saying here: NO CHANGES TO THE EXISTING INTERFACE.

Any new tree representations should be provided by new functions, and the existing functions should give identical output after any of my code modifications, and we should use the existing unique ids for attaching data to nodes/edges as much as possible.

The only thing I'm not sure about is the latter point - consider your example:

(1,(3,5),(4,2:3.4));

Internally we will need at least to store:

(1,(3,5)6,(4,2:3.4)7)8;

So perhaps some mechanism to remember which nodes are labelled with TAXA ids and which are labelled otherwise is in order.

Like (internally)

(1,(3,5)!6,(4,2:3.4)!7)!8;

or whatnot. What the '!' is doesn't really matter as long is it isn't input by client code in the initial tree file. It could be a non-writing ascii code (like zero). It could be as simple as a space (since these are already removed from the representation of NxsSimpleTree during reading, so are guaranteed to be absent from the internal newick representation, right? If spaces are guaranteed to be absent from NxsSimpleTree taxon labels then we are free to use them as custom tokens in the library)

input tree in nexus or other file = (1, (3, 5), (4,2 : 3.4) ); (or whatever)
input tree after loading into ncl using current code:  (1,(3,5),(4,2:3.4)); (spaces are ignored)
input tree after my code has applied ids to any "unlabelled" nodes: (1,(3,5) 6,(4,2:3.4) 7) 8;   [space used as delimiter for non-taxa labels]

each tree loaded from a non-nexml source has a null node and edge hash table or equivalent

std::map<unsigned int, SomeNodeMetadataStruct> * nodeData = 0;
std::map<unsigned int, SomeEdgeMetadataStruct> * edgeData = 0;

if the tree is loaded from a nexml source the hash tables are instantiated

nodeData = new SomeNodeMetadataStruct;
edgeData = new SomeEdgeMetadataStruct;

So the cost of these modifications is a slightly longer stored newick string, and two null pointers per tree (when loaded from a non-nexml source) or a couple of int/string hash tables (when loaded from a nexml source)

Mick

----- Original Message -----
From: "Mark Holder" <mth...@gm...>
To: "ncl-devel" <ncl...@li...>
Sent: Wednesday, May 26, 2010 7:27:09 PM GMT -08:00 US/Canada Pacific
Subject: Re: [Ncl-devel] annotation API for NCL

Hi,

On May 25, 2010, at 11:18 PM, Jeet Sukumaran wrote:
> ...Instead, I think that if client code wants 
> to access annotations on tree components, it would be perfectly fair to 
> expect them to ask for a NCL Tree object and deal or otherwise harvest 
> info from that. That way, edge and node annotations can be managed using 
> the the same "porcelain" as for any other "first class" NCL objects.
> 

On May 26, 2010, at 5:37 PM, Michael Elliot wrote:
> So the tree storage would be a set of modified newick strings. But the newick string would be used to generate a true tree data structure...

        I agree that storing the tree is some compact form (presumably a newick string) but generating a traversable tree (with node and edge structures) on-the-fly would work well and be non-disruptive.  NCL has never had a good metadata system before, so we can impose some constraints on that part of the API to make our life easier and not worry about breaking client code. It would certainly simplify the tree-metadata API if the contract were strict. Such as:

        1. NCL stores the annotations for the nodes and edges in some way (we could use unique keys as Mick suggests, but I have some other thoughts below). The details of the storage would be none of the client's business;

        2. Client code can still request the newick string without any funky markup. This is how a lot of client code interacts with NCL's trees block, so we can' change this. 

        3. If the client wants the richer metadata, then they will have to request that the tree be instantiated as a simple traversable tree. This can already be done using the NxsSimpleTree class. We can expand to fit our needs as it has never been a core "promise" of the NCL interface.  We would just need to add a field to NxsSimpleNode and NxsSimpleEdge where the library can attach a collection of annotations.

I don't really want to make the core tree storage fatter (or "more prosperous" if those of you prefer) than it is now. I definitely use NCL to crunch huge tree files. Typically that would be done in context in which I would not need or want lots of metadata. It would be good not to pay a heavy price in terms of performance (not that any of the proposed changes would be too expensive, but I just wanted to reiterate that constraint). Clearly adding the flexibility to deal with metadata may have some small performance cost, but we should be wary about things that will make it painful to parse files with thousands of trees (each of which has thousands of leaves).

On the topic of storing the association between the metadata and the edge and node:

After parsing, the tree NCL stores it as a newick string with numbers instead of names so that the client doesn't have to worry about parsing funky taxon names  (and the fact that 'a b' and a_b are the same taxon, plus the translate table could mean that 'biae' is also mapped to that taxon, plus the taxa have implicit numbering in nexus so a number is fine...).

We'd need to be able to produce the numbers-only newick form (to avoid breaking client code).

So, if we do want to decorate the tree with unique codes (delimited by # and $ for example), then we'd have to strip those codes out before returning them to the client. That is not too big deal, obviously.  We could also add a new method for getting the newick string with the codes in there (for client who want to deal with that).

I don't have a strenuous objection to that approach. However, I slightly prefer imposing the contract that you only get Edge and Node metadata if you ask for the full tree instantiation.  This gives the flexibility to play around with how we store the association internally, as long as we can decorate the tree that is generated on-the-fly.

Given that the internal structure is a "normalized" newick string, I'm tempted to say that we don't need an id system.  Consider the newick string:

(1,(3,5),(4,2:3.4));

The leaves are easily identified (they are numbers that do not follow : and they are constrained to be 1 + the index in the NxsTaxaBlock, so there will be know clashes).

If we have node-annotation storage and edge-annotation storage separate, then the taxon id can be used as the "key" to find the terminal edge annotations.

Annotations for internal nodes (and edges) can then be accomplished by a number of systems.  Perhaps the easiest is to recognize that every internal node has a closing ')' character associated with it.  The offset of that closing parentheses from the start of the string, is thus a unique identifier for the node.

If we were not "exporting" these ids (and making client code learn the system), then the fact that the system is a bit cryptic (OK, more than a *bit* cryptic) would not cause headaches for clients.

Just a thought.  I don't have serious objections to coming up with a clear system of tagging edges and nodes with labels, but we may not need it.

all the best,
Mark

[snip]

------------------------------------------------------------------------------

_______________________________________________
Ncl-devel mailing list
Ncl...@li...
https://lists.sourceforge.net/lists/listinfo/ncl-devel