Re: [Treebase-devel] remedying mismatched OTU names

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Thanks Arlin.  Indeed, this is a big issue. 

I'd say that there are two major sub-issues:

1. Taxon label consistency among objects within a submission/study. I gather that this is mostly what Arlin et al.'s PPT was addressing: if the set of taxon labels in the alignment don't match with the tree(s), users can't do much with the data until this is fixed. Some minor comments to add: 

	A- One source of the error comes from different programs having different levels of compliance with NEXUS format. For example, open your tree file in Dendroscope and then save it and you'll find that the rules regarding illegal punctuation and underscore usage will have changed, creating a mismatch with the original NEXUS alignment. Likewise, MacClade automatically converts *all* underscores to spaces even if they are single quoted, whereas Mesquite "hard codes" underscores if the token as single quotes around it.  Like Dendroscope, Archaeopteryx saves Newick and NEXUS trees  so that the the labels change (Christian wasn't aware of the arcane tokenization rules -- we just recently discussed this with him, so this may be fixed soon).  This does call for smart algorithms that can read improperly tokenized files (i.e. the "relaxed" setting in PAUP) -- which is tough, seeing as the program has to guess at the meaning of "," or "(" in a Newick string -- is it a new node or a token that was not quoted? And it calls for the ability to synonymize as needed, e.g. automatically recognizing that 'Homo_sapiens_x-2' in one file = 'Homo sapiens x-2' in another file. 

	B- Mismatches sometimes arise when users try to indicate the Genbank accession numbers for separate locus alignments, but the tree is the result of simultaneous analysis. i.e., one alignment will use "Homo_sapiens_AJ23423", another uses "Homo_sapiens_AJ564667", and the tree uses "Homo_sapiens". It's laudable that they want to include this valuable metadata, but it would be better to code it as metadata in a NeXML file. And this calls for easy-to-use NeXML editors. e.g. add the ability to enter Genbank accession numbers in Mesquite, and then save as NeXML, thus preserving "Homo_sapiens" consistently in all alignments and resulting trees, while still communicating the respective accession numbers for each locus. Summer-of-Code project here. 

	C- The basic data model of matrix-rows-matching-with-tree-OTUs works for 99% of datasets, but a growing number of studies use BEAST species inference (and other similar methods) where the tree ends in species OTUs, but the alignment has many more haplotype OTUs. -- i.e. there is, on purpose, a complete mismatch between alignment row labels and tree OTUs. Mesquite can handle this using a taxon association table, though I don't know that this is formal NEXUS or just a Mesquite invention. I don't think that NeXML or PhyloML can handle this. This calls for expanding the capabilities of NeXML and PhyloML.

2. Taxon labels not mapped or not mappable to external authorities or standards. This issue is not really the focus of Arlin et al's PPT, but is what Brian was addressing below. Yet it's equally important for data sharing, if not more so. Some comments:

	A- Until taxon concepts are truly identifiable/citable, the mapping of taxon labels to "taxa" will always be imprecise (with precise taxonomic circumscription, usage, and meaning epistemologically impossible to communicate), but at least gross homonyms need to be addressed. This is a challenge for automated services -- the iPlant TNRS has some advantage given that it does not (yet) include animal or bacterial names, but even within a code there are inter-rank homonyms (e.g. "Drosophila" the genus or subgenus?). A "smart" service would resolve the gross homonym based on the topology of the submitted tree -- i.e. ((Aotus,Homo),Lemur) should cause the service to pick Aotus the monkey instead of Aotus the Eudicot. 

	B- Abbreviations in the taxon labels make it very difficult to do a smart TNRS lookup. Some of the examples of "resolved" labels in the PPT are nonetheless unacceptable with respect to TNRS resolution. Even something as ubiquitous as "E. coli" could refer to (or be confused with) Entamoeba coli (Grassi, 1879) instead of Escherichia coli (Migula 1895). 

	C- Another source of Homonym is with virus names. This is a big problem for TreeBASE because TreeBASE's semi-automated name service starts by ignoring trailing strings that start with capital letters or that contain numbers -- e.g. the assumption is that the third part of "Homo_sapiens_AJ23423" is not part of the name, whereas the third part of "Homo_sapiens_sapiens" is part of the name. Yet, while "Neodiprion abietis" is a sawfly, "Neodiprion abietis NPV" is a gammabaculovirus that happens to infect the sawfly -- naturally, TreeBASE first tries to match the beginning part of the virus name to the host name, and the submitter needs to be sharp enough to notice and correct the problem. I'm going to guess that iPlant's TNRS will map "Ammi majus latent virus" to bishop's-weed, A. majus instead of to a Potyvirus. 

bp

On Aug 13, 2011, at 2:19 AM, Brian O'Meara wrote:

> I agree that name matching is a problem. There is some recent work that might be of interest:
> 
> iPlant has done something similar to do just the name match up between two files in their discovery environment. Select a data file and a tree file, and it will find the names that match and then present the remainder to allow manual matching (there was talk of using fuzzy matching to get good preliminary guesses, but I don't know if that's implemented yet). It has a very similar interface to the one outlined in the slides. 
> 
> However, the long term solution might be automatic name matching. For 30 taxa, doing fuzzy match with user curation can work, but there are now trees with tens of thousands of taxa. Having the names in two different files matched to a standard taxonomy [sadly, one has to say "a standard taxonomy" rather than "the standard taxonomy"] will allow them to paired together as well as connect to existing information. There's a fairly new tool at http://tnrs.iplantcollaborative.org/ that does much of this now. It takes a list of plant names and matches it to a set of names from the Tropicos database. It can correct typos in names, deal with changes in taxonomy [something being moved to a different genus], etc. Due to its current database, it's limited to plants, but it's supposed to be written so that someone else can substitute a different names database. You can set it to automatically select the best match or return a set of possible matches. It also has an API that is pretty easy to use: I wrote a function to call it from within R to convert names on a phylogeny to standardized names (see code here) and it worked on a tree of 50K species. 
> 
> Brian
> 
> 
> _______________________________________
> Brian O'Meara
> Assistant Professor
> Dept. of Ecology & Evolutionary Biology
> U. of Tennessee, Knoxville
> http://www.brianomeara.info
> 
> Students wanted: Applications due Dec. 15, annually
> Postdoc collaborators wanted: Check NIMBioS' website
> Funding wanted: Want to collaborate on a grant?
> 
> 
> On Fri, Aug 12, 2011 at 1:43 PM, Arlin Stoltzfus <ar...@um...> wrote:
> Dear all--
> 
> A common problem with data sharing in phylogenetics is that OTU names do not match between files, e.g., between the alignment and the tree from the same study.  I think I heard it from Bill that this is a common problem in TreeBASE submissions.  I have encountered it many times and have thought about how to design software to deal with the problem.
> 
> After discussing this with Vivek, I decided to make a more formal description of the problem which is available here (sorry about the pptx format):
> 
>  http://dl.dropbox.com/u/7727158/name_matching.pptx
> 
> This includes real examples of mismatched names collected in the wild, an explanation of why the problem occurs, mock-ups of  interactive user sessions, and implementation notes.  Vivek already started playing with some of the concepts and put an app on appspot (the link is in the presentation).
> 
> Comments are welcome.   If implemented as described, how well would this tool serve the community need for name-matching?  What would make it better?
> 
> Arlin
> -------
> Arlin Stoltzfus (ar...@um...)
> Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST
> IBBR, 9600 Gudelsky Drive, Rockville, MD
> tel: 240 314 6208; web: www.molevol.org
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "MIAPA" group.
> For more options, visit this group at
> http://groups.google.com/group/miapa-discuss?hl=en
> 
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "MIAPA" group.
> For more options, visit this group at
> http://groups.google.com/group/miapa-discuss?hl=en