Re: [Treebase-devel] remedying mismatched OTU names

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I agree that name matching is a problem. There is some recent work that
might be of interest:

iPlant has done something similar to do just the name match up between two
files in their discovery environment. Select a data file and a tree file,
and it will find the names that match and then present the remainder to
allow manual matching (there was talk of using fuzzy matching to get good
preliminary guesses, but I don't know if that's implemented yet). It has a
very similar interface to the one outlined in the slides.

However, the long term solution might be automatic name matching. For 30
taxa, doing fuzzy match with user curation can work, but there are now trees
with tens of thousands of taxa. Having the names in two different files
matched to a standard taxonomy [sadly, one has to say "a standard taxonomy"
rather than "the standard taxonomy"] will allow them to paired together as
well as connect to existing information. There's a fairly new tool at
http://tnrs.iplantcollaborative.org/ that does much of this now. It takes a
list of plant names and matches it to a set of names from the Tropicos
database. It can correct typos in names, deal with changes in taxonomy
[something being moved to a different genus], etc. Due to its current
database, it's limited to plants, but it's supposed to be written so that
someone else can substitute a different names database. You can set it to
automatically select the best match or return a set of possible matches. It
also has an API that is pretty easy to use: I wrote a function to call it
from within R to convert names on a phylogeny to standardized names (see
code here<https://r-forge.r-project.org/scm/viewvc.php/pkg/R/resolveNames.R?view=markup&revision=180&root=omearalab>)
and it worked on a tree of 50K species.

Brian

_______________________________________
Brian O'Meara
Assistant Professor
Dept. of Ecology & Evolutionary Biology
U. of Tennessee, Knoxville
http://www.brianomeara.info

Students wanted: Applications due Dec. 15, annually
Postdoc collaborators wanted: Check NIMBioS' website
Funding wanted: Want to collaborate on a grant?

On Fri, Aug 12, 2011 at 1:43 PM, Arlin Stoltzfus <ar...@um...> wrote:

> Dear all--
>
> A common problem with data sharing in phylogenetics is that OTU names do
> not match between files, e.g., between the alignment and the tree from the
> same study.  I think I heard it from Bill that this is a common problem in
> TreeBASE submissions.  I have encountered it many times and have thought
> about how to design software to deal with the problem.
>
> After discussing this with Vivek, I decided to make a more formal
> description of the problem which is available here (sorry about the pptx
> format):
>
>  http://dl.dropbox.com/u/**7727158/name_matching.pptx<http://dl.dropbox.com/u/7727158/name_matching.pptx>
>
> This includes real examples of mismatched names collected in the wild, an
> explanation of why the problem occurs, mock-ups of  interactive user
> sessions, and implementation notes.  Vivek already started playing with some
> of the concepts and put an app on appspot (the link is in the presentation).
>
> Comments are welcome.   If implemented as described, how well would this
> tool serve the community need for name-matching?  What would make it better?
>
> Arlin
> -------
> Arlin Stoltzfus (ar...@um...)
> Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST
> IBBR, 9600 Gudelsky Drive, Rockville, MD
> tel: 240 314 6208; web: www.molevol.org
>
> --
> You received this message because you are subscribed to the Google
> Groups "MIAPA" group.
> For more options, visit this group at
> http://groups.google.com/**group/miapa-discuss?hl=en<http://groups.google.com/group/miapa-discuss?hl=en>
>