From: Brian O'M. <bo...@ut...> - 2011-08-13 06:27:20
|
I agree that name matching is a problem. There is some recent work that might be of interest: iPlant has done something similar to do just the name match up between two files in their discovery environment. Select a data file and a tree file, and it will find the names that match and then present the remainder to allow manual matching (there was talk of using fuzzy matching to get good preliminary guesses, but I don't know if that's implemented yet). It has a very similar interface to the one outlined in the slides. However, the long term solution might be automatic name matching. For 30 taxa, doing fuzzy match with user curation can work, but there are now trees with tens of thousands of taxa. Having the names in two different files matched to a standard taxonomy [sadly, one has to say "a standard taxonomy" rather than "the standard taxonomy"] will allow them to paired together as well as connect to existing information. There's a fairly new tool at http://tnrs.iplantcollaborative.org/ that does much of this now. It takes a list of plant names and matches it to a set of names from the Tropicos database. It can correct typos in names, deal with changes in taxonomy [something being moved to a different genus], etc. Due to its current database, it's limited to plants, but it's supposed to be written so that someone else can substitute a different names database. You can set it to automatically select the best match or return a set of possible matches. It also has an API that is pretty easy to use: I wrote a function to call it from within R to convert names on a phylogeny to standardized names (see code here<https://r-forge.r-project.org/scm/viewvc.php/pkg/R/resolveNames.R?view=markup&revision=180&root=omearalab>) and it worked on a tree of 50K species. Brian _______________________________________ Brian O'Meara Assistant Professor Dept. of Ecology & Evolutionary Biology U. of Tennessee, Knoxville http://www.brianomeara.info Students wanted: Applications due Dec. 15, annually Postdoc collaborators wanted: Check NIMBioS' website Funding wanted: Want to collaborate on a grant? On Fri, Aug 12, 2011 at 1:43 PM, Arlin Stoltzfus <ar...@um...> wrote: > Dear all-- > > A common problem with data sharing in phylogenetics is that OTU names do > not match between files, e.g., between the alignment and the tree from the > same study. I think I heard it from Bill that this is a common problem in > TreeBASE submissions. I have encountered it many times and have thought > about how to design software to deal with the problem. > > After discussing this with Vivek, I decided to make a more formal > description of the problem which is available here (sorry about the pptx > format): > > http://dl.dropbox.com/u/**7727158/name_matching.pptx<http://dl.dropbox.com/u/7727158/name_matching.pptx> > > This includes real examples of mismatched names collected in the wild, an > explanation of why the problem occurs, mock-ups of interactive user > sessions, and implementation notes. Vivek already started playing with some > of the concepts and put an app on appspot (the link is in the presentation). > > Comments are welcome. If implemented as described, how well would this > tool serve the community need for name-matching? What would make it better? > > Arlin > ------- > Arlin Stoltzfus (ar...@um...) > Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST > IBBR, 9600 Gudelsky Drive, Rockville, MD > tel: 240 314 6208; web: www.molevol.org > > -- > You received this message because you are subscribed to the Google > Groups "MIAPA" group. > For more options, visit this group at > http://groups.google.com/**group/miapa-discuss?hl=en<http://groups.google.com/group/miapa-discuss?hl=en> > |