From: Arlin S. <ar...@um...> - 2011-11-04 17:53:58
|
I added a link to Linnaeus (see last slide) in my PowerPoint presentation of this problem: http://dl.dropbox.com/u/7727158/name_matching.pptx Let me know if there are other resources that I should note. That way we won't lose the knowledge that we accumulated in this discussion thread. Arlin On Aug 27, 2011, at 5:28 AM, Hilmar Lapp wrote: > I spoke with the developer, Martin Gerner. He thought it might be > well applicable to this task, even though the tool does a lot more > than we possibly need here. For example, it tokenizes the input, and > also is capable of applying some special "inference" rules (for > instance, "HeLa cells" will be tagged with "Homo sapiens") that are > quite useful if the purpose is linking of text to knowledge terms, > but go beyond simple synonym matching (which it does, too, though). > The dictionaries are pluggable, and apparently it is quite fast in > principle. > > -hilmar > > On Aug 22, 2011, at 6:39 PM, Mark Holder wrote: > >> Hi all, >> I just noticed that Hilmar tweeted a link to Linnaeus: http://linnaeus.sourceforge.net/ >> which seems relevant to this thread. >> >> all the best, >> Mark >> >> On Aug 19, 2011, at 11:06 AM, Arlin Stoltzfus wrote: >> >>> On Aug 15, 2011, at 4:09 AM, Roderic Page wrote: >>> >>>> Mapping tree names to matrix names could be formulated as a >>>> bipartite matching problem, where we have two lists of names and >>>> want to find the best matching. See http://iphylo.blogspot.com/2007/09/matching-names-in-phylogeny-data-files.html >>>> for more details. >>> >>> In computer science, this is called the "marriage problem" when >>> the two lists are the same size. We have a set { X } and a set >>> { Y } of elements with some properties. We have a function >>> f( X_i, Y_j ) that computes a match score for each pair, using the >>> properties. In our case, the only property is the name-string. >>> The marriage problem is to find a pairwise mapping that is optimal >>> in some way. If optimality means minimizing the cost of the >>> worst match, then this is (apparently, to me) the same as the >>> linear bottleneck assignment problem. >>> >>> An obvious function to use (not necessarily the best for our case) >>> is the edit distance, i.e., the number of character-wise edit >>> operations to convert X_i into Y_j. This is called the >>> Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance >>> ). >>> >>> But there is nothing to stop us from creating a distance function >>> that is optimized to work well in phyloinformatics. We could test >>> different functions using real cases such as the ones in my >>> slideshow. >>> >>> One special condition is that, for us, the cost of a s/ >>> <underscore>/<space> / edit is very low. Another special >>> condition is reflected in Rod's longest-common-substring method of >>> matching-- we often have pairs of matching names that have long >>> matching substrings and differ by interruptions. Maybe we need a >>> gap-open and gap-extend penalty like in sequence alignment >>> algorithms. >>> >>> Arlin >>> >>>> This approach could extended to, say, matching names in a NEXUS >>>> file to those in a publication, or a GenBank POPSET from a >>>> publication. For example, if we have a NEXUS file and a POPSET we >>>> could compute the best matching between the two sets of names. Or >>>> taxon names and/or accession numbers could be retrieved from the >>>> publication. >>>> >>>> This would also help provide the context to help avoid homonyms, >>>> such as matching animal names to plant names. >>>> >>>> Regards >>>> >>>> Rod >>>> >>>> >>>> On 15 Aug 2011, at 05:13, Rutger Vos wrote: >>>> >>>>>> this calls for easy-to-use NeXML editors. e.g. add the ability >>>>>> to enter >>>>>> Genbank accession numbers in Mesquite, and then save as NeXML, >>>>>> thus >>>>>> preserving "Homo_sapiens" consistently in all alignments and >>>>>> resulting >>>>>> trees, while still communicating the respective accession >>>>>> numbers for each >>>>>> locus. Summer-of-Code project here. >>>>> >>>>> Indeed. >>>>> >>>>>> C- The basic data model of matrix-rows-matching-with-tree-OTUs >>>>>> works for 99% >>>>>> of datasets, but a growing number of studies use BEAST species >>>>>> inference >>>>>> (and other similar methods) where the tree ends in species >>>>>> OTUs, but the >>>>>> alignment has many more haplotype OTUs. -- i.e. there is, on >>>>>> purpose, a >>>>>> complete mismatch between alignment row labels and tree OTUs. >>>>>> Mesquite can >>>>>> handle this using a taxon association table, though I don't >>>>>> know that this >>>>>> is formal NEXUS or just a Mesquite invention. I don't think >>>>>> that NeXML or >>>>>> PhyloML can handle this. This calls for expanding the >>>>>> capabilities of NeXML >>>>>> and PhyloML. >>>>> >>>>> Yes and no. Multiple matrix rows can reference the same otu, but >>>>> that's not quite what we want. Multiple, separately annotatable >>>>> matrix >>>>> row segments would be a good feature to have, also for TreeBASE's >>>>> needs. >>>>> >>>>> >>>>> >>>>> -- >>>>> Dr. Rutger A. Vos >>>>> School of Biological Sciences >>>>> Philip Lyle Building, Level 4 >>>>> University of Reading >>>>> Reading, RG6 6BX, United Kingdom >>>>> Tel: +44 (0) 118 378 7535 >>>>> http://rutgervos.blogspot.com >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> uberSVN's rich system and user administration capabilities and >>>>> model >>>>> configuration take the hassle out of deploying and managing >>>>> Subversion and >>>>> the tools developers use with it. Learn more about uberSVN and >>>>> get a free >>>>> download at: http://p.sf.net/sfu/wandisco-dev2dev >>>>> _______________________________________________ >>>>> Treebase-devel mailing list >>>>> Tre...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/treebase-devel >>>>> >>>> >>>> >>>> --------------------------------------------------------- >>>> Roderic Page >>>> Professor of Taxonomy >>>> Institute of Biodiversity, Animal Health and Comparative Medicine >>>> College of Medical, Veterinary and Life Sciences >>>> Graham Kerr Building >>>> University of Glasgow >>>> Glasgow G12 8QQ, UK >>>> >>>> Email: r....@bi... >>>> Tel: +44 141 330 4778 >>>> Fax: +44 141 330 2792 >>>> AIM: rod...@ai... >>>> Facebook: http://www.facebook.com/profile.php?id=1112517192 >>>> Twitter: http://twitter.com/rdmpage >>>> Blog: http://iphylo.blogspot.com >>>> Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html >>>> >>> >>> ------- >>> Arlin Stoltzfus (ar...@um...) >>> Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST >>> IBBR, 9600 Gudelsky Drive, Rockville, MD >>> tel: 240 314 6208; web: www.molevol.org >>> >>> ------------------------------------------------------------------------------ >>> Get a FREE DOWNLOAD! and learn more about uberSVN rich system, >>> user administration capabilities and model configuration. Take >>> the hassle out of deploying and managing Subversion and the >>> tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2_______________________________________________ >>> Treebase-devel mailing list >>> Tre...@li... >>> https://lists.sourceforge.net/lists/listinfo/treebase-devel >> >> ------------------------------------------------------------------------------ >> uberSVN's rich system and user administration capabilities and model >> configuration take the hassle out of deploying and managing >> Subversion and >> the tools developers use with it. Learn more about uberSVN and get >> a free >> download at: http://p.sf.net/sfu/wandisco-dev2dev >> _______________________________________________ >> Treebase-devel mailing list >> Tre...@li... >> https://lists.sourceforge.net/lists/listinfo/treebase-devel > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : > =========================================================== > > > > > -- > You received this message because you are subscribed to the Google > Groups "MIAPA" group. > For more options, visit this group at > http://groups.google.com/group/miapa-discuss?hl=en ------- Arlin Stoltzfus (ar...@um...) Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST IBBR, 9600 Gudelsky Drive, Rockville, MD tel: 240 314 6208; web: www.molevol.org |