Re: [Treebase-devel] remedying mismatched OTU names

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I added a link to Linnaeus (see last slide) in my PowerPoint  
presentation of this problem:

    http://dl.dropbox.com/u/7727158/name_matching.pptx

Let me know if there are other resources that I should note.  That way  
we won't lose the knowledge that we accumulated in this discussion  
thread.

Arlin

On Aug 27, 2011, at 5:28 AM, Hilmar Lapp wrote:

> I spoke with the developer, Martin Gerner. He thought it might be  
> well applicable to this task, even though the tool does a lot more  
> than we possibly need here. For example, it tokenizes the input, and  
> also is capable of applying some special "inference" rules (for  
> instance, "HeLa cells" will be tagged with "Homo sapiens") that are  
> quite useful if the purpose is linking of text to knowledge terms,  
> but go beyond simple synonym matching (which it does, too, though).  
> The dictionaries are pluggable, and apparently it is quite fast in  
> principle.
>
> 	-hilmar
>
> On Aug 22, 2011, at 6:39 PM, Mark Holder wrote:
>
>> Hi all,
>> 	I just noticed that Hilmar tweeted a link to Linnaeus:  http://linnaeus.sourceforge.net/ 
>>  which seems relevant to this thread.
>>
>> all the best,
>> Mark
>>
>> On Aug 19, 2011, at 11:06 AM, Arlin Stoltzfus wrote:
>>
>>> On Aug 15, 2011, at 4:09 AM, Roderic Page wrote:
>>>
>>>> Mapping tree names to matrix names could be formulated as a  
>>>> bipartite matching problem, where we have two lists of names and  
>>>> want to find the best matching. See http://iphylo.blogspot.com/2007/09/matching-names-in-phylogeny-data-files.html 
>>>>  for more details.
>>>
>>> In computer science, this is called the "marriage problem" when  
>>> the two lists are the same size.  We have a set { X } and a set  
>>> { Y } of elements with some properties.  We have a function  
>>> f( X_i, Y_j ) that computes a match score for each pair, using the  
>>> properties.  In our case, the only property is the name-string.   
>>> The marriage problem is to find a pairwise mapping that is optimal  
>>> in some way.   If optimality means minimizing the cost of the  
>>> worst match, then this is (apparently, to me) the same as the  
>>> linear bottleneck assignment problem.
>>>
>>> An obvious function to use (not necessarily the best for our case)  
>>> is the edit distance, i.e., the number of character-wise edit  
>>> operations to convert X_i into Y_j.   This is called the  
>>> Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance 
>>> ).
>>>
>>> But there is nothing to stop us from creating a distance function  
>>> that is optimized to work well in phyloinformatics.  We could test  
>>> different functions using real cases such as the ones in my  
>>> slideshow.
>>>
>>> One special condition is that, for us, the cost of a s/ 
>>> <underscore>/<space> / edit is very low.   Another special  
>>> condition is reflected in Rod's longest-common-substring method of  
>>> matching-- we often have pairs of matching names that have long  
>>> matching substrings and differ by interruptions.   Maybe we need a  
>>> gap-open and gap-extend penalty like in sequence alignment  
>>> algorithms.
>>>
>>> Arlin
>>>
>>>> This approach could extended to, say, matching names in a NEXUS  
>>>> file to those in a publication, or a GenBank POPSET from a  
>>>> publication. For example, if we have a NEXUS file and a POPSET we  
>>>> could compute the best matching between the two sets of names. Or  
>>>> taxon names and/or accession numbers could be retrieved from the  
>>>> publication.
>>>>
>>>> This would also help provide the context to help avoid homonyms,  
>>>> such as matching animal names to plant names.
>>>>
>>>> Regards
>>>>
>>>> Rod
>>>>
>>>>
>>>> On 15 Aug 2011, at 05:13, Rutger Vos wrote:
>>>>
>>>>>> this calls for easy-to-use NeXML editors. e.g. add the ability  
>>>>>> to enter
>>>>>> Genbank accession numbers in Mesquite, and then save as NeXML,  
>>>>>> thus
>>>>>> preserving "Homo_sapiens" consistently in all alignments and  
>>>>>> resulting
>>>>>> trees, while still communicating the respective accession  
>>>>>> numbers for each
>>>>>> locus. Summer-of-Code project here.
>>>>>
>>>>> Indeed.
>>>>>
>>>>>> C- The basic data model of matrix-rows-matching-with-tree-OTUs  
>>>>>> works for 99%
>>>>>> of datasets, but a growing number of studies use BEAST species  
>>>>>> inference
>>>>>> (and other similar methods) where the tree ends in species  
>>>>>> OTUs, but the
>>>>>> alignment has many more haplotype OTUs. -- i.e. there is, on  
>>>>>> purpose, a
>>>>>> complete mismatch between alignment row labels and tree OTUs.  
>>>>>> Mesquite can
>>>>>> handle this using a taxon association table, though I don't  
>>>>>> know that this
>>>>>> is formal NEXUS or just a Mesquite invention. I don't think  
>>>>>> that NeXML or
>>>>>> PhyloML can handle this. This calls for expanding the  
>>>>>> capabilities of NeXML
>>>>>> and PhyloML.
>>>>>
>>>>> Yes and no. Multiple matrix rows can reference the same otu, but
>>>>> that's not quite what we want. Multiple, separately annotatable  
>>>>> matrix
>>>>> row segments would be a good feature to have, also for TreeBASE's
>>>>> needs.
>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Dr. Rutger A. Vos
>>>>> School of Biological Sciences
>>>>> Philip Lyle Building, Level 4
>>>>> University of Reading
>>>>> Reading, RG6 6BX, United Kingdom
>>>>> Tel: +44 (0) 118 378 7535
>>>>> http://rutgervos.blogspot.com
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> uberSVN's rich system and user administration capabilities and  
>>>>> model
>>>>> configuration take the hassle out of deploying and managing  
>>>>> Subversion and
>>>>> the tools developers use with it. Learn more about uberSVN and  
>>>>> get a free
>>>>> download at:  http://p.sf.net/sfu/wandisco-dev2dev
>>>>> _______________________________________________
>>>>> Treebase-devel mailing list
>>>>> Tre...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/treebase-devel
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------
>>>> Roderic Page
>>>> Professor of Taxonomy
>>>> Institute of Biodiversity, Animal Health and Comparative Medicine
>>>> College of Medical, Veterinary and Life Sciences
>>>> Graham Kerr Building
>>>> University of Glasgow
>>>> Glasgow G12 8QQ, UK
>>>>
>>>> Email: r....@bi...
>>>> Tel: +44 141 330 4778
>>>> Fax: +44 141 330 2792
>>>> AIM: rod...@ai...
>>>> Facebook: http://www.facebook.com/profile.php?id=1112517192
>>>> Twitter: http://twitter.com/rdmpage
>>>> Blog: http://iphylo.blogspot.com
>>>> Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
>>>>
>>>
>>> -------
>>> Arlin Stoltzfus (ar...@um...)
>>> Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST
>>> IBBR, 9600 Gudelsky Drive, Rockville, MD
>>> tel: 240 314 6208; web: www.molevol.org
>>>
>>> ------------------------------------------------------------------------------
>>> Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
>>> user administration capabilities and model configuration. Take
>>> the hassle out of deploying and managing Subversion and the
>>> tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2_______________________________________________
>>> Treebase-devel mailing list
>>> Tre...@li...
>>> https://lists.sourceforge.net/lists/listinfo/treebase-devel
>>
>> ------------------------------------------------------------------------------
>> uberSVN's rich system and user administration capabilities and model
>> configuration take the hassle out of deploying and managing  
>> Subversion and
>> the tools developers use with it. Learn more about uberSVN and get  
>> a free
>> download at:  http://p.sf.net/sfu/wandisco-dev2dev
>> _______________________________________________
>> Treebase-devel mailing list
>> Tre...@li...
>> https://lists.sourceforge.net/lists/listinfo/treebase-devel
>
> -- 
> ===========================================================
> : Hilmar Lapp  -:- Durham, NC -:- informatics.nescent.org :
> ===========================================================
>
>
>
>
> -- 
> You received this message because you are subscribed to the Google
> Groups "MIAPA" group.
> For more options, visit this group at
> http://groups.google.com/group/miapa-discuss?hl=en

-------
Arlin Stoltzfus (ar...@um...)
Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST
IBBR, 9600 Gudelsky Drive, Rockville, MD
tel: 240 314 6208; web: www.molevol.org