Re: [Treebase-devel] remedying mismatched OTU names

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

On Aug 15, 2011, at 4:09 AM, Roderic Page wrote:

> Mapping tree names to matrix names could be formulated as a  
> bipartite matching problem, where we have two lists of names and  
> want to find the best matching. See http://iphylo.blogspot.com/2007/09/matching-names-in-phylogeny-data-files.html 
>  for more details.

In computer science, this is called the "marriage problem" when the  
two lists are the same size.  We have a set { X } and a set { Y } of  
elements with some properties.  We have a function f( X_i, Y_j ) that  
computes a match score for each pair, using the properties.  In our  
case, the only property is the name-string.  The marriage problem is  
to find a pairwise mapping that is optimal in some way.   If  
optimality means minimizing the cost of the worst match, then this is  
(apparently, to me) the same as the linear bottleneck assignment  
problem.

An obvious function to use (not necessarily the best for our case) is  
the edit distance, i.e., the number of character-wise edit operations  
to convert X_i into Y_j.   This is called the Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance 
).

But there is nothing to stop us from creating a distance function that  
is optimized to work well in phyloinformatics.  We could test  
different functions using real cases such as the ones in my slideshow.

One special condition is that, for us, the cost of a s/<underscore>/ 
<space> / edit is very low.   Another special condition is reflected  
in Rod's longest-common-substring method of matching-- we often have  
pairs of matching names that have long matching substrings and differ  
by interruptions.   Maybe we need a gap-open and gap-extend penalty  
like in sequence alignment algorithms.

Arlin

> This approach could extended to, say, matching names in a NEXUS file  
> to those in a publication, or a GenBank POPSET from a publication.  
> For example, if we have a NEXUS file and a POPSET we could compute  
> the best matching between the two sets of names. Or taxon names and/ 
> or accession numbers could be retrieved from the publication.
>
> This would also help provide the context to help avoid homonyms,  
> such as matching animal names to plant names.
>
> Regards
>
> Rod
>
>
> On 15 Aug 2011, at 05:13, Rutger Vos wrote:
>
>>> this calls for easy-to-use NeXML editors. e.g. add the ability to  
>>> enter
>>> Genbank accession numbers in Mesquite, and then save as NeXML, thus
>>> preserving "Homo_sapiens" consistently in all alignments and  
>>> resulting
>>> trees, while still communicating the respective accession numbers  
>>> for each
>>> locus. Summer-of-Code project here.
>>
>> Indeed.
>>
>>> C- The basic data model of matrix-rows-matching-with-tree-OTUs  
>>> works for 99%
>>> of datasets, but a growing number of studies use BEAST species  
>>> inference
>>> (and other similar methods) where the tree ends in species OTUs,  
>>> but the
>>> alignment has many more haplotype OTUs. -- i.e. there is, on  
>>> purpose, a
>>> complete mismatch between alignment row labels and tree OTUs.  
>>> Mesquite can
>>> handle this using a taxon association table, though I don't know  
>>> that this
>>> is formal NEXUS or just a Mesquite invention. I don't think that  
>>> NeXML or
>>> PhyloML can handle this. This calls for expanding the capabilities  
>>> of NeXML
>>> and PhyloML.
>>
>> Yes and no. Multiple matrix rows can reference the same otu, but
>> that's not quite what we want. Multiple, separately annotatable  
>> matrix
>> row segments would be a good feature to have, also for TreeBASE's
>> needs.
>>
>>
>>
>> -- 
>> Dr. Rutger A. Vos
>> School of Biological Sciences
>> Philip Lyle Building, Level 4
>> University of Reading
>> Reading, RG6 6BX, United Kingdom
>> Tel: +44 (0) 118 378 7535
>> http://rutgervos.blogspot.com
>>
>> ------------------------------------------------------------------------------
>> uberSVN's rich system and user administration capabilities and model
>> configuration take the hassle out of deploying and managing  
>> Subversion and
>> the tools developers use with it. Learn more about uberSVN and get  
>> a free
>> download at:  http://p.sf.net/sfu/wandisco-dev2dev
>> _______________________________________________
>> Treebase-devel mailing list
>> Tre...@li...
>> https://lists.sourceforge.net/lists/listinfo/treebase-devel
>>
>
>
> ---------------------------------------------------------
> Roderic Page
> Professor of Taxonomy
> Institute of Biodiversity, Animal Health and Comparative Medicine
> College of Medical, Veterinary and Life Sciences
> Graham Kerr Building
> University of Glasgow
> Glasgow G12 8QQ, UK
>
> Email: r....@bi...
> Tel: +44 141 330 4778
> Fax: +44 141 330 2792
> AIM: rod...@ai...
> Facebook: http://www.facebook.com/profile.php?id=1112517192
> Twitter: http://twitter.com/rdmpage
> Blog: http://iphylo.blogspot.com
> Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
>

-------
Arlin Stoltzfus (ar...@um...)
Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST
IBBR, 9600 Gudelsky Drive, Rockville, MD
tel: 240 314 6208; web: www.molevol.org