|
From: Arlin S. <ar...@um...> - 2011-11-04 17:53:58
|
I added a link to Linnaeus (see last slide) in my PowerPoint
presentation of this problem:
http://dl.dropbox.com/u/7727158/name_matching.pptx
Let me know if there are other resources that I should note. That way
we won't lose the knowledge that we accumulated in this discussion
thread.
Arlin
On Aug 27, 2011, at 5:28 AM, Hilmar Lapp wrote:
> I spoke with the developer, Martin Gerner. He thought it might be
> well applicable to this task, even though the tool does a lot more
> than we possibly need here. For example, it tokenizes the input, and
> also is capable of applying some special "inference" rules (for
> instance, "HeLa cells" will be tagged with "Homo sapiens") that are
> quite useful if the purpose is linking of text to knowledge terms,
> but go beyond simple synonym matching (which it does, too, though).
> The dictionaries are pluggable, and apparently it is quite fast in
> principle.
>
> -hilmar
>
> On Aug 22, 2011, at 6:39 PM, Mark Holder wrote:
>
>> Hi all,
>> I just noticed that Hilmar tweeted a link to Linnaeus: http://linnaeus.sourceforge.net/
>> which seems relevant to this thread.
>>
>> all the best,
>> Mark
>>
>> On Aug 19, 2011, at 11:06 AM, Arlin Stoltzfus wrote:
>>
>>> On Aug 15, 2011, at 4:09 AM, Roderic Page wrote:
>>>
>>>> Mapping tree names to matrix names could be formulated as a
>>>> bipartite matching problem, where we have two lists of names and
>>>> want to find the best matching. See http://iphylo.blogspot.com/2007/09/matching-names-in-phylogeny-data-files.html
>>>> for more details.
>>>
>>> In computer science, this is called the "marriage problem" when
>>> the two lists are the same size. We have a set { X } and a set
>>> { Y } of elements with some properties. We have a function
>>> f( X_i, Y_j ) that computes a match score for each pair, using the
>>> properties. In our case, the only property is the name-string.
>>> The marriage problem is to find a pairwise mapping that is optimal
>>> in some way. If optimality means minimizing the cost of the
>>> worst match, then this is (apparently, to me) the same as the
>>> linear bottleneck assignment problem.
>>>
>>> An obvious function to use (not necessarily the best for our case)
>>> is the edit distance, i.e., the number of character-wise edit
>>> operations to convert X_i into Y_j. This is called the
>>> Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance
>>> ).
>>>
>>> But there is nothing to stop us from creating a distance function
>>> that is optimized to work well in phyloinformatics. We could test
>>> different functions using real cases such as the ones in my
>>> slideshow.
>>>
>>> One special condition is that, for us, the cost of a s/
>>> <underscore>/<space> / edit is very low. Another special
>>> condition is reflected in Rod's longest-common-substring method of
>>> matching-- we often have pairs of matching names that have long
>>> matching substrings and differ by interruptions. Maybe we need a
>>> gap-open and gap-extend penalty like in sequence alignment
>>> algorithms.
>>>
>>> Arlin
>>>
>>>> This approach could extended to, say, matching names in a NEXUS
>>>> file to those in a publication, or a GenBank POPSET from a
>>>> publication. For example, if we have a NEXUS file and a POPSET we
>>>> could compute the best matching between the two sets of names. Or
>>>> taxon names and/or accession numbers could be retrieved from the
>>>> publication.
>>>>
>>>> This would also help provide the context to help avoid homonyms,
>>>> such as matching animal names to plant names.
>>>>
>>>> Regards
>>>>
>>>> Rod
>>>>
>>>>
>>>> On 15 Aug 2011, at 05:13, Rutger Vos wrote:
>>>>
>>>>>> this calls for easy-to-use NeXML editors. e.g. add the ability
>>>>>> to enter
>>>>>> Genbank accession numbers in Mesquite, and then save as NeXML,
>>>>>> thus
>>>>>> preserving "Homo_sapiens" consistently in all alignments and
>>>>>> resulting
>>>>>> trees, while still communicating the respective accession
>>>>>> numbers for each
>>>>>> locus. Summer-of-Code project here.
>>>>>
>>>>> Indeed.
>>>>>
>>>>>> C- The basic data model of matrix-rows-matching-with-tree-OTUs
>>>>>> works for 99%
>>>>>> of datasets, but a growing number of studies use BEAST species
>>>>>> inference
>>>>>> (and other similar methods) where the tree ends in species
>>>>>> OTUs, but the
>>>>>> alignment has many more haplotype OTUs. -- i.e. there is, on
>>>>>> purpose, a
>>>>>> complete mismatch between alignment row labels and tree OTUs.
>>>>>> Mesquite can
>>>>>> handle this using a taxon association table, though I don't
>>>>>> know that this
>>>>>> is formal NEXUS or just a Mesquite invention. I don't think
>>>>>> that NeXML or
>>>>>> PhyloML can handle this. This calls for expanding the
>>>>>> capabilities of NeXML
>>>>>> and PhyloML.
>>>>>
>>>>> Yes and no. Multiple matrix rows can reference the same otu, but
>>>>> that's not quite what we want. Multiple, separately annotatable
>>>>> matrix
>>>>> row segments would be a good feature to have, also for TreeBASE's
>>>>> needs.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Dr. Rutger A. Vos
>>>>> School of Biological Sciences
>>>>> Philip Lyle Building, Level 4
>>>>> University of Reading
>>>>> Reading, RG6 6BX, United Kingdom
>>>>> Tel: +44 (0) 118 378 7535
>>>>> http://rutgervos.blogspot.com
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> uberSVN's rich system and user administration capabilities and
>>>>> model
>>>>> configuration take the hassle out of deploying and managing
>>>>> Subversion and
>>>>> the tools developers use with it. Learn more about uberSVN and
>>>>> get a free
>>>>> download at: http://p.sf.net/sfu/wandisco-dev2dev
>>>>> _______________________________________________
>>>>> Treebase-devel mailing list
>>>>> Tre...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/treebase-devel
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------
>>>> Roderic Page
>>>> Professor of Taxonomy
>>>> Institute of Biodiversity, Animal Health and Comparative Medicine
>>>> College of Medical, Veterinary and Life Sciences
>>>> Graham Kerr Building
>>>> University of Glasgow
>>>> Glasgow G12 8QQ, UK
>>>>
>>>> Email: r....@bi...
>>>> Tel: +44 141 330 4778
>>>> Fax: +44 141 330 2792
>>>> AIM: rod...@ai...
>>>> Facebook: http://www.facebook.com/profile.php?id=1112517192
>>>> Twitter: http://twitter.com/rdmpage
>>>> Blog: http://iphylo.blogspot.com
>>>> Home page: http://taxonomy.zoology.gla.ac.uk/rod/rod.html
>>>>
>>>
>>> -------
>>> Arlin Stoltzfus (ar...@um...)
>>> Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST
>>> IBBR, 9600 Gudelsky Drive, Rockville, MD
>>> tel: 240 314 6208; web: www.molevol.org
>>>
>>> ------------------------------------------------------------------------------
>>> Get a FREE DOWNLOAD! and learn more about uberSVN rich system,
>>> user administration capabilities and model configuration. Take
>>> the hassle out of deploying and managing Subversion and the
>>> tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2_______________________________________________
>>> Treebase-devel mailing list
>>> Tre...@li...
>>> https://lists.sourceforge.net/lists/listinfo/treebase-devel
>>
>> ------------------------------------------------------------------------------
>> uberSVN's rich system and user administration capabilities and model
>> configuration take the hassle out of deploying and managing
>> Subversion and
>> the tools developers use with it. Learn more about uberSVN and get
>> a free
>> download at: http://p.sf.net/sfu/wandisco-dev2dev
>> _______________________________________________
>> Treebase-devel mailing list
>> Tre...@li...
>> https://lists.sourceforge.net/lists/listinfo/treebase-devel
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org :
> ===========================================================
>
>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "MIAPA" group.
> For more options, visit this group at
> http://groups.google.com/group/miapa-discuss?hl=en
-------
Arlin Stoltzfus (ar...@um...)
Fellow, IBBR; Adj. Assoc. Prof., UMCP; Research Biologist, NIST
IBBR, 9600 Gudelsky Drive, Rockville, MD
tel: 240 314 6208; web: www.molevol.org
|