Colorado Computational Pharmacology, University of Colorado School of Medicine October 22, 2013
The BioLemmatizer is a lemmatization tool for the morphological analysis of
biomedical literature. It is tailored to the biological domain through
integration of several published lexical resources related to molecular
biology. It focuses on the inflectional morphology of English, including the
plural form of nouns, the conjugations of verbs, and the comparative and
superlative form of adjectives and adverbs. The BioLemmatizer retrieves lemmas
based on the use of a lexicon that covers an exhaustive list of inflected word
forms and their corresponding lemmas in both general English and the biomedical
domain, as well as a set of rules that generalize morphological transformations
to heuristically handle words that are not encountered in the lexicon.
This directory contains the software developed by Haibin Liu
<Haibin.Liu@ucdenver.edu>, William A Baumgartner Jr <William.Baumgartner@ucdenver.edu>
and Karin Verspoor <Karin.Verspoor@ucdenver.edu>. The BioLemmatizer is developed
in Java and is released as open source software to the NLP and text mining
research communities to be used for research purposes only (see section 8 below
for copyright information). It can be downloaded via http://biolemmatizer.sourceforge.net.
If you make any changes, the authors would appreciate it if you can send the details
of what you have done. A Perl module of the BioLemmatizer Lingua::En::BioLemmatizer
is developed by Tom Christiansen <tchrist@perl.com> and released on CPAN at
http://search.cpan.org/perldoc?Lingua::EN::BioLemmatizer
Note: The BioLemmatizer code requires Java version 6 or greater.
***** What's New in This Version *****
The biolemmatizer-1.2 adds an optional functionality to normalize British English spellings
into American English spellings and then retrieve corresponding lemmas. For instance: the
lemma of "haemangioblastoma" will be "hemangioblastoma". This is achieved based on a mapping
list and some deterministic rules. This new functionality can better handle surface variants
of words and further reduce the complexity of the analyzed text.
1. Files and Folders
---------------------
README.txt this file
biolemmatizer-1.2.tar.gz the source code, resources, and license for the BioLemmatizer
biolemmatizer-core-1.2-jar-with-dependencies.jar
Jar file for the biolemmatizer-core module, including all
required dependencies
lexicon.lex.gz contains the full lexicon used by the BioLemmatizer
biolemmatizer-eval-datasets.tar.gz contains all the experimental datasets (CRAFT, OED, LLL),
and the gold and silver annotations used for testing the
BioLemmatizer (see section 8 for detailed description)
2. Usage
--------
Set the MAVEN_OPTS environment variable to provide the JVM enough memory to load
the lexicon file (this command only needs to be executed once):
export MAVEN_OPTS="-Xmx1G"
Lemmatize one single input string:
mvn -f biolemmatizer-core/pom.xml exec:java -Dexec.mainClass="edu.ucdenver.ccp.nlp.biolemmatizer.BioLemmatizer" -Dexec.args="<input string> [POS tag]"
Lemmatize input strings in a file, output lemmas to a different file:
mvn -f biolemmatizer-core/pom.xml exec:java -Dexec.mainClass="edu.ucdenver.ccp.nlp.biolemmatizer.BioLemmatizer" -Dexec.args="-i <input file name> -o <output file name>"
Run the BioLemmatizer in interactive mode, i.e. lemmatize input strings from standard input (exit when an empty line is used as input):
mvn -f biolemmatizer-core/pom.xml exec:java -Dexec.mainClass="edu.ucdenver.ccp.nlp.biolemmatizer.BioLemmatizer" -Dexec.args="-t"
Input parameter descriptions:
-f VAL : optional path to a lexicon file. If not set, the default lexicon
available on the classpath is used
-l : By default, the BioLemmatizer output contains the resulting
lemma, the POS tag of the input string and the tagset name of the POS tag.
The option -l returns only the lemma and ignores other information.
-a : to invoke the Americanization process that normalizes common British English
spellings into American English spellings, and retrieves corresponding lemmas.
This is achieved based on a mapping list and some deterministic rules.
For instance: the lemma of "haemangioblastoma" will be "hemangioblastoma".
POS tag : The POS tag associated with the input string.
It is optional and is expected to follow the Penn Treebank tagset.
-i VAL : to specify the input file name
-o VAL : to specify the output file name
-t : to invoke the interactive mode. With this mode, the BioLemmatizer can be easily
integrated into applications written in other languages, such as Perl. To exit
the interactive mode enter a blank line.
See the following sections for specifications of input and output formats, and examples of usage.
3. BioLemmatizer Input Specification
-------------------------------------
The BioLemmatizer can be run to lemmatize a single input string or a batch
of strings submitted in an input file.
Character encoding for all input is assumed to be UTF-8.
(a) Each input token is expected to be of the form <input string> [POS tag]. For examples:
roles NNS or quantitated VBD
The POS tag associated with the input string is expected to follow the widely
used Penn Treebank tagset. The POS information is optional. When it is not
given in the input, the BioLemmatizer returns lemmas for all possible parts
of speech, in terms of both POS tagsets (NUPOS and Penn Treebank tagsets)
represented in the lexicon. Our assumption is that without knowing the word
context, the lemmatizer should return all possible lemmas and allow the user
or calling application to resolve the ambiguities.
(b) Each input file is expected to be in the lemmatization format with or without blank lines.
The lemmatization format requires 2 fields:
* FORM: input string
* POSTAG: POS tag
Each field is delimited by a tab character ('\t'). Each sentence is delimited
by a blank line. The POS tag is expected to follow the Penn Treebank tagset.
Likewise, the POS information is optional. For example:
Bmp7 NN
knockout NN
mice NNS
do VBP
not RB
show VB
any DT
defect NN
in IN
limb NN
polarity NN
. .
Bmp2 NN
mutant NN
embryos NNS
die VBP
too RB
early RB
to TO
assess VB
their PRP$
limb NN
phenotypes NNS
. .
4. BioLemmatizer Output Specification
--------------------------------------
By default, the BioLemmatizer output consists of the resulting lemma, the POS
tag of the input string and the tagset name of the POS tag. For example, for
the input "quantitated VBD", the BioLemmatizer produces "quantitate VBD
PennPOS". If the POS information is not provided in the input, the
BioLemmatizer returns lemmas for all possible parts of speech across all POS
tagsets, separated by a separator "||". For example, for the input
"diminished", the output is "diminish VBD PennPOS||diminished JJ PennPOS".
BioLemmatizer output is encoded using UTF-8.
The option -l is provided to have the BioLemmatizer return only the lemma in
the output. With the option -l, the output for the above examples would be
"quantitate" and "diminish||diminished".
If the input is a file, the resulting lemma is inserted as a new field in the
output file, delimited by a tab character ('\t'). For example:
Bmp7 NN Bmp7
knockout NN knockout
mice NNS mouse
do VBP do
not RB not
show VB show
any DT any
defect NN defect
in IN in
limb NN limb
polarity NN polarity
. . .
Bmp2 NN Bmp2
mutant NN mutant
embryos NNS embryo
die VBP die
too RB too
early RB early
to TO to
assess VB assess
their PRP$ their
limb NN limb
phenotypes NNS phenotype
. . .
5. Usage Examples (shown using executable jar available in biolemmatizer-core/target/ directory after the project is built)
---------------------------------------------------------------------------------------------------------------------------
(a) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar catalyses NNS
=> catalysis NNS PennPOS
(b) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar -l catalyses NNS
=> catalysis
(c) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar running
=> run vvg NUPOS||running JJ PennPOS||run j-vvg NUPOS||run n-vvg NUPOS||running NN PennPOS
(d) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar -l running
=> run||running
(e) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar -l phaeochromocytomata NNS
=> phaeochromocytoma NNS PennPOS
(f) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar -l -a phaeochromocytomata NNS
=> pheochromocytoma NNS PennPOS
(g) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar -t
running
=> run vvg NUPOS||running JJ PennPOS||run VBG PennPOS||run j-vvg NUPOS||run n-vvg NUPOS||running NN PennPOS
catalyses NNS
=> catalysis NNS PennPOS
(h) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar -l -t
running
=> run||running
catalyses NNS
=> catalysis
(i) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar -i inputfile -o outputfile
(j) java -Xmx1G -jar biolemmatizer-core-1.1-jar-with-dependencies.jar -l -i inputfile -o outputfile
See the above sections "BioLemmatizer Input Specification" and "BioLemmatizer
Output Specification" for the guideline of the format of input and output files.
6. Lexical data from the BioLexicon
----------------------------------------------------
The BioLemmatizer integrates lexical resources from three sources: MorphAdorner,
the GENIA tagger and the BioLexicon database. Since the BioLexicon morphological
data used in the BioLemmatizer is included in the publicly available part of the
data in the BioLexicon (EBI term repository), we are able to redistribute it in
the public release of the full version of the BioLemmatizer. For the original
morphological data in the BioLexicon database, please refer to the following
BioLexicon publication and the download link of the freely available data in the
BioLexicon.
Thompson P, McNaught J, Montemagni S, Calzolari N, del Gratta R, Lee V, Marchi S,
Monachini M, Pezik P, Quochi V, Rupp C, Sasaki Y, Venturi G, Rebholz-Schuhmann D,
Ananiadou S: The BioLexicon: a large-scale terminological resource for biomedical
text mining. BMC Bioinformatics 2011, 12:397.
Download link of the EBI term repository of the BioLexicon:
http://www.ebi.ac.uk/Rebholz-srv/BioLexicon/biolexicon.html
ELRA link of the full version of the BioLexicon
http://catalog.elra.info/product_info.php?products_id=1113
7. Performance comparison with/without BioLexicon data
-------------------------------------------------------
Please refer to the following publication for more
detailed performance comparison.
Haibin Liu, Tom Christiansen, William A Baumgartner Jr, and Karin Verspoor
BioLemmatizer: a lemmatization tool for morphological processing of biomedical text
Journal of Biomedical Semantics 2012, 3:3.
After the experiments reported in the publication, we collected all false positive
lemmas we encountered, and we have fixed nearly all of them, either by adding an
entry to the BioLemmatizer lexicon or by modifying the rules of detachment, in some
cases adding the lexicon validation constraint.
Here we provide the lemmatization results on three of our evaluation datasets to
highlight the performance difference for the BioLemmatizer with and without
the BioLexicon data, and the tool achieving the second best performance among
9 lemmatizers we tested.
Evaluation on silver consensus set of CRAFT
Recall Precision F-score
ExcludeBioLexicon 99.56% (5836/5862) 99.56% (5836/5862) 99.56%
IncludeBioLexicon 100% (5862/5862) 100% (5862/5862) 100%
Second best (morpha tool) 100% (5862/5862) 100% (5862/5862) 100%
Evaluation on gold difference set of CRAFT
Recall Precision F-score
ExcludeBioLexicon 94.30% (546/579) 94.30% (546/579) 94.30%
IncludeBioLexicon 99.65% (577/579) 99.65% (577/579) 99.65%
Second best (MorphaAdorner) 81.87% (474/579) 82.29% (474/576) 82.08%
Evaluation on gold OED set
Recall Precision F-score
ExcludeBioLexicon 82.55% (667/808) 82.55% (667/808) 82.55%
IncludeBioLexicon 84.65% (684/808) 84.65% (684/808) 84.65%
Second best (morpha tool) 75.74% (612/808) 75.74% (612/808) 75.74%
Currently, for the performance on biomedical text (the CRAFT set), the
overall lemmatization accuracy of the public release of BioLemmatizer is 99.9%
(the full version of BioLemmatizer, including the BioLexicon data). The version
of the BioLexicon database used in our experiments is: Version of May 22nd, 2009.
8. Description of contents of biolemmatizer-eval-datasets.tar.gz
----------------------------------------------------------------
CRAFT_development_data subset of the CRAFT corpus, containing 7 full-text articles
CRAFT_consensus_silver consensus set of CRAFT_development_data (excluding adverbs),
representing agreement among 6 lemmatizers, to form a
"silver lemma standard"
CRAFT_difference_gold gold lemma annotation of the set of disagreements among 9 lemmatizers
OED_gold gold lemma annotation of the OED (Oxford English Dictionary) set
LLL_gold gold lemma annotation of the LLL05 set, curated with automatically
generated POS information
LLL_gold_updated LLL_gold with fixed annotation on incorrect or inconsistent
instances and task-specific normalizations
9. Copyright and License
------------------------------------
The software is released under the New BSD license
(http://www.opensource.org/licenses/bsd-license.php).
Copyright (c) 2012, Regents of the University of Colorado
All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the University of Colorado nor the names of its
contributors may be used to endorse or promote products derived from this
software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Any documentation, advertising materials, publications and other materials
related to redistribution and use must acknowledge that the software was
developed by Haibin Liu <Haibin.Liu@ucdenver.edu>, William A Baumgartner Jr
<William.Baumgartner@ucdenver.edu> and Karin Verspoor <Karin.Verspoor@ucdenver.edu>
and must refer to the following publication:
Haibin Liu, Tom Christiansen, William A Baumgartner Jr, and Karin Verspoor
BioLemmatizer: a lemmatization tool for morphological processing of biomedical text
Journal of Biomedical Semantics 2012, 3:3.
10. Incorporated software and resources
---------------------------------------
This software incorporates the MorphAdorner software (http://morphadorner.northwestern.edu/),
lexical resources from the BioLexicon database (http://www.ebi.ac.uk/Rebholz-srv/BioLexicon/biolexicon.html)
and the GENIA Tagger (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/).
We redistribute these software and resources here.
MorphAdorner license:
The MorphAdorner source code and data files fall under the following NCSA style license.
Some of the incorporated code and data fall under different licenses as noted in the
section third-party licenses below.
Copyright (c) 2006-2009 by Northwestern University.
All rights reserved.
Developed by:
Academic and Research Technologies
Northwestern University
http://www.it.northwestern.edu/about/departments/at/
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
associated documentation files (the "Software"), to deal with the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the
following conditions:
1. Redistributions of source code must retain the above copyright notice, this list of conditions and
the following disclaimers.
2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions
and the following disclaimers in the documentation and/or other materials provided with the
distribution.
3. Neither the names of Academic and Research Technologies, Northwestern University, nor the
names of its contributors may be used to endorse or promote products derived from this
Software without specific prior written permission.
BioLexicon database citation:
Thompson P, McNaught J, Montemagni S, Calzolari N, del Gratta R, Lee V, Marchi S,
Monachini M, Pezik P, Quochi V, Rupp C, Sasaki Y, Venturi G, Rebholz-Schuhmann D,
Ananiadou S: The BioLexicon: a large-scale terminological resource for biomedical
text mining. BMC Bioinformatics 2011, 12:397.
ELRA link of the full version of the BioLexicon
http://catalog.elra.info/product_info.php?products_id=1113
Download link of the EBI term repository of the BioLexicon:
http://www.ebi.ac.uk/Rebholz-srv/BioLexicon/biolexicon.html
GENIA Tagger License
Copyright (c) 2005, Tsujii Laboratory, The University of Tokyo
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted for non-commercial purposes provided
that the following conditions are met:
- Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
Since The GENIA Tagger uses a dictionary in WordNet for morphological analysis,
the corresponding WordNet license is also included here.
WordNet Release 2.1
This software and database is being provided to you, the LICENSEE, by
Princeton University under the following license. By obtaining, using
and/or copying this software and database, you agree that you have
read, understood, and will comply with these terms and conditions.:
Permission to use, copy, modify and distribute this software and
database and its documentation for any purpose and without fee or
royalty is hereby granted, provided that you agree to comply with
the following copyright notice and statements, including the disclaimer,
and that the same appear on ALL copies of the software, database and
documentation, including modifications that you make for internal
use or for distribution.
WordNet 2.1 Copyright 2005 by Princeton University. All rights reserved.
THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON
UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON
UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-
ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE
OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT
INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR
OTHER RIGHTS.
The name of Princeton University or Princeton may not be used in
advertising or publicity pertaining to distribution of the software
and/or database. Title to copyright in this software, database and
any associated documentation shall at all times remain with
Princeton University and LICENSEE agrees to preserve same.
11. Acknowledgements
------------------------------------
Many thanks to Professor Lawrence Hunter, Helen Johnson, Kevin B. Cohen,
and other members of the Colorado Computational Pharmacology group for
providing valuable effort and suggestions related to this work.