Name | Modified | Size | Downloads / Week |
---|---|---|---|
termometer_suite_080914_win32_binonly.7z | 2014-09-08 | 8.6 MB | |
CHANGELOG | 2014-09-08 | 2.2 kB | |
termometer_suite_080914.tar.gz | 2014-09-08 | 259.3 kB | |
README | 2011-12-12 | 5.0 kB | |
Totals: 4 Items | 8.9 MB | 0 |
Termometer - find closest terms between a list of terms and a gold standard Copyright (C) 2011 LIPN, Université Paris 13 Author : Thibault Mondary (LIPN), based on an original version from J.Van Puymbrouck (LIPN) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>. Termometer suite is composed of two executables, termometer which computes terminological distances and clustering which clusters output of the last according to a threshold. ************************************************* To build termometer_suite, you need QT >= 4.6 Simply run : - qmake - make and you should obtain two executables. To speedup build, you are free to try parallel compilation with "make -jN" with N corresponding to your number of CPU cores. ************************************************* Usage : Let's start with a reference terminology (R) and a list of terms (T). We want to obtain terminological precision and terminological recall of T with respect to R. IT IS ASSUMED THAT T AND R ARE UTF-8 FILES ! OTHERWISE NO ERROR WILL BE THROWN BUT RESULTS MAY BE FALSE ! 1/ Computes terminological distances between T and R, and find for each term of T its closest reference : - ./termometer R.txt T.txt output_dist.txt Note that the estimated computation time isn't very reliable. 2/ Optionally, determine the threshold for the clustering, using the variability of R : - ./termometer R.txt R.txt self_R.txt - Look at "self_R.txt", 4th line, PROPOSED GLOBAL THRESHOLD=0.xxxxx 3/ Do the clustering using the determined threshold, or a custom threshold, remaining 0 means no adaptation (expect normalization) at all - ./clustering R.txt output_dist.txt output_clusters.txt 0.6578 About normalization and adaptation Termometer normalizes terms before computing terminological distances with respect to an hard-coded list of separators. The main idea is to identify separators that are not semantic elements of a compound term. Actually, non semantic separators are the whitespace, \t and "-". The normalization process consists of the following steps : - remove duplicate terms - convert every term to lowercase - convert every non-semantic separator to whitespaces - remove leading and trailing spaces - remove duplicates spaces The list of separators is coded in "termlist.cpp", on the head of private function "normalize". Internally, termometer computes closest terms using normalized forms, but prints original forms into the output file. That is why we could have distances like that : A compound-TERM 0.000 a compound term into the output file. The clustering module uses normalized form for computation of T-P/R/FM and strict P/R/FM. It means that the length of termlist and reference may be lower than the initial length. For example, lets take a reference composed of : thermo-nuclear supernovae thermonuclear supernovae supernovae and a termlist composed of : thermonuclear super-novae thermo nuclear supernovae supernova The output of termometer will be : TERM DISTANCE CLOSEST REFERENCE thermonuclear super-novae 0.27 thermonuclear supernovae thermo nuclear supernovae 0 thermo-nuclear supernovae supernova 0.1 supernovae and the output of the clustering module at threshold 0 : Reference : thermo-nuclear supernovae thermo nuclear supernovae 0 T-precision = 1/3 = 0.333333 T-recall = 1/3 = 0.333333 T-F-measure = 0.333333 Strict precision = 1/3 = 0.333333 Strict recall = 1/3 = 0.333333 Strict F-measure = 0.333333 and at threshold 0.3 : Reference : thermo-nuclear supernovae thermo nuclear supernovae 0 Reference : thermonuclear supernovae thermonuclear super-novae 0.27 Reference : supernovae supernova 0.1 T-precision = 2.63/3 = 0.876667 T-recall = 2.63/3 = 0.876667 T-F-measure = 0.876667 Strict precision = 1/3 = 0.333333 Strict recall = 1/3 = 0.333333 Strict F-measure = 0.333333 ****************************************************************************************** Parsing results Four Perl scripts are furnished to help parsing results of both Termometer and its clustering module. - parse_clusters.pl transforms a list of clustering results into CSV (useful when many systems are tested) - extract_unmatched_terms.pl takes as input a cluster file and extracts terms that were above the fixed threshold - find_above_threshold.pl and find_beyond_threshold_but_not_zero.pl take as input a Termometer output file and filter it with a given threshold, without clustering. T.Mondary