Download Latest Version termometer_suite_080914.tar.gz (259.3 kB)
Email in envelope

Get an email when there's a new version of Termometer

Home
Name Modified Size InfoDownloads / Week
termometer_suite_080914_win32_binonly.7z 2014-09-08 8.6 MB
CHANGELOG 2014-09-08 2.2 kB
termometer_suite_080914.tar.gz 2014-09-08 259.3 kB
README 2011-12-12 5.0 kB
Totals: 4 Items   8.9 MB 0
Termometer - find closest terms between a list of terms and a gold standard 
Copyright (C) 2011  LIPN, Université Paris 13
Author : Thibault Mondary (LIPN), based on an original version from J.Van Puymbrouck (LIPN)

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program.  If not, see <http://www.gnu.org/licenses/>.


Termometer suite is composed of two executables, termometer which computes terminological distances and clustering which 
clusters output of the last according to a threshold.


*************************************************
To build termometer_suite, you need QT >= 4.6

Simply run :

	- qmake
	- make

and you should obtain two executables. To speedup build, you are free to try parallel compilation with "make -jN" with N corresponding to your number of CPU cores.
*************************************************


Usage : 

Let's start with a reference terminology (R) and a list of terms (T). We want to obtain terminological precision and terminological recall of T with respect to R.

IT IS ASSUMED THAT T AND R ARE UTF-8 FILES ! OTHERWISE NO ERROR WILL BE THROWN BUT RESULTS MAY BE FALSE !


	1/ Computes terminological distances between T and R, and find for each term of T its closest reference :

	- ./termometer R.txt T.txt output_dist.txt

Note that the estimated computation time isn't very reliable.



	2/ Optionally, determine the threshold for the clustering, using the variability of R :

	- ./termometer R.txt R.txt self_R.txt
	- Look at "self_R.txt", 4th line, PROPOSED GLOBAL THRESHOLD=0.xxxxx
	

	3/ Do the clustering using the determined threshold, or a custom threshold, remaining 0 means no adaptation (expect normalization) at all

	- ./clustering R.txt output_dist.txt output_clusters.txt 0.6578


About normalization and adaptation

Termometer normalizes terms before computing terminological distances with respect to an hard-coded list of separators.
The main idea is to identify separators that are not semantic elements of a compound term. Actually, non semantic separators
are the whitespace, \t and "-". The normalization process consists of the following steps :

	- remove duplicate terms
	- convert every term to lowercase
	- convert every non-semantic separator to whitespaces
	- remove leading and trailing spaces
	- remove duplicates spaces

The list of separators is coded in "termlist.cpp", on the head of private function "normalize". 

Internally, termometer computes closest terms using normalized forms, but prints original forms into the output file.

That is why we could have distances like that :

A compound-TERM		0.000		a compound term

into the output file.

The clustering module uses normalized form for computation of T-P/R/FM and strict P/R/FM. It means that the length of termlist and reference may be lower than the initial length. For example, lets take a reference composed of :

thermo-nuclear supernovae
thermonuclear supernovae
supernovae

and a termlist composed of :

thermonuclear super-novae
thermo nuclear supernovae
supernova


The output of termometer will be :

TERM	DISTANCE	CLOSEST REFERENCE
thermonuclear super-novae	0.27	thermonuclear supernovae
thermo nuclear supernovae	0	thermo-nuclear supernovae
supernova	0.1	supernovae


and the output of the clustering module at threshold 0 :

Reference : thermo-nuclear supernovae
	thermo nuclear supernovae	0


T-precision = 1/3 = 0.333333
T-recall = 1/3 = 0.333333
T-F-measure = 0.333333

Strict precision = 1/3 = 0.333333
Strict recall = 1/3 = 0.333333
Strict F-measure = 0.333333


and at threshold 0.3 :


Reference : thermo-nuclear supernovae
	thermo nuclear supernovae	0

Reference : thermonuclear supernovae
	thermonuclear super-novae	0.27

Reference : supernovae
	supernova	0.1


T-precision = 2.63/3 = 0.876667
T-recall = 2.63/3 = 0.876667
T-F-measure = 0.876667

Strict precision = 1/3 = 0.333333
Strict recall = 1/3 = 0.333333
Strict F-measure = 0.333333

******************************************************************************************
Parsing results

Four Perl scripts are furnished to help parsing results of both Termometer and its clustering module.

	- parse_clusters.pl transforms a list of clustering results into CSV (useful when many systems are tested)
	- extract_unmatched_terms.pl takes as input a cluster file and extracts terms that were above the fixed threshold
	- find_above_threshold.pl and find_beyond_threshold_but_not_zero.pl take as input a Termometer output file and filter it with a given threshold, without clustering.



T.Mondary
Source: README, updated 2011-12-12