From: <ada...@us...> - 2014-03-04 21:00:16
|
Revision: 17537 http://sourceforge.net/p/gate/code/17537 Author: adamfunk Date: 2014-03-04 21:00:13 +0000 (Tue, 04 Mar 2014) Log Message: ----------- More details of the new TR stuff. Modified Paths: -------------- userguide/trunk/misc-creole.tex Modified: userguide/trunk/misc-creole.tex =================================================================== --- userguide/trunk/misc-creole.tex 2014-03-04 19:49:45 UTC (rev 17536) +++ userguide/trunk/misc-creole.tex 2014-03-04 21:00:13 UTC (rev 17537) @@ -3252,9 +3252,9 @@ \sect[sec:creole:termraider]{TermRaider term extraction tools} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% TermRaider is a set of term extraction and scoring tools developed in the NeOn -and ARCOMEM projects. Although the plugin is still experimental, we are now -including it in GATE as a response to frequent requests from GATE users who have -read publications related to those projects. +and ARCOMEM projects. Although some parts of the plugin are still experimental, +we are now including it in GATE as a response to frequent requests from GATE +users who have read publications related to those projects. The easiest way to test TermRaider is to populate a corpus with related documents, load the sample @@ -3262,8 +3262,8 @@ and run it. This application will process the documents and create instances of three termbank language resources with sensible parameters. -All the language resources in TermRaider are properly serializable and so can be -stored in GATE datastores. +All the language resources in TermRaider are serializable and can be stored in +GATE datastores. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsect{Termbank language resources} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% @@ -3274,25 +3274,34 @@ \begin{itemize} \item \textbf{corpora}: a \texttt{Set<gate.Corpus>} from which the termbank is generated. -\item \textbf{inputASName}: the annotation set name in which to find the term - candidates. +\item \textbf{inputASName} (\texttt{String}): the annotation set name in which + to find the term candidates. \item \textbf{inputAnnotationTypes} (\texttt{Set<String>}): annotation types which are treated as term candidates. -\item \textbf{inputAnnotationFeature}: the feature of each annotation used as - the term string (if the feature is missing from the annotation, the underlying - document content will be whitespace-trimmed and used). Note that these values - are case-sensitive; normally the lemma (\emph{root} feature from the GATE - Morphological Analyser) is used for consistency. +\item \textbf{inputAnnotationFeature} (\texttt{String}): the feature of each + annotation used as the term string (if the feature is missing from the + annotation, the underlying document content will be whitespace-trimmed and + used). Note that these values are case-sensitive; normally the lemma + (\emph{root} feature from the GATE Morphological Analyser) is used for + consistency. \item \textbf{languageFeature} (\texttt{String}): the feature of each annotation identifying the language of the term. (Annotations without the feature will - get a blank language code.) -\item \textbf{scoreProperty}: a description of the score, used in the CSV - output and the Termbank Score Copier PR. + get an empty string as a language code, which can match language-coded terms + more flexibly in some situations.) +\item \textbf{scoreProperty} (\texttt{String}): a description of the principal + output score, used in the termbank's GUI and CSV output and in the Termbank + Score Copier PR. (A sensible default is provided for each termbank type.) \item \textbf{debugMode} (\texttt{Boolean}): this sets the verbosity of the output while creating the termbank. \end{itemize} +Each type of termbank has one or more score types, shown as columns in the +\emph{Details} tab of the GUI and listed in the \emph{Type} pull-down menu in +the \emph{Term Cloud} tab. The first score is always the principal one named by +the \emph{scoreProperty} parameter above. + + The \texttt{Term} class is defined in terms of the term string itself, the language code, and the annotation type, so it is possible to distinguish \emph{affect}(\emph{english},\emph{Noun}) from @@ -3308,7 +3317,7 @@ of tf.idf over other corpora. A document frequency bank can be constructed from one or more corpora, from one -or more existing document frequency banks, or from a combination of them, so +or more existing document frequency banks, or from a combination of both, so that document frequency counts from different sources can be compiled together. It therefore has one additional parameter: %% @@ -3316,6 +3325,10 @@ \item \textbf{inputBanks} zero or more other instances of \emph{DocumentFrequencyBank}. \end{itemize} + + +This type of termbank has only the principal score type. +%% TODO document the flexible language code matching %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsect{TfIdf Termbank} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% @@ -3323,52 +3336,76 @@ of corpora. It has the following additional init parameters. %% \begin{itemize} -\item \textbf{docFreqSource} +\item \textbf{docFreqSource}: an instance of \emph{DocumentFrequencyBank}, which + could be derived from another set of corpora (as described above); if this + parameter is \texttt{null} (\verb!<none>! in the GUI), an instance of + DocumentFrequencyBank will be constructed from this LR's corpora parameter and + used here. \item \textbf{idfCalculation}: an enum (pull-down menu in the GUI) with the - following options for inverted document frequency: + following options for adjusting inverted document frequency (all adjusted so + they must return a positive value, to prevent division by zero), $g(df)$: \begin{itemize} - \item \emph{Logarithmic} $=\log_{2}(n/df)$; - \item \emph{Scaled} $= 1/df$; - \item \emph{Natural} $= 1/df$; + % TODO: add unscaled Logarithmic as below + % change below to LogarithmicScaled + \item \emph{Logarithmic} $=\log_{2}(1+n/\mathit{df})$; + \item \emph{Scaled} $=(1+n)/(1+\mathit{df})$; + \item \emph{Natural} $=1/(1+\mathit{df})$. \end{itemize} -\item \textbf{normalization} +\item \textbf{normalization}: an enum (pull-down) with the following options for + normalizing the raw score $s$, where $s=f(\mathit{tf}){\times}g(idf)$: \begin{itemize} - \item \emph{None} - \item \emph{Hundred} - \item \emph{Sigmoid} + \item \emph{None} $=s$ (this may return numbers in a low range); + \item \emph{Hundred} $=100s$ (this makes the sliders easier to use); + \item \emph{Sigmoid} $=\frac{200}{1+e^{-s/k}}-100$ (this maps all raw scores + monotonically to values in the 0--100 range, so that $0{\rightarrow}0$ and + ${\infty}{\rightarrow}100$). \end{itemize} \item \textbf{tfCalculation}: an enum (pull-down) with the following options for - term frequency: + adjusting term frequency $f(\mathit{tf})$: \begin{itemize} - \item \emph{Natural} $=tf$; - \item \emph{Sqrt} - \item \emph{Logarithmic} $=1+\log_{2} tf$. + \item \emph{Natural} $=\mathit{tf}$; + \item \emph{Sqrt} $=\sqrt{\mathit{tf}}$; + \item \emph{Logarithmic} $=1+\log_{2} \mathit{tf}$. \end{itemize} \end{itemize} %% -For these calcutations, $tf$ is the term frequency (number of occurrences of the -term in the corpora), $df$ is the document frequency according to the -DocumentFrequencySource, and $n$ is the total number of documents. +For the calculations above, $\mathit{tf}$ is the term frequency (number of +individual occurrences of the term in the current corpora), whereas +$\mathit{df}$ is the document frequency of the term according to the +DocumentFrequencySource and $n$ is the total number of documents in the +DocumentFrequencySource. The raw score +$s=f(\mathit{tm}){\times}g(\mathit{df})$. + +This type of termbank has five score types: the principal one (normalized), the +raw score ($s$ above, with the principal name plus the suffix ``.raw''), +\emph{termFrequency}, \emph{localDocFrequency} (number of documents in the +current corpora containing the term; not used in the tf.idf calculation), and +\emph{refDocFrequency} ($\mathit{df}$ above; this will be the same as +\emph{localDocFrequency} if no external \emph{docFreqSource} was specified). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsect{Annotation Termbank} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -This termbank collects the values of scoring features on all the term candidates -and selects the minimum or maximum score or averages them, according to the -\textbf{mergingMode} parameter. It has the following additional init -parameters. +This termbank collects the values of scoring features on all the term candidate +annotations, and for each term determines the minimum, maximum, or mean +according to the \textbf{mergingMode} parameter. It has the following +additional parameters. %% \begin{itemize} \item \textbf{inputScoreFeature}: an annotation feature whose value should be a \texttt{Number} or interpretable as a number. \item \textbf{mergingMode}: an enum (pull-down menu in the GUI) with the options \emph{MINIMUM}, \emph{MEAN}, or \emph{MAXIMUM}. -\item \textbf{normalization} - \begin{itemize} - \item \emph{None} - \item \emph{Hundred} - \item \emph{Sigmoid} - \end{itemize} +\item \textbf{normalization}: the same normalization options as for the TfIdf + Termbank above. To produce augmented tf.idf scores (as in the sample + application), it is generally better to augment the \texttt{tfIdfScore.raw} + values, compile them into an Annotation Termbank, and normalize the results + (rather than carrying out augmentation on the normalized tf.idf scores). \end{itemize} + +This type of termbank has four score types: the principal one (normalized), the +raw score (minimum, maximum, or mean above; with the principal name plus the +suffix ``.raw''), \emph{termFrequency}, \emph{localDocFrequency} (the last two +are not used in the calculation). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsubsect{Hyponymy Termbank} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% @@ -3378,43 +3415,39 @@ \begin{itemize} \item \textbf{inputHeadFeatures} (\texttt{List<String>}): annotation features on term candidates containing the head of the expression. -\item \textbf{normalization} - \begin{itemize} - \item \emph{None} - \item \emph{Hundred} - \item \emph{Sigmoid} - \end{itemize} +\item \textbf{normalization}: the same normalization options as for the TfIdf + Termbank above. \end{itemize} %% Head information is generated by the multiword JAPE grammar included in the -application. We consider $T_1$ a hyponym of $T_2$ if and only if $T_2$'s head -feature value ends with $T_1$'s head or string feature value. +application. This LR treats $T_1$ a hyponym of $T_2$ if and only if $T_2$'s +head feature's value ends with $T_1$'s head or string feature's value. (This +depends on \emph{head-final} construction of compound nouns, as used in English +and German.) + +This type of termbank has five score types: the principal one (normalized), the +raw score ($s$ above, with the principal name plus the suffix ``.raw''), +\emph{termFrequency}, \emph{hyponymCount} (number of distinct hyponyms found in +the current corpora), and \emph{localDocFrequency} (number of documents in the +current corpora containing the term; not used in other calculations). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsect{Termbank Score Copier} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -This processing resource copies the scores and optionally frequencies from a -termbank into features of the term annotations. It has no init parameters and -two runtime parameters. +This processing resource copies the scores from a termbank onto features of the +term annotations. It has no init parameters and two runtime parameters. %% \begin{itemize} \item \textbf{annotationSetName} -\item \textbf{frequencyFeature} default value \emph{frequency} -\item \textbf{docFrequencyFeature} default value \emph{docFrequency} \item \textbf{termbank} \end{itemize} %% -This PR uses the annotation types, string and language code features, and score -features from the selected termbank. It treats any annotation with a matching -type and matching string and language feature (where a missing feature matches -the triple-underscore ``not found'' code) as a match, and copies the score to a -feature on the annotation specified by the termbank's \emph{scoreProperty} -parameter. - -It also copies the term frequency to the annotation's \emph{frequencyFeature} -unless that parameter is blank; and copies the document frequency to the -\emph{docFrequencyFeature} unless that is blank. Note that the default values -are not blank---you need to clear either or both parameters to prevent these -annotation features from being filled in. +This PR uses the annotation types, string and language code features, and scores +from the selected termbank. It treats any annotation with a matching type and +matching string and language feature as a match (although a missing language +feature matches the empty string used as a ``not found'' code), and copies all +the termbanks' scores to features on the annotation with the scores' names. +(The principal score name is determined by the termbank's \emph{scoreProperty} +feature.) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsect{The PMI bank language resource} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |