[gate-cvs] SF.net SVN: gate:[17537] userguide/trunk/misc-creole.tex

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Revision: 17537
          http://sourceforge.net/p/gate/code/17537
Author:   adamfunk
Date:     2014-03-04 21:00:13 +0000 (Tue, 04 Mar 2014)
Log Message:
-----------
More details of the new TR stuff.

Modified Paths:
--------------
    userguide/trunk/misc-creole.tex

Modified: userguide/trunk/misc-creole.tex
===================================================================

--- userguide/trunk/misc-creole.tex	2014-03-04 19:49:45 UTC (rev 17536)
+++ userguide/trunk/misc-creole.tex	2014-03-04 21:00:13 UTC (rev 17537)
@@ -3252,9 +3252,9 @@
 \sect[sec:creole:termraider]{TermRaider term extraction tools}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 TermRaider is a set of term extraction and scoring tools developed in the NeOn
-and ARCOMEM projects.  Although the plugin is still experimental, we are now
-including it in GATE as a response to frequent requests from GATE users who have
-read publications related to those projects.
+and ARCOMEM projects.  Although some parts of the plugin are still experimental,
+we are now including it in GATE as a response to frequent requests from GATE
+users who have read publications related to those projects.
 
 The easiest way to test TermRaider is to populate a corpus with related
 documents, load the sample
@@ -3262,8 +3262,8 @@
 and run it.  This application will process the documents and create instances of
 three termbank language resources with sensible parameters.
 
-All the language resources in TermRaider are properly serializable and so can be
-stored in GATE datastores.
+All the language resources in TermRaider are serializable and can be stored in
+GATE datastores.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsect{Termbank language resources}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3274,25 +3274,34 @@
 \begin{itemize}
 \item \textbf{corpora}: a \texttt{Set<gate.Corpus>} from which the termbank is
   generated.
-\item \textbf{inputASName}: the annotation set name in which to find the term
-  candidates.
+\item \textbf{inputASName} (\texttt{String}): the annotation set name in which
+  to find the term candidates.
 \item \textbf{inputAnnotationTypes} (\texttt{Set<String>}): annotation types
   which are treated as term candidates.
-\item \textbf{inputAnnotationFeature}: the feature of each annotation used as
-  the term string (if the feature is missing from the annotation, the underlying
-  document content will be whitespace-trimmed and used).  Note that these values
-  are case-sensitive; normally the lemma (\emph{root} feature from the GATE
-  Morphological Analyser) is used for consistency.
+\item \textbf{inputAnnotationFeature} (\texttt{String}): the feature of each
+  annotation used as the term string (if the feature is missing from the
+  annotation, the underlying document content will be whitespace-trimmed and
+  used).  Note that these values are case-sensitive; normally the lemma
+  (\emph{root} feature from the GATE Morphological Analyser) is used for
+  consistency.
 \item \textbf{languageFeature} (\texttt{String}): the feature of each annotation
   identifying the language of the term.  (Annotations without the feature will
-  get a blank language code.)
-\item \textbf{scoreProperty}: a description of the score, used in the CSV
-  output and the Termbank Score Copier PR.
+  get an empty string as a language code, which can match language-coded terms
+  more flexibly in some situations.)
+\item \textbf{scoreProperty} (\texttt{String}): a description of the principal
+  output score, used in the termbank's GUI and CSV output and in the Termbank
+  Score Copier PR.  (A sensible default is provided for each termbank type.)
 \item \textbf{debugMode} (\texttt{Boolean}): this sets the verbosity of the
   output while creating the termbank.
 \end{itemize}
 
 
+Each type of termbank has one or more score types, shown as columns in the
+\emph{Details} tab of the GUI and listed in the \emph{Type} pull-down menu in
+the \emph{Term Cloud} tab.  The first score is always the principal one named by
+the \emph{scoreProperty} parameter above.
+
+
 The \texttt{Term} class is defined in terms of the term string itself, the
 language code, and the annotation type, so it is possible to distinguish
 \emph{affect}(\emph{english},\emph{Noun}) from
@@ -3308,7 +3317,7 @@
 of tf.idf over other corpora.
 
 A document frequency bank can be constructed from one or more corpora, from one
-or more existing document frequency banks, or from a combination of them, so
+or more existing document frequency banks, or from a combination of both, so
 that document frequency counts from different sources can be compiled together.
 It therefore has one additional parameter:
 %%
@@ -3316,6 +3325,10 @@
 \item \textbf{inputBanks} zero or more other instances of
   \emph{DocumentFrequencyBank}.
 \end{itemize}
+
+
+This type of termbank has only the principal score type.
+%% TODO document the flexible language code matching
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsect{TfIdf Termbank}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3323,52 +3336,76 @@
 of corpora.  It has the following additional init parameters.
 %%
 \begin{itemize}
-\item \textbf{docFreqSource}
+\item \textbf{docFreqSource}: an instance of \emph{DocumentFrequencyBank}, which
+  could be derived from another set of corpora (as described above); if this
+  parameter is \texttt{null} (\verb!<none>! in the GUI), an instance of
+  DocumentFrequencyBank will be constructed from this LR's corpora parameter and
+  used here.
 \item \textbf{idfCalculation}: an enum (pull-down menu in the GUI) with the
-  following options for inverted document frequency:
+  following options for adjusting inverted document frequency (all adjusted so
+  they must return a positive value, to prevent division by zero), $g(df)$:
   \begin{itemize}
-  \item \emph{Logarithmic} $=\log_{2}(n/df)$;
-  \item \emph{Scaled} $= 1/df$;
-  \item \emph{Natural} $= 1/df$;
+    % TODO: add unscaled Logarithmic as below
+    % change below to LogarithmicScaled
+  \item \emph{Logarithmic} $=\log_{2}(1+n/\mathit{df})$;
+  \item \emph{Scaled} $=(1+n)/(1+\mathit{df})$;
+  \item \emph{Natural} $=1/(1+\mathit{df})$.
   \end{itemize}
-\item \textbf{normalization}
+\item \textbf{normalization}: an enum (pull-down) with the following options for
+  normalizing the raw score $s$, where $s=f(\mathit{tf}){\times}g(idf)$:
   \begin{itemize}
-  \item \emph{None}
-  \item \emph{Hundred}
-  \item \emph{Sigmoid}
+  \item \emph{None} $=s$ (this may return numbers in a low range);
+  \item \emph{Hundred} $=100s$ (this makes the sliders easier to use);
+  \item \emph{Sigmoid} $=\frac{200}{1+e^{-s/k}}-100$ (this maps all raw scores
+    monotonically to values in the 0--100 range, so that $0{\rightarrow}0$ and
+    ${\infty}{\rightarrow}100$).
   \end{itemize}
 \item \textbf{tfCalculation}: an enum (pull-down) with the following options for
-  term frequency:
+  adjusting term frequency $f(\mathit{tf})$:
   \begin{itemize}
-  \item \emph{Natural} $=tf$;
-  \item \emph{Sqrt} 
-  \item \emph{Logarithmic} $=1+\log_{2} tf$.
+  \item \emph{Natural} $=\mathit{tf}$;
+  \item \emph{Sqrt} $=\sqrt{\mathit{tf}}$;
+  \item \emph{Logarithmic} $=1+\log_{2} \mathit{tf}$.
   \end{itemize}
 \end{itemize}
 %%
-For these calcutations, $tf$ is the term frequency (number of occurrences of the
-term in the corpora), $df$ is the document frequency according to the
-DocumentFrequencySource, and $n$ is the total number of documents.
+For the calculations above, $\mathit{tf}$ is the term frequency (number of
+individual occurrences of the term in the current corpora), whereas
+$\mathit{df}$ is the document frequency of the term according to the
+DocumentFrequencySource and $n$ is the total number of documents in the
+DocumentFrequencySource.  The raw score
+$s=f(\mathit{tm}){\times}g(\mathit{df})$.
+
+This type of termbank has five score types: the principal one (normalized), the
+raw score ($s$ above, with the principal name plus the suffix ``.raw''),
+\emph{termFrequency}, \emph{localDocFrequency} (number of documents in the
+current corpora containing the term; not used in the tf.idf calculation), and
+\emph{refDocFrequency} ($\mathit{df}$ above; this will be the same as
+\emph{localDocFrequency} if no external \emph{docFreqSource} was specified).
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsect{Annotation Termbank}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-This termbank collects the values of scoring features on all the term candidates
-and selects the minimum or maximum score or averages them, according to the
-\textbf{mergingMode} parameter.  It has the following additional init
-parameters.
+This termbank collects the values of scoring features on all the term candidate
+annotations, and for each term determines the minimum, maximum, or mean
+according to the \textbf{mergingMode} parameter.  It has the following
+additional parameters.
 %%
 \begin{itemize}
 \item \textbf{inputScoreFeature}: an annotation feature whose value should be a
   \texttt{Number} or interpretable as a number.
 \item \textbf{mergingMode}: an enum (pull-down menu in the GUI) with the options
   \emph{MINIMUM}, \emph{MEAN}, or \emph{MAXIMUM}.
-\item \textbf{normalization}
-  \begin{itemize}
-  \item \emph{None}
-  \item \emph{Hundred}
-  \item \emph{Sigmoid}
-  \end{itemize}
+\item \textbf{normalization}: the same normalization options as for the TfIdf
+  Termbank above.  To produce augmented tf.idf scores (as in the sample
+  application), it is generally better to augment the \texttt{tfIdfScore.raw}
+  values, compile them into an Annotation Termbank, and normalize the results
+  (rather than carrying out augmentation on the normalized tf.idf scores).
 \end{itemize}
+
+This type of termbank has four score types: the principal one (normalized), the
+raw score (minimum, maximum, or mean above; with the principal name plus the
+suffix ``.raw''), \emph{termFrequency}, \emph{localDocFrequency} (the last two
+are not used in the calculation).
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsubsect{Hyponymy Termbank}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -3378,43 +3415,39 @@
 \begin{itemize}
 \item \textbf{inputHeadFeatures} (\texttt{List<String>}): annotation features on
   term candidates containing the head of the expression.
-\item \textbf{normalization}
-  \begin{itemize}
-  \item \emph{None}
-  \item \emph{Hundred}
-  \item \emph{Sigmoid}
-  \end{itemize}
+\item \textbf{normalization}: the same normalization options as for the TfIdf
+  Termbank above.
 \end{itemize}
 %%
 Head information is generated by the multiword JAPE grammar included in the
-application.  We consider $T_1$ a hyponym of $T_2$ if and only if $T_2$'s head
-feature value ends with $T_1$'s head or string feature value.
+application.  This LR treats $T_1$ a hyponym of $T_2$ if and only if $T_2$'s
+head feature's value ends with $T_1$'s head or string feature's value.  (This
+depends on \emph{head-final} construction of compound nouns, as used in English
+and German.)
+
+This type of termbank has five score types: the principal one (normalized), the
+raw score ($s$ above, with the principal name plus the suffix ``.raw''),
+\emph{termFrequency}, \emph{hyponymCount} (number of distinct hyponyms found in
+the current corpora), and \emph{localDocFrequency} (number of documents in the
+current corpora containing the term; not used in other calculations).
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsect{Termbank Score Copier}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-This processing resource copies the scores and optionally frequencies from a
-termbank into features of the term annotations.  It has no init parameters and
-two runtime parameters.
+This processing resource copies the scores from a termbank onto features of the
+term annotations.  It has no init parameters and two runtime parameters.
 %%
 \begin{itemize}
 \item \textbf{annotationSetName}
-\item \textbf{frequencyFeature} default value \emph{frequency}
-\item \textbf{docFrequencyFeature} default value \emph{docFrequency}
 \item \textbf{termbank}
 \end{itemize}
 %%
-This PR uses the annotation types, string and language code features, and score
-features from the selected termbank.  It treats any annotation with a matching
-type and matching string and language feature (where a missing feature matches
-the triple-underscore ``not found'' code) as a match, and copies the score to a
-feature on the annotation specified by the termbank's \emph{scoreProperty}
-parameter.  
-
-It also copies the term frequency to the annotation's \emph{frequencyFeature}
-unless that parameter is blank; and copies the document frequency to the
-\emph{docFrequencyFeature} unless that is blank.  Note that the default values
-are not blank---you need to clear either or both parameters to prevent these
-annotation features from being filled in.
+This PR uses the annotation types, string and language code features, and scores
+from the selected termbank.  It treats any annotation with a matching type and
+matching string and language feature as a match (although a missing language
+feature matches the empty string used as a ``not found'' code), and copies all
+the termbanks' scores to features on the annotation with the scores' names.
+(The principal score name is determined by the termbank's \emph{scoreProperty}
+feature.)
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsect{The PMI bank language resource}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.