Menu

#1250 Cull excessive English stop words list

4.0
closed-fixed
None
5
2016-09-06
2016-05-19
No

For historical reasons OmegaT includes a custom English stop words list. This list was originally intended to be used for teminology extraction, so it contains a very large number of words, many of which are not trivial.

OmegaT uses stop words for two main purposes:

  1. To calculate fuzzy match statistics that accurately reflect linguistic similarity, where stop words tend to be noise
  2. Per [#1124], to reduce noise when looking up words in reference material

StopList_en.txt is ill suited to both of these tasks, causing confusion when seemingly quite different sentences appear to be 100% matches, and excluding some non-trivial words from dictionary lookup.

Lucene now includes a reasonable set of English stop words that better align with our goals:

      "a", "an", "and", "are", "as", "at", "be", "but", "by",
      "for", "if", "in", "into", "is", "it",
      "no", "not", "of", "on", "or", "such",
      "that", "the", "their", "then", "there", "these",
      "they", "this", "to", "was", "will", "with"

Thus I have removed StopList_en.txt so that we fall back to Lucene instead.

Related

Feature Requests: #1124

Discussion

  • Didier Briel

    Didier Briel - 2016-09-06
    • status: open-fixed --> closed-fixed
     
  • Didier Briel

    Didier Briel - 2016-09-06

    Implemented in the released version 4.0 of OmegaT.

    Didier

     

Log in to post a comment.