OmegaT - multiplatform CAT tool / Feature Requests / #1250 Cull excessive English stop words list

#1250 Cull excessive English stop words list

Milestone: 4.0

Status: closed-fixed

Owner: Aaron Madlon-Kay

Labels: None

Priority: 5

Updated: 2016-09-06

Created: 2016-05-19

Creator: Aaron Madlon-Kay

Private: No

For historical reasons OmegaT includes a custom English stop words list. This list was originally intended to be used for teminology extraction, so it contains a very large number of words, many of which are not trivial.

OmegaT uses stop words for two main purposes:

To calculate fuzzy match statistics that accurately reflect linguistic similarity, where stop words tend to be noise
Per [#1124], to reduce noise when looking up words in reference material

StopList_en.txt is ill suited to both of these tasks, causing confusion when seemingly quite different sentences appear to be 100% matches, and excluding some non-trivial words from dictionary lookup.

Lucene now includes a reasonable set of English stop words that better align with our goals:

      "a", "an", "and", "are", "as", "at", "be", "but", "by",
      "for", "if", "in", "into", "is", "it",
      "no", "not", "of", "on", "or", "such",
      "that", "the", "their", "then", "there", "these",
      "they", "this", "to", "was", "will", "with"

Thus I have removed StopList_en.txt so that we fall back to Lucene instead.

Didier Briel - 2016-09-06

status: open-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2016-09-06

Implemented in the released version 4.0 of OmegaT.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Cull excessive English stop words list

The free computer aided translation (CAT) tool for professionals

Group

Searches

Help

#1250 Cull excessive English stop words list

Related

Discussion