OmegaT - multiplatform CAT tool / Feature Requests / #1138 Improve glossary matching of terms containing numbers

The free computer aided translation (CAT) tool for professionals

#1138 Improve glossary matching of terms containing numbers

Milestone: 3.5

Status: closed-fixed

Owner: Aaron Madlon-Kay

Labels: None

Priority: 5

Updated: 2015-12-03

Created: 2015-10-06

Creator: Aaron Madlon-Kay

Private: No

Related to RFE#1033

OmegaT's tokenizers drop tokens containing numbers when tokenizing with tokenizeWords methods. This is intentional for matching purposes, where similarity statistics should ignore numerical tokens in order to more accurately represent the linguistic similarity of two texts.

However, when tokenizing for glossary comparison purposes a more literal comparison is desired because glossary entries are somewhat likely to include numbers that are significant to the meaning of the entry.

Dropping the numerical tokens in this case results in a token search of only the non-numerical tokens, e.g.

"CO2 gas" is tokenized to "gas"
"gas" is found in the tokenization of the source segment "poison gas"
Thus "CO2 gas" is a match for "poison gas", but this is undesirable

This problem cannot be addressed by adjusting existing preferences (stemming on/off, "Use Terms Appearing Separately in Source Text" on/off).

Solution:

Glossary maching with stemming on: StemmingMode.GLOSSARY will no longer filter out tokens containing numbers
Stemming off: Instead of StemmingMode.NONE (which is used for fuzzy matching as well) we use the tokenizeVerbatim() method. This should let you match literally anything you like.
Glossary matching will continue to be case-insensitive in either mode.

Aaron Madlon-Kay - 2015-10-06

This is implemented in trunk, r7786.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2015-10-06

status: open --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

JDR - 2015-10-06

Thanks for implementing this Aaron. It seems to work exactly as I had hoped. Thanks very much.
Julian

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-12-03

Closed in the released version 3.5.3 of OmegaT.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-12-03

status: open-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- JDR - 2015-12-03
  
  Thank you. That's perfect.
  
  Julian
  
  On 3 December 2015 at 08:40, Didier Briel didierbr@users.sf.net wrote:
  
  status: open-fixed --> closed-fixed
  
  Comment:
  
  Closed in the released version 3.5.3 of OmegaT.
  
  Didier
  
  [feature-requests:#1138]
  http://sourceforge.net/p/omegat/feature-requests/1138/ Improve glossary
  matching of terms containing numbers*
  
  Status: closed-fixed
  Group: 3.5
  Created: Tue Oct 06, 2015 05:41 AM UTC by Aaron Madlon-Kay
  Last Updated: Tue Oct 06, 2015 08:21 AM UTC
  Owner: Aaron Madlon-Kay
  
  Related to RFE#1033
  https://sourceforge.net/p/omegat/feature-requests/1033/
  
  OmegaT's tokenizers drop tokens containing numbers when tokenizing with
  tokenizeWords methods. This is intentional for matching purposes, where
  similarity statistics should ignore numerical tokens in order to more
  accurately represent the linguistic similarity of two texts.
  
  However, when tokenizing for glossary comparison purposes a more literal
  comparison is desired because glossary entries are somewhat likely to
  include numbers that are significant to the meaning of the entry.
  
  Dropping the numerical tokens in this case results in a token search of
  only the non-numerical tokens, e.g.
  
  "CO2 gas" is tokenized to "gas"
  
  "gas" is found in the tokenization of the source segment "poison gas"
  
  Thus "CO2 gas" is a match for "poison gas", but this is undesirable
  
  This problem cannot be addressed by adjusting existing preferences
  (stemming on/off, "Use Terms Appearing Separately in Source Text" on/off).
  
  Solution:
  
  Glossary maching with stemming on: StemmingMode.GLOSSARY will no
  longer filter out tokens containing numbers
  
  Stemming off: Instead of StemmingMode.NONE (which is used for fuzzy
  matching as well) we use the tokenizeVerbatim() method. This should
  let you match literally anything you like.
  
  Glossary matching will continue to be case-insensitive in either
  mode.
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/omegat/feature-requests/1138/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Feature Requests: ~~#1138~~
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Improve glossary matching of terms containing numbers

The free computer aided translation (CAT) tool for professionals

Group

Searches

Help

#1138 Improve glossary matching of terms containing numbers

Related

Discussion

Didier

Related