Menu

#1138 Improve glossary matching of terms containing numbers

3.5
closed-fixed
None
5
2015-12-03
2015-10-06
No

Related to RFE#1033

OmegaT's tokenizers drop tokens containing numbers when tokenizing with tokenizeWords methods. This is intentional for matching purposes, where similarity statistics should ignore numerical tokens in order to more accurately represent the linguistic similarity of two texts.

However, when tokenizing for glossary comparison purposes a more literal comparison is desired because glossary entries are somewhat likely to include numbers that are significant to the meaning of the entry.

Dropping the numerical tokens in this case results in a token search of only the non-numerical tokens, e.g.

  • "CO2 gas" is tokenized to "gas"
  • "gas" is found in the tokenization of the source segment "poison gas"
  • Thus "CO2 gas" is a match for "poison gas", but this is undesirable

This problem cannot be addressed by adjusting existing preferences (stemming on/off, "Use Terms Appearing Separately in Source Text" on/off).

Solution:

  • Glossary maching with stemming on: StemmingMode.GLOSSARY will no longer filter out tokens containing numbers
  • Stemming off: Instead of StemmingMode.NONE (which is used for fuzzy matching as well) we use the tokenizeVerbatim() method. This should let you match literally anything you like.
  • Glossary matching will continue to be case-insensitive in either mode.

Related

Feature Requests: #1138

Discussion

  • Aaron Madlon-Kay

    This is implemented in trunk, r7786.

     
  • Aaron Madlon-Kay

    • status: open --> open-fixed
     
  • JDR

    JDR - 2015-10-06

    Thanks for implementing this Aaron. It seems to work exactly as I had hoped. Thanks very much.
    Julian

     
  • Didier Briel

    Didier Briel - 2015-12-03

    Closed in the released version 3.5.3 of OmegaT.

    Didier

     
  • Didier Briel

    Didier Briel - 2015-12-03
    • status: open-fixed --> closed-fixed
     
    • JDR

      JDR - 2015-12-03

      Thank you. That's perfect.

      Julian

      On 3 December 2015 at 08:40, Didier Briel didierbr@users.sf.net wrote:

      • status: open-fixed --> closed-fixed
      • Comment:

      Closed in the released version 3.5.3 of OmegaT.

      Didier

      Status: closed-fixed
      Group: 3.5
      Created: Tue Oct 06, 2015 05:41 AM UTC by Aaron Madlon-Kay
      Last Updated: Tue Oct 06, 2015 08:21 AM UTC
      Owner: Aaron Madlon-Kay

      Related to RFE#1033
      https://sourceforge.net/p/omegat/feature-requests/1033/

      OmegaT's tokenizers drop tokens containing numbers when tokenizing with
      tokenizeWords methods. This is intentional for matching purposes, where
      similarity statistics should ignore numerical tokens in order to more
      accurately represent the linguistic similarity of two texts.

      However, when tokenizing for glossary comparison purposes a more literal
      comparison is desired because glossary entries are somewhat likely to
      include numbers that are significant to the meaning of the entry.

      Dropping the numerical tokens in this case results in a token search of
      only the non-numerical tokens, e.g.

      • "CO2 gas" is tokenized to "gas"
      • "gas" is found in the tokenization of the source segment "poison gas"
      • Thus "CO2 gas" is a match for "poison gas", but this is undesirable

      This problem cannot be addressed by adjusting existing preferences
      (stemming on/off, "Use Terms Appearing Separately in Source Text" on/off).

      Solution:

      • Glossary maching with stemming on: StemmingMode.GLOSSARY will no
        longer filter out tokens containing numbers
      • Stemming off: Instead of StemmingMode.NONE (which is used for fuzzy
        matching as well) we use the tokenizeVerbatim() method. This should
        let you match literally anything you like.
      • Glossary matching will continue to be case-insensitive in either
        mode.

      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/omegat/feature-requests/1138/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Feature Requests: #1138


Log in to post a comment.

MongoDB Logo MongoDB