Menu

#1785 Tokenizer to support the snowball generated stemming mode

6.1
open-fixed
5
2025-11-02
2022-02-05
No

With the LuceneItalianTokenizer selected for an Italian to English translation, stemming does not work for glossary entries. For example, with paese and emettere added to a sample project glossary, the words paesi and emettono in a source segment are not recognized by the glossary function.

Switching to the Hunspell tokenizer resolved this issue for me but I wanted to flag this issue with the Lucene tokenizer.

Version: OmegaT-5.7.0_0_8ae1ecfb5
Platform: Mac OS X 10.16
Java: 1.8.0_312 x86_64
Memory: 445MiB total / 324MiB free / 3641MiB max

Discussion

  • adrm

    adrm - 2022-02-12

    Confirmed.

    It seems related to the tokenizer and those specific terms, as some glossary entries are recognized (traduzione/traduzioni, frutto/frutti, lavoro/lavori, for instance), but others are not.

    Switching to the Hunspell tokenizer does lead to different results, but it does not resolve the issue for me: it recognizes still less entries (does not recognize paese/emettere and paesi/emettono, and neither are other terms that Lucene did recognize).

    Version: OmegaT-5.7.0 (8ae1ecfb)
    Platform: Ubuntu 20.04
    Java: 11.01.13 (64 bits)
    Memory: 142 MB; 47 MB free; 1.410 MB max

     
  • Hiroshi Miura

    Hiroshi Miura - 2025-08-18

    Since OmegaT 6.1 beta has upgraded to LanguageTool 6.5 and Lucene 8.11.4, have there been any improvements to the Italian glossary stemming issues reported in ticket #1088? Specifically, does the newer LanguageTool version better handle stemmed forms like paesi/paese and emettono/emettere when using the LuceneItalianTokenizer?

     

    Last edit: Hiroshi Miura 2025-08-18
  • Hiroshi Miura

    Hiroshi Miura - 2025-08-18

    Ticket moved from /p/omegat/bugs/1088/

     
    👍
    1
  • Hiroshi Miura

    Hiroshi Miura - 2025-08-18
    • labels: --> lucene, stemmer, tokenizer
    • summary: No glossary stemming with LuceneItalianTokenizer --> Tokenizer to support the snowball generated stemming mode
    • assigned_to: Hiroshi Miura
    • Group: None --> 6.1
     
  • Hiroshi Miura

    Hiroshi Miura - 2025-08-22
     
  • Hiroshi Miura

    Hiroshi Miura - 2025-11-02
    • status: open --> open-fixed
     

Log in to post a comment.

MongoDB Logo MongoDB