Related to RFE#1033
OmegaT's tokenizers drop tokens containing numbers when tokenizing with tokenizeWords methods. This is intentional for matching purposes, where similarity statistics should ignore numerical tokens in order to more accurately represent the linguistic similarity of two texts.
However, when tokenizing for glossary comparison purposes a more literal comparison is desired because glossary entries are somewhat likely to include numbers that are significant to the meaning of the entry.
Dropping the numerical tokens in this case results in a token search of only the non-numerical tokens, e.g.
This problem cannot be addressed by adjusting existing preferences (stemming on/off, "Use Terms Appearing Separately in Source Text" on/off).
Solution:
StemmingMode.GLOSSARY will no longer filter out tokens containing numbersStemmingMode.NONE (which is used for fuzzy matching as well) we use the tokenizeVerbatim() method. This should let you match literally anything you like.
This is implemented in trunk, r7786.
Thanks for implementing this Aaron. It seems to work exactly as I had hoped. Thanks very much.
Julian
Closed in the released version 3.5.3 of OmegaT.
Didier
Thank you. That's perfect.
Julian
On 3 December 2015 at 08:40, Didier Briel didierbr@users.sf.net wrote:
Related
Feature Requests:
#1138