Some characters such as ü can be encoded in multiple ways in Unicode, e.g.
U+00FC LATIN SMALL LETTER U WITH DIAERESIS (composed)<U+0075 LATIN SMALL LETTER U U+0308 COMBINING DIAERESIS> (decomposed)Currently OmegaT treats these as entirely different strings for the purposes of TM and glossary matching. It is reasonable, however, for a user to assume that they would be treated as identical. Whether the current behavior is a bug or not is a bit subjective.
In the context of the attached sample project,
Strings that differ only in character composition should be considered 100% matches.
In the context of the attached sample project,
The most reasonable way to handle this seems to be to apply Unicode normalization to strings we read into OmegaT that might be subject to matching, i.e. source files, TMs, and glossaries.
It should not be necessary to normalize translation input from the user (direct user typing or machine translations) because target text is not subject to matching within a project. If the output of a project becomes input for another project, it will be normalized upon loading.
Issues for discussion:
Current behavior (trunk, r7583) with attached test project
Expected behavior (topic/aaron/unicode-normalization branch in official git)
Expected behavior (topic/aaron/unicode-normalization branch in official git)
From my own investigation:
In light of the above, I have committed changes to trunk that apply NFC normalization to text read from source files, external TMXs, and glossary files (r7637).
Fixed in the released version 3.5.2 of OmegaT.
Didier
Diff:
Related
Bugs:
#757