OmegaT - multiplatform CAT tool / Bugs / #758 Equivalent Unicode composed and decomposed characters not treated as equivalent

The free computer aided translation (CAT) tool for professionals

#758 Equivalent Unicode composed and decomposed characters not treated as equivalent

Milestone: 3.5

Status: closed-fixed

Owner: Aaron Madlon-Kay

Labels: None

Priority: 5

Updated: 2018-02-27

Created: 2015-08-02

Creator: Aaron Madlon-Kay

Private: No

Some characters such as ü can be encoded in multiple ways in Unicode, e.g.

U+00FC LATIN SMALL LETTER U WITH DIAERESIS (composed)
<U+0075 LATIN SMALL LETTER U U+0308 COMBINING DIAERESIS> (decomposed)

Currently OmegaT treats these as entirely different strings for the purposes of TM and glossary matching. It is reasonable, however, for a user to assume that they would be treated as identical. Whether the current behavior is a bug or not is a bit subjective.

Current behavior

Fuzzy matches that would seem to be 100% matches are not recognized as such.
Glossary entries that would seem to match are not found.

In the context of the attached sample project,

The entry in the project TM is not recognized as matching, and thus appears as an orphan fuzzy match for the first segment.
The entry in the external TM is not recognized as a 100% match for the first segment.
Only one of the glossary entries finds a hit in the first segment.

Expected behavior

Strings that differ only in character composition should be considered 100% matches.

In the context of the attached sample project,

The entry in the project TM should be recognized as a match for the first segment.
The entry in the external TM should be a 100% match for the first segment.
Both glossary entries should hit in the first segment, and should collapse to one item in the glossary pane.

Proposed fix

The most reasonable way to handle this seems to be to apply Unicode normalization to strings we read into OmegaT that might be subject to matching, i.e. source files, TMs, and glossaries.

It should not be necessary to normalize translation input from the user (direct user typing or machine translations) because target text is not subject to matching within a project. If the output of a project becomes input for another project, it will be normalized upon loading.

Issues for discussion:

Is normalization behavior undesirable for any reason?
What performance penalties are associated with normalization?
What normalization scheme should we use? NFC seems reasonable, and would suffice to work around [#757], but would NFD be better for any reason?

1 Attachments

NormalizationTest.zip

Aaron Madlon-Kay - 2015-08-02

Current behavior (trunk, r7583) with attached test project

Normalization-current.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2015-08-02

Expected behavior (topic/aaron/unicode-normalization branch in official git)

Normalization-expected.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2015-08-02

Expected behavior (topic/aaron/unicode-normalization branch in official git)

Normalization-expected.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2015-08-17

From my own investigation:

There seems to be no reason not to perform Unicode normalization.

Adding appropriate normalization calls has not had a noticeable impact on performance.

There seems to be no reason to choose NFD over NFC.

In light of the above, I have committed changes to trunk that apply NFC normalization to text read from source files, external TMXs, and glossary files (r7637).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2015-08-17

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2015-08-17

status: closed-fixed --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-08-17

summary: Equivalent Unicode composed and decomposed characters are not treated as equivalent --> Equivalent Unicode composed and decomposed characters not treated as equivalent
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-09-20

Fixed in the released version 3.5.2 of OmegaT.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-09-20

status: open-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Description has changed:

Diff:

--- old
+++ new
@@ -33,4 +33,4 @@


 * Is normalization behavior undesirable for any reason?
 * What performance penalties are associated with normalization?
-* What normalization scheme should we use? NFC seems reasonable, and would suffice to work around [bug 757](https://sourceforge.net/p/omegat/bugs/757/), but would NFD be better for any reason?
+* What normalization scheme should we use? NFC seems reasonable, and would suffice to work around [#757], but would NFD be better for any reason?

Bugs: ~~#757~~

Equivalent Unicode composed and decomposed characters not treated as equivalent

The free computer aided translation (CAT) tool for professionals

Group

Searches

Help