Menu

#758 Equivalent Unicode composed and decomposed characters not treated as equivalent

3.5
closed-fixed
None
5
2018-02-27
2015-08-02
No

Some characters such as ü can be encoded in multiple ways in Unicode, e.g.

  • U+00FC LATIN SMALL LETTER U WITH DIAERESIS (composed)
  • <U+0075 LATIN SMALL LETTER U U+0308 COMBINING DIAERESIS> (decomposed)

Currently OmegaT treats these as entirely different strings for the purposes of TM and glossary matching. It is reasonable, however, for a user to assume that they would be treated as identical. Whether the current behavior is a bug or not is a bit subjective.

Current behavior

  • Fuzzy matches that would seem to be 100% matches are not recognized as such.
  • Glossary entries that would seem to match are not found.

In the context of the attached sample project,

  • The entry in the project TM is not recognized as matching, and thus appears as an orphan fuzzy match for the first segment.
  • The entry in the external TM is not recognized as a 100% match for the first segment.
  • Only one of the glossary entries finds a hit in the first segment.

Expected behavior

Strings that differ only in character composition should be considered 100% matches.

In the context of the attached sample project,

  • The entry in the project TM should be recognized as a match for the first segment.
  • The entry in the external TM should be a 100% match for the first segment.
  • Both glossary entries should hit in the first segment, and should collapse to one item in the glossary pane.

Proposed fix

The most reasonable way to handle this seems to be to apply Unicode normalization to strings we read into OmegaT that might be subject to matching, i.e. source files, TMs, and glossaries.

It should not be necessary to normalize translation input from the user (direct user typing or machine translations) because target text is not subject to matching within a project. If the output of a project becomes input for another project, it will be normalized upon loading.

Issues for discussion:

  • Is normalization behavior undesirable for any reason?
  • What performance penalties are associated with normalization?
  • What normalization scheme should we use? NFC seems reasonable, and would suffice to work around [#757], but would NFD be better for any reason?
1 Attachments

Related

Bugs: #757

Discussion

  • Aaron Madlon-Kay

    Current behavior (trunk, r7583) with attached test project

     
  • Aaron Madlon-Kay

    From my own investigation:

    • There seems to be no reason not to perform Unicode normalization.
    • Adding appropriate normalization calls has not had a noticeable impact on performance.
    • There seems to be no reason to choose NFD over NFC.

    In light of the above, I have committed changes to trunk that apply NFC normalization to text read from source files, external TMXs, and glossary files (r7637).

     
  • Aaron Madlon-Kay

    • status: open --> closed-fixed
     
  • Aaron Madlon-Kay

    • status: closed-fixed --> open-fixed
     
  • Didier Briel

    Didier Briel - 2015-08-17
    • summary: Equivalent Unicode composed and decomposed characters are not treated as equivalent --> Equivalent Unicode composed and decomposed characters not treated as equivalent
     
  • Didier Briel

    Didier Briel - 2015-09-20

    Fixed in the released version 3.5.2 of OmegaT.

    Didier

     
  • Didier Briel

    Didier Briel - 2015-09-20
    • status: open-fixed --> closed-fixed
     
  • Aaron Madlon-Kay

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -33,4 +33,4 @@
    
    
     * Is normalization behavior undesirable for any reason?
     * What performance penalties are associated with normalization?
    -* What normalization scheme should we use? NFC seems reasonable, and would suffice to work around [bug 757](https://sourceforge.net/p/omegat/bugs/757/), but would NFD be better for any reason?
    +* What normalization scheme should we use? NFC seems reasonable, and would suffice to work around [#757], but would NFD be better for any reason?
    
     

    Related

    Bugs: #757


Log in to post a comment.

MongoDB Logo MongoDB