OmegaT - multiplatform CAT tool / Bugs / #1046 Glossary has wrong encoding for non-ASCII characters

Kos Ivantsov - 2021-04-30

There was also this report in the users' mailing list:
https://sourceforge.net/p/omegat/mailman/message/37264390/
The message had a screenshot attached, but I can't see it there in the web version. I'm attaching it here too.
It looks like a similar problem, though complicated and made apparent by the attempt of git commit. I've experienced the same issue as Valter describes in his email (but in EN>UK projects), though I couldn't reliably reproduce it, therefore it wasn't reported.

Last edit: Kos Ivantsov 2021-04-30

screenshot.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anil Duggirala - 2021-05-05

I'm on a Team project on version 5.4.3 and am also experiencing exactly the same issue. This issue does not appear to be an issue when translating from en-US to es-CO, only when translating from es-CO to en-US.

Last edit: Anil Duggirala 2021-05-05

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

msoutopico - 2021-05-06

The operating system should be added to the preconditions: Windows 10.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Kos Ivantsov - 2021-05-06
  
  I've experienced the same bug on Linux, but can't really reproduce it.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Kos Ivantsov - 2021-05-06

Description has changed:

Diff:

--- old +++ new @@ -27,3 +27,4 @@ ### Comments I haven't tested it but I assume the same issue would happen if the language combination was inverted. +Discovered on Windows 10.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Description has changed:

Diff:

--- old
+++ new
@@ -1,6 +1,6 @@
 ### Preconditions

-* You have installed OmegaT 5.4.4.
+* You have installed OmegaT 5.4.4 (Windows 10)
 * The option **View &gt; Mark glossary matches** is ticked.

 ### Steps to reproduce
@@ -27,4 +27,3 @@

 ### Comments
 I haven&#39;t tested it but I assume the same issue would happen if the language combination was inverted.
-Discovered on Windows 10.

Hiroshi Miura - 2021-06-18

GlossaryReaderTSV class use util/EncodingDetector utility class.
In generally said, the encoding detection is worked when enough amount of characters exist in input data.

When there is very small number of characters, detection will failed.
IMHO, the error 'MalformedInput' length = 1 means it.

Last edit: Hiroshi Miura 2021-06-18

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anil Duggirala - 2021-06-18

I'm on version 5.5.0 now on Windows 10 Home. This issue is not an issue for me now. I have tried adding glossary entries in both directions of the en-US es-CO language pair. It appears to have been fixed in this latest version.
thank you,

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

msoutopico - 2021-06-19

In version 5.5.0 I can reproduce the issue when I follow Hiroshi's steps:

create new OmegaT project

create glossary.txt which content is just single space character: echo " " > glossary.txt

try to add term with accent from OmegaT GUI: traducción > translation

observe glossary.txt: it shows traducci?n in the second row

I can reproduce it as well if, instead of a space, the first term pair that I add do not have any accented characters (pure ASCII characters). In both cases I get a ASCII text file.

However, if the first term that I add to a new glossary already has an accent, then the bug is not reproduced and I get a UTF-8 Unicode file.

In my book, it shouldn't matter what are the first characters added to the glossary file or whether the amount of characters is big enough or not. The encoding of the glossary file should be UTF-8 by default, even in the unlikely event that the two charsets in the language pair of the project only use ASCII in their alphabet.

Last edit: msoutopico 2021-06-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2021-06-20

Nothing relevant to this issue has changed in the last several releases.

This is the same issue as was discussed on the Telegram channel a bit ago where a file containing ÷ was misinterpreted as a Thai encoding.

If the encoding of a file is not specified by out-of-band information, it will be automatically detected. The detection algorithm can only be 100% certain of the encoding when there is a definitive indicator like a BOM for Unicode.

In the absence of a definitive indicator, the algorithm must rely on heuristics. The gist of it is that it looks at the first n bytes of the file, compares to known byte frequency distributions for various encodings, and chooses whatever seems to fit best. This heuristic is very easily fooled by short inputs.

The only reliable way to avoid this issue is to provide a definitive indicator like a BOM, or for you to explicitly communicate the encoding via out-of-band information. For source files, you can choose the encoding in the file filter settings or via file extension (.utf8 for text files). For the glossary I guess the only option is to use the .utf8 extension.

Always appending a BOM is not a great solution because other software often doesn't interpret it correctly.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- msoutopico - 2021-06-20
  
  Thanks, Aaron.
  
  This is how I see from my limited (and perhaps biased) perspective:
  
  This issue is about the writeable glossary that OmegaT creates when a user adds the first term pair (it doesn't affect read-only glossaries, which are not created by OmegaT and therefore the user has control over them). Since OmegaT creates that writeable glossary file, I think OmegaT can legitimately decide what encoding the file must have to make the file work correctly. UTF-8 is the encoding that would accommodate all languages most conveniently, so it seems it makes sense to create the file with that encoding. So why not declare UTF-8 explicitly rather than leaving it to the heuristics, which is clearly problematic?
  
  It seems you're saying that using an .utf8 extension would let OmegaT read the glossary as UTF-8. If that's the case, couldn't the writeable glossary be glossary.utf8 instead of glossary.txt (maybe with an option in the preferences)? If that's possible, the solution seems easy. I think it's an affordable cost (for Windows users, at least) to have to do right-click > "Open with" to open the file, or to even assign Notepad (or Virtaal, or LO Calc, etc.) to the .utf8 extension. Otherwise, i.e. if signing the file with a BOM is the only way to make it clear that its encoding is UTF-8, then I'd still think it's worthy.
  
  About the heuristics and its success rate: If I understand correctly, in the absence of a definite indicator, the encoding of the file must be decided when the file is created, which means that the heuristics really depend on the first term pair being added to the new glossary. If I think of my native languages (Spanish, Galician, Portuguese), I think it is very likely (~80%) that the first line does not have any accent, leading the algorithm to think that the encoding can be ASCII. Other Latin-script languages might use accents more profusely, making it less likely that the first line is pure ASCII, but making it a bigger problem when that happens (because the problem will affect more words in the rest of the file).
  
  You mention the BOM can be a problem for "other software". Indeed, it could be the case that third-party applications need to further process the writeable glossaries (e.g. I might want to collect the terminology that my users have recorded in their projects). However, if the BOM is a problem for those further uses of the glossaries, it shouldn't be very problematic for an engineer or the terminologist to remove the BOM if that's necessary on their end (e.g. just a sed that does s/^\uFEFF// or s/^\xef\xbb\xbf//), or at least much less than is for a translator to deal with a glossary file with the wrong encoding.
  
  On the other hand, I'm ignorant about the technicalities, but I would expect any modern software that is Unicode-aware not to choke at this byte order mark...
  
  In a nutshell, I would say OmegaT should be more concerned with the certain immediate needs of everyday users of the tool (e.g. translators), and less with potential further processing that third parties might undertake. In my book, a priori, the benefit of declaring UTF-8 explicitly (with BOM or with .utf8 extension) outnumbers the potential disadvantages.
  
  Please note that I'm referring only to the glossary, not the source files, not sure if the problem is the same there and not sure if OmegaT must necessarily handle both the glossary and the source files identically. In any case, I don't see as a problem for the user (or who prepares the project) to select the right encoding, since there is a clickable interface for that (in the file list window) and it's part of their job anyway. However, it is a problem for the average user if the terms they add to the glossary are not recognized. The issue is not necessarily easy to realize and it's definitely not easy to fix for an average user who knows nothing about file encodings.
  
  Thanks, and sorry about my long post.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Hiroshi Miura - 2021-06-22
    
    There is no perfect solution for encoding detection of vanilla text file, as @amake express, it is based on heuristics.
    
    Emacs use a hack to add encoding signature as a comment at first line
    # -*- coding: utf-8 -*-
    
    https://github.com/omegat-org/omegat/pull/112 is a simple implementation to do it.
    It works for me
    
    Last edit: Hiroshi Miura 2021-06-22
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Hiroshi Miura - 2021-07-01
      
      Fix merged.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - msoutopico - 2021-07-01
        
        Thank you so much for your work, Hiroshi. Is there or will there be a nightly build that I can use to test the fix?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Aaron Madlon-Kay - 2021-07-01
        
        https://sourceforge.net/projects/omegat/files/Nightly/
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - msoutopico - 2021-07-01
      
      Something I haven't understood is why OmegaT needs to "detect" the encoding of the glossary file it creates itself, rather than simply create the file with UTF-8 encoding (defining the encoding explicitly).
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Aaron Madlon-Kay - 2021-07-01
        
        You seem to think that a plain text file will somehow "know" its own encoding. This is not the case: with unstructured data there is nowhere for that knowledge to be stored. The encoding of a plain text file is a property of the bytes that are contained within. You can't know that a file was "created with UTF-8" except by inspecting the bytes and determining that they must be UTF-8.
        
        In the case of UTF-8 it's particularly bad because UTF-8 is a superset of ASCII. Imagine: You add several glossary entries that all happen to fit into ASCII, and the file will be detected as ASCII. Now add another term that contains a single non-ASCII character. What will the result be detected as? Who knows: the file is now 99% ASCII plus perhaps a handful of non-ASCII bytes. If those bytes happen to fit the expected distribution of the wrong encoding, then you're going to get garbage and there's nothing we can do about it.
        
        Further OmegaT can't assume that the writable glossary file it creates will still be the same file at any point in the future, because you could have replaced it with any other arbitrary file in the meantime. It must always detect the encoding.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        msoutopico - 2021-07-06
        
        Thanks for the clarification. This matter makes a lot more sense to me now.
        
        Yes, it seems I was making some wrong assumptions. I know one can set the encoding of a string (e.g. Python's str.encode() method, Java seems to have something equivalent: https://www.baeldung.com/java-string-encode-utf-8#core-java) and my wrong assumption was that if you write a UTF-8 string to a text file, that will set the encoding of the file to UTF-8 even if the string does not contain any non-ASCII character.
        
        On the other hand, my understanding is that some in programming languages (like Python 3) it's possible to use a specific encoding (normally UTF-8) by default for most operations that require an encoding (e.g. reading in a text file without a specified encoding), so they don't try to detect it. I was assuming that would be possible in Java too (not sure this was also a wrong assumption).
        
        Regarding your last point about the writeable glossary (my ticket was specifically about the writeable glossary), in my opinion some minimum requirements are reasonable in the interest of the greatest good for the greatest number of users. For example, a specific encoding by default for the writeable glossary, which the user should not change.
        
        Thanks for accepting Hiroshi's fix. I'll test it as soon as I can.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Hiroshi Miura - 2021-07-31
        
        On the other hand, my understanding is that some in programming languages (like Python 3) it's possible to use a specific encoding (normally UTF-8) by default for most operations that require an encoding (e.g. reading in a text file without a specified encoding), so they don't try to detect it. I was assuming that would be possible in Java too (not sure this was also a wrong assumption).
        
        Because I'm more Python enthusiast and dislike Java :-P , I'd like to leave a comment.
        
        It is not a solution but a mistake to set default encoding UTF-8 as normally .
        
        For example, Microsoft Windows OS, which has largest market share for desktop OS, has two defaults encoding, one is Unicode (UTF-16LE) and another is one depending on Languages of Windows.
        
        Default encoding for some Latin languages, such as English, is UTF-8 because it is designed to be compatible with ASCII encoding.
        
        On the other hand, Default encoding of CJK, aka. Chinese, Japanese and Korean, three friendly cultures, has another encoding because of historical reason.
        MS Windows Japanese environment default is CP932 that is not compatible with UTF-8. If user open glossary.txt with Windows default Notepad editor and save it, it is easily break a Python's normal. Is Japenese Windows abnormal ? ^^;;
        
        You can find detailed information and discussion about Text encoding here.
        
        https://www.python.org/dev/peps/pep-0597/
        
        https://discuss.python.org/t/pep-597-use-utf-8-for-default-text-file-encoding/1819
        
        As PEP-597 described,
        
        Using the default encoding is a common mistake
        
        Developers using macOS or Linux may forget that the default encoding is not always UTF-8.
        (snip)
        Even Python experts may assume that the default encoding is UTF-8. This creates bugs that only happen on Windows;
        
        This is a background why OmegaT don't use default encoding for text.
        
        Last edit: Hiroshi Miura 2021-07-31
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2021-07-01

status: open --> closed-rejected
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2021-07-01

Closing because the behavior is working as intended within the limits of encoding detection. Workarounds:

Use a BOM

Use the .utf8 extension

Use the new magic comment feature described in [feature-requests:#1579]

New glossaries created by OmegaT will have magic comments so I consider the issue to be adequately addressed.

Related

Feature Requests: ~~#1579~~
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Hiroshi Miura - 2021-07-31
  
  We can also set some environment fallback defaults when there is not enough characters to detect.
  
  for macOS and Linux, fall back encoding to UTF-8
  
  for Windows, encoding is depends on locale.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - msoutopico - 2021-08-02
    
    Hi Hiroshi,
    
    From what you say I understand that OmegaT cannot have a default encoding
    in Windows because Microsoft doesn't use the Unicode standard for the
    Japanese version of its OS. I'm ignorant in this regard, I don't know and I
    don't understand why Microsoft doesn't use a Unicode-compatible encoding
    for Japanese.
    
    However, wouldn't all this hassle be avoided if the writeable glossary had
    the .utf8 extension? Why isn't that an option?
    
    In any case, thanks for your insight about encodings in Japanese Windows.
    The UTF-8 default sounds good as a fallback plan!
    
    Cheers, Manuel
    
    On Sat, 31 Jul 2021 at 05:47, Hiroshi Miura miurahr9@users.sourceforge.net
    wrote:
    
    We can also set some environment fallback defaults when there is not
    enough characters to detect.
    
    for macOS and Linux, fall back encoding to UTF-8
    
    for Windows, encoding is depends on locale.
    
    [bugs:#1046] https://sourceforge.net/p/omegat/bugs/1046/ Glossary has
    wrong encoding for non-ASCII characters*
    
    Status: closed-rejected
    Group: 5.5
    Labels: encoding utf8 glossary unicode
    Created: Fri Apr 30, 2021 04:58 PM UTC by msoutopico
    Last Updated: Sat Jul 31, 2021 03:38 AM UTC
    Owner: nobody
    Preconditions
    
    You have installed OmegaT 5.4.4 (Windows 10)
    
    The option View > Mark glossary matches is ticked.
    
    Steps to reproduce
    
    Create a project with es as source language and en as target
    language
    
    The text file contains text with accents, e.g. Añade el término
    "traducción" al glosario.
    
    Create a new entry in the glossary, with traducción as the source
    term and translation (or whatever) as the target term.
    
    Expected results
    
    The term pair appears in the glossary pane while that segment is
    active.
    
    The source term traducción is underlined in the segment.
    
    If you open the glossary, you can see row traducción TAB translation
    
    The glossary.txt file has UTF-8 encoding.
    
    Actual results
    
    The term pair does not appear in the glossary pane while that
    segment is active.
    
    The source term traducción is not underlined in the segment.
    
    If you open the glossary, you can see row traducci?n TAB translation
    (a question mark instead of ó)
    
    The glossary.txt file has ASCII encoding.
    
    Watch it in action: https://recordit.co/wyE1Br1XIJ
    Comments
    
    I haven't tested it but I assume the same issue would happen if the
    language combination was inverted.
    
    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/omegat/bugs/1046/
    
    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/
    
    Related
    
    Bugs: ~~#1046~~
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marco Cevoli - 2021-07-16

Closing this issue is frankly not really useful. If you let OmegaT create a glossary, as reported, after a while the glossary doesn't show the non-ascii characters. It seems a simple bug to me. Another story is that no-one is able to fix it, but closing it with "behaviour is working as intended" is laughable.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Glossary has wrong encoding for non-ASCII characters

The free computer aided translation (CAT) tool for professionals

Group

Searches

Help

#1046 Glossary has wrong encoding for non-ASCII characters

Preconditions

Steps to reproduce

Expected results

Actual results

Comments

Related

Discussion

Using the default encoding is a common mistake

Related

Related