es
as source language and en
as target languageAñade el término "traducción" al glosario.
traducción
as the source term and translation
(or whatever) as the target term.traducción
is underlined in the segment.traducción TAB translation
glossary.txt
file has UTF-8 encoding.traducción
is not underlined in the segment.traducci?n TAB translation
(a question mark instead of ó
)glossary.txt
file has ASCII encoding.Watch it in action: https://recordit.co/wyE1Br1XIJ
I haven't tested it but I assume the same issue would happen if the language combination was inverted.
There was also this report in the users' mailing list:
https://sourceforge.net/p/omegat/mailman/message/37264390/
The message had a screenshot attached, but I can't see it there in the web version. I'm attaching it here too.
It looks like a similar problem, though complicated and made apparent by the attempt of git commit. I've experienced the same issue as Valter describes in his email (but in EN>UK projects), though I couldn't reliably reproduce it, therefore it wasn't reported.
Last edit: Kos Ivantsov 2021-04-30
I'm on a Team project on version 5.4.3 and am also experiencing exactly the same issue. This issue does not appear to be an issue when translating from en-US to es-CO, only when translating from es-CO to en-US.
Last edit: Anil Duggirala 2021-05-05
The operating system should be added to the preconditions: Windows 10.
I've experienced the same bug on Linux, but can't really reproduce it.
Diff:
Diff:
GlossaryReaderTSV class use util/EncodingDetector utility class.
In generally said, the encoding detection is worked when enough amount of characters exist in input data.
When there is very small number of characters, detection will failed.
IMHO, the error 'MalformedInput' length = 1 means it.
Last edit: Hiroshi Miura 2021-06-18
I'm on version 5.5.0 now on Windows 10 Home. This issue is not an issue for me now. I have tried adding glossary entries in both directions of the en-US es-CO language pair. It appears to have been fixed in this latest version.
thank you,
In version 5.5.0 I can reproduce the issue when I follow Hiroshi's steps:
echo " " > glossary.txt
traducción
>translation
traducci?n
in the second rowI can reproduce it as well if, instead of a space, the first term pair that I add do not have any accented characters (pure ASCII characters). In both cases I get a ASCII text file.
However, if the first term that I add to a new glossary already has an accent, then the bug is not reproduced and I get a UTF-8 Unicode file.
In my book, it shouldn't matter what are the first characters added to the glossary file or whether the amount of characters is big enough or not. The encoding of the glossary file should be UTF-8 by default, even in the unlikely event that the two charsets in the language pair of the project only use ASCII in their alphabet.
Last edit: msoutopico 2021-06-19
Nothing relevant to this issue has changed in the last several releases.
This is the same issue as was discussed on the Telegram channel a bit ago where a file containing
÷
was misinterpreted as a Thai encoding.If the encoding of a file is not specified by out-of-band information, it will be automatically detected. The detection algorithm can only be 100% certain of the encoding when there is a definitive indicator like a BOM for Unicode.
In the absence of a definitive indicator, the algorithm must rely on heuristics. The gist of it is that it looks at the first n bytes of the file, compares to known byte frequency distributions for various encodings, and chooses whatever seems to fit best. This heuristic is very easily fooled by short inputs.
The only reliable way to avoid this issue is to provide a definitive indicator like a BOM, or for you to explicitly communicate the encoding via out-of-band information. For source files, you can choose the encoding in the file filter settings or via file extension (
.utf8
for text files). For the glossary I guess the only option is to use the.utf8
extension.Always appending a BOM is not a great solution because other software often doesn't interpret it correctly.
Thanks, Aaron.
This is how I see from my limited (and perhaps biased) perspective:
This issue is about the writeable glossary that OmegaT creates when a user adds the first term pair (it doesn't affect read-only glossaries, which are not created by OmegaT and therefore the user has control over them). Since OmegaT creates that writeable glossary file, I think OmegaT can legitimately decide what encoding the file must have to make the file work correctly. UTF-8 is the encoding that would accommodate all languages most conveniently, so it seems it makes sense to create the file with that encoding. So why not declare UTF-8 explicitly rather than leaving it to the heuristics, which is clearly problematic?
It seems you're saying that using an
.utf8
extension would let OmegaT read the glossary as UTF-8. If that's the case, couldn't the writeable glossary beglossary.utf8
instead ofglossary.txt
(maybe with an option in the preferences)? If that's possible, the solution seems easy. I think it's an affordable cost (for Windows users, at least) to have to do right-click > "Open with" to open the file, or to even assign Notepad (or Virtaal, or LO Calc, etc.) to the.utf8
extension. Otherwise, i.e. if signing the file with a BOM is the only way to make it clear that its encoding is UTF-8, then I'd still think it's worthy.About the heuristics and its success rate: If I understand correctly, in the absence of a definite indicator, the encoding of the file must be decided when the file is created, which means that the heuristics really depend on the first term pair being added to the new glossary. If I think of my native languages (Spanish, Galician, Portuguese), I think it is very likely (~80%) that the first line does not have any accent, leading the algorithm to think that the encoding can be ASCII. Other Latin-script languages might use accents more profusely, making it less likely that the first line is pure ASCII, but making it a bigger problem when that happens (because the problem will affect more words in the rest of the file).
You mention the BOM can be a problem for "other software". Indeed, it could be the case that third-party applications need to further process the writeable glossaries (e.g. I might want to collect the terminology that my users have recorded in their projects). However, if the BOM is a problem for those further uses of the glossaries, it shouldn't be very problematic for an engineer or the terminologist to remove the BOM if that's necessary on their end (e.g. just a sed that does
s/^\uFEFF//
ors/^\xef\xbb\xbf//
), or at least much less than is for a translator to deal with a glossary file with the wrong encoding.On the other hand, I'm ignorant about the technicalities, but I would expect any modern software that is Unicode-aware not to choke at this byte order mark...
In a nutshell, I would say OmegaT should be more concerned with the certain immediate needs of everyday users of the tool (e.g. translators), and less with potential further processing that third parties might undertake. In my book, a priori, the benefit of declaring UTF-8 explicitly (with BOM or with
.utf8
extension) outnumbers the potential disadvantages.Please note that I'm referring only to the glossary, not the source files, not sure if the problem is the same there and not sure if OmegaT must necessarily handle both the glossary and the source files identically. In any case, I don't see as a problem for the user (or who prepares the project) to select the right encoding, since there is a clickable interface for that (in the file list window) and it's part of their job anyway. However, it is a problem for the average user if the terms they add to the glossary are not recognized. The issue is not necessarily easy to realize and it's definitely not easy to fix for an average user who knows nothing about file encodings.
Thanks, and sorry about my long post.
There is no perfect solution for encoding detection of vanilla text file, as @amake express, it is based on heuristics.
Emacs use a hack to add encoding signature as a comment at first line
# -*- coding: utf-8 -*-
https://github.com/omegat-org/omegat/pull/112 is a simple implementation to do it.
It works for me
Last edit: Hiroshi Miura 2021-06-22
Fix merged.
Thank you so much for your work, Hiroshi. Is there or will there be a nightly build that I can use to test the fix?
https://sourceforge.net/projects/omegat/files/Nightly/
Something I haven't understood is why OmegaT needs to "detect" the encoding of the glossary file it creates itself, rather than simply create the file with UTF-8 encoding (defining the encoding explicitly).
You seem to think that a plain text file will somehow "know" its own encoding. This is not the case: with unstructured data there is nowhere for that knowledge to be stored. The encoding of a plain text file is a property of the bytes that are contained within. You can't know that a file was "created with UTF-8" except by inspecting the bytes and determining that they must be UTF-8.
In the case of UTF-8 it's particularly bad because UTF-8 is a superset of ASCII. Imagine: You add several glossary entries that all happen to fit into ASCII, and the file will be detected as ASCII. Now add another term that contains a single non-ASCII character. What will the result be detected as? Who knows: the file is now 99% ASCII plus perhaps a handful of non-ASCII bytes. If those bytes happen to fit the expected distribution of the wrong encoding, then you're going to get garbage and there's nothing we can do about it.
Further OmegaT can't assume that the writable glossary file it creates will still be the same file at any point in the future, because you could have replaced it with any other arbitrary file in the meantime. It must always detect the encoding.
Thanks for the clarification. This matter makes a lot more sense to me now.
Yes, it seems I was making some wrong assumptions. I know one can set the encoding of a string (e.g. Python's
str.encode()
method, Java seems to have something equivalent: https://www.baeldung.com/java-string-encode-utf-8#core-java) and my wrong assumption was that if you write a UTF-8 string to a text file, that will set the encoding of the file to UTF-8 even if the string does not contain any non-ASCII character.On the other hand, my understanding is that some in programming languages (like Python 3) it's possible to use a specific encoding (normally UTF-8) by default for most operations that require an encoding (e.g. reading in a text file without a specified encoding), so they don't try to detect it. I was assuming that would be possible in Java too (not sure this was also a wrong assumption).
Regarding your last point about the writeable glossary (my ticket was specifically about the writeable glossary), in my opinion some minimum requirements are reasonable in the interest of the greatest good for the greatest number of users. For example, a specific encoding by default for the writeable glossary, which the user should not change.
Thanks for accepting Hiroshi's fix. I'll test it as soon as I can.
Because I'm more Python enthusiast and dislike Java :-P , I'd like to leave a comment.
It is not a solution but a mistake to set default encoding UTF-8 as normally .
For example, Microsoft Windows OS, which has largest market share for desktop OS, has two defaults encoding, one is Unicode (UTF-16LE) and another is one depending on Languages of Windows.
Default encoding for some Latin languages, such as English, is UTF-8 because it is designed to be compatible with ASCII encoding.
On the other hand, Default encoding of CJK, aka. Chinese, Japanese and Korean, three friendly cultures, has another encoding because of historical reason.
MS Windows Japanese environment default is CP932 that is not compatible with UTF-8. If user open glossary.txt with Windows default
Notepad
editor and save it, it is easily break a Python's normal. Is Japenese Windows abnormal ? ^^;;You can find detailed information and discussion about Text encoding here.
https://www.python.org/dev/peps/pep-0597/
https://discuss.python.org/t/pep-597-use-utf-8-for-default-text-file-encoding/1819
As PEP-597 described,
This is a background why OmegaT don't use default encoding for text.
Last edit: Hiroshi Miura 2021-07-31
Closing because the behavior is working as intended within the limits of encoding detection. Workarounds:
.utf8
extensionNew glossaries created by OmegaT will have magic comments so I consider the issue to be adequately addressed.
Related
Feature Requests:
#1579We can also set some environment fallback defaults when there is not enough characters to detect.
Hi Hiroshi,
From what you say I understand that OmegaT cannot have a default encoding
in Windows because Microsoft doesn't use the Unicode standard for the
Japanese version of its OS. I'm ignorant in this regard, I don't know and I
don't understand why Microsoft doesn't use a Unicode-compatible encoding
for Japanese.
However, wouldn't all this hassle be avoided if the writeable glossary had
the .utf8 extension? Why isn't that an option?
In any case, thanks for your insight about encodings in Japanese Windows.
The UTF-8 default sounds good as a fallback plan!
Cheers, Manuel
On Sat, 31 Jul 2021 at 05:47, Hiroshi Miura miurahr9@users.sourceforge.net
wrote:
Related
Bugs:
#1046Closing this issue is frankly not really useful. If you let OmegaT create a glossary, as reported, after a while the glossary doesn't show the non-ascii characters. It seems a simple bug to me. Another story is that no-one is able to fix it, but closing it with "behaviour is working as intended" is laughable.