A very common query on the user group list is relating to input file encodings. For users unfamiliar with the issue it can be very difficult to determine the correct encoding, much less change the input files' encoding or set up OmegaT correctly.
Mozilla has made available a Java port of their charset detection code under GPL2 (and others):
https://code.google.com/p/juniversalchardet/
With this we can perform best-effort detection when the user has not specified an encoding and when the filter allows it.
I have a patch ready for this and am waiting for the next beta branch.
This is great.
Sadly however, it won't solve what I believe is a significant part of the issues (I had one again today): the source text is in English (so automatic detection will give something like Windows 1252) and the target is garbage, because the target language (e.g., Chinese) is not compatible with the source encoding.
Didier
Wouldn't it be possible for OmegaT to know that Chinese is not covered by Windows 1252 and force the target encoding into UTF-8 for ex ?
You're right, this doesn't solve that, but they're unrelated problems. We're not using the input encoding as output encoding; we're using the system's default encoding (likely equivalent to
Charset.defaultCharset()
) as output encoding.There are only two solutions that will fix Windows:
1. Maintain a table matching languages to encodings
2. Use Unicode
To me it's a no-brainer to use Unicode.
If it's important that the output docs open cleanly in Notepad.exe then we could use UTF-16 for Windows and UTF-8 everywhere else.
There's still the issue of e.g. inserting Unicode Chinese into an HTML file with a meta tag specifying Windows-1252 or whatever. That has to be handled on the filter level, and will still rely on us choosing an encoding that supports Chinese (so might as well choose Unicode).
Note that I'm only talking about the fallback for when the user has not specified an output encoding.
Last edit: Aaron Madlon-Kay 2015-03-24
It's distribution/configuration-dependent, but, nowadays, UTF-8 is used in most cases indeed.
Notepad works fine with UTF-8.
For XML filters source encoding, we're applying the XML rules (use charset, if there isn't one, default to UTF-8), not the OS encoding.
Didier
OK, great, then special treatment isn't needed.
The XML filter overrides the areas I'm changing anyway, so it is unaffected. Even if it used the new logic the behavior would not change (I don't autodetect or default to UTF-8 unless the user has not specified an encoding AND the filter has
is{Source,Target}EncodingVariable() == true
; XML filter has{false,true}
).The input encoding detection is now in trunk, r7096.
I will make a separate ticket for the output encoding issue.
Implemented in the released version 3.4 of OmegaT.
Didier