Menu

#1075 Automatically detect input file encoding when possible

3.4
closed-fixed
None
5
2015-04-22
2015-03-23
No

A very common query on the user group list is relating to input file encodings. For users unfamiliar with the issue it can be very difficult to determine the correct encoding, much less change the input files' encoding or set up OmegaT correctly.

Mozilla has made available a Java port of their charset detection code under GPL2 (and others):
https://code.google.com/p/juniversalchardet/

With this we can perform best-effort detection when the user has not specified an encoding and when the filter allows it.

Discussion

  • Aaron Madlon-Kay

    I have a patch ready for this and am waiting for the next beta branch.

     
  • Didier Briel

    Didier Briel - 2015-03-23

    This is great.

    Sadly however, it won't solve what I believe is a significant part of the issues (I had one again today): the source text is in English (so automatic detection will give something like Windows 1252) and the target is garbage, because the target language (e.g., Chinese) is not compatible with the source encoding.

    Didier

     
    • Jean-Christophe Helary

      Sadly however, it won't solve what I believe is a significant part of the issues (I had one again today): the source text is in English (so automatic detection will give something like Windows 1252) and the target is garbage, because the target language (e.g., Chinese) is not compatible with the source encoding.

      Wouldn't it be possible for OmegaT to know that Chinese is not covered by Windows 1252 and force the target encoding into UTF-8 for ex ?

       
    • Aaron Madlon-Kay

      You're right, this doesn't solve that, but they're unrelated problems. We're not using the input encoding as output encoding; we're using the system's default encoding (likely equivalent to Charset.defaultCharset()) as output encoding.

      • OS X: This appears to always be UTF-8, so no problem.
      • Linux: Ubuntu 14.04 reports UTF-8 as well.
      • Windows: This depends on the OS's current language, and will cause problems whenever the translation target language is not the OS language.

      There are only two solutions that will fix Windows:
      1. Maintain a table matching languages to encodings
      2. Use Unicode

      To me it's a no-brainer to use Unicode.

      If it's important that the output docs open cleanly in Notepad.exe then we could use UTF-16 for Windows and UTF-8 everywhere else.

      There's still the issue of e.g. inserting Unicode Chinese into an HTML file with a meta tag specifying Windows-1252 or whatever. That has to be handled on the filter level, and will still rely on us choosing an encoding that supports Chinese (so might as well choose Unicode).

      Note that I'm only talking about the fallback for when the user has not specified an output encoding.

       

      Last edit: Aaron Madlon-Kay 2015-03-24
  • Didier Briel

    Didier Briel - 2015-03-24

    Linux: Ubuntu 14.04 reports UTF-8 as well.

    It's distribution/configuration-dependent, but, nowadays, UTF-8 is used in most cases indeed.

    If it's important that the output docs open cleanly in Notepad.exe then we could use UTF-16 for Windows and UTF-8 everywhere else.

    Notepad works fine with UTF-8.

    For XML filters source encoding, we're applying the XML rules (use charset, if there isn't one, default to UTF-8), not the OS encoding.

    Didier

     
  • Aaron Madlon-Kay

    Notepad works fine with UTF-8.

    OK, great, then special treatment isn't needed.

    For XML filters source encoding, we're applying the XML rules (use charset, if there isn't one, default to UTF-8), not the OS encoding.

    The XML filter overrides the areas I'm changing anyway, so it is unaffected. Even if it used the new logic the behavior would not change (I don't autodetect or default to UTF-8 unless the user has not specified an encoding AND the filter has is{Source,Target}EncodingVariable() == true; XML filter has {false,true}).

     
  • Aaron Madlon-Kay

    • status: open --> open-fixed
     
  • Aaron Madlon-Kay

    The input encoding detection is now in trunk, r7096.

    I will make a separate ticket for the output encoding issue.

     
  • Didier Briel

    Didier Briel - 2015-04-22

    Implemented in the released version 3.4 of OmegaT.

    Didier

     
  • Didier Briel

    Didier Briel - 2015-04-22
    • status: open-fixed --> closed-fixed
     

Log in to post a comment.