OmegaT - multiplatform CAT tool / Feature Requests / #1075 Automatically detect input file encoding when possible

Aaron Madlon-Kay - 2015-03-23

I have a patch ready for this and am waiting for the next beta branch.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-03-23

This is great.

Sadly however, it won't solve what I believe is a significant part of the issues (I had one again today): the source text is in English (so automatic detection will give something like Windows 1252) and the target is garbage, because the target language (e.g., Chinese) is not compatible with the source encoding.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Jean-Christophe Helary - 2015-03-23
  
  Sadly however, it won't solve what I believe is a significant part of the issues (I had one again today): the source text is in English (so automatic detection will give something like Windows 1252) and the target is garbage, because the target language (e.g., Chinese) is not compatible with the source encoding.
  
  Wouldn't it be possible for OmegaT to know that Chinese is not covered by Windows 1252 and force the target encoding into UTF-8 for ex ?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Aaron Madlon-Kay - 2015-03-24
  
  You're right, this doesn't solve that, but they're unrelated problems. We're not using the input encoding as output encoding; we're using the system's default encoding (likely equivalent to Charset.defaultCharset()) as output encoding.
  
  OS X: This appears to always be UTF-8, so no problem.
  
  Linux: Ubuntu 14.04 reports UTF-8 as well.
  
  Windows: This depends on the OS's current language, and will cause problems whenever the translation target language is not the OS language.
  
  There are only two solutions that will fix Windows:
  1. Maintain a table matching languages to encodings
  2. Use Unicode
  
  To me it's a no-brainer to use Unicode.
  
  If it's important that the output docs open cleanly in Notepad.exe then we could use UTF-16 for Windows and UTF-8 everywhere else.
  
  There's still the issue of e.g. inserting Unicode Chinese into an HTML file with a meta tag specifying Windows-1252 or whatever. That has to be handled on the filter level, and will still rely on us choosing an encoding that supports Chinese (so might as well choose Unicode).
  
  Note that I'm only talking about the fallback for when the user has not specified an output encoding.
  
  Last edit: Aaron Madlon-Kay 2015-03-24
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-03-24

Linux: Ubuntu 14.04 reports UTF-8 as well.

It's distribution/configuration-dependent, but, nowadays, UTF-8 is used in most cases indeed.

If it's important that the output docs open cleanly in Notepad.exe then we could use UTF-16 for Windows and UTF-8 everywhere else.

Notepad works fine with UTF-8.

For XML filters source encoding, we're applying the XML rules (use charset, if there isn't one, default to UTF-8), not the OS encoding.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2015-03-24

Notepad works fine with UTF-8.

OK, great, then special treatment isn't needed.

For XML filters source encoding, we're applying the XML rules (use charset, if there isn't one, default to UTF-8), not the OS encoding.

The XML filter overrides the areas I'm changing anyway, so it is unaffected. Even if it used the new logic the behavior would not change (I don't autodetect or default to UTF-8 unless the user has not specified an encoding AND the filter has is{Source,Target}EncodingVariable() == true; XML filter has {false,true}).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2015-04-07

status: open --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2015-04-07

The input encoding detection is now in trunk, r7096.

I will make a separate ticket for the output encoding issue.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-04-22

Implemented in the released version 3.4 of OmegaT.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-04-22

status: open-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Automatically detect input file encoding when possible

The free computer aided translation (CAT) tool for professionals

Group

Searches

Help

#1075 Automatically detect input file encoding when possible

Discussion