Using OmegaT 4.1.5 on Windows 7 / Russian locale / JRE 1.8.0_171. When running the Extract Text Content script (extract_text_content.groovy file), discovered an issue as follows: the output files are encoded in cp1251 (i. e. the system default encoding), which results in character loss for languages that use characters not included in the mentioned encoding.
Steps to reproduce:
1. Download the attached archive and unpack the project
2. Open in OmegaT in Windows with a Russian (Belarusian, Ukrainian) locale.
3. Run the script. It should run fine, no errors.
4. Open project_source_content.txt and project_target_content.txt. The latter opens fine. However, project_source_content.txt shows question marks instead of French letters with diacritics such as é, à, ç.
So, suggestion is to explicitely force UTF-8 (or UTF-16) for the script output.
P. S.
My skills in Groovy are absolute zero, so I searched the Web and tried to modify the lines
srcTextFile << source + "\n";
and
tgtTextFile << target + "\n";
to, respectively,
srcTextFile.withWriter('UTF-8') << source + "\n";
and
tgtTextFile.withWriter('UTF-8') << target + "\n";
However, this results in empty output files. I can't figure out anything else.
Do not worry, I'll make the changes.
Didier
Implemented in SVN (/trunk, [r10427]).
In addition to saving in UTF-8, I have used a system-newline (instead of a hardcoded \n), so that newlines are visible under Windows Notepad, for instance.
Didier
Related
Commit: [r10427]
Closed in the released version 4.1.5 update 1 of OmegaT.
Didier