OmegaT - multiplatform CAT tool / Feature Requests / #1393 Extract Text Content script to export to UTF-8

The free computer aided translation (CAT) tool for professionals

#1393 Extract Text Content script to export to UTF-8

Milestone: 4.1

Status: closed-fixed

Owner: Didier Briel

Labels: scripts (9)

Priority: 5

Updated: 2018-08-06

Created: 2018-06-21

Creator: Gabix

Private: No

Using OmegaT 4.1.5 on Windows 7 / Russian locale / JRE 1.8.0_171. When running the Extract Text Content script (extract_text_content.groovy file), discovered an issue as follows: the output files are encoded in cp1251 (i. e. the system default encoding), which results in character loss for languages that use characters not included in the mentioned encoding.

Steps to reproduce:
1. Download the attached archive and unpack the project
2. Open in OmegaT in Windows with a Russian (Belarusian, Ukrainian) locale.
3. Run the script. It should run fine, no errors.
4. Open project_source_content.txt and project_target_content.txt. The latter opens fine. However, project_source_content.txt shows question marks instead of French letters with diacritics such as é, à, ç.

So, suggestion is to explicitely force UTF-8 (or UTF-16) for the script output.

P. S.
My skills in Groovy are absolute zero, so I searched the Web and tried to modify the lines

srcTextFile << source + "\n";
and
tgtTextFile << target + "\n";
to, respectively,
srcTextFile.withWriter('UTF-8') << source + "\n";
and
tgtTextFile.withWriter('UTF-8') << target + "\n";

However, this results in empty output files. I can't figure out anything else.

1 Attachments

Test_FR-RU.zip

Discussion

Didier Briel - 2018-06-21

assigned_to: Didier Briel

Group: future --> 4.1
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2018-06-21

Do not worry, I'll make the changes.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2018-06-21

summary: Extract Text Content script to xport to UTF-8 --> Extract Text Content script to export to UTF-8
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2018-06-22

status: open --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2018-06-22

Implemented in SVN (/trunk, [r10427]).

In addition to saving in UTF-8, I have used a system-newline (instead of a hardcoded \n), so that newlines are visible under Windows Notepad, for instance.

Didier

Related

Commit: [r10427]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2018-08-06

status: open-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2018-08-06

Closed in the released version 4.1.5 update 1 of OmegaT.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Extract Text Content script to export to UTF-8

The free computer aided translation (CAT) tool for professionals

Group

Searches

Help

#1393 Extract Text Content script to export to UTF-8

Discussion

Related