OmegaT - multiplatform CAT tool / Bugs / #740 Valid XML characters outside the BMP are stripped from input

The free computer aided translation (CAT) tool for professionals

#740 Valid XML characters outside the BMP are stripped from input

Milestone: 3.4

Status: closed-fixed

Owner: Aaron Madlon-Kay

Labels: None

Priority: 5

Updated: 2015-04-22

Created: 2015-04-07

Creator: Aaron Madlon-Kay

Private: No

OmegaT sanitizes input so that all characters in input strings conform to the allowed characters in XML. However this is done incorrectly, iterating the string by chars instead of by codepoints.

Since Java's Strings are UTF-16, characters outside the BMP are represented by surrogate pairs; any single surrogate is not allowed free-standing in XML, but valid pairs are; as we are iterating by char and not codepoint, we look at each surrogate separately and thus remove it.

As this does not appear to be the intended behavior, I will fix it by correcting the iteration method.

Discussion

Aaron Madlon-Kay - 2015-04-07

This is fixed in trunk, r7100.

Known issue: The Editor appears to be very slow when it contains non-BMP characters.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Description has changed:

Diff:

--- old
+++ new
@@ -1,4 +1,4 @@
-OmegaT sanitizes input so that all characters in input strings conform to the allowed characters in XML (http://en.wikipedia.org/wiki/Valid_characters_in_XML). However this is done incorrectly, iterating the string by `char`s instead of by `codepoint`s.
+OmegaT sanitizes input so that all characters in input strings conform to the [allowed characters in XML](http://en.wikipedia.org/wiki/Valid_characters_in_XML). However this is done incorrectly, iterating the string by `char`s instead of by `codepoint`s.

 Since Java's `String`s are UTF-16, characters outside the BMP are represented by surrogate pairs; any single surrogate is not allowed free-standing in XML, but valid pairs are; as we are iterating by `char` and not `codepoint`, we look at each surrogate separately and thus remove it.

Didier Briel - 2015-04-22

status: open-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-04-22

Fixed in the released version 3.4 of OmegaT.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Valid XML characters outside the BMP are stripped from input

The free computer aided translation (CAT) tool for professionals

Group

Searches

Help

#740 Valid XML characters outside the BMP are stripped from input

Discussion