Menu

#740 Valid XML characters outside the BMP are stripped from input

3.4
closed-fixed
None
5
2015-04-22
2015-04-07
No

OmegaT sanitizes input so that all characters in input strings conform to the allowed characters in XML. However this is done incorrectly, iterating the string by chars instead of by codepoints.

Since Java's Strings are UTF-16, characters outside the BMP are represented by surrogate pairs; any single surrogate is not allowed free-standing in XML, but valid pairs are; as we are iterating by char and not codepoint, we look at each surrogate separately and thus remove it.

As this does not appear to be the intended behavior, I will fix it by correcting the iteration method.

Discussion

  • Aaron Madlon-Kay

    This is fixed in trunk, r7100.

    Known issue: The Editor appears to be very slow when it contains non-BMP characters.

     
  • Aaron Madlon-Kay

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -1,4 +1,4 @@
    -OmegaT sanitizes input so that all characters in input strings conform to the allowed characters in XML (http://en.wikipedia.org/wiki/Valid_characters_in_XML). However this is done incorrectly, iterating the string by `char`s instead of by `codepoint`s.
    +OmegaT sanitizes input so that all characters in input strings conform to the [allowed characters in XML](http://en.wikipedia.org/wiki/Valid_characters_in_XML). However this is done incorrectly, iterating the string by `char`s instead of by `codepoint`s.
    
     Since Java's `String`s are UTF-16, characters outside the BMP are represented by surrogate pairs; any single surrogate is not allowed free-standing in XML, but valid pairs are; as we are iterating by `char` and not `codepoint`, we look at each surrogate separately and thus remove it.
    
     
  • Didier Briel

    Didier Briel - 2015-04-22
    • status: open-fixed --> closed-fixed
     
  • Didier Briel

    Didier Briel - 2015-04-22

    Fixed in the released version 3.4 of OmegaT.

    Didier

     

Log in to post a comment.

MongoDB Logo MongoDB