OmegaT sanitizes input so that all characters in input strings conform to the allowed characters in XML. However this is done incorrectly, iterating the string by chars instead of by codepoints.
Since Java's Strings are UTF-16, characters outside the BMP are represented by surrogate pairs; any single surrogate is not allowed free-standing in XML, but valid pairs are; as we are iterating by char and not codepoint, we look at each surrogate separately and thus remove it.
As this does not appear to be the intended behavior, I will fix it by correcting the iteration method.
This is fixed in trunk, r7100.
Known issue: The Editor appears to be very slow when it contains non-BMP characters.
Diff:
Fixed in the released version 3.4 of OmegaT.
Didier