#569 Codepoint collation with non-BMP chars

v8.7.3
closed
5
2012-10-08
2006-08-12
Michael Kay
No

Saxon's implementation of the Unicode codepoint
collation actually compares UTF-16 code units rather
than Unicode codepoint values. This leads to incorrect
results when comparing a character in the range
56320-65535 with a character outside the BMP (that is,
greater than 65535). Because the non-BMP character is
represented by a pair of code units (a "surrogate
pair") of which the first one is less than 56320, the
non-BMP character collates as "less than" the other
character. For example, the result of the expression

"" lt "𑅰"

is incorrectly returned as false.

The problem does not affect equality comparisons.

Discussion

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks