I have problem parsing xml document with unicode characters past U10000 for example U1F1E7,
(http://apps.timwhitlock.info/unicode/inspect/hex/1F1E7)
I'm getting \uF1E7 instead. can someone help me on this?
@Test
public void testEmoji() throws Exception {
String xmlDocument = "";
xmlDocument += "<?xml version=\"1.0\" encoding=\"UTF-8\"?>";
xmlDocument += "<root>";
//xmlDocument += "<token>emoji\uD83C\uDDE7</token>";
xmlDocument += "<token>emoji"+new String(Character.toChars(0x1F1E7))+"</token>";
xmlDocument += "</root>";
VTDGen vg = new VTDGen();
vg.setDoc(xmlDocument.getBytes("UTF-8"));
vg.parse(true);
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectElementNS("","token");
while(ap.iterate()) {
int t = vn.getText();
if (t != -1) {
String value = vn.toRawString(t);
//assertEquals(value, "emoji\uD83C\uDDE7");
assertEquals(value, "emoji"+new String(Character.toChars(0x1F1E7)));
}
}
}
FYI we're using 2.11
and I tested with toNormalizedString instead of toRawString and got same failure
ok, will look into it and get back
that particular char is outside the range of chars supportable by java's internal char representation. More specificly, that particular char requires 4 bytes, which can't fit in a 16bit char value in java
Java's char is a UTF-16 code unit. so for characters with code-point > 0xFFFF it will be encoded with 2 chars (a surrogate pair).
as I wrote in test. we can handle it with surrogate pair.
http://www.oracle.com/us/technologies/java/supplementary-142654.html
it is being fixed, patch or new release will go out soon.
download 2.13_4 and you will find the fix
this bug is marked as fixed