Menu

#7 problem parsing xml document with unicode characters past U10000

2.0
closed
unicode (1)
2017-08-27
2016-08-02
Keunwoo Ryu
No

I have problem parsing xml document with unicode characters past U10000 for example U1F1E7,
(http://apps.timwhitlock.info/unicode/inspect/hex/1F1E7)
I'm getting \uF1E7 instead. can someone help me on this?

@Test
public void testEmoji() throws Exception {
    String xmlDocument = "";
    xmlDocument += "<?xml version=\"1.0\" encoding=\"UTF-8\"?>";
    xmlDocument += "<root>";
    //xmlDocument += "<token>emoji\uD83C\uDDE7</token>";
    xmlDocument += "<token>emoji"+new String(Character.toChars(0x1F1E7))+"</token>";
    xmlDocument += "</root>";
    VTDGen vg = new VTDGen();
    vg.setDoc(xmlDocument.getBytes("UTF-8"));
    vg.parse(true);
    VTDNav vn = vg.getNav();
    AutoPilot ap = new AutoPilot(vn);
    ap.selectElementNS("","token");
    while(ap.iterate()) {
        int t = vn.getText();
        if (t != -1) {
            String value = vn.toRawString(t);
            //assertEquals(value, "emoji\uD83C\uDDE7");
            assertEquals(value, "emoji"+new String(Character.toChars(0x1F1E7)));
        }
    }
}

Discussion

  • Keunwoo Ryu

    Keunwoo Ryu - 2016-08-02

    FYI we're using 2.11

     
  • Keunwoo Ryu

    Keunwoo Ryu - 2016-08-02

    and I tested with toNormalizedString instead of toRawString and got same failure

     
  • jimmy zhang

    jimmy zhang - 2016-08-17

    ok, will look into it and get back

     
  • jimmy zhang

    jimmy zhang - 2016-08-17

    that particular char is outside the range of chars supportable by java's internal char representation. More specificly, that particular char requires 4 bytes, which can't fit in a 16bit char value in java

     
  • Keunwoo Ryu

    Keunwoo Ryu - 2016-09-20

    Java's char is a UTF-16 code unit. so for characters with code-point > 0xFFFF it will be encoded with 2 chars (a surrogate pair).

    as I wrote in test. we can handle it with surrogate pair.

    http://www.oracle.com/us/technologies/java/supplementary-142654.html

     
  • jimmy zhang

    jimmy zhang - 2017-02-10

    it is being fixed, patch or new release will go out soon.

     
  • jimmy zhang

    jimmy zhang - 2017-07-20

    download 2.13_4 and you will find the fix

     
  • jimmy zhang

    jimmy zhang - 2017-08-27
    • labels: --> unicode
    • status: open --> closed
    • assigned_to: jimmy zhang
     
  • jimmy zhang

    jimmy zhang - 2017-08-27

    this bug is marked as fixed

     

Log in to post a comment.

MongoDB Logo MongoDB