VTD-XML: The Future of XML Processing / Tickets / #7 problem parsing xml document with unicode characters past U10000

#7 problem parsing xml document with unicode characters past U10000

Milestone: 2.0

Status: closed

Owner: jimmy zhang

Labels: unicode (1)

Updated: 2017-08-27

Created: 2016-08-02

Creator: Keunwoo Ryu

Private: No

I have problem parsing xml document with unicode characters past U10000 for example U1F1E7,
(http://apps.timwhitlock.info/unicode/inspect/hex/1F1E7)
I'm getting \uF1E7 instead. can someone help me on this?

@Test
public void testEmoji() throws Exception {
    String xmlDocument = "";
    xmlDocument += "<?xml version=\"1.0\" encoding=\"UTF-8\"?>";
    xmlDocument += "<root>";
    //xmlDocument += "<token>emoji\uD83C\uDDE7</token>";
    xmlDocument += "<token>emoji"+new String(Character.toChars(0x1F1E7))+"</token>";
    xmlDocument += "</root>";
    VTDGen vg = new VTDGen();
    vg.setDoc(xmlDocument.getBytes("UTF-8"));
    vg.parse(true);
    VTDNav vn = vg.getNav();
    AutoPilot ap = new AutoPilot(vn);
    ap.selectElementNS("","token");
    while(ap.iterate()) {
        int t = vn.getText();
        if (t != -1) {
            String value = vn.toRawString(t);
            //assertEquals(value, "emoji\uD83C\uDDE7");
            assertEquals(value, "emoji"+new String(Character.toChars(0x1F1E7)));
        }
    }
}

Discussion

Keunwoo Ryu - 2016-08-02

FYI we're using 2.11

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keunwoo Ryu - 2016-08-02

and I tested with toNormalizedString instead of toRawString and got same failure

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jimmy zhang - 2016-08-17

ok, will look into it and get back

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jimmy zhang - 2016-08-17

that particular char is outside the range of chars supportable by java's internal char representation. More specificly, that particular char requires 4 bytes, which can't fit in a 16bit char value in java

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keunwoo Ryu - 2016-09-20

Java's char is a UTF-16 code unit. so for characters with code-point > 0xFFFF it will be encoded with 2 chars (a surrogate pair).

as I wrote in test. we can handle it with surrogate pair.

http://www.oracle.com/us/technologies/java/supplementary-142654.html

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jimmy zhang - 2017-02-10

it is being fixed, patch or new release will go out soon.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jimmy zhang - 2017-07-20

download 2.13_4 and you will find the fix

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jimmy zhang - 2017-08-27

labels: --> unicode

status: open --> closed

assigned_to: jimmy zhang
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

jimmy zhang - 2017-08-27

this bug is marked as fixed

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

problem parsing xml document with unicode characters past U10000

Milestone

Searches

Help

#7 problem parsing xml document with unicode characters past U10000

Discussion