CyberNeko HTML Parser / Bugs / #155 Surrogate characters handling in XHTML importer

#155 Surrogate characters handling in XHTML importer

Milestone: 1.9.15

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2015-01-18

Created: 2014-04-07

Creator: Radu Coravu

Private: No

An XHTML document might have surrogate characters in it.
For example it might contain as text content idiograms like the ones defined here:

http://www.alanwood.net/unicode/cjk_unified_ideographs_extension_b.html

Such idiograms are each composed of two characters (high surrogate and low surrogate).
Right now the purifier (Purifier.purifyText(XMLString)) escapes the characters to "\u...." which is incorrect.

If the surrogate characters are correct, they should be written directly:

    private final XMLStringBuffer fStringBuffer = new XMLStringBuffer();
    /** Purify content. */
    @Override
    protected XMLString purifyText(XMLString text) {
        fStringBuffer.length = 0;
        for (int i = 0; i < text.length; i++) {
            char c = text.ch[text.offset+i];
            if (XMLChar.isInvalid(c)) {
              boolean problem = true;
              if(i < text.length - 1) {
                int high = c;
                int low = text.ch[text.offset + i + 1];
                if (XMLChar.isHighSurrogate(high)) {
                  if (!XMLChar.isLowSurrogate(low)) {
                    //Invalid XML
                  } else {
                    int supplemental = XMLChar.supplemental((char) high, (char) low);
                    if (!XMLChar.isValid(supplemental)) {
                      //Invalid XML
                    } else {
                      //EXM-30072 Valid surrogate case, write the surrogate character directly.
                      problem = false;
                      fStringBuffer.append((char) high);
                      fStringBuffer.append((char) low);
                      i++;
                    }
                  }
                } else {
                  //INvalid
                }
              }
              if(problem) {
                fStringBuffer.append("\\u"+toHexString(c,4));
              }
            }
            else {
                fStringBuffer.append(c);
            }
        }
        return fStringBuffer;
    }

Discussion

Marc Guillemot - 2014-06-02

Could you provide a unit test for this issue?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

The TC would be something like:

String in = "<html><p>\uD840\uDC00</p></html> ";
String out = NEKO_API_HERE.import(new InputSource(
        new StringReader(in)));

assertEquals("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" + 
    "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"\n" + 
    "                      \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n" + 
    "<html>\n" + 
    "    <head></head>\n" + 
    "    <body><p>\uD840\uDC00</p></body>\n" + 
    "</html>\n" + 
    "", out);

Surrogate characters handling in XHTML importer

Group

Searches

Help

#155 Surrogate characters handling in XHTML importer

Discussion