#155 Surrogate characters handling in XHTML importer

1.9.15
open
nobody
None
5
2014-08-20
2014-04-07
Radu Coravu
No

An XHTML document might have surrogate characters in it.
For example it might contain as text content idiograms like the ones defined here:

http://www.alanwood.net/unicode/cjk_unified_ideographs_extension_b.html

Such idiograms are each composed of two characters (high surrogate and low surrogate).
Right now the purifier (Purifier.purifyText(XMLString)) escapes the characters to "\u...." which is incorrect.

If the surrogate characters are correct, they should be written directly:

    private final XMLStringBuffer fStringBuffer = new XMLStringBuffer();
    /** Purify content. */
    @Override
    protected XMLString purifyText(XMLString text) {
        fStringBuffer.length = 0;
        for (int i = 0; i < text.length; i++) {
            char c = text.ch[text.offset+i];
            if (XMLChar.isInvalid(c)) {
              boolean problem = true;
              if(i < text.length - 1) {
                int high = c;
                int low = text.ch[text.offset + i + 1];
                if (XMLChar.isHighSurrogate(high)) {
                  if (!XMLChar.isLowSurrogate(low)) {
                    //Invalid XML
                  } else {
                    int supplemental = XMLChar.supplemental((char) high, (char) low);
                    if (!XMLChar.isValid(supplemental)) {
                      //Invalid XML
                    } else {
                      //EXM-30072 Valid surrogate case, write the surrogate character directly.
                      problem = false;
                      fStringBuffer.append((char) high);
                      fStringBuffer.append((char) low);
                      i++;
                    }
                  }
                } else {
                  //INvalid
                }
              }
              if(problem) {
                fStringBuffer.append("\\u"+toHexString(c,4));
              }
            }
            else {
                fStringBuffer.append(c);
            }
        }
        return fStringBuffer;
    }

Discussion

  • Marc Guillemot
    Marc Guillemot
    2014-06-02

    Could you provide a unit test for this issue?

     
  • Radu Coravu
    Radu Coravu
    2014-06-02

    The TC would be something like:

    String in = "<html><p>\uD840\uDC00</p></html> ";
    String out = NEKO_API_HERE.import(new InputSource(
            new StringReader(in)));
    
    assertEquals("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" + 
        "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"\n" + 
        "                      \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n" + 
        "<html>\n" + 
        "    <head></head>\n" + 
        "    <body><p>\uD840\uDC00</p></body>\n" + 
        "</html>\n" + 
        "", out);