Menu

#155 Surrogate characters handling in XHTML importer

1.9.15
open
nobody
None
5
2015-01-18
2014-04-07
Radu Coravu
No

An XHTML document might have surrogate characters in it.
For example it might contain as text content idiograms like the ones defined here:

http://www.alanwood.net/unicode/cjk_unified_ideographs_extension_b.html

Such idiograms are each composed of two characters (high surrogate and low surrogate).
Right now the purifier (Purifier.purifyText(XMLString)) escapes the characters to "\u...." which is incorrect.

If the surrogate characters are correct, they should be written directly:

    private final XMLStringBuffer fStringBuffer = new XMLStringBuffer();
    /** Purify content. */
    @Override
    protected XMLString purifyText(XMLString text) {
        fStringBuffer.length = 0;
        for (int i = 0; i < text.length; i++) {
            char c = text.ch[text.offset+i];
            if (XMLChar.isInvalid(c)) {
              boolean problem = true;
              if(i < text.length - 1) {
                int high = c;
                int low = text.ch[text.offset + i + 1];
                if (XMLChar.isHighSurrogate(high)) {
                  if (!XMLChar.isLowSurrogate(low)) {
                    //Invalid XML
                  } else {
                    int supplemental = XMLChar.supplemental((char) high, (char) low);
                    if (!XMLChar.isValid(supplemental)) {
                      //Invalid XML
                    } else {
                      //EXM-30072 Valid surrogate case, write the surrogate character directly.
                      problem = false;
                      fStringBuffer.append((char) high);
                      fStringBuffer.append((char) low);
                      i++;
                    }
                  }
                } else {
                  //INvalid
                }
              }
              if(problem) {
                fStringBuffer.append("\\u"+toHexString(c,4));
              }
            }
            else {
                fStringBuffer.append(c);
            }
        }
        return fStringBuffer;
    }

Discussion

  • Marc Guillemot

    Marc Guillemot - 2014-06-02

    Could you provide a unit test for this issue?

     
  • Radu Coravu

    Radu Coravu - 2014-06-02

    The TC would be something like:

    String in = "<html><p>\uD840\uDC00</p></html> ";
    String out = NEKO_API_HERE.import(new InputSource(
            new StringReader(in)));
    
    assertEquals("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" + 
        "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"\n" + 
        "                      \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n" + 
        "<html>\n" + 
        "    <head></head>\n" + 
        "    <body><p>\uD840\uDC00</p></body>\n" + 
        "</html>\n" + 
        "", out);
    
     

Log in to post a comment.