Menu

#140 Writer does not mark up characters not supported by encoding

1.9.15
open
nobody
None
5
2012-07-06
2012-07-06
No

If your original XHTML included unicode entities, for example,   the current implementation, when writing to an encoding that does not support 16 bit characters, fails ot re-encode such entities that cannot be represented by the target encoding (example: ISO-8859-1).

Discussion

  • Donald Fraser

    Donald Fraser - 2012-07-06

    Updated Writer class with bug fix.

     
  • Marc Guillemot

    Marc Guillemot - 2012-10-05

    Could you provide a unit test?

    It seems to me that your fix uses classes available first since Java 6 what is not usable for current target of NekoHTML.

     
  • Donald Fraser

    Donald Fraser - 2012-10-08

    The code I provided compiles against Java 5 so there are no Java 6 dependencies.
    What parts do you think are Java 6 specific?
    I will produce a unit test case and post later in week.

     
  • Donald Fraser

    Donald Fraser - 2012-10-08

    I have attached the unit test.
    Is there a reason to be supporting such old versions of Java?
    1.5 was no longer supported by Sun in 2009!

    Can't see how one can attach further files to this bug so am in-lining the test-case.

    Method to be placed in Class
    WriterTest.java

    package:
    package org.cyberneko.html.filters

    /**
    * Regression test for bug: writer does not handle encoding characters that cannot
    * be represented by 7 or 8 bit character sets.
    * http://sourceforge.net/support/tracker.php?aid=3540875
    */
    public void testEncoderOutputOfEntities() throws Exception {

    final String content = "<html><head></head><body>Here is some HTML that includes an encoded entity &#x3000; </body></html>";
    final InputStream inputStream = new ByteArrayInputStream(content.getBytes());

    ByteArrayOutputStream outputstream = new ByteArrayOutputStream();

    final XMLDocumentFilter[] filters = {
    new Purifier(),
    new org.cyberneko.html.filters.Writer(outputstream, "ISO-8859-1")
    };

    // create HTML parser
    final XMLParserConfiguration parser = new HTMLConfiguration();
    parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
    parser.setProperty("http://cyberneko.org/html/properties/filters", filters);

    // there are no "ISO-8859-1" specific characters in the input only ASCII so any ASCII based
    // encoder will do when reading input
    XMLInputSource source = new XMLInputSource(null, "currentUrl", null, inputStream, "ISO-8859-1");

    parser.parse(source);
    inputStream.close();
    String test = outputstream.toString("ISO-8859-1");
    if (!test.equals(content))
    throw new Exception("Expecting parsed HTML content to be the same but they are not. Entity &#x3000; has become ?");
    }

     
  • Donald Fraser

    Donald Fraser - 2012-10-08

    I did a quick check on the classes I have uysed. If I drop StringBuilder I can get support back to 1.4. Going back as far as 1.3 would mean a lot of work but not impossible...

     

Log in to post a comment.

MongoDB Logo MongoDB