CyberNeko HTML Parser / Bugs / #140 Writer does not mark up characters not supported by encoding

#140 Writer does not mark up characters not supported by encoding

Milestone: 1.9.15

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2012-07-06

Created: 2012-07-06

Creator: Donald Fraser

Private: No

If your original XHTML included unicode entities, for example, 　 the current implementation, when writing to an encoding that does not support 16 bit characters, fails ot re-encode such entities that cannot be represented by the target encoding (example: ISO-8859-1).

Discussion

Donald Fraser - 2012-07-06

Updated Writer class with bug fix.

Writer.java

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Marc Guillemot - 2012-10-05

Could you provide a unit test?

It seems to me that your fix uses classes available first since Java 6 what is not usable for current target of NekoHTML.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donald Fraser - 2012-10-08

The code I provided compiles against Java 5 so there are no Java 6 dependencies.
What parts do you think are Java 6 specific?
I will produce a unit test case and post later in week.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donald Fraser - 2012-10-08

I have attached the unit test.
Is there a reason to be supporting such old versions of Java?
1.5 was no longer supported by Sun in 2009!

Can't see how one can attach further files to this bug so am in-lining the test-case.

Method to be placed in Class
WriterTest.java

package:
package org.cyberneko.html.filters

/**
* Regression test for bug: writer does not handle encoding characters that cannot
* be represented by 7 or 8 bit character sets.
* http://sourceforge.net/support/tracker.php?aid=3540875
*/
public void testEncoderOutputOfEntities() throws Exception {

final String content = "<html><head></head><body>Here is some HTML that includes an encoded entity 　 </body></html>";
final InputStream inputStream = new ByteArrayInputStream(content.getBytes());

ByteArrayOutputStream outputstream = new ByteArrayOutputStream();

final XMLDocumentFilter[] filters = {
new Purifier(),
new org.cyberneko.html.filters.Writer(outputstream, "ISO-8859-1")
};

// create HTML parser
final XMLParserConfiguration parser = new HTMLConfiguration();
parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
parser.setProperty("http://cyberneko.org/html/properties/filters", filters);

// there are no "ISO-8859-1" specific characters in the input only ASCII so any ASCII based
// encoder will do when reading input
XMLInputSource source = new XMLInputSource(null, "currentUrl", null, inputStream, "ISO-8859-1");

parser.parse(source);
inputStream.close();
String test = outputstream.toString("ISO-8859-1");
if (!test.equals(content))
throw new Exception("Expecting parsed HTML content to be the same but they are not. Entity 　 has become ?");
}

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donald Fraser - 2012-10-08

I did a quick check on the classes I have uysed. If I drop StringBuilder I can get support back to 1.4. Going back as far as 1.3 would mean a lot of work but not impossible...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Writer does not mark up characters not supported by encoding

Group

Searches

Help

#140 Writer does not mark up characters not supported by encoding

Discussion