If your original XHTML included unicode entities, for example,   the current implementation, when writing to an encoding that does not support 16 bit characters, fails ot re-encode such entities that cannot be represented by the target encoding (example: ISO-8859-1).
Updated Writer class with bug fix.
Could you provide a unit test?
It seems to me that your fix uses classes available first since Java 6 what is not usable for current target of NekoHTML.
The code I provided compiles against Java 5 so there are no Java 6 dependencies.
What parts do you think are Java 6 specific?
I will produce a unit test case and post later in week.
I have attached the unit test.
Is there a reason to be supporting such old versions of Java?
1.5 was no longer supported by Sun in 2009!
Can't see how one can attach further files to this bug so am in-lining the test-case.
Method to be placed in Class
WriterTest.java
package:
package org.cyberneko.html.filters
/**
* Regression test for bug: writer does not handle encoding characters that cannot
* be represented by 7 or 8 bit character sets.
* http://sourceforge.net/support/tracker.php?aid=3540875
*/
public void testEncoderOutputOfEntities() throws Exception {
final String content = "<html><head></head><body>Here is some HTML that includes an encoded entity   </body></html>";
final InputStream inputStream = new ByteArrayInputStream(content.getBytes());
ByteArrayOutputStream outputstream = new ByteArrayOutputStream();
final XMLDocumentFilter[] filters = {
new Purifier(),
new org.cyberneko.html.filters.Writer(outputstream, "ISO-8859-1")
};
// create HTML parser
final XMLParserConfiguration parser = new HTMLConfiguration();
parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
parser.setProperty("http://cyberneko.org/html/properties/filters", filters);
// there are no "ISO-8859-1" specific characters in the input only ASCII so any ASCII based
// encoder will do when reading input
XMLInputSource source = new XMLInputSource(null, "currentUrl", null, inputStream, "ISO-8859-1");
parser.parse(source);
inputStream.close();
String test = outputstream.toString("ISO-8859-1");
if (!test.equals(content))
throw new Exception("Expecting parsed HTML content to be the same but they are not. Entity   has become ?");
}
I did a quick check on the classes I have uysed. If I drop StringBuilder I can get support back to 1.4. Going back as far as 1.3 would mean a lot of work but not impossible...