There seems to be an issue with recent versions of HtmlCleaner's DomSerializer (new DomSerializer(cp, false)). I clean some html docs and export an org.w3c.Document using DomSerializer newed with the following cp and false.
Now when I call textNode.getNodeValue() to get a text node value, is returned (while I need it's unicode value;). To fix this I new DomSerializer with true as its second parameter. Now another problem rises. ZWNJ character, although was in it's unicode value in the original document is returned as ‌ when I call textNode.getNodeValue(). This works correctly with htmlcleaner-2.2 but fails with all newer versions up to now (which is 2.8). Also note that when I create new DomSerializer(cp, true), the following code
((TagNode) cleaned.evaluateXPath(xpath)[0]).getText()
returns proper text without an escaped ‌, but I use another xpath library to obtain text nodes which is dependent on textNode.getNodeValue().
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the report Mohsen. I think there are several problems in how character encoding is handled, due to the various ways different people want to use the output. See also the discussion on this bug:
Scott, thanks for the quick response. Actually I cannot get the point with second parameter of DomSerializer. I believe a CP parameter should suffice for serializer. When I specify cp.setRecognizeUnicodeChars(true), I expect it to convert any instances of non unicode entities (such as NCR or special entities) to their unicode value. Perhapes another parameter is required for a DomSerializer to specify it's output encoding.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm not quite sure either why DomSerializer has an escapeXml parameter - none of the other serializers use it, they just interpret the current properties.
I'll have a go at the test case using ZWNJ and see if I can trace where things are getting confused.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Please see this link first.
There seems to be an issue with recent versions of HtmlCleaner's DomSerializer (new DomSerializer(cp, false)). I clean some html docs and export an org.w3c.Document using DomSerializer newed with the following cp and false.
cp.setTransResCharsToNCR(false);
cp.setRecognizeUnicodeChars(true);
cp.setTransSpecialEntitiesToNCR(false);
Now when I call textNode.getNodeValue() to get a text node value, is returned (while I need it's unicode value;). To fix this I new DomSerializer with true as its second parameter. Now another problem rises. ZWNJ character, although was in it's unicode value in the original document is returned as ‌ when I call textNode.getNodeValue(). This works correctly with htmlcleaner-2.2 but fails with all newer versions up to now (which is 2.8). Also note that when I create new DomSerializer(cp, true), the following code
((TagNode) cleaned.evaluateXPath(xpath)[0]).getText()
returns proper text without an escaped ‌, but I use another xpath library to obtain text nodes which is dependent on textNode.getNodeValue().
Thanks for the report Mohsen. I think there are several problems in how character encoding is handled, due to the various ways different people want to use the output. See also the discussion on this bug:
https://sourceforge.net/p/htmlcleaner/bugs/118/
Scott, thanks for the quick response. Actually I cannot get the point with second parameter of DomSerializer. I believe a CP parameter should suffice for serializer. When I specify cp.setRecognizeUnicodeChars(true), I expect it to convert any instances of non unicode entities (such as NCR or special entities) to their unicode value. Perhapes another parameter is required for a DomSerializer to specify it's output encoding.
I'm not quite sure either why DomSerializer has an escapeXml parameter - none of the other serializers use it, they just interpret the current properties.
I'll have a go at the test case using ZWNJ and see if I can trace where things are getting confused.
Here is a test case. Note that I cannot get rid of both ‌ and   in the output. This test passes with htmlcleaner-2.8
Scott, is there any good news on this issue? Should I open a file?