I have done several tests and it seems that the unicode characters are parsed correctly (using nekohtml-1.9.19 parser) and even displayed correctly by the BoxBrowser demo.
However, there seems to be a problem in the output provided by the NormalOutput class. This is just a very simple implementation of the DOM tree serialization used for the demos only. For a serious application, you should probably use a different way of DOM serialization.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The general advice is to replace the java.io.PrintStream with a custom output filter stream that converts specific utf-8 chars into their &#xnnn; representitive, when writing the doc out. I think the NormalOutput class is doing its job as it should.
The other possibility is to sub class the InputStream used by the DefaultDocumentSource and see if the escaping of &#xnnn; unicode char entitys is happening there and defeat it.
It would seem that stopping the escaping in the first place would be the best way, but maybe the xcerces parser expects a utf-8 char stream?
Scott
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I ended up replacing NormalOutput with a new class UnicodeOutput, which passes all the text nodes through a string replacment class. The string replacement class has a map of unicode chars to their related &#xnnn; entity. Replacing them in the text node String before handing that into the output PrintStream.
Its crude but gets me along further...
thanks,
Scott
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It seems that I have solved the issue by configuring the output streams and writers correctly (without using entities). Obviously, I have been using a combination of writers that caused replacing the unicode chars. I still don't understand this part of Java deeply but now, the ComputeStyles demo produces a correct unicode output for me. The corresponding commit is here: https://github.com/radkovo/CSSBox/commit/ff63da6c4cfb314fae97c47f0d1136f0b767751c
Radek
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm using CSSBox to do comparison of style changes in a document
repository.
Its a fantastic tool!
My only problem is we use unicode in strings extensively. Like so
When I run the html document through the ComputeStyles Demo or a
derivative applet, I get
Which for my comparison is perfect except the unicode char "–" in
the source doc has been replaced with "–" in the output doc.
I can't find a doNotEscapeUnicode type function in the DomSource parser
where I think it is happening.
Any help would be appreciated.
thanks,
Scott
--
He who knows others is wise;
He who know himself is enlightened.
Lao-tzu
Last edit: scott derrick 2014-03-06
I have done several tests and it seems that the unicode characters are parsed correctly (using nekohtml-1.9.19 parser) and even displayed correctly by the BoxBrowser demo.
However, there seems to be a problem in the output provided by the NormalOutput class. This is just a very simple implementation of the DOM tree serialization used for the demos only. For a serious application, you should probably use a different way of DOM serialization.
Radek,
The general advice is to replace the java.io.PrintStream with a custom output filter stream that converts specific utf-8 chars into their &#xnnn; representitive, when writing the doc out. I think the NormalOutput class is doing its job as it should.
The other possibility is to sub class the InputStream used by the DefaultDocumentSource and see if the escaping of &#xnnn; unicode char entitys is happening there and defeat it.
It would seem that stopping the escaping in the first place would be the best way, but maybe the xcerces parser expects a utf-8 char stream?
Scott
Radek,
I ended up replacing NormalOutput with a new class UnicodeOutput, which passes all the text nodes through a string replacment class. The string replacement class has a map of unicode chars to their related &#xnnn; entity. Replacing them in the text node String before handing that into the output PrintStream.
Its crude but gets me along further...
thanks,
Scott
It seems that I have solved the issue by configuring the output streams and writers correctly (without using entities). Obviously, I have been using a combination of writers that caused replacing the unicode chars. I still don't understand this part of Java deeply but now, the ComputeStyles demo produces a correct unicode output for me. The corresponding commit is here:
https://github.com/radkovo/CSSBox/commit/ff63da6c4cfb314fae97c47f0d1136f0b767751c
Radek