CSSBox / Discussion / CSSBox Discussion: Stop escaping unicode chars in string

scott derrick - 2014-03-06

I'm using CSSBox to do comparison of style changes in a document
repository.

Its a fantastic tool!

My only problem is we use unicode in strings extensively. Like so

<span class="italic">Miscellaneous Writings 1883–1896</span>

When I run the html document through the ComputeStyles Demo or a
derivative applet, I get

<span class="italic" style="color: #000000;font-family: serif;font-style: italic;letter-spacing: normal;line-height: 1.4em;">Miscellaneous Writings 1883–1896</span>

Which for my comparison is perfect except the unicode char "–" in
the source doc has been replaced with "–" in the output doc.

I can't find a doNotEscapeUnicode type function in the DomSource parser
where I think it is happening.

Any help would be appreciated.

thanks,

Scott

--
He who knows others is wise;
He who know himself is enlightened.
Lao-tzu

Last edit: scott derrick 2014-03-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Radek Burget - 2014-03-06
  
  I have done several tests and it seems that the unicode characters are parsed correctly (using nekohtml-1.9.19 parser) and even displayed correctly by the BoxBrowser demo.
  
  However, there seems to be a problem in the output provided by the NormalOutput class. This is just a very simple implementation of the DOM tree serialization used for the demos only. For a serious application, you should probably use a different way of DOM serialization.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - scott derrick - 2014-03-06
    
    Radek,
    
    The general advice is to replace the java.io.PrintStream with a custom output filter stream that converts specific utf-8 chars into their &#xnnn; representitive, when writing the doc out. I think the NormalOutput class is doing its job as it should.
    
    The other possibility is to sub class the InputStream used by the DefaultDocumentSource and see if the escaping of &#xnnn; unicode char entitys is happening there and defeat it.
    
    It would seem that stopping the escaping in the first place would be the best way, but maybe the xcerces parser expects a utf-8 char stream?
    
    Scott
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

scott derrick - 2014-03-06

Radek,

I ended up replacing NormalOutput with a new class UnicodeOutput, which passes all the text nodes through a string replacment class. The string replacement class has a map of unicode chars to their related &#xnnn; entity. Replacing them in the text node String before handing that into the output PrintStream.

Its crude but gets me along further...

thanks,

Scott

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Radek Burget - 2014-03-07

It seems that I have solved the issue by configuring the output streams and writers correctly (without using entities). Obviously, I have been using a combination of writers that caused replacing the unicode chars. I still don't understand this part of Java deeply but now, the ComputeStyles demo produces a correct unicode output for me. The corresponding commit is here:
https://github.com/radkovo/CSSBox/commit/ff63da6c4cfb314fae97c47f0d1136f0b767751c

Radek

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stop escaping unicode chars in string

Pure Java HTML / CSS rendering engine

Forums

Help

Stop escaping unicode chars in string

Stop escaping unicode chars in string

Pure Java HTML / CSS rendering engine

Forums

Help

Stop escaping unicode chars in string document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Stop escaping unicode chars in string