Menu

Stop escaping unicode chars in string

2014-03-06
2014-03-07
  • scott derrick

    scott derrick - 2014-03-06

    I'm using CSSBox to do comparison of style changes in a document
    repository.

    Its a fantastic tool!

    My only problem is we use unicode in strings extensively. Like so

    <span class="italic">Miscellaneous Writings 1883&#x2013;1896</span>
    

    When I run the html document through the ComputeStyles Demo or a
    derivative applet, I get

    <span class="italic" style="color: #000000;font-family: 
    serif;font-style: italic;letter-spacing: normal;line-height: 
    1.4em;">Miscellaneous Writings 1883–1896</span>
    

    Which for my comparison is perfect except the unicode char "–" in
    the source doc has been replaced with "–" in the output doc.

    I can't find a doNotEscapeUnicode type function in the DomSource parser
    where I think it is happening.

    Any help would be appreciated.

    thanks,

    Scott

    --
    He who knows others is wise;
    He who know himself is enlightened.
    Lao-tzu

     

    Last edit: scott derrick 2014-03-06
    • Radek Burget

      Radek Burget - 2014-03-06

      I have done several tests and it seems that the unicode characters are parsed correctly (using nekohtml-1.9.19 parser) and even displayed correctly by the BoxBrowser demo.

      However, there seems to be a problem in the output provided by the NormalOutput class. This is just a very simple implementation of the DOM tree serialization used for the demos only. For a serious application, you should probably use a different way of DOM serialization.

       
      • scott derrick

        scott derrick - 2014-03-06

        Radek,

        The general advice is to replace the java.io.PrintStream with a custom output filter stream that converts specific utf-8 chars into their &#xnnn; representitive, when writing the doc out. I think the NormalOutput class is doing its job as it should.

        The other possibility is to sub class the InputStream used by the DefaultDocumentSource and see if the escaping of &#xnnn; unicode char entitys is happening there and defeat it.

        It would seem that stopping the escaping in the first place would be the best way, but maybe the xcerces parser expects a utf-8 char stream?

        Scott

         
  • scott derrick

    scott derrick - 2014-03-06

    Radek,

    I ended up replacing NormalOutput with a new class UnicodeOutput, which passes all the text nodes through a string replacment class. The string replacement class has a map of unicode chars to their related &#xnnn; entity. Replacing them in the text node String before handing that into the output PrintStream.

    Its crude but gets me along further...

    thanks,

    Scott

     
  • Radek Burget

    Radek Burget - 2014-03-07

    It seems that I have solved the issue by configuring the output streams and writers correctly (without using entities). Obviously, I have been using a combination of writers that caused replacing the unicode chars. I still don't understand this part of Java deeply but now, the ComputeStyles demo produces a correct unicode output for me. The corresponding commit is here:
    https://github.com/radkovo/CSSBox/commit/ff63da6c4cfb314fae97c47f0d1136f0b767751c

    Radek

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.