Menu

OutputDocument always encodes characters above 127 in attributes

Antony
2013-02-19
2013-02-21
  • Antony

    Antony - 2013-02-19

    I am fixing up some Thai and Chinese web pages that contain invalid links. The pages are utf-16. The text is in Thai or Chinese. When I change an href to add some Thai or Chinese text, the output is written with all numeric character references. I don't want this, I want it to output the Thai/Chinese data.

    The call stack is

    OutputDocumentSegment.appendTo
     AttributesOutputSegment.appendTo
      Attributes.appendHTML()
       Attribute.appendHTML()
        CharacterReference.appendEncode()
         CharacterReference.appendEncodeCheckForWhiteSpaceFormatting()
          CharacterReference.appendDecimalCharacterReferenceString()
    

    and I end up with a href="มัล..." rather than the Thai text มัล. Is it possible to avoid encoding these as numeric references. Why is it done so? All other parts of the document are written with no encoding, i.e. the data is written as it is, no existing attributes are re-encoded.

     
  • Martin Jericho

    Martin Jericho - 2013-02-19

    Hi Antony,

    I have now fixed this issue in version 3.4.

    Until version 3.4 is officially released, the development version is available here:
    http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip

    The static Config.CurrentCharacterReferenceEncodingBehaviour property can be used to revert back to the old behaviour, but it was only really appropriate when using 7-bit ASCII encoding or non-unicode character sets.

    Cheers
    Martin

     
  • Antony

    Antony - 2013-02-21

    Thanks for the quick response Martin!
    Antony

     

Log in to post a comment.