I am fixing up some Thai and Chinese web pages that contain invalid links. The pages are utf-16. The text is in Thai or Chinese. When I change an href to add some Thai or Chinese text, the output is written with all numeric character references. I don't want this, I want it to output the Thai/Chinese data.
and I end up with a href="มัล..." rather than the Thai text มัล. Is it possible to avoid encoding these as numeric references. Why is it done so? All other parts of the document are written with no encoding, i.e. the data is written as it is, no existing attributes are re-encoded.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The static Config.CurrentCharacterReferenceEncodingBehaviour property can be used to revert back to the old behaviour, but it was only really appropriate when using 7-bit ASCII encoding or non-unicode character sets.
Cheers
Martin
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am fixing up some Thai and Chinese web pages that contain invalid links. The pages are utf-16. The text is in Thai or Chinese. When I change an href to add some Thai or Chinese text, the output is written with all numeric character references. I don't want this, I want it to output the Thai/Chinese data.
The call stack is
and I end up with a href="มัล..." rather than the Thai text มัล. Is it possible to avoid encoding these as numeric references. Why is it done so? All other parts of the document are written with no encoding, i.e. the data is written as it is, no existing attributes are re-encoded.
Hi Antony,
I have now fixed this issue in version 3.4.
Until version 3.4 is officially released, the development version is available here:
http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip
The static Config.CurrentCharacterReferenceEncodingBehaviour property can be used to revert back to the old behaviour, but it was only really appropriate when using 7-bit ASCII encoding or non-unicode character sets.
Cheers
Martin
Thanks for the quick response Martin!
Antony