From: Lars Lindner <lars.lindner@gm...> - 2005-01-16 02:07:34
Today I added code to load a DTD containing all HTML4
entities (Latin 1, special and Greek entities). About 40k
but hey I've cut down the Changelog about the same size.
I think with this additional code now all HTML escaping
cases are covered. Please anyone correct me if I'm wrong!
The cases I can think of:
ASCII text (with unescaped special chars, invalid XML)
=> allowed due to recovery mode
=> libxml2 removes all special chars
=> result is plain UTF-8 text (any HTML gets destroyed)
text in correct encoding/CDATA sections
=> handled by libxml2, converted to UTF-8
text with XML entities (double escaped HTML)
=> first escaping done by libxml2
=> second escaping done by unhtmlize()
in parsing routines
text with HTML entities (invalid XML)
=> handled by special entity resolver which
will resolve/remove all these entities
This guarantees that the titles displayed in feed and
item list do never contain escaped HTML or HTML entities.
To display them in the tree stores they are escaped again
and printed within Pango layout markup.
Anyone who has time to do so please test the new CVS code
and have a close look at the correct encoding of headlines
and item contens (especially for Atom feeds). Thanks!