From: Mayrgundter, P. <pma...@do...> - 2004-08-12 13:47:41
|
Hi again, Looks like my last message was munged (at least for me) when I tried to = include the #0 entity. Anyways, JTidy converts a null char in source HTML to an invalid XML = entity on output, thus breaking the XHTML and XML output modes for = documents with that char. In fact, many control characters shouldn't be = allowed into output XML documents. Here's my planned fix. If in XML or XHTML output mode, allow only valid = XML characters through the Lexing stage, and drop everything else. The = valid ranges for XML chars is: Char ::=3D #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | = [#x10000-#x10FFFF] (see: http://www.w3.org/TR/2004/REC-xml-20040204/#NT-Char) XML entities follow the same rule; they can only represent chars in this = range. Since invalid XML chars don't really have any purpose in well-formed XML = and were probably included in the source document as a mistake (e.g. the = null char), dropping them seems fine. I'll be preparing the patch soon and will make it available on this = list. Cheers, Pablo Mayrgundter |