From: Mayrgundter, P. <pma...@do...> - 2004-08-12 13:47:41
|
Hi again, Looks like my last message was munged (at least for me) when I tried to = include the #0 entity. Anyways, JTidy converts a null char in source HTML to an invalid XML = entity on output, thus breaking the XHTML and XML output modes for = documents with that char. In fact, many control characters shouldn't be = allowed into output XML documents. Here's my planned fix. If in XML or XHTML output mode, allow only valid = XML characters through the Lexing stage, and drop everything else. The = valid ranges for XML chars is: Char ::=3D #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | = [#x10000-#x10FFFF] (see: http://www.w3.org/TR/2004/REC-xml-20040204/#NT-Char) XML entities follow the same rule; they can only represent chars in this = range. Since invalid XML chars don't really have any purpose in well-formed XML = and were probably included in the source document as a mistake (e.g. the = null char), dropping them seems fine. I'll be preparing the patch soon and will make it available on this = list. Cheers, Pablo Mayrgundter |
From: Mayrgundter, P. <pma...@do...> - 2004-08-12 16:22:03
|
Here's my fix: // Allow only valid XML characters. See: // http://www.w3.org/TR/2004/REC-xml-20040204/#NT-Char if ((this.configuration.xmlOut || this.configuration.xHTML) && !((c >=3D 0x20 && c <=3D0xD7FF) // Check the common-case = first. || c =3D=3D 0x9 || c =3D=3D 0xA || c =3D=3D 0xD // Then = white-space. || (c >=3D 0xE000 && c <=3D0xFFFD) // Then high-range = unicode. || (c >=3D 0x10000 && c <=3D0x10FFFF))) { return; = = =20 } Not sure where best to put this. I've got it as the first statement in = Lexer.addCharToLexer(int) currently. |
From: Mayrgundter, P. <pma...@do...> - 2004-08-12 16:50:29
|
Here's a similar fix that drops invalid numeric entities from XML mode = (entities that map to invalid XML chars): if ((this.configuration.xmlOut || this.configuration.xHTML) && !((ch >=3D 0x20 && ch <=3D0xD7FF) // Check the = common-case first. = =20 || ch =3D=3D 0x9 || ch =3D=3D 0xA || ch =3D=3D 0xD // = Then white-space. = =20 || (ch >=3D 0xE000 && ch <=3D0xFFFD))) { this.lexsize =3D start; return; } This should be inserted into Lexer.parseEntity(short), right after: str =3D getString(this.lexbuf, start, this.lexsize - start); ch =3D EntityTable.getDefaultEntityTable().entityCode(str); Using both of these fixes and the duplicate attribute fix I posted = earlier on this list, I'm getting fairly robust conversion to XHTML. Cheers, Pablo -----Original Message----- From: jti...@li... on behalf of Mayrgundter, = Pablo Sent: Thu 8/12/2004 12:20 PM To: jti...@li... Subject: RE: [Jtidy-devel] Tidying doc with null char to XHTML outputs = invalid XML entity. =20 Here's my fix: // Allow only valid XML characters. See: // http://www.w3.org/TR/2004/REC-xml-20040204/#NT-Char if ((this.configuration.xmlOut || this.configuration.xHTML) && !((c >=3D 0x20 && c <=3D0xD7FF) // Check the common-case = first. || c =3D=3D 0x9 || c =3D=3D 0xA || c =3D=3D 0xD // Then = white-space. || (c >=3D 0xE000 && c <=3D0xFFFD) // Then high-range = unicode. || (c >=3D 0x10000 && c <=3D0x10FFFF))) { return; = = =20 } Not sure where best to put this. I've got it as the first statement in = Lexer.addCharToLexer(int) currently. |
From: Fabrizio G. <fg...@gm...> - 2004-08-17 22:53:00
|
thanks for the patches, Pablo. I just committed your fixes in cvs. I also recently committed a fix from the c version of tidy for the duplicate attribute bug (it also allows to join "style" and "class" attributes instead of dropping duplicates). Xml output should now be a lot more stable, you can try it in a rc8 nightly build. Any other patch and bugfix is welcome, I recommend you to use the sourceforge bugtracker to submit them, instead of posting to the mailing list, and if possible also add a junit testcase which shows what the patch does (this takes only a few minutes, since you will simply need to provide an input, expected output, and configuration files - see http://cvs.sourceforge.net/viewcvs.py/jtidy/jtidy2/src/test/org/w3c/tidy/JTidyBugsTest.java?view=markup as an example). fabrizio On Thu, 12 Aug 2004 12:47:25 -0400, Mayrgundter, Pablo <pma...@do...> wrote: > > Here's a similar fix that drops invalid numeric entities from XML mode (entities that map to invalid XML chars): > |