From: Mayrgundter, P. <pma...@do...> - 2004-08-12 16:50:29
|
Here's a similar fix that drops invalid numeric entities from XML mode = (entities that map to invalid XML chars): if ((this.configuration.xmlOut || this.configuration.xHTML) && !((ch >=3D 0x20 && ch <=3D0xD7FF) // Check the = common-case first. = =20 || ch =3D=3D 0x9 || ch =3D=3D 0xA || ch =3D=3D 0xD // = Then white-space. = =20 || (ch >=3D 0xE000 && ch <=3D0xFFFD))) { this.lexsize =3D start; return; } This should be inserted into Lexer.parseEntity(short), right after: str =3D getString(this.lexbuf, start, this.lexsize - start); ch =3D EntityTable.getDefaultEntityTable().entityCode(str); Using both of these fixes and the duplicate attribute fix I posted = earlier on this list, I'm getting fairly robust conversion to XHTML. Cheers, Pablo -----Original Message----- From: jti...@li... on behalf of Mayrgundter, = Pablo Sent: Thu 8/12/2004 12:20 PM To: jti...@li... Subject: RE: [Jtidy-devel] Tidying doc with null char to XHTML outputs = invalid XML entity. =20 Here's my fix: // Allow only valid XML characters. See: // http://www.w3.org/TR/2004/REC-xml-20040204/#NT-Char if ((this.configuration.xmlOut || this.configuration.xHTML) && !((c >=3D 0x20 && c <=3D0xD7FF) // Check the common-case = first. || c =3D=3D 0x9 || c =3D=3D 0xA || c =3D=3D 0xD // Then = white-space. || (c >=3D 0xE000 && c <=3D0xFFFD) // Then high-range = unicode. || (c >=3D 0x10000 && c <=3D0x10FFFF))) { return; = = =20 } Not sure where best to put this. I've got it as the first statement in = Lexer.addCharToLexer(int) currently. |