RE: [Jtidy-devel] Tidying doc with null char to XHTML outputs invalid XML entity.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Here's a similar fix that drops invalid numeric entities from XML mode =
(entities that map to invalid XML chars):

        if ((this.configuration.xmlOut || this.configuration.xHTML)
            && !((ch >=3D 0x20 && ch <=3D0xD7FF) // Check the =
common-case first.                                                       =
                                     =20
                 || ch =3D=3D 0x9 || ch =3D=3D 0xA || ch =3D=3D 0xD // =
Then white-space.                                                        =
                                    =20
                 || (ch >=3D 0xE000 && ch <=3D0xFFFD))) {
            this.lexsize =3D start;
            return;
        }

This should be inserted into Lexer.parseEntity(short), right after:

        str =3D getString(this.lexbuf, start, this.lexsize - start);
        ch =3D EntityTable.getDefaultEntityTable().entityCode(str);

Using both of these fixes and the duplicate attribute fix I posted =
earlier on this list, I'm getting fairly robust conversion to XHTML.

Cheers,
Pablo

-----Original Message-----
From: jti...@li... on behalf of Mayrgundter, =
Pablo
Sent: Thu 8/12/2004 12:20 PM
To: jti...@li...
Subject: RE: [Jtidy-devel] Tidying doc with null char to XHTML outputs =
invalid XML entity.
=20

Here's my fix:

        // Allow only valid XML characters.  See:
        // http://www.w3.org/TR/2004/REC-xml-20040204/#NT-Char
        if ((this.configuration.xmlOut || this.configuration.xHTML)
            && !((c >=3D 0x20 && c <=3D0xD7FF) // Check the common-case =
first.
                 || c =3D=3D 0x9 || c =3D=3D 0xA || c =3D=3D 0xD // Then =
white-space.
                 || (c >=3D 0xE000 && c <=3D0xFFFD) // Then high-range =
unicode.
                 || (c >=3D 0x10000 && c <=3D0x10FFFF))) {
            return;                                                      =
                                                                         =
                      =20
        }

Not sure where best to put this.  I've got it as the first statement in =
Lexer.addCharToLexer(int) currently.