This patch to HtmlStreamTokenizer.java CVS version 1.2
creates several public instance fields used as flags to
indicate how it should deal with some malformities in
HTML. It is in no way clean or perfect, but setting
all its flags to true allows parsing through most
malformed HTML (perhaps not dealing with that
malformity in the way you might like). Each flag
denotes whether < and > should be allowed inside of its
corresponding entity, or should end that entity. By
default, these are all set to true, providing identical
behavior to 1.2 by allowing < and > inside of anything
and thus searching for proper endings to each type of
entity:
allowTagsInComments - setting this to false denotes
that < or > inside of a comment should end that comment
regardless of whether there was an exact --> ending it,
continuing parsing as if < begun a tag following the
comment, or treating > as --> and continuing parsing on
the next character
allowTagsInBangTags - similar to comments, bang tags
are any tag starting with <!
allowTagsInTags - setting this to false causes < to end
a tag as well as >
allowTagsInTagQuotes - setting this to false causes <
inside of single or double quotes inside of a tag, i.e.
an attribute value, to end that tag
This patch should probably be cleaned up in terms of
providing a better interface for these options than
public data members, and, more importantly, tested to
make sure it is bug-free.
Logged In: YES
user_id=705615
Didn't notice this when I posted it, but apparently
sourceforge escapes entities in comments. The <'s and
>'s above were intended to be literal < (less than) and >
(greater than) characters.
Logged In: YES
user_id=705615
The initial patch I had posted here was a normal diff and
didn't even apply correclty for me, so I am posting an
updated one.
Logged In: YES
user_id=705615
See original comments for details on how to turn on the
malformed-HTML handling features included in this patch.
This is a unified context diff patch for CVS version 1.2. You
can apply it simply by changing to the
src/com/arthurdo/parser directory and doing
"patch < HtmlStreamTokenizer.patch"
This updated version of the patch also includes a public static
boolean flag called stripEscapeBadTermChar, which if set to
false prevents HtmlStreamTokenizer from eating any
punctuation after an escape sequence (i.e. consuming the .
in  .). See bug #1018298 for more details.
Hackish patch for dealing with some malformed HTML