Menu

#1 Hackish patch for dealing with some malformed HTML

open
nobody
None
5
2003-02-06
2003-02-06
No

This patch to HtmlStreamTokenizer.java CVS version 1.2
creates several public instance fields used as flags to
indicate how it should deal with some malformities in
HTML. It is in no way clean or perfect, but setting
all its flags to true allows parsing through most
malformed HTML (perhaps not dealing with that
malformity in the way you might like). Each flag
denotes whether < and > should be allowed inside of its
corresponding entity, or should end that entity. By
default, these are all set to true, providing identical
behavior to 1.2 by allowing < and > inside of anything
and thus searching for proper endings to each type of
entity:

allowTagsInComments - setting this to false denotes
that < or > inside of a comment should end that comment
regardless of whether there was an exact --> ending it,
continuing parsing as if < begun a tag following the
comment, or treating > as --> and continuing parsing on
the next character

allowTagsInBangTags - similar to comments, bang tags
are any tag starting with <!

allowTagsInTags - setting this to false causes < to end
a tag as well as >

allowTagsInTagQuotes - setting this to false causes <
inside of single or double quotes inside of a tag, i.e.
an attribute value, to end that tag

This patch should probably be cleaned up in terms of
providing a better interface for these options than
public data members, and, more importantly, tested to
make sure it is bug-free.

Discussion

  • Eric C. Jensen

    Eric C. Jensen - 2004-08-28

    Logged In: YES
    user_id=705615

    Didn't notice this when I posted it, but apparently
    sourceforge escapes entities in comments. The <'s and
    &gt;'s above were intended to be literal < (less than) and >
    (greater than) characters.

     
  • Eric C. Jensen

    Eric C. Jensen - 2004-08-28

    Logged In: YES
    user_id=705615

    The initial patch I had posted here was a normal diff and
    didn't even apply correclty for me, so I am posting an
    updated one.

     
  • Eric C. Jensen

    Eric C. Jensen - 2004-08-28

    Logged In: YES
    user_id=705615

    See original comments for details on how to turn on the
    malformed-HTML handling features included in this patch.
    This is a unified context diff patch for CVS version 1.2. You
    can apply it simply by changing to the
    src/com/arthurdo/parser directory and doing
    "patch < HtmlStreamTokenizer.patch"

    This updated version of the patch also includes a public static
    boolean flag called stripEscapeBadTermChar, which if set to
    false prevents HtmlStreamTokenizer from eating any
    punctuation after an escape sequence (i.e. consuming the .
    in &nbsp.). See bug #1018298 for more details.

     
  • Eric C. Jensen

    Eric C. Jensen - 2004-08-28

    Hackish patch for dealing with some malformed HTML

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.