[Htmlparser-developer] Annette's bug(Dirty HTML) fixed - Some intelligence added to the parser

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Folks,
    We seem to have a heroic parser now...
    You can check out the latest code from CVS.

    Here's the fix. As you know - if we have an additional erroneous =
inverted comma in a tag, the parser cannot judge whether to treat this =
as erroneous or valid. Now the parser has some amount of intelligence - =
if it encounters an inverted comma, and a close tag character, then it =
does a check to see whether it should treat this as an error or a valid =
character.

    This decision making process is facilitated with a strictVector - =
which holds the tags for which it should not make allowances. Currently, =
there is only one - "INPUT" (Should we have any more? ). If the tag =
being parsed is not a strict tag like INPUT, then it is assumed that =
this is an erroneous tag and needs to be corrected.

    The correction process occurs (and is validated with some testcases =
in HTMLTag - particularly testStrictParsing). If you go thru that =
testcase - you will see that the attributes are also correctly =
retrieved.
    This solution doesent break anything else - we have 82 testcases, =
all passing.
    I'd be grateful if folks can test this version and let me know if =
this solution is acceptable.
   =20
    Also - a general question - would you prefer something like nightly =
drop packages for downloading, or is a request to checkout from CVS fine =
?

Thanks and Regards,
Somik   =20