[Htmlparser-user] Annette's bug difficult to fix - need help (AI approaches ??)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Folks,
    If you've been following the latest exchange on htmlparser-user, =
Annette has shown us a crazy example of dirty html, which works in the =
browser, but crashes the parser.
    The site is http://www.cia.gov  =20
    Search for this string - <font face=3D"Arial,"helvetica,"
    and you will find it in the html. Now this erroneous inverted comma =
in front of helvetica should not be there.=20

    This has been captured in a test case in HTMLTagTest.java (you can =
get it from CVS), and this test fails (testParsing()).
    The problem is - the core parsing mechanism ignores anything within =
inverted commas. This is critical so as to be able to accept angular =
brackets in inverted commas. If we remove this feature from the parser =
other tests will break.
   =20
    So I need some suggestions on how we might modify our parsing - how =
do we intelligently understand that this is an error (how easy it is for =
us humans to figure this out) ? Looks like linear approaches wouldnt =
work anymore... Maybe we need to associate some intelligence - that if =
its a font tag, then this kind of stuff is most definitely an error. =
Whereas if its a jsp tag, we can get more strict with our parsing. This =
will probably cause a fundamental shift in our core parsing technique.

Regards,
Somik