[Htmlparser-developer] Annette's bug difficult to fix - need help (AI approaches ??)
Brought to you by:
derrickoswald
|
From: Somik R. <so...@ya...> - 2002-05-02 03:11:27
|
Hi Folks,
If you've been following the latest exchange on htmlparser-user, =
Annette has shown us a crazy example of dirty html, which works in the =
browser, but crashes the parser.
The site is http://www.cia.gov =20
Search for this string - <font face=3D"Arial,"helvetica,"
and you will find it in the html. Now this erroneous inverted comma =
in front of helvetica should not be there.=20
This has been captured in a test case in HTMLTagTest.java (you can =
get it from CVS), and this test fails (testParsing()).
The problem is - the core parsing mechanism ignores anything within =
inverted commas. This is critical so as to be able to accept angular =
brackets in inverted commas. If we remove this feature from the parser =
other tests will break.
=20
So I need some suggestions on how we might modify our parsing - how =
do we intelligently understand that this is an error (how easy it is for =
us humans to figure this out) ? Looks like linear approaches wouldnt =
work anymore... Maybe we need to associate some intelligence - that if =
its a font tag, then this kind of stuff is most definitely an error. =
Whereas if its a jsp tag, we can get more strict with our parsing. This =
will probably cause a fundamental shift in our core parsing technique.
Regards,
Somik
|