[Htmlparser-user] Annette's bug difficult to fix - need help (AI approaches ??)
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-05-02 03:11:27
|
Hi Folks, If you've been following the latest exchange on htmlparser-user, = Annette has shown us a crazy example of dirty html, which works in the = browser, but crashes the parser. The site is http://www.cia.gov =20 Search for this string - <font face=3D"Arial,"helvetica," and you will find it in the html. Now this erroneous inverted comma = in front of helvetica should not be there.=20 This has been captured in a test case in HTMLTagTest.java (you can = get it from CVS), and this test fails (testParsing()). The problem is - the core parsing mechanism ignores anything within = inverted commas. This is critical so as to be able to accept angular = brackets in inverted commas. If we remove this feature from the parser = other tests will break. =20 So I need some suggestions on how we might modify our parsing - how = do we intelligently understand that this is an error (how easy it is for = us humans to figure this out) ? Looks like linear approaches wouldnt = work anymore... Maybe we need to associate some intelligence - that if = its a font tag, then this kind of stuff is most definitely an error. = Whereas if its a jsp tag, we can get more strict with our parsing. This = will probably cause a fundamental shift in our core parsing technique. Regards, Somik |