Re: [Htmlparser-developer] HttpUnit etc. was (Re: Table Scanner )
Brought to you by:
derrickoswald
From: Sam J. <ga...@yh...> - 2002-12-24 17:05:00
|
Hi Somik Somik Raha wrote: >Hi Sam, > > >>Can you give me an example of the hard coded rules you are using now, >>and a couple of examples of dirty html pages that cause them to be >>sub-optimal. >> >> > >Here are some tags : >[1] From neurogrid.com (debugging last year :) ><a href="mailto:sa...@ne...?subject=Site Comments">Mail Us<a> > >[2] From freshmeat.net ><a>revision</a> > >[3] From fedpage.com ><a href="registration.asp?EventID=1272"><img border="0" >src="\images\register.gif"</a> > >[4] From yahoo.com ><a href=s/8741><img >src="http://us.i1.yimg.com/us.yimg.com/i/i16/mov_popc.gif" height=16 >width=16 border=0></img></td><td nowrap> ><a href=s/7509><b>Yahoo! Movies</b></a> > >As you can see, dirty html hardly looks predictable. Especially when links >are not closed correctly, the scanner has to guess when it should close the >tag. > >And this is only for the link tag. For normal tags, >[1] <sometag key1=value key2="value2 key3 = value3> >[2] <sometag key1="<sometag>" key2="<!-- skdlskld -->"> > >The above two tags demonstrate a classic dilemma. If we ignore inverted >commas, we cannot handle case 2, where the contents within inverted commas >is valid text and not tags. All these examples are accepted by IE. > >All of these problems are currently handled by the parser, and I was looking >to simplify the brain of the parser. > Could you explain how all of these examples are handled by the current parser. Are you using some kind of specific rule to handle each case? Perhaps you can cut and paste a bit of code to the list to illustrate. The more precisely you can describe the operation of the existing parser when handling these kinds of cases, the more likely I can come up with a learner that will meet your needs. CHEERS> SAM |