Re: [Htmlparser-developer] HttpUnit etc. was (Re: Table Scanner )
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-12-24 07:28:40
|
Hi Sam, > Can you give me an example of the hard coded rules you are using now, > and a couple of examples of dirty html pages that cause them to be > sub-optimal. Here are some tags : [1] From neurogrid.com (debugging last year :) <a href="mailto:sa...@ne...?subject=Site Comments">Mail Us<a> [2] From freshmeat.net <a>revision</a> [3] From fedpage.com <a href="registration.asp?EventID=1272"><img border="0" src="\images\register.gif"</a> [4] From yahoo.com <a href=s/8741><img src="http://us.i1.yimg.com/us.yimg.com/i/i16/mov_popc.gif" height=16 width=16 border=0></img></td><td nowrap> <a href=s/7509><b>Yahoo! Movies</b></a> As you can see, dirty html hardly looks predictable. Especially when links are not closed correctly, the scanner has to guess when it should close the tag. And this is only for the link tag. For normal tags, [1] <sometag key1=value key2="value2 key3 = value3> [2] <sometag key1="<sometag>" key2="<!-- skdlskld -->"> The above two tags demonstrate a classic dilemma. If we ignore inverted commas, we cannot handle case 2, where the contents within inverted commas is valid text and not tags. All these examples are accepted by IE. All of these problems are currently handled by the parser, and I was looking to simplify the brain of the parser. > Using learning in a system to increase efficiency is usually very > difficult to do well. Learning systems basically have more flexibility > than other systems, but as a consequence you have moer free parameters. > It is easy to add a learning framework but then spend all your time > just trying to adjust the system parameters, and then to discover that > exploring the space of possible parameters for your learner is just too > expensive. > > Nontheless I am always fascinated by the problem of adding learning to a > system, precisely because it is so difficult to do well. If you can > give me some concrete examples, I will do my best to help you select an > appropriate learning mechanism. Thanks! > Interesting. The ant scripts for neurogrid were originally made by Rick > Knowles, and I'm still only just getting a really good feel for ant. I > remember trying to set up something in my ng scripts that would add the > date to the jar file name, like you sometimes do, and failing. My own > fault really; I rarely read the manual and always try to learn by > modifying the operation of an existing system (kind of an evolutionary > approach ....) Thats what I did too - but I started with the examples in the ant website (I think they believe in the approach too) Cheers, Somik |