Re: [Htmlparser-user] how to deal with form tag following table tag
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-12-06 05:45:17
|
Hi Leslie, Indeed, the <form> tag is a nightmare to work with. At one point, we had removed it from the basic set of scanners. We put it back in after our exception handling mechanism was in place - so now, if things get messy, you should get an exception. We can't possibly handle every bit of screwed html :), although we try really hard to. > it would be better if the end-form tag could be 'assumed' so that the > file could at least be parsed. that would mirror the behavior of commercial > browsers. I had spent some time on this, tried it and failed miserably. It turned out to be almost impossible to predict where a form tag should end, bcos of its expanse, and particularly bcos of its intermingling with <table>. Its quite possible I missed something - so if you have any innovative suggestions, it would be really helpful. I think we're dealing on the realm of AI here :). Remember, there is a big constraint, we have a streaming, real-time parser and not a DOM style parser where we have all of it and can go back and forth. A good heuristic that really works will make our day. Bytway, maybe this discussion could better happen on the dev list.. Feel free to join us as a dev (send me your sourceforge id). Regards, Somik ----- Original Message ----- From: "Leslie Rohde" <le...@op...> To: <htm...@li...> Sent: Thursday, December 05, 2002 5:25 PM Subject: Re: [Htmlparser-user] how to deal with form tag following table tag > actually, there are two problems in the case at hand, and i am not at all > sure that the <table><form> construction is the worst of them. > > not only does hotbot produce this invalid sequence, but they also > failed to close the form tag. it looks like HTMLFormScanner > simply falls out of the loop at lines 136-154 looking for the end tag and > throws an exception when not found. > > it would be better if the end-form tag could be 'assumed' so that the > file could at least be parsed. that would mirror the behavior of commercial > browsers. > > Leslie Rohde wrote: > > > the construction <table ...><form...> is not allowed in spec, but it > > does occur in such places as the hotbot search engine results page. > > currently, htmlparser delivers a flood errors and exceptions when > > parsing a hotbot results page. > > > > how best to handle this? > > > > -- > Leslie Rohde > mailto:le...@op... > http://www.optitext.com > > > > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user |