Hi Leslie,
Indeed, the <form> tag is a nightmare to work with. At one point, we had
removed it from the basic set of scanners. We put it back in after our
exception handling mechanism was in place - so now, if things get messy, you
should get an exception. We can't possibly handle every bit of screwed html
:), although we try really hard to.
> it would be better if the end-form tag could be 'assumed' so that the
> file could at least be parsed. that would mirror the behavior of
commercial
> browsers.
I had spent some time on this, tried it and failed miserably. It turned out
to be almost impossible to predict where a form tag should end, bcos of its
expanse, and particularly bcos of its intermingling with <table>.
Its quite possible I missed something - so if you have any innovative
suggestions, it would be really helpful. I think we're dealing on the realm
of AI here :). Remember, there is a big constraint, we have a streaming,
real-time parser and not a DOM style parser where we have all of it and can
go back and forth. A good heuristic that really works will make our day.
Bytway, maybe this discussion could better happen on the dev list.. Feel
free to join us as a dev (send me your sourceforge id).
Regards,
Somik
----- Original Message -----
From: "Leslie Rohde" <le...@op...>
To: <htm...@li...>
Sent: Thursday, December 05, 2002 5:25 PM
Subject: Re: [Htmlparser-user] how to deal with form tag following table tag
> actually, there are two problems in the case at hand, and i am not at all
> sure that the <table><form> construction is the worst of them.
>
> not only does hotbot produce this invalid sequence, but they also
> failed to close the form tag. it looks like HTMLFormScanner
> simply falls out of the loop at lines 136-154 looking for the end tag and
> throws an exception when not found.
>
> it would be better if the end-form tag could be 'assumed' so that the
> file could at least be parsed. that would mirror the behavior of
commercial
> browsers.
>
> Leslie Rohde wrote:
>
> > the construction <table ...><form...> is not allowed in spec, but it
> > does occur in such places as the hotbot search engine results page.
> > currently, htmlparser delivers a flood errors and exceptions when
> > parsing a hotbot results page.
> >
> > how best to handle this?
> >
>
> --
> Leslie Rohde
> mailto:le...@op...
> http://www.optitext.com
>
>
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|