[Htmlparser-developer] General Bug Behavior
Brought to you by:
derrickoswald
From: Claude D. <CD...@ar...> - 2002-08-01 18:17:56
|
We've found three documents over the last few days that cause the HTMLParser to hang. I will make sure they get into the bug database but the issue centers around what should happen when the parser encounters ill-formed HTML. I would propose that the correct behavior is to throw and exception if the parser is unable to handle the syntax, but right now it just hangs. Clearly, more investigation is required to determine whether it's in a loop or waiting on the input. Since I'm not sure what a fix would entail, I though it worth raising the issue as a general design question. What should be done when the parser encounters malformed HTML that goes beyond the realm of reasonable recovery? =20 BTW: The documents we encountered that hung the parser had the following artifacts: =20 1) Inclusiong of "<!-->" pattern which is technically an invalid comment syntax. 2) Inclusion of the "<html><head><TITLE>" pattern twice at the beginning of the document. 3) Two opening "<TITLE>" tags with only one ending "</TITLE>" tag. =20 From our point of view, a hag is devastating in that it does not allow the application to move forward. An exception would be ideal in that it would identify the problem without breaking the application. =20 |