[Htmlparser-developer] Pinpointing a bug in attribute parsing
Brought to you by:
derrickoswald
From: Joseph R. <jmr...@tg...> - 2003-06-12 19:17:59
|
I've run into an issue using the parser, and I'd like to file a bug, but I'd like some input as to which parts of the parser behavior are really wrong. What I'm actually doing is parsing a page more than once. I download a page, parse it to remove a few unwanted tags, and output the results by calling toHtml() on all the nodes I want to keep. Later in the application, I parse the page again, making some changes to some StringNodes. The problem is that these consecutive parses lead to mangled HTML. The output of the second parse doesn't match the output of the first. In particular, the problem that I'm having is with tags which have standalone attributes. For example, if I have a tag: <SOMETAG FOO1="BAR1" FOO2="BAR2" FOO3> FOO3 is a standalone attribute, which is perfectly valid in HTML. Many common tags use standalone attributes, like CHECKED on a checkbox, DISABLED on a text input, or NOWRAP on a table cell. The parser doesn't seem to like standalones, though, and assigns it an empty string as a value, so calling toHtml() gets me the result: <SOMETAG FOO1="BAR1" FOO3="" FOO2="BAR2"> This seems wrong to me, but would not itself be disastrous. However, there also appears to be a bug in the parser such that it doesn't like empty-string values for attributes, and toHtml() chokes on them. If I run this first result through the parser again, calling toHtml() now gets me: <SOMETAG FOO1="BAR1" FOO3=""> This is clearly wrong. I've lost the attribute FOO2 and its value entirely. It would seem to me that parsing a page and producing HTML output should be consistent over multiple runs. If I have output from the parser, the parser should consider all of that to be valid HTML, and should produce identical output if the results are run through it again. Thoughts? _____________________________________________________________ Joe Robins Tel: 212-918-5057 Thaumaturgix, Inc. Fax: 212-918-5001 19 W. 44th St., 18th Floor Email: jmr...@tg... New York, NY 10036 http://www.tgix.com thau'ma-tur-gy, n. the working of miracles. |