[Htmlparser-developer] Pinpointing a bug in attribute parsing
Brought to you by:
derrickoswald
|
From: Joseph R. <jmr...@tg...> - 2003-06-12 19:17:59
|
I've run into an issue using the parser, and I'd like to file a bug, but
I'd like some input as to which parts of the parser behavior are really
wrong.
What I'm actually doing is parsing a page more than once. I download a
page, parse it to remove a few unwanted tags, and output the results by
calling toHtml() on all the nodes I want to keep. Later in the
application, I parse the page again, making some changes to some
StringNodes.
The problem is that these consecutive parses lead to mangled HTML. The
output of the second parse doesn't match the output of the first. In
particular, the problem that I'm having is with tags which have
standalone attributes. For example, if I have a tag:
<SOMETAG FOO1="BAR1" FOO2="BAR2" FOO3>
FOO3 is a standalone attribute, which is perfectly valid in HTML. Many
common tags use standalone attributes, like CHECKED on a checkbox,
DISABLED on a text input, or NOWRAP on a table cell. The parser doesn't
seem to like standalones, though, and assigns it an empty string as a
value, so calling toHtml() gets me the result:
<SOMETAG FOO1="BAR1" FOO3="" FOO2="BAR2">
This seems wrong to me, but would not itself be disastrous. However,
there also appears to be a bug in the parser such that it doesn't like
empty-string values for attributes, and toHtml() chokes on them. If I
run this first result through the parser again, calling toHtml() now
gets me:
<SOMETAG FOO1="BAR1" FOO3="">
This is clearly wrong. I've lost the attribute FOO2 and its value entirely.
It would seem to me that parsing a page and producing HTML output should
be consistent over multiple runs. If I have output from the parser, the
parser should consider all of that to be valid HTML, and should produce
identical output if the results are run through it again.
Thoughts?
_____________________________________________________________
Joe Robins Tel: 212-918-5057
Thaumaturgix, Inc. Fax: 212-918-5001
19 W. 44th St., 18th Floor Email: jmr...@tg...
New York, NY 10036 http://www.tgix.com
thau'ma-tur-gy, n. the working of miracles.
|