Re: [Htmlparser-developer] Pinpointing a bug in attribute parsing
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2003-06-12 21:21:21
|
Sounds like an AttributeParser bug.. You could write a test that provides this input, and asserts the expected output - in AttributeParserTest (look at the other tests). Then, file a bug report, and choose the category "AttributeParser". Kaarle Kaila will dive in to the rescue. Regards, Somik ----- Original Message ----- From: "Joseph Robins" <jmr...@tg...> To: <htm...@li...> Sent: Thursday, June 12, 2003 12:17 PM Subject: [Htmlparser-developer] Pinpointing a bug in attribute parsing > I've run into an issue using the parser, and I'd like to file a bug, but > I'd like some input as to which parts of the parser behavior are really > wrong. > > What I'm actually doing is parsing a page more than once. I download a > page, parse it to remove a few unwanted tags, and output the results by > calling toHtml() on all the nodes I want to keep. Later in the > application, I parse the page again, making some changes to some > StringNodes. > > The problem is that these consecutive parses lead to mangled HTML. The > output of the second parse doesn't match the output of the first. In > particular, the problem that I'm having is with tags which have > standalone attributes. For example, if I have a tag: > > <SOMETAG FOO1="BAR1" FOO2="BAR2" FOO3> > > FOO3 is a standalone attribute, which is perfectly valid in HTML. Many > common tags use standalone attributes, like CHECKED on a checkbox, > DISABLED on a text input, or NOWRAP on a table cell. The parser doesn't > seem to like standalones, though, and assigns it an empty string as a > value, so calling toHtml() gets me the result: > > <SOMETAG FOO1="BAR1" FOO3="" FOO2="BAR2"> > > This seems wrong to me, but would not itself be disastrous. However, > there also appears to be a bug in the parser such that it doesn't like > empty-string values for attributes, and toHtml() chokes on them. If I > run this first result through the parser again, calling toHtml() now > gets me: > > <SOMETAG FOO1="BAR1" FOO3=""> > > This is clearly wrong. I've lost the attribute FOO2 and its value entirely. > > It would seem to me that parsing a page and producing HTML output should > be consistent over multiple runs. If I have output from the parser, the > parser should consider all of that to be valid HTML, and should produce > identical output if the results are run through it again. > > Thoughts? > > _____________________________________________________________ > Joe Robins Tel: 212-918-5057 > Thaumaturgix, Inc. Fax: 212-918-5001 > 19 W. 44th St., 18th Floor Email: jmr...@tg... > New York, NY 10036 http://www.tgix.com > > thau'ma-tur-gy, n. the working of miracles. > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: eBay > Great deals on office technology -- on eBay now! Click here: > http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5 > _______________________________________________ > Htmlparser-developer mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-developer |