From: SourceForge.net <no...@so...> - 2012-07-27 09:23:16
|
Bugs item #2909310, was opened at 2009-12-05 07:04 Message generated for change (Comment added) made by kohlschuetter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=952178&aid=2909310&group_id=195122 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: tag-balancer Group: None Status: Open Resolution: None >Priority: 7 Private: No Submitted By: Christian Kohlschütter (kohlschuetter) Assigned to: Marc Guillemot (mguillem) Summary: unbalanced DIV inside A causing problems Initial Comment: Hi, the following page causes nekohtml 1.9.13 to procuce an incorrect DOM tree because of unbalanced tags: http://abclocal.go.com/kgo/story?section=news/national_world&id=7153024 (in line 381 we have <div class="icon_digg"></a>) You can easily reproduce this by using my boilerplate removal library "boilerpipe" ( http://code.google.com/p/boilerpipe/ ), which uses nekohtml 1.9.13 upstream. A user complained about the results, but it turned out to be an HTML parsing issue -- after inserting the missing </div> tag into the source it works as expected. Would be great if you could fix this problem. Cheers, Christian ---------------------------------------------------------------------- >Comment By: Christian Kohlschütter (kohlschuetter) Date: 2012-07-27 02:23 Message: Still broken with 1.9.16 ---------------------------------------------------------------------- Comment By: Benjamin McCann (chengas123) Date: 2010-11-21 13:32 Message: Any update on this? Neko is placing links and divs inside of links neither of which is valid. I would really like to see Neko match FireFox's behavior here. Input: <html><a><div><a></html> Neko: <HTML><HEAD/><BODY><A><DIV><A/></DIV></A></BODY></HTML> Firefox: <HTML><HEAD/><BODY><A></A><DIV><A/></A></DIV></BODY></HTML> ---------------------------------------------------------------------- Comment By: Christian Kohlschütter (kohlschuetter) Date: 2009-12-14 13:52 Message: I am quite sure that nekohtml has more bugs than this one. Even if this is just the top of the iceberg, I would like to see this fixed. I have checked the testcases you mentioned and saw that now nekohtml fails a few tests. I have updated my patch and created a new canonical test for my example. All canonical tests now pass. The new patch now only changes behaviour for the A tag, while keeping everything else intact (I added a new constructor with a "nestable" parameter, that is only called for A). ---------------------------------------------------------------------- Comment By: Marc Guillemot (mguillem) Date: 2009-12-14 12:58 Message: I don't agree with the patch: it only solves the top of the iceberg. The problem is more general and some unit tests like test-a_href-around-p.html are wrong: a <a> tag should not be able to wrap tags like <p> or <div>. ---------------------------------------------------------------------- Comment By: Christian Kohlschütter (kohlschuetter) Date: 2009-12-11 07:51 Message: Sure, here it is. However there was no conflicting change compared to the previous patch. I have just taken the opportunity to remove some newline/encoding changes from the patch. ---------------------------------------------------------------------- Comment By: Marc Guillemot (mguillem) Date: 2009-12-11 04:56 Message: Your patch seems to be against release 1.13. Can you provide a patch against latest sources from SVN? ---------------------------------------------------------------------- Comment By: Christian Kohlschütter (kohlschuetter) Date: 2009-12-10 01:52 Message: I think I have tracked down the bug (patch provided). NekoHTML currently aborts scanning the element stack if a parent element is a block-level element. This appears to be not correct for any elements that are defined to be non-nestable. The patch does the following: - Add a "nestable" attribute to the Element class. Its state is determined automatically by the arguments passed to the constructor (it is set to true if the element's code id is contained the closes array). - If "nestable" is true, the element stack gets traversed completely for startElement and endElement Cheers, Christian ---------------------------------------------------------------------- Comment By: Christian Kohlschütter (kohlschuetter) Date: 2009-12-09 05:21 Message: Sorry, there was a typo in the minimal example (missing slashes). I of course meant: <a><div></a> <a><div></a> ---------------------------------------------------------------------- Comment By: Christian Kohlschütter (kohlschuetter) Date: 2009-12-09 05:20 Message: And this is what Firefox 3.5.5 and Safari 4.0.4 make out of it -- determined with javascript:window.alert(document.body.outerHTML): <body><a></a><div> <a></a><div> </div></div></body> ---------------------------------------------------------------------- Comment By: Christian Kohlschütter (kohlschuetter) Date: 2009-12-09 04:49 Message: Hi Marc, yes. I think the minimal example is this: <a><div><a> <a><div><a> which results in the following tree: <HTML> <HEAD> </HEAD> <BODY> <A> <DIV> <A> <DIV> </DIV> </A> </DIV> </A> </BODY> </HTML> whereas <a><div><a> once yields the following: <HTML> <HEAD> </HEAD> <BODY> <A> <DIV> </DIV> </A> </BODY> </HTML> This behaviour (for the first case) violates the HTML 4 Spec ( http://www.w3.org/TR/html4/struct/links.html#h-12.2.2 ) section 12.2.2 "Nested links are illegal". Cheers, Christian ---------------------------------------------------------------------- Comment By: Marc Guillemot (mguillem) Date: 2009-12-07 12:20 Message: Can you provided a *minimal* example of HTML code that is not correctly parsed? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=952178&aid=2909310&group_id=195122 |