Thread: [Htmlparser-user] Endless recursion bug in toHtml() ?
Brought to you by:
derrickoswald
From: Bram <br...@av...> - 2006-04-12 12:31:43
Attachments:
TestHtmlTransformer.java
|
Hello, I've isolated a problem, where sometimes htmlparser runs into an endless recursion loop when transforming links in HTML pages. Basically, I fetch & parse a document, change attributes on links, and reassemble it into a HTML page again. Without changing the attributes, the problem does not occur. See the attached JUnit test case. Best regards, Bram Avontuur -- Latest survey shows that 3 out of 4 people make up 75% of the world's population. |
From: Derrick O. <Der...@Ro...> - 2006-04-13 03:44:25
|
Bram, The mini <A/> tag in your test case is considered a content-less XML tag (see isEmptyXmlTag() in TagNode). This causes the parser to set the end tag reference to be the start tag - this is done so that there will always be an end tag, which may be a bad design decision, but it was thought to be better than inventing a non-existent tag, or returning null. When you add an HREF attribute in this case, the only attribute (the tag name with the slash) is no longer the last attribute and hence isEmptyXmlTag returns false, but the end tag reference is still pointing to the start tag, which causes the recursion. I guess the add attribute code could be smarter and detect this pathological situation, but I'm wondering if that's the real solution or just a band-aid. Was this discovered in the wild? Why is the XML syntax used for an empty link? Is this an XML file? Perhaps an XML parser would be a better choice. Derrick Bram wrote: >Hello, > >I've isolated a problem, where sometimes htmlparser runs into an >endless recursion loop when transforming links in HTML pages. > >Basically, I fetch & parse a document, change attributes on links, and >reassemble it into a HTML page again. Without changing the attributes, >the problem does not occur. > >See the attached JUnit test case. > >Best regards, >Bram Avontuur > > >------------------------------------------------------------------------ > >import java.util.logging.Logger; > >import org.htmlparser.Node; >import org.htmlparser.Parser; >import org.htmlparser.filters.TagNameFilter; >import org.htmlparser.nodes.TagNode; >import org.htmlparser.util.NodeList; >import org.htmlparser.util.ParserException; > >import junit.framework.TestCase; > >public class TestHtmlTransformer extends TestCase { > private static final Logger logger = Logger.global; > > public void testHtmlTransformer() { > String html = "<a/>"; > NodeList nodes = null; > > try { > Parser parser = Parser.createParser(html, "UTF-8"); > nodes = parser.parse(null); > } catch (ParserException e) { > logger.severe("ParserException: " + e.getMessage()); > fail("ParserException"); > } > > nodes.toHtml(); //this works fine. > logger.info("First html conversion went OK"); > > NodeList links = nodes.extractAllNodesThatMatch( > new TagNameFilter("a"), true); > > for (int i = 0; i < links.size(); i++) { > Node tmpNode = links.elementAt(i); > TagNode link = null; > if (tmpNode instanceof TagNode) { > link = (TagNode)links.elementAt(i); > } else { > continue; > } > > link.setAttribute("href", "http://foo.com/"); > } > > try { > nodes.toHtml(); //now it starts endless recursive loop > } catch (StackOverflowError e) { > //e.printStackTrace(); > fail("Endless recursion detected."); > } > logger.info("Passed second conversion"); //you never see this > } >} > > |
From: Bram <br...@av...> - 2006-04-13 08:13:37
Attachments:
TestHtmlTransformer.java
|
Hello Derrick, > The mini <A/> tag in your test case is considered a content-less XML > tag (see isEmptyXmlTag() in TagNode). > This causes the parser to set the end tag reference to be the start tag > - this is done so that there will always be an end tag, which may be a > bad design decision, but it was thought to be better than inventing a > non-existent tag, or returning null. > > When you add an HREF attribute in this case, the only attribute (the tag > name with the slash) is no longer the last attribute and hence > isEmptyXmlTag returns false, but the end tag reference is still pointing > to the start tag, which causes the recursion. > Thanks for the explanation. > I guess the add attribute code could be smarter and detect this > pathological situation, but I'm wondering if that's the real solution or > just a band-aid. > That sounds like a band-aid indeed, but it's a viable solution to prevent this problem until something better has been worked out, IMHO. > Was this discovered in the wild? Why is the XML syntax used for an > empty link? Is this an XML file? Perhaps an XML parser would be a > better choice. > Apparently I over-simplified the problem I occasionally occur. I tried to use the exact same test case on the URL that gave me the problems, and now it worked flawlessly. After a bit of tinkering I noticed I changed the name of the attribute (from 'onclick' to 'href') while trying to isolate the problem yesterday. Changing it back to 'onclick' will fail the test. This can be explained by the fact that the broken <a> tag already had a 'href' attribute, and simply changing an existing attribute doesn't trigger the problem. This is demonstrated in the newly attached JUnit test. Sadly, '<a ... />' tags do occur in the wild, so it would be very good to prevent this from happing, from a user's point of view. Bram -- Light travels faster than sound. This is why some people appear bright until you hear them speak. |