Menu

#146 2.13 does not correct table structure

v2.14
closed-fixed
nobody
None
5
2015-09-14
2015-08-06
No

Missing tbody and tr no longer added.
This page has a missing tr around the first table row at the top:
http://emits.sso.esa.int/emits/owa/emits_online.showao?typ1=6599&user=Anonymous

Previous releases of HtmlCleaner would add the missing tbody and tr but 2.13 closes the table immediately. Both browsers I have tested automatically add the missing tbody and tr.

Discussion

1 2 > >> (Page 1 of 2)
  • Scott Wilson

    Scott Wilson - 2015-08-06

    Yes, this is a side-effect with resolving the previous crtiical bug with infinite loops in required parent tags.

    The possible resolutions are:

    • Downgrade support for other required parent tags (i.e. revert to earlier functionality, but omit full support for many HTML5 tags)
    • Rewrite the engine to better handle all of these cases without going into a loop
     
  • Martin Denham

    Martin Denham - 2015-08-10

    By 'revert to earlier functionality' how much earlier do you mean. We are currently using 2.10 with a couple of earlier fixes from 2.11 (built from sourcefourge on 26th Feb) and it corrects table structure errors like this very well.

     
  • Scott Wilson

    Scott Wilson - 2015-08-10

    Prior to 2.13 is fine; in 2.13 I modified parse order; see the Release Note here: http://htmlcleaner.sourceforge.net/release.php

    Basically, in 2.13 the order of actions was changed as before this could lead to some odd situations where there was an infinite loop. Its hardly an ideal fix, but an OOME is more critical than poorer table support. Its definitely a problem though.

     
  • Scott Wilson

    Scott Wilson - 2015-08-10

    Martin - I've had another read of the HTML5 spec and I think this is a special case relating to unexpected markup when in table context - I've had a go at making a change to the rules in HC that should help - hopefully without other side-effects (well, all the tests pass at least).

    Can you do a build from the current trunk and try that out?

     
  • Martin Denham

    Martin Denham - 2015-08-12

    Hi Scott, I have built revision 418 from sourceforge and it successfully corrects the table problem in this thread, but does not correct any other issues e.g. ul structure. I have not tested this build on other pages yet but should be able to do that by tomorrow.

    Martin

     
  • Martin Denham

    Martin Denham - 2015-08-13

    I ran a full test last night and unfortunately received OOM errors.

    The hprof has a lot of htmlcleaner.TagNode objects.
    Here is the stack trace of one thread running when an OOM occurred. I don't know if this is the one which caused the error:

    "asyncExecutor-51" prio=5 tid=139 RUNNABLE
    at java.lang.OutOfMemoryError.<init>(OutOfMemoryError.java:48)
    at org.htmlcleaner.TagNode.<init>(TagNode.java:58)
    Local Variable: org.htmlcleaner.TagNode#1506318
    at org.htmlcleaner.TagNode.<init>(TagNode.java:109)
    at org.htmlcleaner.HtmlCleaner.newTagNode(HtmlCleaner.java:650)
    at org.htmlcleaner.HtmlCleaner.makeTree(HtmlCleaner.java:977)
    Local Variable: java.util.ArrayList$ListItr#25
    Local Variable: org.htmlcleaner.TagNode#1537585
    Local Variable: java.util.ArrayList#1546657
    at org.htmlcleaner.HtmlTokenizer.addToken(HtmlTokenizer.java:103)
    at org.htmlcleaner.HtmlTokenizer.tagStart(HtmlTokenizer.java:552)
    Local Variable: java.lang.String#198058
    at org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:486)
    at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:449)
    Local Variable: org.htmlcleaner.HtmlCleaner#7
    Local Variable: org.htmlcleaner.HtmlTokenizer#7
    Local Variable: org.htmlcleaner.CleanTimeValues#7
    at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:359)</init></init></init>

    Here is another I noticed:
    java.lang.OutOfMemoryError: Java heap space
    at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256)
    at java.util.HashMap.putVal(HashMap.java:630)
    at java.util.HashMap.putMapEntries(HashMap.java:514)
    at java.util.HashMap.putAll(HashMap.java:784)
    at org.htmlcleaner.TagNode.replaceAttributes(TagNode.java:230)
    at org.htmlcleaner.TagNode.setForeignMarkup(TagNode.java:835)
    at org.htmlcleaner.HtmlCleaner.makeTree(HtmlCleaner.java:868)
    at org.htmlcleaner.HtmlTokenizer.addToken(HtmlTokenizer.java:103)
    at org.htmlcleaner.HtmlTokenizer.tagStart(HtmlTokenizer.java:552)
    at org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:486)
    at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:449)
    at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:359)
    

    I have an hdump if required.

     
  • Scott Wilson

    Scott Wilson - 2015-08-13

    Can you post the HTML that triggers the OOME?

     
  • Martin Denham

    Martin Denham - 2015-08-13

    Not immediately because our code does not catch Errors, only Exceptions, and we have a lot of threads running simultaneously so logs don't help much but I should be able to add an OutOfMemoryError catch statement and run the test again if necessary - it will take a while.

     
  • Martin Denham

    Martin Denham - 2015-08-13

    I have found a few urls that seemed to cause OOM but not all of the time. I tested the following url and it caused an OOM, but after a restart it seemed to process normally, then on the third attempt caused another OOM:
    http://www.medievalpottery.org.uk/contact.htm
    java.lang.OutOfMemoryError: GC overhead limit exceeded
    org.htmlcleaner.TagNode.attributesToLowerCase(TagNode.java:859)
    org.htmlcleaner.TagNode.getAttributesInLowerCase(TagNode.java:161)
    org.htmlcleaner.TagNode.setForeignMarkup(TagNode.java:835)
    org.htmlcleaner.HtmlCleaner.makeTree(HtmlCleaner.java:868)
    org.htmlcleaner.HtmlTokenizer.addToken(HtmlTokenizer.java:103)
    org.htmlcleaner.HtmlTokenizer.tagStart(HtmlTokenizer.java:552)
    org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:486)
    org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:449)
    org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:359)
    org.htmlcleaner.Serializer.getAsString(Serializer.java:214)

    This is when using revision 418 of HtmlCleaner from Sourceforge.

    Other urls that sometimes seem to cause OOM are:
    http://www.wellbeingofwomen.org.uk/research/research-grants/?menu=1c
    http://www.icr-global.org/contact-us/
    http://www.ics.ac.uk/icf/research-and-achievements/grants-and-research-awards/gold-medal-award/
    http://www.marinemammals.gov.au/grants

     

    Last edit: Martin Denham 2015-08-13
  • Scott Wilson

    Scott Wilson - 2015-08-13

    Thanks Martin - I appreciate your help and patience with this.

    I've made a few tweaks to the HTML5 processor rules, and also added extra checks to the parser. I don't get any OOMs with those URLs now.

     
  • Martin Denham

    Martin Denham - 2015-08-14

    I have done some basic testing on revision 420 and the OOM has disappeared. I should be able to report back on more thorough tests next week.

     
  • Scott Wilson

    Scott Wilson - 2015-08-14

    Great! If all continues to be well I'll make a new release with the changes.

     
  • Martin Denham

    Martin Denham - 2015-08-18

    Apologies, for feeding back piecemeal but it would be difficult for me to analyze all potential problem pages at once, so I hope that by passing back possible issues as I notice them it fixes a lot of potential problem pages I haven't yet analyzed.

    There seems to be a possible issue with structures like <ul><div><li>..</li></div></ul> as seen on this page: http://eacea.ec.europa.eu/erasmus-plus/funding_en
    Looking at the xpath for "Selection results - Knowledge Alliances 2015 EAC/A04/2014" (top right). The latest HtmlCleaner removes the div between ul and li but Chrome and an older HtmlCleaner leave it there.

    A different inconsistency which I can't understand occurs on this page: http://indigoprojects.eu/funding/indigo-calls
    The xpath for the central text block below the title has changed.
    Old HtmlCleaner and Chrome is:

    //*[@id="content"]/div/div[1]/div[2]/section[1]/div/article/div

    Latest HtmlCleaner moves one of the divs but I can't see why resulting in an xpath of:

    //*[@id="content"]/div[1]/div[3]/section[1]/div[1]/article[1]/div[1]

    So /div[1]/div[2] has become div[3]

    I am making the assumption that if old HtmlCleaner and Chrome view pages similarly then they are correct.

     

    Last edit: Martin Denham 2015-08-18
  • Scott Wilson

    Scott Wilson - 2015-08-18

    On the EAC example: the HTML 4/5 processing rules are crystal clear on this one - you can only have LI elements as the direct children of a UL. So HC is correct to move the DIV outside of the UL tag. The only other option is to wrap the DIV inside an LI, but HC currently doesn't have a rule setting to do that. I guess Chrome is just being lenient in that case.

     
  • Scott Wilson

    Scott Wilson - 2015-08-18

    On th esecond case I don't see a div moved here - the structure looks the same in both Chrome and after using HC 2.14-SNAPSHOT), and the first XPath evaluates to the article DIV in each case.

    CORRECTION my mistake I was using the wrong file. OK, I've replicated your result. Now to discover what's happening

     

    Last edit: Scott Wilson 2015-08-18
  • Scott Wilson

    Scott Wilson - 2015-08-18

    OK, on the Indigo page, the issue is an unclosed DIV, and Chrome and HC seem to process that slightly differently, probably due to rule ordering. I'll take a look.

     
  • Scott Wilson

    Scott Wilson - 2015-08-18

    Right, its something to do with the tag provider rules. If you do:

    curl http://indigoprojects.eu/funding/indigo-calls | java -jar htmlcleaner-2.14-SNAPSHOT.jar htmlversion=4 dest=output.html
    

    Then the XPath expression works as before. So there's a problem somewhere in the Html5TagProvider class.

     
  • Scott Wilson

    Scott Wilson - 2015-08-18

    OK, once again its caused by Chrome taking the easy way out and letting you include a DIV within a UL, while in HC the initial DIV token is moved, causing the chain of consequences that ends up with the output looking even more broken as a result :(

    I think we may be better off doing two-pass cleaning; first a relatively tolerant token-based pass to build the tree, then a set of tree operations to move tag structures out of invalid positions. For now maybe I'll just get rid of content restrictions on list tags.

     
  • Martin Denham

    Martin Denham - 2015-08-19

    Do you think there may be situations where different users would want different appproaches to cleaning? At the moment I am steered by a desire for backward consistency that also often happens to be consistent with Chrome and some other browsers, but many might have a requirement for strict html conformance.

    So, another idea might be to have selectable cleaning profiles. You could have e.g. CHROME_PROFILE/RELAXED_PROFILE, STRICT_PROFILE/HTML_SPEC_PROFILE. These would consist of a few boolean values and settings that are used during the cleaning process.

     
  • Scott Wilson

    Scott Wilson - 2015-08-19

    You can actually do this to some extent already by using the TagProvider interface. HC ships with two different tag providers - HTML4 and HTML5. If you specify "htmlversion=4|5" on the command line it will switch to the one you specify. Within Java you can override the TagProvider with your own custom implementation - e.g. "Html5ChromeRelaxedTagProvider" - and then set it to be the one used in the cleaner prooperties. However I like the idea of simply having "strict" and "lax" versions as well as this deeper level of customisation.

     
  • Martin Denham

    Martin Denham - 2015-08-19

    If I set htmlversion=4 do you think the tidied pages will be more similar to the older version of HC we currently use (2.10) and also Chrome?

     
  • Scott Wilson

    Scott Wilson - 2015-08-19

    Well, YMMV. Its certainly less strict about lists and tables, but it also misses out the semantics of new elements such as ruby, nav etc. But it could be a good workaround until I make a new release with the strict/lax option

     
  • Scott Wilson

    Scott Wilson - 2015-08-24
    • status: open --> closed-fixed
     
  • Martin Denham

    Martin Denham - 2015-09-11

    Hi,
    Just wondering if you are still considering implementing a strict/lax option.
    Cheers
    Martin

     
  • Scott Wilson

    Scott Wilson - 2015-09-11

    Definitely still an option, though for the current release I've made the default HTML5 profile "lax".

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.

MongoDB Logo MongoDB