Missing tbody and tr no longer added.
This page has a missing tr around the first table row at the top:
http://emits.sso.esa.int/emits/owa/emits_online.showao?typ1=6599&user=Anonymous
Previous releases of HtmlCleaner would add the missing tbody and tr but 2.13 closes the table immediately. Both browsers I have tested automatically add the missing tbody and tr.
Yes, this is a side-effect with resolving the previous crtiical bug with infinite loops in required parent tags.
The possible resolutions are:
By 'revert to earlier functionality' how much earlier do you mean. We are currently using 2.10 with a couple of earlier fixes from 2.11 (built from sourcefourge on 26th Feb) and it corrects table structure errors like this very well.
Prior to 2.13 is fine; in 2.13 I modified parse order; see the Release Note here: http://htmlcleaner.sourceforge.net/release.php
Basically, in 2.13 the order of actions was changed as before this could lead to some odd situations where there was an infinite loop. Its hardly an ideal fix, but an OOME is more critical than poorer table support. Its definitely a problem though.
Martin - I've had another read of the HTML5 spec and I think this is a special case relating to unexpected markup when in table context - I've had a go at making a change to the rules in HC that should help - hopefully without other side-effects (well, all the tests pass at least).
Can you do a build from the current trunk and try that out?
Hi Scott, I have built revision 418 from sourceforge and it successfully corrects the table problem in this thread, but does not correct any other issues e.g. ul structure. I have not tested this build on other pages yet but should be able to do that by tomorrow.
Martin
I ran a full test last night and unfortunately received OOM errors.
The hprof has a lot of htmlcleaner.TagNode objects.
Here is the stack trace of one thread running when an OOM occurred. I don't know if this is the one which caused the error:
"asyncExecutor-51" prio=5 tid=139 RUNNABLE
at java.lang.OutOfMemoryError.<init>(OutOfMemoryError.java:48)
at org.htmlcleaner.TagNode.<init>(TagNode.java:58)
Local Variable: org.htmlcleaner.TagNode#1506318
at org.htmlcleaner.TagNode.<init>(TagNode.java:109)
at org.htmlcleaner.HtmlCleaner.newTagNode(HtmlCleaner.java:650)
at org.htmlcleaner.HtmlCleaner.makeTree(HtmlCleaner.java:977)
Local Variable: java.util.ArrayList$ListItr#25
Local Variable: org.htmlcleaner.TagNode#1537585
Local Variable: java.util.ArrayList#1546657
at org.htmlcleaner.HtmlTokenizer.addToken(HtmlTokenizer.java:103)
at org.htmlcleaner.HtmlTokenizer.tagStart(HtmlTokenizer.java:552)
Local Variable: java.lang.String#198058
at org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:486)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:449)
Local Variable: org.htmlcleaner.HtmlCleaner#7
Local Variable: org.htmlcleaner.HtmlTokenizer#7
Local Variable: org.htmlcleaner.CleanTimeValues#7
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:359)</init></init></init>
I have an hdump if required.
Can you post the HTML that triggers the OOME?
Not immediately because our code does not catch Errors, only Exceptions, and we have a lot of threads running simultaneously so logs don't help much but I should be able to add an OutOfMemoryError catch statement and run the test again if necessary - it will take a while.
I have found a few urls that seemed to cause OOM but not all of the time. I tested the following url and it caused an OOM, but after a restart it seemed to process normally, then on the third attempt caused another OOM:
http://www.medievalpottery.org.uk/contact.htm
java.lang.OutOfMemoryError: GC overhead limit exceeded
org.htmlcleaner.TagNode.attributesToLowerCase(TagNode.java:859)
org.htmlcleaner.TagNode.getAttributesInLowerCase(TagNode.java:161)
org.htmlcleaner.TagNode.setForeignMarkup(TagNode.java:835)
org.htmlcleaner.HtmlCleaner.makeTree(HtmlCleaner.java:868)
org.htmlcleaner.HtmlTokenizer.addToken(HtmlTokenizer.java:103)
org.htmlcleaner.HtmlTokenizer.tagStart(HtmlTokenizer.java:552)
org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:486)
org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:449)
org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:359)
org.htmlcleaner.Serializer.getAsString(Serializer.java:214)
This is when using revision 418 of HtmlCleaner from Sourceforge.
Other urls that sometimes seem to cause OOM are:
http://www.wellbeingofwomen.org.uk/research/research-grants/?menu=1c
http://www.icr-global.org/contact-us/
http://www.ics.ac.uk/icf/research-and-achievements/grants-and-research-awards/gold-medal-award/
http://www.marinemammals.gov.au/grants
Last edit: Martin Denham 2015-08-13
Thanks Martin - I appreciate your help and patience with this.
I've made a few tweaks to the HTML5 processor rules, and also added extra checks to the parser. I don't get any OOMs with those URLs now.
I have done some basic testing on revision 420 and the OOM has disappeared. I should be able to report back on more thorough tests next week.
Great! If all continues to be well I'll make a new release with the changes.
Apologies, for feeding back piecemeal but it would be difficult for me to analyze all potential problem pages at once, so I hope that by passing back possible issues as I notice them it fixes a lot of potential problem pages I haven't yet analyzed.
There seems to be a possible issue with structures like <ul><div><li>..</li></div></ul> as seen on this page: http://eacea.ec.europa.eu/erasmus-plus/funding_en
Looking at the xpath for "Selection results - Knowledge Alliances 2015 EAC/A04/2014" (top right). The latest HtmlCleaner removes the div between ul and li but Chrome and an older HtmlCleaner leave it there.
A different inconsistency which I can't understand occurs on this page: http://indigoprojects.eu/funding/indigo-calls
The xpath for the central text block below the title has changed.
Old HtmlCleaner and Chrome is:
Latest HtmlCleaner moves one of the divs but I can't see why resulting in an xpath of:
So /div[1]/div[2] has become div[3]
I am making the assumption that if old HtmlCleaner and Chrome view pages similarly then they are correct.
Last edit: Martin Denham 2015-08-18
On the EAC example: the HTML 4/5 processing rules are crystal clear on this one - you can only have LI elements as the direct children of a UL. So HC is correct to move the DIV outside of the UL tag. The only other option is to wrap the DIV inside an LI, but HC currently doesn't have a rule setting to do that. I guess Chrome is just being lenient in that case.
On th esecond case I don't see a div moved here - the structure looks the same in both Chrome and after using HC 2.14-SNAPSHOT), and the first XPath evaluates to the article DIV in each case.
CORRECTION my mistake I was using the wrong file. OK, I've replicated your result. Now to discover what's happening
Last edit: Scott Wilson 2015-08-18
OK, on the Indigo page, the issue is an unclosed DIV, and Chrome and HC seem to process that slightly differently, probably due to rule ordering. I'll take a look.
Right, its something to do with the tag provider rules. If you do:
Then the XPath expression works as before. So there's a problem somewhere in the Html5TagProvider class.
OK, once again its caused by Chrome taking the easy way out and letting you include a DIV within a UL, while in HC the initial DIV token is moved, causing the chain of consequences that ends up with the output looking even more broken as a result :(
I think we may be better off doing two-pass cleaning; first a relatively tolerant token-based pass to build the tree, then a set of tree operations to move tag structures out of invalid positions. For now maybe I'll just get rid of content restrictions on list tags.
Do you think there may be situations where different users would want different appproaches to cleaning? At the moment I am steered by a desire for backward consistency that also often happens to be consistent with Chrome and some other browsers, but many might have a requirement for strict html conformance.
So, another idea might be to have selectable cleaning profiles. You could have e.g. CHROME_PROFILE/RELAXED_PROFILE, STRICT_PROFILE/HTML_SPEC_PROFILE. These would consist of a few boolean values and settings that are used during the cleaning process.
You can actually do this to some extent already by using the TagProvider interface. HC ships with two different tag providers - HTML4 and HTML5. If you specify "htmlversion=4|5" on the command line it will switch to the one you specify. Within Java you can override the TagProvider with your own custom implementation - e.g. "Html5ChromeRelaxedTagProvider" - and then set it to be the one used in the cleaner prooperties. However I like the idea of simply having "strict" and "lax" versions as well as this deeper level of customisation.
If I set htmlversion=4 do you think the tidied pages will be more similar to the older version of HC we currently use (2.10) and also Chrome?
Well, YMMV. Its certainly less strict about lists and tables, but it also misses out the semantics of new elements such as ruby, nav etc. But it could be a good workaround until I make a new release with the strict/lax option
Hi,
Just wondering if you are still considering implementing a strict/lax option.
Cheers
Martin
Definitely still an option, though for the current release I've made the default HTML5 profile "lax".