HtmlCleaner / Bugs / #146 2.13 does not correct table structure

Scott Wilson - 2015-08-06

Yes, this is a side-effect with resolving the previous crtiical bug with infinite loops in required parent tags.

The possible resolutions are:

Downgrade support for other required parent tags (i.e. revert to earlier functionality, but omit full support for many HTML5 tags)

Rewrite the engine to better handle all of these cases without going into a loop
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Denham - 2015-08-10

By 'revert to earlier functionality' how much earlier do you mean. We are currently using 2.10 with a couple of earlier fixes from 2.11 (built from sourcefourge on 26th Feb) and it corrects table structure errors like this very well.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-08-10

Prior to 2.13 is fine; in 2.13 I modified parse order; see the Release Note here: http://htmlcleaner.sourceforge.net/release.php

Basically, in 2.13 the order of actions was changed as before this could lead to some odd situations where there was an infinite loop. Its hardly an ideal fix, but an OOME is more critical than poorer table support. Its definitely a problem though.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-08-10

Martin - I've had another read of the HTML5 spec and I think this is a special case relating to unexpected markup when in table context - I've had a go at making a change to the rules in HC that should help - hopefully without other side-effects (well, all the tests pass at least).

Can you do a build from the current trunk and try that out?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Denham - 2015-08-12

Hi Scott, I have built revision 418 from sourceforge and it successfully corrects the table problem in this thread, but does not correct any other issues e.g. ul structure. I have not tested this build on other pages yet but should be able to do that by tomorrow.

Martin

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Denham - 2015-08-13

I ran a full test last night and unfortunately received OOM errors.

The hprof has a lot of htmlcleaner.TagNode objects.
Here is the stack trace of one thread running when an OOM occurred. I don't know if this is the one which caused the error:

"asyncExecutor-51" prio=5 tid=139 RUNNABLE
at java.lang.OutOfMemoryError.<init>(OutOfMemoryError.java:48)
at org.htmlcleaner.TagNode.<init>(TagNode.java:58)
Local Variable: org.htmlcleaner.TagNode#1506318
at org.htmlcleaner.TagNode.<init>(TagNode.java:109)
at org.htmlcleaner.HtmlCleaner.newTagNode(HtmlCleaner.java:650)
at org.htmlcleaner.HtmlCleaner.makeTree(HtmlCleaner.java:977)
Local Variable: java.util.ArrayList$ListItr#25
Local Variable: org.htmlcleaner.TagNode#1537585
Local Variable: java.util.ArrayList#1546657
at org.htmlcleaner.HtmlTokenizer.addToken(HtmlTokenizer.java:103)
at org.htmlcleaner.HtmlTokenizer.tagStart(HtmlTokenizer.java:552)
Local Variable: java.lang.String#198058
at org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:486)
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:449)
Local Variable: org.htmlcleaner.HtmlCleaner#7
Local Variable: org.htmlcleaner.HtmlTokenizer#7
Local Variable: org.htmlcleaner.CleanTimeValues#7
at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:359)</init></init></init>

Here is another I noticed: java.lang.OutOfMemoryError: Java heap space at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256) at java.util.HashMap.putVal(HashMap.java:630) at java.util.HashMap.putMapEntries(HashMap.java:514) at java.util.HashMap.putAll(HashMap.java:784) at org.htmlcleaner.TagNode.replaceAttributes(TagNode.java:230) at org.htmlcleaner.TagNode.setForeignMarkup(TagNode.java:835) at org.htmlcleaner.HtmlCleaner.makeTree(HtmlCleaner.java:868) at org.htmlcleaner.HtmlTokenizer.addToken(HtmlTokenizer.java:103) at org.htmlcleaner.HtmlTokenizer.tagStart(HtmlTokenizer.java:552) at org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:486) at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:449) at org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:359)

I have an hdump if required.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-08-13

Can you post the HTML that triggers the OOME?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Denham - 2015-08-13

Not immediately because our code does not catch Errors, only Exceptions, and we have a lot of threads running simultaneously so logs don't help much but I should be able to add an OutOfMemoryError catch statement and run the test again if necessary - it will take a while.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Denham - 2015-08-13

I have found a few urls that seemed to cause OOM but not all of the time. I tested the following url and it caused an OOM, but after a restart it seemed to process normally, then on the third attempt caused another OOM:
http://www.medievalpottery.org.uk/contact.htm
java.lang.OutOfMemoryError: GC overhead limit exceeded
org.htmlcleaner.TagNode.attributesToLowerCase(TagNode.java:859)
org.htmlcleaner.TagNode.getAttributesInLowerCase(TagNode.java:161)
org.htmlcleaner.TagNode.setForeignMarkup(TagNode.java:835)
org.htmlcleaner.HtmlCleaner.makeTree(HtmlCleaner.java:868)
org.htmlcleaner.HtmlTokenizer.addToken(HtmlTokenizer.java:103)
org.htmlcleaner.HtmlTokenizer.tagStart(HtmlTokenizer.java:552)
org.htmlcleaner.HtmlTokenizer.start(HtmlTokenizer.java:486)
org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:449)
org.htmlcleaner.HtmlCleaner.clean(HtmlCleaner.java:359)
org.htmlcleaner.Serializer.getAsString(Serializer.java:214)

This is when using revision 418 of HtmlCleaner from Sourceforge.

Other urls that sometimes seem to cause OOM are:
http://www.wellbeingofwomen.org.uk/research/research-grants/?menu=1c
http://www.icr-global.org/contact-us/
http://www.ics.ac.uk/icf/research-and-achievements/grants-and-research-awards/gold-medal-award/
http://www.marinemammals.gov.au/grants

Last edit: Martin Denham 2015-08-13

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-08-13

Thanks Martin - I appreciate your help and patience with this.

I've made a few tweaks to the HTML5 processor rules, and also added extra checks to the parser. I don't get any OOMs with those URLs now.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Denham - 2015-08-14

I have done some basic testing on revision 420 and the OOM has disappeared. I should be able to report back on more thorough tests next week.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-08-14

Great! If all continues to be well I'll make a new release with the changes.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Denham - 2015-08-18

Apologies, for feeding back piecemeal but it would be difficult for me to analyze all potential problem pages at once, so I hope that by passing back possible issues as I notice them it fixes a lot of potential problem pages I haven't yet analyzed.

There seems to be a possible issue with structures like <ul><div><li>..</li></div></ul> as seen on this page: http://eacea.ec.europa.eu/erasmus-plus/funding_en
Looking at the xpath for "Selection results - Knowledge Alliances 2015 EAC/A04/2014" (top right). The latest HtmlCleaner removes the div between ul and li but Chrome and an older HtmlCleaner leave it there.

A different inconsistency which I can't understand occurs on this page: http://indigoprojects.eu/funding/indigo-calls
The xpath for the central text block below the title has changed.
Old HtmlCleaner and Chrome is:

//*[@id="content"]/div/div[1]/div[2]/section[1]/div/article/div

Latest HtmlCleaner moves one of the divs but I can't see why resulting in an xpath of:

//*[@id="content"]/div[1]/div[3]/section[1]/div[1]/article[1]/div[1]

So /div[1]/div[2] has become div[3]

I am making the assumption that if old HtmlCleaner and Chrome view pages similarly then they are correct.

Last edit: Martin Denham 2015-08-18

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-08-18

On the EAC example: the HTML 4/5 processing rules are crystal clear on this one - you can only have LI elements as the direct children of a UL. So HC is correct to move the DIV outside of the UL tag. The only other option is to wrap the DIV inside an LI, but HC currently doesn't have a rule setting to do that. I guess Chrome is just being lenient in that case.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-08-18

On th esecond case I don't see a div moved here - the structure looks the same in both Chrome and after using HC 2.14-SNAPSHOT), and the first XPath evaluates to the article DIV in each case.

CORRECTION my mistake I was using the wrong file. OK, I've replicated your result. Now to discover what's happening

Last edit: Scott Wilson 2015-08-18

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-08-18

OK, on the Indigo page, the issue is an unclosed DIV, and Chrome and HC seem to process that slightly differently, probably due to rule ordering. I'll take a look.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-08-18

Right, its something to do with the tag provider rules. If you do:

curl http://indigoprojects.eu/funding/indigo-calls | java -jar htmlcleaner-2.14-SNAPSHOT.jar htmlversion=4 dest=output.html

Then the XPath expression works as before. So there's a problem somewhere in the Html5TagProvider class.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-08-18

OK, once again its caused by Chrome taking the easy way out and letting you include a DIV within a UL, while in HC the initial DIV token is moved, causing the chain of consequences that ends up with the output looking even more broken as a result :(

I think we may be better off doing two-pass cleaning; first a relatively tolerant token-based pass to build the tree, then a set of tree operations to move tag structures out of invalid positions. For now maybe I'll just get rid of content restrictions on list tags.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Denham - 2015-08-19

Do you think there may be situations where different users would want different appproaches to cleaning? At the moment I am steered by a desire for backward consistency that also often happens to be consistent with Chrome and some other browsers, but many might have a requirement for strict html conformance.

So, another idea might be to have selectable cleaning profiles. You could have e.g. CHROME_PROFILE/RELAXED_PROFILE, STRICT_PROFILE/HTML_SPEC_PROFILE. These would consist of a few boolean values and settings that are used during the cleaning process.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-08-19

You can actually do this to some extent already by using the TagProvider interface. HC ships with two different tag providers - HTML4 and HTML5. If you specify "htmlversion=4|5" on the command line it will switch to the one you specify. Within Java you can override the TagProvider with your own custom implementation - e.g. "Html5ChromeRelaxedTagProvider" - and then set it to be the one used in the cleaner prooperties. However I like the idea of simply having "strict" and "lax" versions as well as this deeper level of customisation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Denham - 2015-08-19

If I set htmlversion=4 do you think the tidied pages will be more similar to the older version of HC we currently use (2.10) and also Chrome?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-08-19

Well, YMMV. Its certainly less strict about lists and tables, but it also misses out the semantics of new elements such as ruby, nav etc. But it could be a good workaround until I make a new release with the strict/lax option

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-08-24

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Denham - 2015-09-11

Hi,
Just wondering if you are still considering implementing a strict/lax option.
Cheers
Martin

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Scott Wilson - 2015-09-11

Definitely still an option, though for the current release I've made the default HTML5 profile "lax".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

2.13 does not correct table structure

Group

Searches

Help

#146 2.13 does not correct table structure

Discussion