Upgrading from v2.22 to v2.24 we have unfortunaly hit the following problem on a lot of webisites. Version v2.23 is also affected.
From what I could understand by debugging, there are certain conditions for which the cleaning will produce an empty attributeName, and therefore Utils.sanitizeXmlIdentifier() will raise a StringIndexOutOfBoundsException in line 627 when trying attName.substring(0,1)
However, I could verify on v2.22 that the cleaning did not produce such empty attribute name to begin with.
We have noticed this on many pages, especially when there is a wrong namespace url like in
<html><body><svg xmlns="<http://www.w3.org/2000/svg>"><defs></defs></svg></body></html>
or with an not allowed char in the tag name like <font-size:>
The following test can be used to replicate the issue.
@Test
public void testStringIndexOutOfBoundsException() {
// http://cnt.kingrecords.co.jp/high-resolution/news
final var props = new CleanerProperties();
props.setNamespacesAware(false);
final String html = "<html><body><span style=\"text-decoration: underline;\"><font-size: \"medium;\"></span></body></html>";
final Exception exception = assertThrows(StringIndexOutOfBoundsException.class, () -> {
final TagNode tagNode = new HtmlCleaner(props).clean(html);
final Document doc = new DomSerializer(props, false, false, false).createDOM(tagNode);
});
assertEquals("begin 0, end 1, length 0", exception.getMessage());
}
Other web pages where this happens are:
{"url": "http://aboutlincolncenter.org/contact-us/online"}
{"url": "http://bikereview.com.au/category/columns/"}
{"url": "http://cca.milfordct.com/WebForms/EvtListing.aspx?dbid2=ctmil&keyword=217054&class=E"}
{"url": "http://cnt.kingrecords.co.jp/high-resolution/news"}
{"url": "http://hobbyen.co.kr/news/newsList.php?&pagenum=1"}
{"url": "http://tech4tea.com/blog/category/tech-news/tickers/"}
{"url": "http://whitecenternow.com/categories/wildlife-2/"}
{"url": "http://www.artscape.co.za/sa-international-ballet-competition-postponed/"}
{"url": "http://www.athome.com/corporate-sustainability"}
{"url": "http://www.batesline.com/archives/cities/"}
it's worth noticing that, by setting
props.setOmitUnknownTags(true);the issue does not happen