Menu

#225 StringIndexOutOfBoundsException while sanitizeXmlIdentifier

v2.30
open
nobody
None
9
2023-06-19
2021-07-22
legrass
No

Upgrading from v2.22 to v2.24 we have unfortunaly hit the following problem on a lot of webisites. Version v2.23 is also affected.

From what I could understand by debugging, there are certain conditions for which the cleaning will produce an empty attributeName, and therefore Utils.sanitizeXmlIdentifier() will raise a StringIndexOutOfBoundsException in line 627 when trying attName.substring(0,1)

However, I could verify on v2.22 that the cleaning did not produce such empty attribute name to begin with.

We have noticed this on many pages, especially when there is a wrong namespace url like in

<html><body><svg xmlns="<http://www.w3.org/2000/svg>"><defs></defs></svg></body></html>

or with an not allowed char in the tag name like <font-size:>
The following test can be used to replicate the issue.

@Test
  public void testStringIndexOutOfBoundsException() {
    // http://cnt.kingrecords.co.jp/high-resolution/news
    final var props = new CleanerProperties();
    props.setNamespacesAware(false);

    final String html = "<html><body><span style=\"text-decoration: underline;\"><font-size: \"medium;\"></span></body></html>";
    final Exception exception = assertThrows(StringIndexOutOfBoundsException.class, () -> {

      final TagNode tagNode = new HtmlCleaner(props).clean(html);

      final Document doc = new DomSerializer(props, false, false, false).createDOM(tagNode);

    });
    assertEquals("begin 0, end 1, length 0", exception.getMessage());
  }

Other web pages where this happens are:

{"url": "http://aboutlincolncenter.org/contact-us/online"}
{"url": "http://bikereview.com.au/category/columns/"}
{"url": "http://cca.milfordct.com/WebForms/EvtListing.aspx?dbid2=ctmil&keyword=217054&class=E"}
{"url": "http://cnt.kingrecords.co.jp/high-resolution/news"}
{"url": "http://hobbyen.co.kr/news/newsList.php?&pagenum=1"}
{"url": "http://tech4tea.com/blog/category/tech-news/tickers/"}
{"url": "http://whitecenternow.com/categories/wildlife-2/"}
{"url": "http://www.artscape.co.za/sa-international-ballet-competition-postponed/"}
{"url": "http://www.athome.com/corporate-sustainability"}
{"url": "http://www.batesline.com/archives/cities/"}

Discussion

  • legrass

    legrass - 2021-07-22

    it's worth noticing that, by setting props.setOmitUnknownTags(true); the issue does not happen

     
  • Scott Wilson

    Scott Wilson - 2023-04-29
    • Group: v2.24 --> v2.29
     
  • Scott Wilson

    Scott Wilson - 2023-06-19
    • Group: v2.29 --> v2.30
     

Log in to post a comment.

MongoDB Logo MongoDB