  • Seanster

    Seanster - 2015-10-16

    Wow that's some really bad stuff, no wonder you want to clean it. Duplicated overlapping head and body tags. How does that even happen? Perhaps have a look at bug #155 and see if ignoring namespaces helps.

    I'd also try the obvious.. getting rid of the extraneous head and body tags before cleaning.

  • Vaughn Dickson

    Vaughn Dickson - 2016-07-04

    Hmmm. Didn't get any notification on this topic. Thanks for the reply. We're crawling the web, so I don't have control over most of this html.

    Just ran into an OOME on "" too.

    The nodelist in HTMLCleaner just keeps growing. Seems to keep spinning in the tag, and I can't use pruneTags because it doesn't get past the parsing stage. Will check out that bug report and try ignore namespaces.

  • Vaughn Dickson

    Vaughn Dickson - 2016-07-04

    I saw that interrupts are handled in HTMLCleaner now, so I just surrounded my code in an executor to timeout after 5s to stop the nodelist growing uncontrollably.

     class HTMLParseTask implements Callable<Document> {
            String html
            HTMLParseTask(String html) {
                this.html = html
            Document call() throws Exception {
                TagNode tagNode = cleaner.clean(html)
                return domSerializer.createDOM(tagNode)
        public Document clean(String html) {
            if(html == null) return null
            // limit the html cleaning to 5s, to avoid any bad html causing infinite loops
            ExecutorService executor = Executors.newSingleThreadExecutor();
            Future<String> future = executor.submit(new HTMLParseTask(html));
            Document result = null
            try {
                result = future.get(MAX_PARSE_TIME, TimeUnit.SECONDS)
            } catch(TimeoutException ex) {
                log.error("!!!!!!!!!!!!! Error parsing HTML. Timed out after " + MAX_PARSE_TIME + " seconds")
            } finally {
            return result
  • Scott Wilson

    Scott Wilson - 2016-07-04

    Highly ironic we get into an infinite loop on a page about recursion :)

    Thanks for sharing your executor code - I'll use that for my next scraper!


