HtmlCleaner / Discussion / Help: Infinite loop on page

Vaughn Dickson - 2015-10-16

Hi,

I'm getting an infiniteloop and hitting outofmemoryerrors while parsing: http://info.math4all.nl/website/view2.php?page=geschiedenis/wiskundigen/poisson

It's in HtmlTokenizer.start() somewhere...

Thanks,
Vaughn

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Seanster - 2015-10-16

Wow that's some really bad stuff, no wonder you want to clean it. Duplicated overlapping head and body tags. How does that even happen? Perhaps have a look at bug #155 and see if ignoring namespaces helps.

http://sourceforge.net/p/htmlcleaner/bugs/155/

I'd also try the obvious.. getting rid of the extraneous head and body tags before cleaning.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vaughn Dickson - 2016-07-04

Hmmm. Didn't get any notification on this topic. Thanks for the reply. We're crawling the web, so I don't have control over most of this html.

Just ran into an OOME on "https://en.wikibooks.org/wiki/A-level_Computing/AQA/Paper_1/Fundamentals_of_programming/Recursion" too.

The nodelist in HTMLCleaner just keeps growing. Seems to keep spinning in the $tag, and I can't use pruneTags because it doesn't get past the parsing stage. Will check out that bug report and try ignore namespaces.$

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I saw that interrupts are handled in HTMLCleaner now, so I just surrounded my code in an executor to timeout after 5s to stop the nodelist growing uncontrollably.

 class HTMLParseTask implements Callable<Document> {
        String html

        HTMLParseTask(String html) {
            this.html = html
        }

        @Override
        Document call() throws Exception {
            TagNode tagNode = cleaner.clean(html)
            return domSerializer.createDOM(tagNode)
        }
    }

    public Document clean(String html) {
        if(html == null) return null
        // limit the html cleaning to 5s, to avoid any bad html causing infinite loops
        ExecutorService executor = Executors.newSingleThreadExecutor();
        Future<String> future = executor.submit(new HTMLParseTask(html));
        Document result = null
        try {
            result = future.get(MAX_PARSE_TIME, TimeUnit.SECONDS)
        } catch(TimeoutException ex) {
            future.cancel(true)
            log.error("!!!!!!!!!!!!! Error parsing HTML. Timed out after " + MAX_PARSE_TIME + " seconds")
        } finally {
            executor.shutdownNow()
        }
        return result
    }

Scott Wilson - 2016-07-04

Highly ironic we get into an infinite loop on a page about recursion :)

Thanks for sharing your executor code - I'll use that for my next scraper!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Infinite loop on page

Forums

Help

Infinite loop on page document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Infinite loop on page