Menu

Infinite loop on page

Help
2015-10-16
2016-07-04
  • Seanster

    Seanster - 2015-10-16

    Wow that's some really bad stuff, no wonder you want to clean it. Duplicated overlapping head and body tags. How does that even happen? Perhaps have a look at bug #155 and see if ignoring namespaces helps.

    http://sourceforge.net/p/htmlcleaner/bugs/155/

    I'd also try the obvious.. getting rid of the extraneous head and body tags before cleaning.

     
  • Vaughn Dickson

    Vaughn Dickson - 2016-07-04

    Hmmm. Didn't get any notification on this topic. Thanks for the reply. We're crawling the web, so I don't have control over most of this html.

    Just ran into an OOME on "https://en.wikibooks.org/wiki/A-level_Computing/AQA/Paper_1/Fundamentals_of_programming/Recursion" too.

    The nodelist in HTMLCleaner just keeps growing. Seems to keep spinning in the tag, and I can't use pruneTags because it doesn't get past the parsing stage. Will check out that bug report and try ignore namespaces.

     
  • Vaughn Dickson

    Vaughn Dickson - 2016-07-04

    I saw that interrupts are handled in HTMLCleaner now, so I just surrounded my code in an executor to timeout after 5s to stop the nodelist growing uncontrollably.

     class HTMLParseTask implements Callable<Document> {
            String html
    
            HTMLParseTask(String html) {
                this.html = html
            }
    
            @Override
            Document call() throws Exception {
                TagNode tagNode = cleaner.clean(html)
                return domSerializer.createDOM(tagNode)
            }
        }
    
        public Document clean(String html) {
            if(html == null) return null
            // limit the html cleaning to 5s, to avoid any bad html causing infinite loops
            ExecutorService executor = Executors.newSingleThreadExecutor();
            Future<String> future = executor.submit(new HTMLParseTask(html));
            Document result = null
            try {
                result = future.get(MAX_PARSE_TIME, TimeUnit.SECONDS)
            } catch(TimeoutException ex) {
                future.cancel(true)
                log.error("!!!!!!!!!!!!! Error parsing HTML. Timed out after " + MAX_PARSE_TIME + " seconds")
            } finally {
                executor.shutdownNow()
            }
            return result
        }
    
     
  • Scott Wilson

    Scott Wilson - 2016-07-04

    Highly ironic we get into an infinite loop on a page about recursion :)

    Thanks for sharing your executor code - I'll use that for my next scraper!

     

Log in to post a comment.