Wow that's some really bad stuff, no wonder you want to clean it. Duplicated overlapping head and body tags. How does that even happen? Perhaps have a look at bug #155 and see if ignoring namespaces helps.
I saw that interrupts are handled in HTMLCleaner now, so I just surrounded my code in an executor to timeout after 5s to stop the nodelist growing uncontrollably.
Hi,
I'm getting an infiniteloop and hitting outofmemoryerrors while parsing: http://info.math4all.nl/website/view2.php?page=geschiedenis/wiskundigen/poisson
It's in HtmlTokenizer.start() somewhere...
Thanks,
Vaughn
Wow that's some really bad stuff, no wonder you want to clean it. Duplicated overlapping head and body tags. How does that even happen? Perhaps have a look at bug #155 and see if ignoring namespaces helps.
http://sourceforge.net/p/htmlcleaner/bugs/155/
I'd also try the obvious.. getting rid of the extraneous head and body tags before cleaning.
Hmmm. Didn't get any notification on this topic. Thanks for the reply. We're crawling the web, so I don't have control over most of this html.
Just ran into an OOME on "https://en.wikibooks.org/wiki/A-level_Computing/AQA/Paper_1/Fundamentals_of_programming/Recursion" too.
The nodelist in HTMLCleaner just keeps growing. Seems to keep spinning in the
I saw that interrupts are handled in HTMLCleaner now, so I just surrounded my code in an executor to timeout after 5s to stop the nodelist growing uncontrollably.
Highly ironic we get into an infinite loop on a page about recursion :)
Thanks for sharing your executor code - I'll use that for my next scraper!