Re: [Htmlunit-user] htmlunit 2.18 OSGi hangs on getting body text

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Wayne,
This is because Regular Expressions take too long for big text (500 KB in your case).
In SVN, this has been fixed.
Please use latest snapshot or build [1] once successful.
Thanks,Ahmed      From: Wayne Xin <way...@ho...>
 To: htm...@li... 
 Sent: Thursday, October 22, 2015 2:21 AM
 Subject: [Htmlunit-user] htmlunit 2.18 OSGi hangs on getting body text

Hi,
My simple program hangs on retrieving body text for web site: http://yorkshire-digital.com/
I don’t see anything special with this web site but my code hangs in retrieving body text. I don’t think it’ll be difficult for anybody to reproduce the problem. The problem must be in their HTML, however, I would like to see if there is a workaround:
 java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF); java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF); java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
 WebClient webClient = new WebClient(BrowserVersion.CHROME); webClient.getOptions().setTimeout(timeout*1000);  webClient.getOptions().setCssEnabled(false); webClient.getOptions().setPopupBlockerEnabled(true); webClient.getOptions().setThrowExceptionOnScriptError(false); webClient.getOptions().setRedirectEnabled(true); webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOptions().setUseInsecureSSL(true); webClient.getOptions().setPrintContentOnFailingStatusCode(false); webClient.setJavaScriptTimeout(5000);      webClient.setRefreshHandler(new HtmlunitWaitingRefreshHandler(timeout)); webClient.waitForBackgroundJavaScript(5000); …
               HtmlPage aPage = webClient.getPage(urlWithProto);               System.out.println(“body text:” + page.getBody().asText());
The code hangs at following stack:
        at java.util.regex.Pattern$Slice.match(Pattern.java:3867)        at java.util.regex.Pattern$Curly.match0(Pattern.java:4170)        at java.util.regex.Pattern$Curly.match(Pattern.java:4132)        at java.util.regex.Pattern$Start.match(Pattern.java:3408)        at java.util.regex.Matcher.search(Matcher.java:1199)        at java.util.regex.Matcher.find(Matcher.java:592)        at java.util.regex.Matcher.replaceAll(Matcher.java:907)        at com.gargoylesoftware.htmlunit.html.HtmlSerializer.reduceWhitespace(HtmlSerializer.java:84)        at com.gargoylesoftware.htmlunit.html.HtmlSerializer.cleanUp(HtmlSerializer.java:70)        at com.gargoylesoftware.htmlunit.html.HtmlSerializer.asText(HtmlSerializer.java:64)        at com.gargoylesoftware.htmlunit.html.DomNode.asText(DomNode.java:770)        at com.pan.utils.SingleCrawler.pagedumpWithHtmlunit(SingleCrawler.java:503)
I don’t own the page, it may change later. The content looks like (captured by browser) the following. I suspect it’s a bug in HtmlSerializer. Any help would be appreciated. 
Thanks.
-Wayne  

Re: [Htmlunit-user] htmlunit 2.18 OSGi hangs on getting body text

Java GUI-Less browser, supporting JavaScript, to run against web pages

Re: [Htmlunit-user] htmlunit 2.18 OSGi hangs on getting body text