From: Ahmed A. <asa...@ya...> - 2015-10-22 08:21:09
|
Hi Wayne, This is because Regular Expressions take too long for big text (500 KB in your case). In SVN, this has been fixed. Please use latest snapshot or build [1] once successful. Thanks,Ahmed From: Wayne Xin <way...@ho...> To: htm...@li... Sent: Thursday, October 22, 2015 2:21 AM Subject: [Htmlunit-user] htmlunit 2.18 OSGi hangs on getting body text Hi, My simple program hangs on retrieving body text for web site: http://yorkshire-digital.com/ I don’t see anything special with this web site but my code hangs in retrieving body text. I don’t think it’ll be difficult for anybody to reproduce the problem. The problem must be in their HTML, however, I would like to see if there is a workaround: java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF); java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF); java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF); WebClient webClient = new WebClient(BrowserVersion.CHROME); webClient.getOptions().setTimeout(timeout*1000); webClient.getOptions().setCssEnabled(false); webClient.getOptions().setPopupBlockerEnabled(true); webClient.getOptions().setThrowExceptionOnScriptError(false); webClient.getOptions().setRedirectEnabled(true); webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getOptions().setUseInsecureSSL(true); webClient.getOptions().setPrintContentOnFailingStatusCode(false); webClient.setJavaScriptTimeout(5000); webClient.setRefreshHandler(new HtmlunitWaitingRefreshHandler(timeout)); webClient.waitForBackgroundJavaScript(5000); … HtmlPage aPage = webClient.getPage(urlWithProto); System.out.println(“body text:” + page.getBody().asText()); The code hangs at following stack: at java.util.regex.Pattern$Slice.match(Pattern.java:3867) at java.util.regex.Pattern$Curly.match0(Pattern.java:4170) at java.util.regex.Pattern$Curly.match(Pattern.java:4132) at java.util.regex.Pattern$Start.match(Pattern.java:3408) at java.util.regex.Matcher.search(Matcher.java:1199) at java.util.regex.Matcher.find(Matcher.java:592) at java.util.regex.Matcher.replaceAll(Matcher.java:907) at com.gargoylesoftware.htmlunit.html.HtmlSerializer.reduceWhitespace(HtmlSerializer.java:84) at com.gargoylesoftware.htmlunit.html.HtmlSerializer.cleanUp(HtmlSerializer.java:70) at com.gargoylesoftware.htmlunit.html.HtmlSerializer.asText(HtmlSerializer.java:64) at com.gargoylesoftware.htmlunit.html.DomNode.asText(DomNode.java:770) at com.pan.utils.SingleCrawler.pagedumpWithHtmlunit(SingleCrawler.java:503) I don’t own the page, it may change later. The content looks like (captured by browser) the following. I suspect it’s a bug in HtmlSerializer. Any help would be appreciated. Thanks. -Wayne |