Following this discussion : http://markmail.org/message/zxojka3zwxnylqov
I am submitting a patch that migrates HTMLUnit to HttpClient 4.0. The patch currently is only for the main source code. The patch for all the jUnit tests is almost complete, but I will wait for your comments before I submit it.
A few remarks :
- My experience with HTMLUnit is still a bit limited
- HttpWebConnection and WebClient contain the core changes
- WebClient now encapsulate an HttpClient instance. Seems a more logically fit than HttpWebConnection for HttpClient 4.0.
Well, share your comments !
Regards,
Nicolas
Thanks for the patch! I'll have a look over the weekend, but in the meantime, can you make sure all of the code uses spaces for indentation? The tabs probably won't pass our formatting checks (mvn checkstyle:checkstyle) and they make the diff larger than necessary.
Also, while I have a look through the code, can you elaborate on your comment that WebClient is a better place for the HttpClient instance? From a design perspective, the various WebConnection implementations are supposed to abstract away the connection details (and make things like the MockWebConnection possible).
In HttpClient 4, http method objects are merely messages. They do not hold the response body and connection status. The HttpResponse object now encapsulate these things.
A quick comparison:
In HttpClient 3 :
GetMethod get = new GetMethod(url);
httpclient.executeMethod(method);
byte[] responseBody = method.getResponseBody();
HttpClient 4
HttpGet httpget = new HttpGet("http://localhost/");
HttpResponse response = httpclient.execute(httpget);
HttpEntity entity = response.getEntity();
InputStream content = entity.getContent();
Also, a lot of configuration options, like cookies, are now managed in httpClient.
So, in my humble opinion, HttpClient and WebClient are now closer in their purpose and WebClient should encapsulate an instance of HttpClient.
(The execution of methods still happends in HttpWebConnection.)
@nicolasbelisle
the comparison doesn't help me to understand why it would be valuable to have an HttpClient instance in WebClient :-(
Do you have other arguments?
In WebClient (from the patch), the HttpClient instance is used in mostly two places :
getCookieManager()
In HttpClient 4, the cookie policy is managed by the HttpClient object, not the method objects.
setUseInsecureSSL()
It now updates the schemeRegistry for the HttpClient instance. This replaces the Protocol.registerProtocol/unregisterProtocol in HttpClient 3.1.
AbstractHttpClient has more configuration options in HtpClient 4.0, similar to the configuration of WebClient.
Is it possible to manage the HttpClient instance in HttpWebConnection. Some information would have to be passed to HttpWebConnection, but nothing complicated.
If you are still not sure… I see no problem in updating my patch.
Hello again,
I just added an updated patch and the unit tests. Also added my pom.xml file.
The patch still uses an httpclient instance in webclient. See previous comment(s) for explanations.
Cool! I'll have a look in the next couple of days and post feedback.
BTW, I've opened an issue with the HttpClient guys regarding serialization of Credentials instances; if this makes it into HC 4.1, it would get rid of some of the serialization-related hacks in the HU code.
https://issues.apache.org/jira/browse/HTTPCLIENT-916
I had a quick look at v2 of the patch, but it's very hard to see what has really changed; there are lots of these sorts of changes:
} else {
This is just noise that keeps the patch from being readable... can you get rid of all of these pure formatting changes and resubmit? Thanks!
Added 2 new patches. Should look better !
Quick question: the ClientConnectionManager of HttpClient (v.4.0) should be shutdown() after usage. Would WebClient.closeAllWindows() be the proper place to do this ?
Thanks
Just added the shutdown call in WebClient.closeAllWindows().
See patch_main_v4.txt.
The patch is getting old now. I'm using it in Constellio (our open-source crawl/search engine).
Any interest in getting this added ?
Regards
Hi Nicolas,
Thanks for updating the patch, you can remove old versions.
I have the following comments:
- You can add @author of your name to all significantly changed files
- WebClient should never know about httpClient, so make the latter part of HttpWebConnection instead.
- I am inclined to make closing httpClient after each request, not in WebClient.closeAllWindows()
@Daniel: your feedback please?
Hi,
-- WebClient should never know about httpClient, so make the latter part of HttpWebConnection instead. --
From HttpClient documentation : http://hc.apache.org/httpcomponents-client/tutorial/html/connmgmt.html
"HttpClient has a complete control over the process of connection initialization and termination as well as I/O operations on active connections."
Creating and closing an HttpClient instance for each HttpWebConnection is not using HttpClient appropriately.
In the patch, connection opening/closing is still done in HttpWebConnection, but it can now benefit from HttpClient's connection pool and centralized configuraion.
HttpClient has changed and I think it's important to re-consider how it now fits into HtmlUnit.
By the way, HttpClient initialization is still lazy.
Also, if there's some design aspects you want to discuss, I'm available on Skype (nicolas.belisle).
Cheers !
Nicolas
Hi Nicolas,
I have no issue of delgating whatever supported by HttpClient (cookie management, open connections, etc).
By "WebClient should never know about httpClient", I meant that HttpClient is specific to HttpWebConnection, not to other types of connections (e.g. FalsifyingWebConnection), and when HttpClient is upgraded to future versions, only HttpWebConnection should be modified
When I said ""
I get it.
I've modified the patches.
A few things;
- I added code to implement a maximum content length for WebClient and a corresponding test case (WebClient4Test).
- There's a few tests that fail. I would need input on those. My knowledge of HtmlUnit lacks on certain aspects.
- I've added a few FIXME comments in the code.
Hi Nicolas,
The patch is not updated, I uploaded one against SVN trunk (httpclient4.patch).
Against r5703
src/main/java
src/test/java
maven project file
Hi,
I updated my patch files (see main_patch.txt and test_patch.txt below). Ran checkstyle and unit tests. Did some important changes.
WebClient3Test still fails because of a bug in HttpClient 4. I just filled an issue in their JIRA
https://issues.apache.org/jira/browse/HTTPCLIENT-939
WebClient.setMaxContentLength() what added so that users can restrict the maximum length for any content fetched by WebClient. It will throw an IOException in two cases:
1. The header specifies a larger than allowed content length, or
2. The content Inputstream reads farther the allowed content length.
I added "WebClient4Test.java" in the patch to test the new feature.
I added an updated "pom.xml" file with proper formatting.
There are few key changes:
- NTLM support in HttpClient 4 is not officially provided. See http://hc.apache.org/httpcomponents-client/ntlm.html
- WebClient. setUseInsecureSSL() cannot be statically registered, as with HttpClient 3. I’m using the "instanceOf" operator as I workaround which is a fragile approach.
- WebClient.closeAllWindows() has a similar problem for HttpWebConnection shutdown.
It's a lot of work so far, hope it's helpful.
Regards
Hi
I checked out the trunk now, installed core-js and when I'm installing htmlunit itself, I'm getting a lot of compilation errors like this:
HtmlPage.java:[33,49] package net.sourceforge.htmlunit.corejs.javascript does not exist
HTMLElement.java:[30,49] package net.sourceforge.htmlunit.corejs.javascript does not exist
I really need to use httpclient 4.0 together with htmlunit, could please anybody explain to me what to do?
Thank you, Joseph
Please use the user-list for such questions, this is not related to this Feature Request
Nicolas, your work speaks for itself, thanks so much for the effort.
Most of the patch is committed in SVN, with the following comments:
1- Huge amount of HttpClient debugging is automatically printed, to be switched off
2- NTLM authentication to be available (with allowing the possibility to use HttpClient unofficial mechanism)
3- Insecure SSL to be tested
Those three points must be verified/resolved before closing this issue
The following changes was made to pass all tests:
4- Custom httpClient_.setRedirectHandler() was used to resolve the HttpClient 'bug', see HttpWebConnection.getHttpClient()
5- HttpWebConnection.setUseInsecureSSL() code was moved to a package private class to allow GAE tests to pass.
Others:
6- setMaxContentLength() was removed as I don't find it useful, please use the dev-list for further discussion
7- XMLHttpRequest: there was an incorrectly hardcoded "username"/"password"
8- You can put your name in test cases if you wish, please provide a patch
Thanks again
Hi again,
1 HttpClient debugging is now resolved
3 Insecure SSL is not working, please verify
SSLInsecure Patch