There are several restrictions GAE places that prevent a vanilla HtmlUnit from running on GAE. Specifically, no threads can be started on GAE. GAE also does not allow many classes, such as URLStreamHandler, Applets etc.
I have a somewhat hacky solution to get HtmlUnit to run on GAE. It builds on the dethreading patch, and has 3 main changes:
-- the EventLoop does not start a thread. Instead, it just accumulates jobs. When the main thread calls pumpEventLoop (long timeout)
-- An appEngine compatible implementation of WebConnection called UrlFetchWebConnection.
-- A hack to avoid the use of URLHandler. I rewrite "javascript:<data>" url as "http://javascript/<data>" and then in UrlFetchWebConnection, I return the appropriate response.
In addition, I had to change some of the variables in WebClient.java, so that they are not statically initialized. I have attached a patch against a "recent version" of my dethreading patch, just to get the discussion started.
Thanks for the patch. This is indeed a hack, but this is good enough to start the discussion.
In this first comment I don't want to go into details of the patch but rather about the target. I personally find GAE very interesting and would really welcome it if HtmlUnit could run on GAE, even if it is not the primary target.
I believe that the first thing that we have to define is what we want to see running on GAE. Currently HtmlUnit doesn't work at all and it would surely be very difficult to have the "full" HtmlUnit running on GAE. We have to define something between these two extrema as the target.
Once the target is defined, we need to ensure that it is reached and that future releases of HtmlUnit continue to reach it. Everything that can be tested by unit tests integrated in HtmlUnit build's has to be tested this way but this isn't enough. GAE is different and the only real test would be to deploy a project on GAE to run some tests (I've personally experienced that the dev mode is interesting but not good enough to give a definitive result). This means that we need to setup a new project for instance in HtmlUnit's SVN for this purpose and create a GAE app with HtmlUnit committers as developers (+ you? or you submit some other patches to become committer ;-)).
What do you think? What would you see as initial set of tests that should pass?
Marc: Thanks for the comments. To start with, I would suggest adding a test that does the following, where urls is just a collection of popular JS heavy URLs:
for (String url : urls) {
HtmlPage page = client.getPage(inputUrl);
client.pumpEventLoop(10000);
String pageAsString = page.asXml();
assert(...); // assert pageAsString contains the rendered DOM.
}
To implement such a test, we will need to come up with a list of URLs and conditions for each URL that confirm whether HtmlUnit produced the DOM correctly or not. This test would also directly help any developers who aim to use HtmlUnit for crawling, as outlined in http://code.google.com/web/ajaxcrawling/docs/html-snapshot.html
Thoughts?
In the code fragment, the first line should be:
for (String inputUrl : urls) {
In addition, the client could be set up as:
WebClient client = new WebClient();
static {
WebClient.setAppEngineMode();
}
A file containing a comprehensive list of classes used in HtmlUnit code, which can't be used on AppEngine.
Classes that can't be used on GAE.
Hmm, external URLs are surely interesting but I wouldn't use them as the first test. As we don't control them, we would have to regularly verify manually that:
- "normal" HtmlUnit still works with them
- get new reference DOM
What about a set of things that we can control, without external dependencies like HtmlUnit JS library tests?
Additional note: in your comment as well as in http://code.google.com/web/ajaxcrawling/docs/html-snapshot.html you use "new WebClient()" which is (currently) equivalent to "new WebClient(BrowserVersion.INTERNET_EXPLORER_7)". Is it voluntary to simulate IE7 rather than an other browser?
Almost all the HtmlUnit JS library tests "should" work out of the box, except that a pumpEventLoop(...) call would need to be added. However, I still think it would be worthwhile to have an integration test going -- the knob we have is in selecting the URLs and what conditions to check for them. Perhaps, the "new reference DOM" could be obtained by running HtmlUnit in non-appEngine mode. So the test would check whether the AppEngine-mode HtmlUnit produces the same output as the "default" HtmlUnit. Thoughts?
Regarding IE7, it is just an attempt to keep things simple. (plus, in most cases, it doesn't affect the output.) Maybe it is time to update the default browser in HtmlUnit :-).
@Marc: Marking it as "next release" and assigning it to you. Is that fine?
@Amit: please let committers assign bugs themselves. Remember that most of the time we work on HtmlUnit in our free time (gigs are welcome to speed up things ;-)) and that even if I would welcome GAE support I can't foresee how many time I will have to work on it.
I find the idea interesting to take "normal HtmlUnit" as reference to test "HtmlUnit on GAE".
Concerning the default browser: if if wouldn't hurt so much users, I would prefer to remove the notion of default browser! ;-)
Marc: Sorry for assigning the bug to you.
Now that the dethreading patch is in, I will update the patch so that anyone can play with it. Let us also continue the discussion around what tests to add. So far, we have:
- Being able to run most of HtmlUnit JS tests, in the AppEngine mode.
- AppEngine integration tests, where we test the DOM produced by HtmlUnit in AppEngine mode against the HtmlUnit in DOM mode.
Did I miss anything?
View and moderate all "feature-requests Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Feature Requests"
I just want to chip in to say I too am very interested in running HTML Unit on GAE. amitmanjhi mentions that the dethreading patch is in. Is there any way to get this version of HTML Unit? Would it run on GAE?
(And thanks all for this great tool!)
@Phil: I believe this is currently at the "working proof of concept" stage; you can grab the sources in SVN trunk and apply this patch and probably get it to work, but it may be a while yet before you can just download HtmlUnit and use it in GAE. I guess if Amit and Marc continue to collaborate as quickly as they did for the dethreading patch, it might happen before the next version of HtmlUnit is released :-)
View and moderate all "feature-requests Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Feature Requests"
@Amit, @Marc: Cheers cheers! Go go go! Viva Open Source. :) More seriously, if there's anything I can do to help speedup the process, let me know, I could have some time to contribute myself.
Updated git patch against r5611. To see this in action, visit: http://ajax-crawler.appspot.com/
Built with GWT + HtmlUnit with this patch.
@Phil: you are welcome to try this patch. Apply it against a current htmlunit svn source (r5611 or higher). Build the HtmlUnit jars and try them on AppEngine.
Now that Marc is back, I hope things can move quickly on landing a version of this patch.
For info: I work slowly on this. I've added a unit test (as NotYetImplemented) simulating the problems due to GAE white list. I have an idea how to cleanly fix it (proposed patch is a hack and can't be integrated this way). Once this is done, you can update the patch to make use of it.
Marc: I looked at r5610. It is a great approach for testing appEngine compatibility. Looking forward to the rest of the patch.
I've committed a fix to avoid NoClassDefFoundError for URLStreamHandler and uses a hack similar to the one of your patch for javascript, about and data urls BUT without the need to change WebClient.URL_ABOUT_BLANK or to make any change that would modify the normal behaviour of HtmlUnit.
Can you take these changes into account and simplify your patch?
I'd like to continue step by step, with dedicated tests all the time. The next problem with HtmlUnit on GAE is now that the HttpWebConnection starts a thread. My plan is to write a test that reproduces this problem and then to use the web connection of your patch. I don't know when I'll be able to work. If you provide a patch for that, it could go faster.
Marc: I will take a look at the changes and update the patch.
Wouldn't HttpWebConnection fail because it tries to load the java.lang.Thread class somewhere?
View and moderate all "feature-requests Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Feature Requests"
Just checking the status on this... I'm still really interested by that patch and can't wait to try it on my project! Thanks to all of you, btw.
@Phil: you can try the previous patch I attached to this issue. You might have to use a previous svn revision. Let us know how it goes.
View and moderate all "feature-requests Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Feature Requests"
I applied the patch, appEngine.r5611.patch, against revision 5611, packaged and implemented AJAX crawling, as per http://code.google.com/web/ajaxcrawling/docs/html-snapshot.html, on a GWT application.
About 50% of the time an AJAX crawled URL, containing escaped_fragment, a timeout exception is thrown, details below.
@amitmanjhi How did you structure http://ajax-crawler.appspot.com/ to allow much longer request? That is, did you configure anything special regarding timeouts? What sequence of instructions did you use on WebClient? Are you willing to share the code?
Thanks
com.gargoylesoftware.htmlunit.UrlFetchWebConnection getResponse: Exception Timeout while fetching: http://someapp.appspot.com/#!about while trying to fetch, returning () for URL http://someapp.appspot.com/#!about
java.io.IOException: Timeout while fetching: http://someapp.appspot.com/#!about
at com.google.appengine.api.urlfetch.URLFetchServiceImpl.convertApplicationException(URLFetchServiceImpl.java:108)
at com.google.appengine.api.urlfetch.URLFetchServiceImpl.fetch(URLFetchServiceImpl.java:39)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.fetchResponse(URLFetchServiceStreamHandler.java:404)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getInputStream(URLFetchServiceStreamHandler.java:283)
at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getResponseCode(URLFetchServiceStreamHandler.java:136)
at com.gargoylesoftware.htmlunit.UrlFetchWebConnection.getResponse(UrlFetchWebConnection.java:107)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1481)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponse(WebClient.java:1410)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:309)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
No, I did not configure anything special. Here is the prototype code I run:
static {
WebClient.setAppEngineMode();
}
public String greetServer(String inputUrl) {
WebClient client = new WebClient();
client.setThrowExceptionOnScriptError(false);
try {
HtmlPage page = client.getPage(inputUrl);
client.pumpEventLoop(10000);
return page.asXml();
} catch (FailingHttpStatusCodeException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
return "Unable to get a DOM for " + inputUrl;
}
}
Here is the long-due updated patch against r5662. I was able to simplify it a lot, thanks to Marc's changes and after discarding some other unnecessary changes I had. One other change I needed to make and which contributed to the delay was that Marc's hack did not work in GAE's dev app server: http://code.google.com/p/googleappengine/issues/detail?id=3155
Instead, I had to use a GAE recommended way to find out whether we are in GAE mode: WebClient::isGaeMode(). Please apply the patch using 'patch -p1'
Amit:
- I'm surprised concerning GAE bug 3155. As far as I can remember this is somewhat I had successfully tested in GAE dev mode. Perhaps did it change between to GAE versions?
- what about unit tests?