Menu

#162 Get HtmlUnit to run on Google App Engine (GAE)

closed
None
5
2014-08-18
2010-03-02
Amit Manjhi
No

There are several restrictions GAE places that prevent a vanilla HtmlUnit from running on GAE. Specifically, no threads can be started on GAE. GAE also does not allow many classes, such as URLStreamHandler, Applets etc.

I have a somewhat hacky solution to get HtmlUnit to run on GAE. It builds on the dethreading patch, and has 3 main changes:
-- the EventLoop does not start a thread. Instead, it just accumulates jobs. When the main thread calls pumpEventLoop (long timeout)
-- An appEngine compatible implementation of WebConnection called UrlFetchWebConnection.
-- A hack to avoid the use of URLHandler. I rewrite "javascript:<data>" url as "http://javascript/<data>" and then in UrlFetchWebConnection, I return the appropriate response.

In addition, I had to change some of the variables in WebClient.java, so that they are not statically initialized. I have attached a patch against a "recent version" of my dethreading patch, just to get the discussion started.

Discussion

1 2 3 > >> (Page 1 of 3)
  • Marc Guillemot

    Marc Guillemot - 2010-03-03

    Thanks for the patch. This is indeed a hack, but this is good enough to start the discussion.

    In this first comment I don't want to go into details of the patch but rather about the target. I personally find GAE very interesting and would really welcome it if HtmlUnit could run on GAE, even if it is not the primary target.

    I believe that the first thing that we have to define is what we want to see running on GAE. Currently HtmlUnit doesn't work at all and it would surely be very difficult to have the "full" HtmlUnit running on GAE. We have to define something between these two extrema as the target.

    Once the target is defined, we need to ensure that it is reached and that future releases of HtmlUnit continue to reach it. Everything that can be tested by unit tests integrated in HtmlUnit build's has to be tested this way but this isn't enough. GAE is different and the only real test would be to deploy a project on GAE to run some tests (I've personally experienced that the dev mode is interesting but not good enough to give a definitive result). This means that we need to setup a new project for instance in HtmlUnit's SVN for this purpose and create a GAE app with HtmlUnit committers as developers (+ you? or you submit some other patches to become committer ;-)).

    What do you think? What would you see as initial set of tests that should pass?

     
  • Amit Manjhi

    Amit Manjhi - 2010-03-03

    Marc: Thanks for the comments. To start with, I would suggest adding a test that does the following, where urls is just a collection of popular JS heavy URLs:

    for (String url : urls) {
    HtmlPage page = client.getPage(inputUrl);
    client.pumpEventLoop(10000);
    String pageAsString = page.asXml();
    assert(...); // assert pageAsString contains the rendered DOM.
    }

    To implement such a test, we will need to come up with a list of URLs and conditions for each URL that confirm whether HtmlUnit produced the DOM correctly or not. This test would also directly help any developers who aim to use HtmlUnit for crawling, as outlined in http://code.google.com/web/ajaxcrawling/docs/html-snapshot.html

    Thoughts?

     
  • Amit Manjhi

    Amit Manjhi - 2010-03-03

    In the code fragment, the first line should be:
    for (String inputUrl : urls) {

    In addition, the client could be set up as:
    WebClient client = new WebClient();

    static {
    WebClient.setAppEngineMode();
    }

     
  • Amit Manjhi

    Amit Manjhi - 2010-03-04

    A file containing a comprehensive list of classes used in HtmlUnit code, which can't be used on AppEngine.

     
  • Amit Manjhi

    Amit Manjhi - 2010-03-04

    Classes that can't be used on GAE.

     
  • Marc Guillemot

    Marc Guillemot - 2010-03-04

    Hmm, external URLs are surely interesting but I wouldn't use them as the first test. As we don't control them, we would have to regularly verify manually that:
    - "normal" HtmlUnit still works with them
    - get new reference DOM

    What about a set of things that we can control, without external dependencies like HtmlUnit JS library tests?

    Additional note: in your comment as well as in http://code.google.com/web/ajaxcrawling/docs/html-snapshot.html you use "new WebClient()" which is (currently) equivalent to "new WebClient(BrowserVersion.INTERNET_EXPLORER_7)". Is it voluntary to simulate IE7 rather than an other browser?

     
  • Amit Manjhi

    Amit Manjhi - 2010-03-06

    Almost all the HtmlUnit JS library tests "should" work out of the box, except that a pumpEventLoop(...) call would need to be added. However, I still think it would be worthwhile to have an integration test going -- the knob we have is in selecting the URLs and what conditions to check for them. Perhaps, the "new reference DOM" could be obtained by running HtmlUnit in non-appEngine mode. So the test would check whether the AppEngine-mode HtmlUnit produces the same output as the "default" HtmlUnit. Thoughts?

    Regarding IE7, it is just an attempt to keep things simple. (plus, in most cases, it doesn't affect the output.) Maybe it is time to update the default browser in HtmlUnit :-).

     
  • Amit Manjhi

    Amit Manjhi - 2010-03-08

    @Marc: Marking it as "next release" and assigning it to you. Is that fine?

     
  • Marc Guillemot

    Marc Guillemot - 2010-03-09

    @Amit: please let committers assign bugs themselves. Remember that most of the time we work on HtmlUnit in our free time (gigs are welcome to speed up things ;-)) and that even if I would welcome GAE support I can't foresee how many time I will have to work on it.

    I find the idea interesting to take "normal HtmlUnit" as reference to test "HtmlUnit on GAE".

    Concerning the default browser: if if wouldn't hurt so much users, I would prefer to remove the notion of default browser! ;-)

     
  • Amit Manjhi

    Amit Manjhi - 2010-03-09

    Marc: Sorry for assigning the bug to you.

    Now that the dethreading patch is in, I will update the patch so that anyone can play with it. Let us also continue the discussion around what tests to add. So far, we have:
    - Being able to run most of HtmlUnit JS tests, in the AppEngine mode.
    - AppEngine integration tests, where we test the DOM produced by HtmlUnit in AppEngine mode against the HtmlUnit in DOM mode.

    Did I miss anything?

     
  • Anonymous

    Anonymous - 2010-03-12

    I just want to chip in to say I too am very interested in running HTML Unit on GAE. amitmanjhi mentions that the dethreading patch is in. Is there any way to get this version of HTML Unit? Would it run on GAE?

    (And thanks all for this great tool!)

     
  • Daniel Gredler

    Daniel Gredler - 2010-03-13

    @Phil: I believe this is currently at the "working proof of concept" stage; you can grab the sources in SVN trunk and apply this patch and probably get it to work, but it may be a while yet before you can just download HtmlUnit and use it in GAE. I guess if Amit and Marc continue to collaborate as quickly as they did for the dethreading patch, it might happen before the next version of HtmlUnit is released :-)

     
  • Anonymous

    Anonymous - 2010-03-14

    @Amit, @Marc: Cheers cheers! Go go go! Viva Open Source. :) More seriously, if there's anything I can do to help speedup the process, let me know, I could have some time to contribute myself.

     
  • Amit Manjhi

    Amit Manjhi - 2010-03-26

    Updated git patch against r5611. To see this in action, visit: http://ajax-crawler.appspot.com/

    Built with GWT + HtmlUnit with this patch.

     
  • Amit Manjhi

    Amit Manjhi - 2010-03-26

    @Phil: you are welcome to try this patch. Apply it against a current htmlunit svn source (r5611 or higher). Build the HtmlUnit jars and try them on AppEngine.

    Now that Marc is back, I hope things can move quickly on landing a version of this patch.

     
  • Marc Guillemot

    Marc Guillemot - 2010-03-26

    For info: I work slowly on this. I've added a unit test (as NotYetImplemented) simulating the problems due to GAE white list. I have an idea how to cleanly fix it (proposed patch is a hack and can't be integrated this way). Once this is done, you can update the patch to make use of it.

     
  • Amit Manjhi

    Amit Manjhi - 2010-03-26

    Marc: I looked at r5610. It is a great approach for testing appEngine compatibility. Looking forward to the rest of the patch.

     
  • Marc Guillemot

    Marc Guillemot - 2010-03-29

    I've committed a fix to avoid NoClassDefFoundError for URLStreamHandler and uses a hack similar to the one of your patch for javascript, about and data urls BUT without the need to change WebClient.URL_ABOUT_BLANK or to make any change that would modify the normal behaviour of HtmlUnit.

    Can you take these changes into account and simplify your patch?

    I'd like to continue step by step, with dedicated tests all the time. The next problem with HtmlUnit on GAE is now that the HttpWebConnection starts a thread. My plan is to write a test that reproduces this problem and then to use the web connection of your patch. I don't know when I'll be able to work. If you provide a patch for that, it could go faster.

     
  • Amit Manjhi

    Amit Manjhi - 2010-03-31

    Marc: I will take a look at the changes and update the patch.

    Wouldn't HttpWebConnection fail because it tries to load the java.lang.Thread class somewhere?

     
  • Anonymous

    Anonymous - 2010-04-13

    Just checking the status on this... I'm still really interested by that patch and can't wait to try it on my project! Thanks to all of you, btw.

     
  • Amit Manjhi

    Amit Manjhi - 2010-04-13

    @Phil: you can try the previous patch I attached to this issue. You might have to use a previous svn revision. Let us know how it goes.

     
  • Anonymous

    Anonymous - 2010-04-13

    I applied the patch, appEngine.r5611.patch, against revision 5611, packaged and implemented AJAX crawling, as per http://code.google.com/web/ajaxcrawling/docs/html-snapshot.html, on a GWT application.

    About 50% of the time an AJAX crawled URL, containing escaped_fragment, a timeout exception is thrown, details below.

    @amitmanjhi How did you structure http://ajax-crawler.appspot.com/ to allow much longer request? That is, did you configure anything special regarding timeouts? What sequence of instructions did you use on WebClient? Are you willing to share the code?

    Thanks

    com.gargoylesoftware.htmlunit.UrlFetchWebConnection getResponse: Exception Timeout while fetching: http://someapp.appspot.com/#!about while trying to fetch, returning () for URL http://someapp.appspot.com/#!about
    java.io.IOException: Timeout while fetching: http://someapp.appspot.com/#!about
    at com.google.appengine.api.urlfetch.URLFetchServiceImpl.convertApplicationException(URLFetchServiceImpl.java:108)
    at com.google.appengine.api.urlfetch.URLFetchServiceImpl.fetch(URLFetchServiceImpl.java:39)
    at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.fetchResponse(URLFetchServiceStreamHandler.java:404)
    at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getInputStream(URLFetchServiceStreamHandler.java:283)
    at com.google.apphosting.utils.security.urlfetch.URLFetchServiceStreamHandler$Connection.getResponseCode(URLFetchServiceStreamHandler.java:136)
    at com.gargoylesoftware.htmlunit.UrlFetchWebConnection.getResponse(UrlFetchWebConnection.java:107)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1481)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponse(WebClient.java:1410)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:309)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)

     
  • Amit Manjhi

    Amit Manjhi - 2010-04-13

    No, I did not configure anything special. Here is the prototype code I run:

    static {
    WebClient.setAppEngineMode();
    }

    public String greetServer(String inputUrl) {
    WebClient client = new WebClient();
    client.setThrowExceptionOnScriptError(false);
    try {
    HtmlPage page = client.getPage(inputUrl);
    client.pumpEventLoop(10000);
    return page.asXml();
    } catch (FailingHttpStatusCodeException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    } catch (MalformedURLException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    } catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
    } catch (Exception e) {
    e.printStackTrace();
    }
    return "Unable to get a DOM for " + inputUrl;
    }
    }

     
  • Amit Manjhi

    Amit Manjhi - 2010-04-26

    Here is the long-due updated patch against r5662. I was able to simplify it a lot, thanks to Marc's changes and after discarding some other unnecessary changes I had. One other change I needed to make and which contributed to the delay was that Marc's hack did not work in GAE's dev app server: http://code.google.com/p/googleappengine/issues/detail?id=3155

    Instead, I had to use a GAE recommended way to find out whether we are in GAE mode: WebClient::isGaeMode(). Please apply the patch using 'patch -p1'

     
  • Marc Guillemot

    Marc Guillemot - 2010-04-27

    Amit:
    - I'm surprised concerning GAE bug 3155. As far as I can remember this is somewhat I had successfully tested in GAE dev mode. Perhaps did it change between to GAE versions?
    - what about unit tests?

     
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.