-----Original Message-----
From: Hancock, David (DHANCOCK) [mailto:DHANCOCK@arinc.com]
Sent: Saturday, September 27, 2003 4:37 PM
To: webware-discuss@lists.sourceforge.net
Cc: Adamshick, Greg (GADAMSHI); Bugenhagen, John (JBUGENHA); Kancianic, Jennifer C. (JKANCIAN); Parangot, Reena M. (RMP)
Subject: [Webware-discuss] Webware/WebKit and load-balancing

Has anyone in Webware-land been successful implementing a load-balancer between Apache and one or more WebKit instances?  I've been trying to do this for many weeks without success.  I wrote about my problems a while ago, but I still haven't had any luck. 

Hmmm.  Do you really mean load balancing between a single instance of Apache and multiple WebKit instances?  Ordinarily I would assume the load balancer would go _before_ Apache and would load balance between multiple Apache instances, each of which used mod_webkit to talk to their own single instance of WebKit.

Pertinent information: Webware 0.8.1, Python 2.2, RedHat 7.3 (2.4.20something kernel), mod_webkit, DCOracle2 in use, also pymqi (Python binding for MQSeries middleware).  We have two web/application servers, each running Apache and each running WebKit.  We use a Cisco LocalDirector listening on a virtual IP and load-balancing (and failing out) the Apache servers.  Once the LocalDirector binds to the real IP of a web/app server, that server's Apache and WebKit handle the request.  Sessions use the File store, and are NFSed so that either server can get a request and handle the session. 

I'm not sure if the File store is "process-safe" -- is it possible for 2 processes to step on one another's sessions?  That's something you might want to look into.  (But probably not related to the wedging you're seeing.)

My main goal is to be able to trap, basically in real time, those cases where WebKit hangs but doesn't die.  The LocalDirector does an admirable (albeit expensive) job of handling hardware failure or stopped Apache servers. But the hardware (knock wood) hasn't failed and Apache is rock-solid.  But WebKit has on numerous occasions just "stopped."  Unfortunately, Apache continues to handle the incoming requests, pesters the dead WebKit port 10 times, and then returns 500 Server Error to the client. 

If you're comfortable hacking the C code, you could change mod_webkit's behavior.  Its current behavior was intended to allow restarting WebKit without losing requests.  You could modify it so that it adds fault recovery -- when it can't contact WebKit, it could attempt to restart WebKit, wait a few seconds, _then_ try again.

Another possibility is to add load balancing and fault tolerance right into mod_webkit -- when it fails to connect to one appserver it could try another.

We've had good success load-balancing some outbound xmlrpc requests using (first) proxylb and then pythondirector. But when I try either of these software load-balancers, I get a 500 Server Error response and "cannot scan servlet headers" in the Apache error log.  The mod_webkit.c code shows this error as coming AFTER the request has gone to the WebKit port:

    . . .
    /* Now we get the response from the AppServer */
    ap_hard_timeout("wk_read",r);


    //  log_message("scanning for headers",r);
    //pull out headers
    if ((ret=ap_scan_script_header_err_buff(r, buffsocket, NULL))) {
        if( ret>=500 || ret < 0) {
            log_message("cannot scan servlet headers ", r);
            return 2;
        }
        r->status_line = NULL;
    }
    . . .

Again, is this load balancing between Apache and WebKit, or is it between the web client and Apache?  I assume the load balancers were only designed for the latter.

 Any ideas?  I'd be grateful for either (a) which way to go with troubleshooting or (b) pointers to other solutions that have worked for failover.  We're already checking for one style of hung WebKit processes and issuing a restart, but that hasn't handled every hang mode we've encountered. 

- Geoff