Has anyone in Webware-land been successful implementing a load-balancer between Apache and one or more WebKit instances?  I've been trying to do this for many weeks without success.  I wrote about my problems a while ago, but I still haven't had any luck.

Pertinent information: Webware 0.8.1, Python 2.2, RedHat 7.3 (2.4.20something kernel), mod_webkit, DCOracle2 in use, also pymqi (Python binding for MQSeries middleware).  We have two web/application servers, each running Apache and each running WebKit.  We use a Cisco LocalDirector listening on a virtual IP and load-balancing (and failing out) the Apache servers.  Once the LocalDirector binds to the real IP of a web/app server, that server's Apache and WebKit handle the request.  Sessions use the File store, and are NFSed so that either server can get a request and handle the session.

My main goal is to be able to trap, basically in real time, those cases where WebKit hangs but doesn't die.  The LocalDirector does an admirable (albeit expensive) job of handling hardware failure or stopped Apache servers. But the hardware (knock wood) hasn't failed and Apache is rock-solid.  But WebKit has on numerous occasions just "stopped."  Unfortunately, Apache continues to handle the incoming requests, pesters the dead WebKit port 10 times, and then returns 500 Server Error to the client.

We've had good success load-balancing some outbound xmlrpc requests using (first) proxylb and then pythondirector. But when I try either of these software load-balancers, I get a 500 Server Error response and "cannot scan servlet headers" in the Apache error log.  The mod_webkit.c code shows this error as coming AFTER the request has gone to the WebKit port:

    . . .
    /* Now we get the response from the AppServer */

    //  log_message("scanning for headers",r);
    //pull out headers
    if ((ret=ap_scan_script_header_err_buff(r, buffsocket, NULL))) {
        if( ret>=500 || ret < 0) {
            log_message("cannot scan servlet headers ", r);
            return 2;
        r->status_line = NULL;
    . . .

Any ideas?  I'd be grateful for either (a) which way to go with troubleshooting or (b) pointers to other solutions that have worked for failover.  We're already checking for one style of hung WebKit processes and issuing a restart, but that hasn't handled every hang mode we've encountered.

Thanks for any and all ideas; I'll summarize them to the list if they don't come in via the list.

David Hancock | dhancock@arinc.com | 410-266-4384