[Evms-devel] Daemon Patch for Large Clusters

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi, 

I have been testing EVMS on a 16-node cluster for some time. We found
several problems 
in EVMS daemon code that would force EVMS time out in 10 minutes. 

Problem #1: daemon thread may reply to wrong message if it's worker has
been killed. 

Scenario: 

After node 1 sent close engine requests to node 2 and 3, node 2 sent
back acknowledgment
and processed with it's own open engine requests to node 1 and 3.
During the whole time, 
node 3 is still processing node 1's close engine request, and sent back
a status message to 
node 1, which node 1 acknowledged promptly. 

Lets take a close look at node 2. 

In the following code, when receive_response_for_command() returns
error, ie, pipe broken because
the worker has been killed by node 2's open engine request, it uses
response_msg to reply to node 1. 
However, response_msg is a reply to a status message. So, node 1 would
never receive the reply
of it's original close engine request. 

The fix is to store a copy of the original message and use it to reply
in case of error. 

static void process_api(const ece_msg_t * msg) {

        ece_msg_t * response_msg = get_msg(msg);
        int rc;
        u_int32_t net_rc;

        LOG_PROC_ENTRY();

        send_msg_to_worker(msg);

        /* Now it's OK to release the message. */
        sem_post(&msg_sem);

        rc = receive_response_for_command(response_msg); 

        if (rc == 0) {
                SEND_MSG(response_msg);

                /* engine_free() handles NULL pointers. */
                engine_free(response_msg->msg);

        } else {
                evms_host_to_net(&net_rc, int_f, rc);
                response_msg->cmd |= COMMAND_RESPONSE;
                response_msg->size = sizeof(net_rc);
                response_msg->msg  = &net_rc;
                SEND_MSG(response_msg);
        }

        free_msg(response_msg);
        LOG_PROC_EXIT_VOID();
}

Problem #2: daemon thread may try to allocate huge buffer. 

The same scenario as that of problem #1. In the following code treat
the first 4 bytes coming out 
of a (broken) pipe as the length of the incoming message from the
worker. When a process is killed,
what's coming out of a broken pipe is not define. Sometimes, it turns
out to be a huge number. 

The fix here is to set an upper limit for messages from a worker. I set
the limit to 4096 in the patch, 
but I guess you can set that to better suit your systems. 

static int receive_from_worker(ece_msg_t * response_msg) {

        msg_hdr_t output_msg_hdr;
        int bytes_read;

        LOG_PROC_ENTRY();

        pthread_mutex_lock(&worker->output_pipe_mutex);

        LOG_DEBUG("Request to read %zd bytes from fd %d.\n",
sizeof(output_msg_hdr), worker->output_pipe[0]);
        bytes_read = read(worker->output_pipe[0], &output_msg_hdr,
sizeof(output_msg_hdr));
        if (bytes_read == -1) {
                LOG_SERIOUS("Read of message failed with errno %d:
%s.\n", errno, strerror(errno));
                LOG_PROC_EXIT_INT(errno);
                return errno;
        } else {
                LOG_DEBUG("%d of %zd bytes read.\n", bytes_read,
sizeof(output_msg_hdr));
        }

        response_msg->cmd  = output_msg_hdr.cmd;
        response_msg->size = output_msg_hdr.size;

        if (output_msg_hdr.size > 0) {
                LOG_DEBUG("Allocate %u bytes for response message.\n",
output_msg_hdr.size);

                response_msg->msg = engine_alloc(output_msg_hdr.size);
                if (response_msg->msg == NULL) {
                        LOG_DEBUG("Failed to allocate memory for a
receive buffer.\n");
                        response_msg->size = 0;
                        LOG_PROC_EXIT_INT(ENOMEM);
                        return ENOMEM;
                }

Problem #3: Previous worker not being killed

Same scenario as that of problem #1, but more specifically, the open
engine request from node 3 coming in
as node 2 is shutting itself down in the following code. The first
thing it does is to marked itself as not running. 
Because of that, node 3's open engine request can create a new worker
to take the old one's place but won't 
kill the previous worker (to cause the previous thread to reply error
back to node 1's close engine request). 

The fix here is to move "worker_running = FALSE;" to the end of the
function. 

static void process_close_engine(ece_msg_t * msg) {

        worker_context_t * wrkr = worker;

        LOG_PROC_ENTRY();

        /*
         * Mark the worker thread stopped before process_api()
releases
         * the daemon_router to handle more commands.  We don't want
any
         * subsequent incoming commands to think that the worker is
running,
         * since it will eventually be terminated.
         */
        worker_running = FALSE;

        /* Send the evms_close_engine() API to the worker. */
        process_api(msg);

        shutdown_worker(wrkr);

        LOG_PROC_EXIT_VOID();
}

Problem #4: Daemon doesn't reply to multiple requests sent to it's
worker. 

Scenario: Node 1 sent open engine request to node 2 and 3, at the same
time node 3 sent the same 
request to 1 and 2. On node 2, it replied to node 1 that engine opened
correctly. Then node 2 created
a new worker for node 3 to replace the one created for node 1. 

Apparently both node 1 and 3 failed to remotely open engine on all
other nodes. So both node 1 and 3 
sent a close request to node 2. Node 2 sent both requests to the worker
to process. 

At the same time, after processing bother node 1 and 3's open/close
requests, node 4 sent a open engine 
request to node 1, 2 and 3. Node 2 created a new worker and cause the
daemon thread to reply error to 
either node 1 or node 3, but not both of them. 

The fix: added a mutex to guarantee only one open/close thread can be
waiting for reply from a worker.