From: Changju G. <CG...@no...> - 2006-08-28 16:52:49
|
Hi, I have been testing EVMS on a 16-node cluster for some time. We found several problems in EVMS daemon code that would force EVMS time out in 10 minutes. Problem #1: daemon thread may reply to wrong message if it's worker has been killed. Scenario: After node 1 sent close engine requests to node 2 and 3, node 2 sent back acknowledgment and processed with it's own open engine requests to node 1 and 3. During the whole time, node 3 is still processing node 1's close engine request, and sent back a status message to node 1, which node 1 acknowledged promptly. Lets take a close look at node 2. In the following code, when receive_response_for_command() returns error, ie, pipe broken because the worker has been killed by node 2's open engine request, it uses response_msg to reply to node 1. However, response_msg is a reply to a status message. So, node 1 would never receive the reply of it's original close engine request. The fix is to store a copy of the original message and use it to reply in case of error. static void process_api(const ece_msg_t * msg) { ece_msg_t * response_msg = get_msg(msg); int rc; u_int32_t net_rc; LOG_PROC_ENTRY(); send_msg_to_worker(msg); /* Now it's OK to release the message. */ sem_post(&msg_sem); rc = receive_response_for_command(response_msg); if (rc == 0) { SEND_MSG(response_msg); /* engine_free() handles NULL pointers. */ engine_free(response_msg->msg); } else { evms_host_to_net(&net_rc, int_f, rc); response_msg->cmd |= COMMAND_RESPONSE; response_msg->size = sizeof(net_rc); response_msg->msg = &net_rc; SEND_MSG(response_msg); } free_msg(response_msg); LOG_PROC_EXIT_VOID(); } Problem #2: daemon thread may try to allocate huge buffer. The same scenario as that of problem #1. In the following code treat the first 4 bytes coming out of a (broken) pipe as the length of the incoming message from the worker. When a process is killed, what's coming out of a broken pipe is not define. Sometimes, it turns out to be a huge number. The fix here is to set an upper limit for messages from a worker. I set the limit to 4096 in the patch, but I guess you can set that to better suit your systems. static int receive_from_worker(ece_msg_t * response_msg) { msg_hdr_t output_msg_hdr; int bytes_read; LOG_PROC_ENTRY(); pthread_mutex_lock(&worker->output_pipe_mutex); LOG_DEBUG("Request to read %zd bytes from fd %d.\n", sizeof(output_msg_hdr), worker->output_pipe[0]); bytes_read = read(worker->output_pipe[0], &output_msg_hdr, sizeof(output_msg_hdr)); if (bytes_read == -1) { LOG_SERIOUS("Read of message failed with errno %d: %s.\n", errno, strerror(errno)); LOG_PROC_EXIT_INT(errno); return errno; } else { LOG_DEBUG("%d of %zd bytes read.\n", bytes_read, sizeof(output_msg_hdr)); } response_msg->cmd = output_msg_hdr.cmd; response_msg->size = output_msg_hdr.size; if (output_msg_hdr.size > 0) { LOG_DEBUG("Allocate %u bytes for response message.\n", output_msg_hdr.size); response_msg->msg = engine_alloc(output_msg_hdr.size); if (response_msg->msg == NULL) { LOG_DEBUG("Failed to allocate memory for a receive buffer.\n"); response_msg->size = 0; LOG_PROC_EXIT_INT(ENOMEM); return ENOMEM; } Problem #3: Previous worker not being killed Same scenario as that of problem #1, but more specifically, the open engine request from node 3 coming in as node 2 is shutting itself down in the following code. The first thing it does is to marked itself as not running. Because of that, node 3's open engine request can create a new worker to take the old one's place but won't kill the previous worker (to cause the previous thread to reply error back to node 1's close engine request). The fix here is to move "worker_running = FALSE;" to the end of the function. static void process_close_engine(ece_msg_t * msg) { worker_context_t * wrkr = worker; LOG_PROC_ENTRY(); /* * Mark the worker thread stopped before process_api() releases * the daemon_router to handle more commands. We don't want any * subsequent incoming commands to think that the worker is running, * since it will eventually be terminated. */ worker_running = FALSE; /* Send the evms_close_engine() API to the worker. */ process_api(msg); shutdown_worker(wrkr); LOG_PROC_EXIT_VOID(); } Problem #4: Daemon doesn't reply to multiple requests sent to it's worker. Scenario: Node 1 sent open engine request to node 2 and 3, at the same time node 3 sent the same request to 1 and 2. On node 2, it replied to node 1 that engine opened correctly. Then node 2 created a new worker for node 3 to replace the one created for node 1. Apparently both node 1 and 3 failed to remotely open engine on all other nodes. So both node 1 and 3 sent a close request to node 2. Node 2 sent both requests to the worker to process. At the same time, after processing bother node 1 and 3's open/close requests, node 4 sent a open engine request to node 1, 2 and 3. Node 2 created a new worker and cause the daemon thread to reply error to either node 1 or node 3, but not both of them. The fix: added a mutex to guarantee only one open/close thread can be waiting for reply from a worker. |