Re: [Dpcl-develop] re: daemon termination
Brought to you by:
dpcl-admin,
dwootton
From: Dave W. <dwo...@us...> - 2004-09-29 16:27:44
|
Steve The DAEMON_TERMINATE message is not responsible for getting the DPCL daemon to be terminated. It may have something to do with why dpclSD doesn't terminate. The basic flow here is that for some reason the DPCL client has terminated, implicitly closing its socket connection. The dpcld daemon recognizes the socket close and treats it as eof on the socket, driving an invocation of default_cb. Part of the processing in default_cb is to sned the DAEMON_TERMINATE message to the dpclSD process. The purpose of the message is to inform dpclSD that one of the DPCL client connections has been closed. The dpclSD process keeps track of all DPCL client connections on the node, and this message decrements the connection count. When the connection count is zero, dpclSD can terminate. The dpclSD process never terminates a DPCL daemon to eliminate the possibility of a dpcld process being incorrectly killed because dpclSD got its connection count wrong. There is additional processing after this in dpcld which among other things, decrements dpcld's count of connect clients. When that count becomes zero, then that dpcld process can safely terminate. I suspect something is going wrong in the remaining processing after the message is sent. Your comment about the shared memory message polling process being "very active" is interesting since I have been looking at a problem where probe expressions are sending many messages using Ais_send at a high rate with one effect being a hung dpcld. Can you clarify what you mean? The polling you see is driven by a timer which triggers every .4 seconds and is never shut down until the daemon terminates. If there are messages waiting to be sent to the client, the polling loop attempts to send them to the client. In the case I am looking at, I see a message logged about a 'SIGPIPE' signal being raised. Do you happen to see this as well? If you see this, it might be a clue as to what's going on (strictly a guess). If there is something causing the client connection to be closed from the dpcld side becase of an error detected in dpcld, then the dpcld bookkeeping associated with closing the socket results in the code which drives default_cb() never getting invoked. This means the connection count may not be properly decremented with the result being a hung dpcld process. If you still can't get anywhere with this, you can send the dpcld log to me and I can spend a few minutes looking at it. Send it directly to me instead of the mailing list to avoid problems with attachment size limits for the mailing list. Also, can you provide me with some explanation of what the testcase is doing, including the probe module? If the testcase is small, send that too. Dave Steve Collins <sl...@sg...> Sent by: dpc...@ww... 09/28/2004 02:42 PM To dpc...@ww... cc sl...@sg... Subject [Dpcl-develop] re: daemon termination Thanks to DaveW for the tip about stdout. I can now see that SSM_dsend is being entered when trying to send the DAEMON_TERMINATE message to the SD. Thanks also for JamesW's confirmation of my initial analysis. The reason I suggested that the Hybrid <might not> be the culprit is that whenever I get dpclSD/dpcld daemons 'hanging around', I notice that the SIGALRM mechanism for 'polling' of messages in shared memory is still active. Very active. I have two simple DPCL tests, one of which (the probe module test) always presents 'hanging daemons' and the other of which (the one shot probe) only once in a great while leaves 'hanging daemons'. I'm all but certain there is a timing issue here. What I am seeing in the .stdout file for the case where the daemons hang is as follows: SSM_dsend is entered and enters the 'nonblocking' code, which winds up going to AisMsgbuff::WriteDMsgv where it is noted that the message queue is NOT full and the message (the DAEMON_TERMINATE message!) is queued for sending. The 'poll_for_message' code of the comm daemon is then entered REPEATEDLY until it just quits, somehow. What I am seeing in the .stdout file for the case where the daemons do NOT hang (the one_shot probe test) is as follows: SSM_dsend is entered but satisfied the 'isBlockRWSocket' criteria and the DAEMON_TERMINATE message is delivered correctly to the SD, to wit: daissd: SdDaemonTerminateCB entered. daissd: inside removeDaisClient daissd: sending terminate ack daissd: last client, can exit daissd: Client List is Empty daissd: Super daemon exiting-before pthread_join Then the TERMINATE_ACK reaches the comm daemon and everything shuts down just fine. Right now I suspect the SIGALRM 'polling' mechanism is somehow getting in the way of the DAEMON_TERMINATE message being delivered to the SD. Somthing is blocking or not blocking. A timing issue is clear, because the case the usually works (the one_shot probe) does occassionally fail as well, with the same symptoms. Thanks again James and DaveW for all your help!! SteveC - SGI Tools _______________________________________________ Dpcl-develop mailing list Dpc...@ww... http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dpcl-develop |