From: Stephens, A. <all...@wi...> - 2009-04-29 16:09:01
|
Felix wrote: > 1)Does the topology service do anything special when it > detects link congestion? Or do the messages just get dropped > as appears to be happening in this case? If the topology service is unable to send a name table update because of link congestion the update is lost. TIPC *will* issue a warning to the system log to indicate that this has happened, so you can check for the string "distribution failure" to see if this is actually occurring. > 2)If 100's of processes go down at almost the same time (i.e. > during a node reset), the topology service on another node > will subsequently flood the link with withdrawn messages won' > it? I just wanted to determine what the expected behavior is. No. In TIPC 1.5.12, name table updates are only issued by the node which created the named socket. This means that if a node fails ungracefully (i.e. it just dies), no update messages are sent by any node in the network; instead, each of the other nodes detects that it has lost contact with the failed node (i.e. it no longer has any working link to that node) and automatically forces its topology service to purge any name table entries published by the failed node. If a node fails gracefully (i.e. it shuts down all of its processes and then dies) then it is possible that some of the "withdraw" messages it tries to send may be lost due to link congestion; however, once the node dies the automatic cleanup done by the other nodes will have the same effect as if the lost messages had arrived. > 3)Is there anyway that an application can determine that a > link congestion situation is occurring? We can see from > using tipc-config -ls that link congestion has occurred. I > assume that the number we're seeing refers to the number of > occurrences of link congestion. An application using TIPC's socket API can tell if link congestion is occurring by doing a non-blocking send (i.e. use the MSG_DONTWAIT flag on its send operations); if the send fails and errno is set to EWOULDBLOCK, then this indicates that the message could not be sent due to link congestion. > It is very important that our application knows what > processes are up/down so we're very sensitive to the > situation where we lose these topology service messages. Any > advice or pointers would be very helpful for this situation. If you are having issues with applications not receiving topology service events properly, I suggest that you try to identify if the name table entries on the node in question are correct. If the name table entries are incorrect then the problem lies in the name table code rather than the topology service; in this case, issues like link congestion need to be considered. On the other hand, if the name table entries are correct then the problem lies in the topology service; in this case, there might be issues with the event messages being sent to the applications by the topology service. One possible scenario I can think of that could possibly explain the problems you are encountering would be if name table updates are being lost when a failed node restarts (rather than when it fails). Whenever node A joins the network, all of the other nodes will attempt to dump their name tables to it "en masse". I can imagine that it might be possible that some updates from node B may not reach node A due to link congestion, causing node A's name table to be incomplete and the applications on node A to not receive "publish" events that they should have received. Regards, Al > -----Original Message----- > From: Nayman Felix-QA5535 [mailto:Fel...@mo...] > Sent: Tuesday, April 28, 2009 6:41 PM > To: tip...@li... > Cc: Natarajan Ramesh-A17988 > Subject: [tipc-discussion] Link congestion with the topology service > > All, > I'm running TIPC 1.5.12 on linux 2.6.21. > > We appear to be seeing link congestion occurring with our > topology service notifications (publications and withdrawls). > It appears that when we have a situation where one of our > nodes goes down under a heavy load, the other nodes are > getting inundated with withdrawn events and some of these > events are not making it to their intended destination > because of link congestion. There are hundreds of processes > running on each of 10 nodes plus one cluster manager. > I believe the topology service opens up its socket with > critical importance and so do our applications. So there is > no problem in enqueuing the messages on the socket receive > queue, but we don't let the transmit queue grew indefinitely > though. I can increase the window size from it's default of > 50 to the maximum 150 which will let us go from 96 > fragments/messages to 300 fragments/messages for critical > messages. I'm not sure how much this will help. > > So I've got a few questions: ... moved above ... > > Thanks, > Felix > -------------------------------------------------------------- > ---------------- > Register Now & Save for Velocity, the Web Performance & > Operations Conference from O'Reilly Media. Velocity features > a full day of expert-led, hands-on workshops and two days of > sessions from industry leaders in dedicated Performance & > Operations tracks. Use code vel09scf and Save an extra 15% > before 5/3. http://p.sf.net/sfu/velocityconf > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > |