From: Nayman Felix-Q. <Fel...@mo...> - 2009-04-28 22:41:18
|
All, I'm running TIPC 1.5.12 on linux 2.6.21. We appear to be seeing link congestion occurring with our topology service notifications (publications and withdrawls). It appears that when we have a situation where one of our nodes goes down under a heavy load, the other nodes are getting inundated with withdrawn events and some of these events are not making it to their intended destination because of link congestion. There are hundreds of processes running on each of 10 nodes plus one cluster manager. I believe the topology service opens up its socket with critical importance and so do our applications. So there is no problem in enqueuing the messages on the socket receive queue, but we don't let the transmit queue grew indefinitely though. I can increase the window size from it's default of 50 to the maximum 150 which will let us go from 96 fragments/messages to 300 fragments/messages for critical messages. I'm not sure how much this will help. So I've got a few questions: 1)Does the topology service do anything special when it detects link congestion? Or do the messages just get dropped as appears to be happening in this case? 2)If 100's of processes go down at almost the same time (i.e. during a node reset), the topology service on another node will subsequently flood the link with withdrawn messages won' it? I just wanted to determine what the expected behavior is. 3)Is there anyway that an application can determine that a link congestion situation is occurring? We can see from using tipc-config -ls that link congestion has occurred. I assume that the number we're seeing refers to the number of occurrences of link congestion. It is very important that our application knows what processes are up/down so we're very sensitive to the situation where we lose these topology service messages. Any advice or pointers would be very helpful for this situation. Thanks, Felix |