From: Natarajan Ramesh-A. <ra...@mo...> - 2009-04-29 16:39:58
|
Hi Allan, We are seeing this message in syslog. What does this mean? Apr 28 21:41:43 slot0_3 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_4 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_13 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_9 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_1 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_10 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_12 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_3 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_4 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_13 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_11 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_1 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_12 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_9 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_10 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_3 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_11 kernel: TIPC: Unable to remove publication by node 0x1001007 Apr 28 21:41:43 slot0_13 kernel: TIPC: Unable to remove publication by node 0x1001007 thanks Ramesh -----Original Message----- From: Stephens, Allan [mailto:all...@wi...] Sent: Wednesday, April 29, 2009 11:09 AM To: Nayman Felix-QA5535; tip...@li... Cc: Natarajan Ramesh-A17988 Subject: RE: [tipc-discussion] Link congestion with the topology service Felix wrote: > 1)Does the topology service do anything special when it detects link > congestion? Or do the messages just get dropped as appears to be > happening in this case? If the topology service is unable to send a name table update because of link congestion the update is lost. TIPC *will* issue a warning to the system log to indicate that this has happened, so you can check for the string "distribution failure" to see if this is actually occurring. > 2)If 100's of processes go down at almost the same time (i.e. > during a node reset), the topology service on another node will > subsequently flood the link with withdrawn messages won' > it? I just wanted to determine what the expected behavior is. No. In TIPC 1.5.12, name table updates are only issued by the node which created the named socket. This means that if a node fails ungracefully (i.e. it just dies), no update messages are sent by any node in the network; instead, each of the other nodes detects that it has lost contact with the failed node (i.e. it no longer has any working link to that node) and automatically forces its topology service to purge any name table entries published by the failed node. If a node fails gracefully (i.e. it shuts down all of its processes and then dies) then it is possible that some of the "withdraw" messages it tries to send may be lost due to link congestion; however, once the node dies the automatic cleanup done by the other nodes will have the same effect as if the lost messages had arrived. > 3)Is there anyway that an application can determine that a link > congestion situation is occurring? We can see from using tipc-config > -ls that link congestion has occurred. I assume that the number > we're seeing refers to the number of occurrences of link congestion. An application using TIPC's socket API can tell if link congestion is occurring by doing a non-blocking send (i.e. use the MSG_DONTWAIT flag on its send operations); if the send fails and errno is set to EWOULDBLOCK, then this indicates that the message could not be sent due to link congestion. > It is very important that our application knows what processes are > up/down so we're very sensitive to the situation where we lose these > topology service messages. Any advice or pointers would be very > helpful for this situation. If you are having issues with applications not receiving topology service events properly, I suggest that you try to identify if the name table entries on the node in question are correct. If the name table entries are incorrect then the problem lies in the name table code rather than the topology service; in this case, issues like link congestion need to be considered. On the other hand, if the name table entries are correct then the problem lies in the topology service; in this case, there might be issues with the event messages being sent to the applications by the topology service. One possible scenario I can think of that could possibly explain the problems you are encountering would be if name table updates are being lost when a failed node restarts (rather than when it fails). Whenever node A joins the network, all of the other nodes will attempt to dump their name tables to it "en masse". I can imagine that it might be possible that some updates from node B may not reach node A due to link congestion, causing node A's name table to be incomplete and the applications on node A to not receive "publish" events that they should have received. Regards, Al > -----Original Message----- > From: Nayman Felix-QA5535 [mailto:Fel...@mo...] > Sent: Tuesday, April 28, 2009 6:41 PM > To: tip...@li... > Cc: Natarajan Ramesh-A17988 > Subject: [tipc-discussion] Link congestion with the topology service > > All, > I'm running TIPC 1.5.12 on linux 2.6.21. > > We appear to be seeing link congestion occurring with our topology > service notifications (publications and withdrawls). > It appears that when we have a situation where one of our nodes goes > down under a heavy load, the other nodes are getting inundated with > withdrawn events and some of these events are not making it to their > intended destination because of link congestion. There are hundreds > of processes running on each of 10 nodes plus one cluster manager. > I believe the topology service opens up its socket with critical > importance and so do our applications. So there is no problem in > enqueuing the messages on the socket receive queue, but we don't let > the transmit queue grew indefinitely though. I can increase the > window size from it's default of 50 to the maximum 150 which will let > us go from 96 fragments/messages to 300 fragments/messages for > critical messages. I'm not sure how much this will help. > > So I've got a few questions: ... moved above ... > > Thanks, > Felix > -------------------------------------------------------------- > ---------------- > Register Now & Save for Velocity, the Web Performance & Operations > Conference from O'Reilly Media. Velocity features a full day of > expert-led, hands-on workshops and two days of sessions from industry > leaders in dedicated Performance & Operations tracks. Use code > vel09scf and Save an extra 15% before 5/3. > http://p.sf.net/sfu/velocityconf > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > |