From: Stephens, A. <all...@wi...> - 2009-04-29 17:47:20
|
Hi Ramesh: This message indicates that a variety of different nodes have been asked to remove publications by node <1.1.7> from their name tables that publication could not be found. This is the sort of thing that one would expect to see if <1.1.7>'s earlier publication update was lost, and never reached the other nodes. Regards, Al > -----Original Message----- > From: Natarajan Ramesh-A17988 [mailto:ra...@mo...] > Sent: Wednesday, April 29, 2009 12:39 PM > To: Stephens, Allan; Nayman Felix-QA5535; > tip...@li... > Subject: RE: [tipc-discussion] Link congestion with the > topology service > > Hi Allan, > > We are seeing this message in syslog. What does this mean? > > Apr 28 21:41:43 slot0_3 kernel: TIPC: Unable to remove > publication by node 0x1001007 Apr 28 21:41:43 slot0_4 kernel: > TIPC: Unable to remove publication by node 0x1001007 Apr 28 > 21:41:43 slot0_13 kernel: TIPC: Unable to remove publication > by node 0x1001007 Apr 28 21:41:43 slot0_9 kernel: TIPC: > Unable to remove publication by node 0x1001007 Apr 28 > 21:41:43 slot0_1 kernel: TIPC: Unable to remove publication > by node 0x1001007 Apr 28 21:41:43 slot0_10 kernel: TIPC: > Unable to remove publication by node 0x1001007 Apr 28 > 21:41:43 slot0_12 kernel: TIPC: Unable to remove publication > by node 0x1001007 Apr 28 21:41:43 slot0_3 kernel: TIPC: > Unable to remove publication by node 0x1001007 Apr 28 > 21:41:43 slot0_4 kernel: TIPC: Unable to remove publication > by node 0x1001007 Apr 28 21:41:43 slot0_13 kernel: TIPC: > Unable to remove publication by node 0x1001007 Apr 28 > 21:41:43 slot0_11 kernel: TIPC: Unable to remove publication > by node 0x1001007 Apr 28 21:41:43 slot0_1 kernel: TIPC: > Unable to remove publication by node 0x1001007 Apr 28 > 21:41:43 slot0_12 kernel: TIPC: Unable to remove publication > by node 0x1001007 Apr 28 21:41:43 slot0_9 kernel: TIPC: > Unable to remove publication by node 0x1001007 Apr 28 > 21:41:43 slot0_10 kernel: TIPC: Unable to remove publication > by node 0x1001007 Apr 28 21:41:43 slot0_3 kernel: TIPC: > Unable to remove publication by node 0x1001007 Apr 28 > 21:41:43 slot0_11 kernel: TIPC: Unable to remove publication > by node 0x1001007 Apr 28 21:41:43 slot0_13 kernel: TIPC: > Unable to remove publication by node 0x1001007 > > thanks > Ramesh > > -----Original Message----- > From: Stephens, Allan [mailto:all...@wi...] > Sent: Wednesday, April 29, 2009 11:09 AM > To: Nayman Felix-QA5535; tip...@li... > Cc: Natarajan Ramesh-A17988 > Subject: RE: [tipc-discussion] Link congestion with the > topology service > > Felix wrote: > > > 1)Does the topology service do anything special when it > detects link > > congestion? Or do the messages just get dropped as appears to be > > happening in this case? > > If the topology service is unable to send a name table update > because of link congestion the update is lost. TIPC *will* > issue a warning to the system log to indicate that this has > happened, so you can check for the string "distribution > failure" to see if this is actually occurring. > > > 2)If 100's of processes go down at almost the same time (i.e. > > during a node reset), the topology service on another node will > > subsequently flood the link with withdrawn messages won' > > it? I just wanted to determine what the expected behavior is. > > No. In TIPC 1.5.12, name table updates are only issued by > the node which created the named socket. This means that if > a node fails ungracefully (i.e. it just dies), no update > messages are sent by any node in the network; instead, each > of the other nodes detects that it has lost contact with the > failed node (i.e. it no longer has any working link to that > node) and automatically forces its topology service to purge > any name table entries published by the failed node. If a > node fails gracefully (i.e. it shuts down all of its > processes and then dies) then it is possible that some of the > "withdraw" messages it tries to send may be lost due to link > congestion; however, once the node dies the automatic cleanup > done by the other nodes will have the same effect as if the > lost messages had arrived. > > > 3)Is there anyway that an application can determine that a link > > congestion situation is occurring? We can see from using > tipc-config > > -ls that link congestion has occurred. I assume that the number > > we're seeing refers to the number of occurrences of link congestion. > > An application using TIPC's socket API can tell if link > congestion is occurring by doing a non-blocking send (i.e. > use the MSG_DONTWAIT flag on its send operations); if the > send fails and errno is set to EWOULDBLOCK, then this > indicates that the message could not be sent due to link congestion. > > > It is very important that our application knows what processes are > > up/down so we're very sensitive to the situation where we > lose these > > topology service messages. Any advice or pointers would be very > > helpful for this situation. > > If you are having issues with applications not receiving > topology service events properly, I suggest that you try to > identify if the name table entries on the node in question > are correct. If the name table entries are incorrect then > the problem lies in the name table code rather than the > topology service; in this case, issues like link congestion > need to be considered. On the other hand, if the name table > entries are correct then the problem lies in the topology > service; in this case, there might be issues with the event > messages being sent to the applications by the topology service. > > One possible scenario I can think of that could possibly > explain the problems you are encountering would be if name > table updates are being lost when a failed node restarts > (rather than when it fails). Whenever node A joins the > network, all of the other nodes will attempt to dump their > name tables to it "en masse". I can imagine that it might be > possible that some updates from node B may not reach node A > due to link congestion, causing node A's name table to be > incomplete and the applications on node A to not receive > "publish" events that they should have received. > > Regards, > Al > > > -----Original Message----- > > From: Nayman Felix-QA5535 [mailto:Fel...@mo...] > > Sent: Tuesday, April 28, 2009 6:41 PM > > To: tip...@li... > > Cc: Natarajan Ramesh-A17988 > > Subject: [tipc-discussion] Link congestion with the topology service > > > > All, > > I'm running TIPC 1.5.12 on linux 2.6.21. > > > > We appear to be seeing link congestion occurring with our topology > > service notifications (publications and withdrawls). > > It appears that when we have a situation where one of our > nodes goes > > down under a heavy load, the other nodes are getting inundated with > > withdrawn events and some of these events are not making it > to their > > intended destination because of link congestion. There are > hundreds > > of processes running on each of 10 nodes plus one cluster manager. > > I believe the topology service opens up its socket with critical > > importance and so do our applications. So there is no problem in > > enqueuing the messages on the socket receive queue, but we > don't let > > the transmit queue grew indefinitely though. I can increase the > > window size from it's default of 50 to the maximum 150 > which will let > > us go from 96 fragments/messages to 300 fragments/messages for > > critical messages. I'm not sure how much this will help. > > > > So I've got a few questions: > > ... moved above ... > > > > > Thanks, > > Felix > > -------------------------------------------------------------- > > ---------------- > > Register Now & Save for Velocity, the Web Performance & Operations > > Conference from O'Reilly Media. Velocity features a full day of > > expert-led, hands-on workshops and two days of sessions > from industry > > leaders in dedicated Performance & Operations tracks. Use code > > vel09scf and Save an extra 15% before 5/3. > > http://p.sf.net/sfu/velocityconf > > _______________________________________________ > > tipc-discussion mailing list > > tip...@li... > > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > > |