Re: [tipc-discussion] Link congestion with the topology service

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Allan,

We are seeing this message in syslog. What does this mean?

Apr 28 21:41:43 slot0_3 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_4 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_13 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_9 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_1 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_10 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_12 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_3 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_4 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_13 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_11 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_1 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_12 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_9 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_10 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_3 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_11 kernel: TIPC: Unable to remove publication by
node 0x1001007 
Apr 28 21:41:43 slot0_13 kernel: TIPC: Unable to remove publication by
node 0x1001007 

thanks
Ramesh

-----Original Message-----
From: Stephens, Allan [mailto:all...@wi...] 
Sent: Wednesday, April 29, 2009 11:09 AM
To: Nayman Felix-QA5535; tip...@li...
Cc: Natarajan Ramesh-A17988
Subject: RE: [tipc-discussion] Link congestion with the topology service

Felix wrote:

> 1)Does the topology service do anything special when it detects link 
> congestion?  Or do the messages just get dropped as appears to be 
> happening in this case?

If the topology service is unable to send a name table update because of
link congestion the update is lost.  TIPC *will* issue a warning to the
system log to indicate that this has happened, so you can check for the
string "distribution failure" to see if this is actually occurring.

> 2)If 100's of processes go down at almost the same time (i.e. 
> during a node reset), the topology service on another node will 
> subsequently flood the link with withdrawn messages won'
> it?  I just wanted to determine what the expected behavior is.

No.  In TIPC 1.5.12, name table updates are only issued by the node
which created the named socket.  This means that if a node fails
ungracefully (i.e. it just dies), no update messages are sent by any
node in the network; instead, each of the other nodes detects that it
has lost contact with the failed node (i.e. it no longer has any working
link to that node) and automatically forces its topology service to
purge any name table entries published by the failed node.  If a node
fails gracefully (i.e. it shuts down all of its processes and then dies)
then it is possible that some of the "withdraw" messages it tries to
send may be lost due to link congestion; however, once the node dies the
automatic cleanup done by the other nodes will have the same effect as
if the lost messages had arrived.

> 3)Is there anyway that an application can determine that a link 
> congestion situation is occurring?  We can see from using tipc-config 
> -ls  that link congestion has occurred.  I assume that the number 
> we're seeing refers to the number of occurrences of link congestion.

An application using TIPC's socket API can tell if link congestion is
occurring by doing a non-blocking send (i.e. use the MSG_DONTWAIT flag
on its send operations); if the send fails and errno is set to
EWOULDBLOCK, then this indicates that the message could not be sent due
to link congestion.

> It is very important that our application knows what processes are 
> up/down so we're very sensitive to the situation where we lose these 
> topology service messages.  Any advice or pointers would be very 
> helpful for this situation.

If you are having issues with applications not receiving topology
service events properly, I suggest that you try to identify if the name
table entries on the node in question are correct.  If the name table
entries are incorrect then the problem lies in the name table code
rather than the topology service; in this case, issues like link
congestion need to be considered.  On the other hand, if the name table
entries are correct then the problem lies in the topology service; in
this case, there might be issues with the event messages being sent to
the applications by the topology service.

One possible scenario I can think of that could possibly explain the
problems you are encountering would be if name table updates are being
lost when a failed node restarts (rather than when it fails).  Whenever
node A joins the network, all of the other nodes will attempt to dump
their name tables to it "en masse".  I can imagine that it might be
possible that some updates from node B may not reach node A due to link
congestion, causing node A's name table to be incomplete and the
applications on node A to not receive "publish" events that they should
have received.

Regards,
Al

> -----Original Message-----
> From: Nayman Felix-QA5535 [mailto:Fel...@mo...]
> Sent: Tuesday, April 28, 2009 6:41 PM
> To: tip...@li...
> Cc: Natarajan Ramesh-A17988
> Subject: [tipc-discussion] Link congestion with the topology service
> 
> All,
> I'm running TIPC 1.5.12 on linux 2.6.21.
>  
> We appear to be seeing link congestion occurring with our topology 
> service notifications (publications and withdrawls).
>  It appears that when we have a situation where one of our nodes goes 
> down under a heavy load, the other nodes are getting inundated with 
> withdrawn events and some of these events are not making it to their 
> intended destination because of link congestion.  There are hundreds 
> of processes running on each of 10 nodes plus one cluster manager.
> I believe the topology service opens up its socket with critical 
> importance and so do our applications.  So there is no problem in 
> enqueuing the messages on the socket receive queue, but we don't let 
> the transmit queue grew indefinitely though.  I can increase the 
> window size from it's default of 50 to the maximum 150 which will let 
> us go from 96 fragments/messages to 300 fragments/messages for 
> critical messages.  I'm not sure how much this will help.
>  
>  So I've got a few questions:

... moved above ...

>  
> Thanks,
> Felix
> --------------------------------------------------------------
> ----------------
> Register Now & Save for Velocity, the Web Performance & Operations 
> Conference from O'Reilly Media. Velocity features a full day of 
> expert-led, hands-on workshops and two days of sessions from industry 
> leaders in dedicated Performance & Operations tracks. Use code 
> vel09scf and Save an extra 15% before 5/3. 
> http://p.sf.net/sfu/velocityconf 
> _______________________________________________
> tipc-discussion mailing list
> tip...@li...
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
> 

Re: [tipc-discussion] Link congestion with the topology service

Cluster wide IPC providing datagram, connection, and bus messaging

Re: [tipc-discussion] Link congestion with the topology service