From: Erik H. <eri...@er...> - 2012-10-25 13:48:17
|
There is a problem with SOCK_RDM messaging in TIPC. If the receiving port of an RDM packet is unable to handle the message (socket rcvbuf full), the message can be: A. Dropped, if DEST_DROPPABLE flag is true. B. Rejected back to sender if DEST_DROPPABLE flag is false. For B, if one or more messages have been rejected, subsequent messages that are still in transit can still be delivered to the socket if it have managed to drain the socket rcvqueue enough for them to be buffered. This breaks the guaranteed sequentiality of RDM message delivery and applies to both uni/multicast messaging. Easy way out of this is to force applications to use SEQPACKET/STREAM sockets if they need guaranteed message delivery. But we already have (almost) all the infrastructure in place to implement blocking send for RDM if one or more of the recipients cannot handle the packet. If we add a 'congested' state in the publication struct: http://lxr.free-electrons.com/source/net/tipc/name_table.h#L68 Everytime we send an RDM message, we will resolve a publication through either tipc_nametbl_translate or tipc_nametbl_mc_translate. If one or more of the destination ports that have published the given name is congested, this will be reported back to the port layer, and the send() call can be blocked/return -EAGAIN. If the node/portid is explicitly specified (sockaddr_tipc of type TIPC_ADDR_ID), we would probably need to add a reverse publication lookup in tipc_send2port to check the 'congested' state. When a message is received on a RDM socket, we check the buffer fill level. If this exceeds TIPC_RDM_HIGH_WMARK, we send out a TIPC broadcast message with a (new?) TIPC protocol primitive that states that port X on node a.b.c is overloaded. Upon receiving this message, all nodes in the cluster updates the corresponding publication and sets 'congested'=true. The WMARK limits should probably be based directly on the socket rcvbuf size, maybe: #define TIPC_RDM_LOW_WMARK(sk) (sk->sk_rcvbuf/4) #define TIPC_RDM_HIGH_WMARK(sk) (sk->sk_rcvbuf - TIPC_RDM_LOW_WMARK(sk)) When the buffer fill level drops below TIPC_RDM_LOW_WMARK, we send out another TIPC broadcast message, reporting that port X on node a.b.c is no longer congested. Will this work? |
From: Ying X. <yin...@wi...> - 2012-10-26 08:56:49
|
Hi Erik, I understood your idea. Firstly I think the solution might be workable, but I believe it's not easy to implement a stronger version. As you know, in a distribution system, it's a hard thing to maintain a shared variable(i,e, "congested" as you mentioned). When we want to keep the variable synchronized in all nodes of cluster in time, we have to make hard efforts. Furthermore, from my understanding, you actually try to add a simple flow control algorithm for RDM socket on port layer to prevent its recvqueue from overloading. I believe the flow control mechanism can really reduce the rate of overflow, but it may not eliminate the risk at all because: 1. In distribution system, we cannot easily estimate which water mark value notifying that some receiver will enter into congested state is the best. 2. As I stated in past, in port layer, as long as we receive any message, we cannot drop it even if we know receive buffer will be overloaded. 3. Even for the same skb buffer, the skb's truesize might be different for different hosts. Therefor, we cannot 100% prevent socket receive buffer from being overloaded with TM algorithm. Like your specified, with my TM patches sent to you previously we cannot also prohibit the issue happening. In all, I cannot figure out a better method to resolve the issue. Regards, Ying Erik Hugne wrote: > There is a problem with SOCK_RDM messaging in TIPC. > If the receiving port of an RDM packet is unable to handle the message > (socket rcvbuf full), the message can be: > A. Dropped, if DEST_DROPPABLE flag is true. > B. Rejected back to sender if DEST_DROPPABLE flag is false. > For B, if one or more messages have been rejected, subsequent messages > that are still in transit can still be delivered to the socket if it > have managed to drain the socket rcvqueue enough for them to be buffered. > This breaks the guaranteed sequentiality of RDM message delivery and > applies to both uni/multicast messaging. > > Easy way out of this is to force applications to use SEQPACKET/STREAM > sockets if they need guaranteed message delivery. > But we already have (almost) all the infrastructure in place to > implement blocking send for RDM if one or more of the recipients > cannot handle the packet. > If we add a 'congested' state in the publication struct: > http://lxr.free-electrons.com/source/net/tipc/name_table.h#L68 > > Everytime we send an RDM message, we will resolve a publication > through either tipc_nametbl_translate or tipc_nametbl_mc_translate. > If one or more of the destination ports that have published the given > name is congested, this will be reported back to the port layer, and > the send() call can be blocked/return -EAGAIN. > If the node/portid is explicitly specified (sockaddr_tipc of type > TIPC_ADDR_ID), we would probably need to add a reverse publication > lookup in tipc_send2port to check the 'congested' state. > > When a message is received on a RDM socket, we check the buffer fill > level. If this exceeds TIPC_RDM_HIGH_WMARK, we send out a TIPC > broadcast message with a (new?) TIPC protocol primitive that states > that port X on node a.b.c is overloaded. > Upon receiving this message, all nodes in the cluster updates the > corresponding publication and sets 'congested'=true. > > The WMARK limits should probably be based directly on the socket > rcvbuf size, maybe: > #define TIPC_RDM_LOW_WMARK(sk) (sk->sk_rcvbuf/4) > #define TIPC_RDM_HIGH_WMARK(sk) (sk->sk_rcvbuf - TIPC_RDM_LOW_WMARK(sk)) > > When the buffer fill level drops below TIPC_RDM_LOW_WMARK, we send out > another TIPC broadcast message, reporting that port X on node a.b.c is > no longer congested. > > Will this work? > > > > |
From: Erik H. <eri...@er...> - 2012-11-05 16:18:11
|
The unconnected nature of RDM means that we will never know if the published ports/nodes will actually receive a copy of the message. Maybe the socket have been close()'d, but the sender's nametable not yet updated? So, we cannot provide any guarantee for delivery of RDM messages to the ports bound to a specific name. Basically, we don't have enough 'state' in the RDM protocol to base any ack/retransmit decisions on. It is my understanding that, for unconnected transmission methods, the best we can do is to introduce a feature that communicates to the peer nodes publication table if a remote port cannot handle more data. The method that i proposed will work for the trivial case with one sender and one receiver. But it becomes more complicated in the second case, if there are multiple senders and a single receiver since all senders can potentially send a big burst of data, exhausting the receivers buffer. The third case is when we have multiple ports with overlapping names that should receive an RDM message. If any of the recepient ports cannot handle the message we should block the send() call. If we have reached an RDM port congestion, it will eventually be 'unblocked' (when LOW_WMARK is reached) and a broadcast message indication is sent to the peer nodes. Now, all peer's may send a huge burst of data to the recipient port, and we can potentially run in to an overload condition again. This overload will obviously also trigger the HIGH_WMARK message to be sent out again, telling the senders to back-off. If i interpret tipc_port_recv_mcast/tipc_port_recv_msg code correctly, messages will be rejected for both mcast/unicast messaging if a port cannot handle a message. What we have gained with this is that when a server port gets overloaded/rejects a set of messages, we will no longer receive any 'out of sequence' messages, and based on the rejected messages, the sending ports will always know that the server have received up to message X before it got choked. This may introduce a 'stop and go' behavior for heavily loaded servers, which might be less than ideal.. but i hope you agree with me that it's better than receiving out-of-sequence data. //E On Fri, Nov 02, 2012 at 11:10:07AM -0400, Andrew Booth wrote: > I agree with Ying, I don't think this approach handles all cases, though > it might help with some common cases. > Maybe enhance the RDM code to allow for acknowledgement and retransmission > per-port? Not trivial, but I'm under the impression that is the approach > taken by other reliable transports (tcp, etc). > Andrew > -----Ying Xue <yin...@wi...> wrote: ----- > To: Erik Hugne <eri...@er...> > From: Ying Xue <yin...@wi...> > Date: 10/26/2012 05:07AM > Cc: Jon Maloy <jon...@er...>, > "tip...@li..." > <tip...@li...> > Subject: Re: [tipc-discussion] Enhancement to SOCK_RDM > > Hi Erik, > > I understood your idea. > > Firstly I think the solution might be workable, but I believe it's not > easy to implement a stronger version. As you know, in a distribution > system, it's a hard thing to maintain a shared variable(i,e, "congested" > as you mentioned). When we want to keep the variable synchronized in all > nodes of cluster in time, we have to make hard efforts. > > Furthermore, from my understanding, you actually try to add a simple > flow control algorithm for RDM socket on port layer to prevent its > recvqueue from overloading. > I believe the flow control mechanism can really reduce the rate of > overflow, but it may not eliminate the risk at all because: > 1. In distribution system, we cannot easily estimate which water mark > value notifying that some receiver will enter into congested state is > the best. > 2. As I stated in past, in port layer, as long as we receive any > message, we cannot drop it even if we know receive buffer will be > overloaded. > 3. Even for the same skb buffer, the skb's truesize might be different > for different hosts. Therefor, we cannot 100% prevent socket receive > buffer from being overloaded with TM algorithm. > Like your specified, with my TM patches sent to you previously we cannot > also prohibit the issue happening. > > In all, I cannot figure out a better method to resolve the issue. > > Regards, > Ying > > Erik Hugne wrote: > > There is a problem with SOCK_RDM messaging in TIPC. > > If the receiving port of an RDM packet is unable to handle the message > > (socket rcvbuf full), the message can be: > > A. Dropped, if DEST_DROPPABLE flag is true. > > B. Rejected back to sender if DEST_DROPPABLE flag is false. > > For B, if one or more messages have been rejected, subsequent messages > > that are still in transit can still be delivered to the socket if it > > have managed to drain the socket rcvqueue enough for them to be > buffered. > > This breaks the guaranteed sequentiality of RDM message delivery and > > applies to both uni/multicast messaging. > > > > Easy way out of this is to force applications to use SEQPACKET/STREAM > > sockets if they need guaranteed message delivery. > > But we already have (almost) all the infrastructure in place to > > implement blocking send for RDM if one or more of the recipients > > cannot handle the packet. > > If we add a 'congested' state in the publication struct: > > http://lxr.free-electrons.com/source/net/tipc/name_table.h#L68 > > > > Everytime we send an RDM message, we will resolve a publication > > through either tipc_nametbl_translate or tipc_nametbl_mc_translate. > > If one or more of the destination ports that have published the given > > name is congested, this will be reported back to the port layer, and > > the send() call can be blocked/return -EAGAIN. > > If the node/portid is explicitly specified (sockaddr_tipc of type > > TIPC_ADDR_ID), we would probably need to add a reverse publication > > lookup in tipc_send2port to check the 'congested' state. > > > > When a message is received on a RDM socket, we check the buffer fill > > level. If this exceeds TIPC_RDM_HIGH_WMARK, we send out a TIPC > > broadcast message with a (new?) TIPC protocol primitive that states > > that port X on node a.b.c is overloaded. > > Upon receiving this message, all nodes in the cluster updates the > > corresponding publication and sets 'congested'=true. > > > > The WMARK limits should probably be based directly on the socket > > rcvbuf size, maybe: > > #define TIPC_RDM_LOW_WMARK(sk) (sk->sk_rcvbuf/4) > > #define TIPC_RDM_HIGH_WMARK(sk) (sk->sk_rcvbuf - TIPC_RDM_LOW_WMARK(sk)) > > > > When the buffer fill level drops below TIPC_RDM_LOW_WMARK, we send out > > another TIPC broadcast message, reporting that port X on node a.b.c is > > no longer congested. > > > > Will this work? > > > > > > > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_sfd2d_oct > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion |
From: Ying X. <yin...@wi...> - 2012-11-06 02:58:50
|
I agree with Andrew's proposal which is an interesting idea because it can help me understand how to drop messages when socket receive buffer is overflowed. Meanwhile, we would not affect link layer state even if this thing happens. Although link layer has been arranged one reliable flow control algorithm, it seems redundant to involve another new reliable flow control mechanism on port layer. To enhance the RDM socket, I believe it's necessary. With this thought, it's easy to fix my lastly listed questions. Which reliable flow control mechanism is best for us? Firstly I can figure out the algorithm which is already deployed on mulicast link layer. So to implement it on port layer again, i think at least it can work reliably under mutlicast communication circumstance. I noticed Erik's enhanced solution about it. Now it seems workable :-) But I reminder actually it's now non-reliable to send back rejected messages to its senders. Especially, when link is congested, perhaps the rejected message is dropped silently. Maybe we can find other algorithms by looking into some papers. In all, lastly we also need to implement some prototypes to prove which is best for us. In addition, i am looking forward to listening Jon's comments :-) Regards, Ying Erik Hugne wrote: > The unconnected nature of RDM means that we will never know if the published > ports/nodes will actually receive a copy of the message. Maybe the socket have > been close()'d, but the sender's nametable not yet updated? So, we cannot > provide any guarantee for delivery of RDM messages to the ports bound to a > specific name. > Basically, we don't have enough 'state' in the RDM protocol to base any > ack/retransmit decisions on. > > It is my understanding that, for unconnected transmission methods, the best we > can do is to introduce a feature that communicates to the peer nodes > publication table if a remote port cannot handle more data. > > The method that i proposed will work for the trivial case with one sender and > one receiver. But it becomes more complicated in the second case, if there are > multiple senders and a single receiver since all senders can potentially send > a big burst of data, exhausting the receivers buffer. > The third case is when we have multiple ports with overlapping names that > should receive an RDM message. If any of the recepient ports cannot handle the > message we should block the send() call. > > If we have reached an RDM port congestion, it will eventually be 'unblocked' > (when LOW_WMARK is reached) and a broadcast message indication is sent to the > peer nodes. Now, all peer's may send a huge burst of data to the recipient > port, and we can potentially run in to an overload condition again. > This overload will obviously also trigger the HIGH_WMARK message to be sent out > again, telling the senders to back-off. > > If i interpret tipc_port_recv_mcast/tipc_port_recv_msg code correctly, messages > will be rejected for both mcast/unicast messaging if a port cannot handle a > message. > > What we have gained with this is that when a server port gets > overloaded/rejects a set of messages, we will no longer receive any > 'out of sequence' messages, and based on the rejected messages, the sending > ports will always know that the server have received up to message X before > it got choked. > > This may introduce a 'stop and go' behavior for heavily loaded servers, which might > be less than ideal.. but i hope you agree with me that it's better than receiving > out-of-sequence data. > > //E > > On Fri, Nov 02, 2012 at 11:10:07AM -0400, Andrew Booth wrote: > >> I agree with Ying, I don't think this approach handles all cases, though >> it might help with some common cases. >> Maybe enhance the RDM code to allow for acknowledgement and retransmission >> per-port? Not trivial, but I'm under the impression that is the approach >> taken by other reliable transports (tcp, etc). >> Andrew >> -----Ying Xue <yin...@wi...> wrote: ----- >> To: Erik Hugne <eri...@er...> >> From: Ying Xue <yin...@wi...> >> Date: 10/26/2012 05:07AM >> Cc: Jon Maloy <jon...@er...>, >> "tip...@li..." >> <tip...@li...> >> Subject: Re: [tipc-discussion] Enhancement to SOCK_RDM >> >> Hi Erik, >> >> I understood your idea. >> >> Firstly I think the solution might be workable, but I believe it's not >> easy to implement a stronger version. As you know, in a distribution >> system, it's a hard thing to maintain a shared variable(i,e, "congested" >> as you mentioned). When we want to keep the variable synchronized in all >> nodes of cluster in time, we have to make hard efforts. >> >> Furthermore, from my understanding, you actually try to add a simple >> flow control algorithm for RDM socket on port layer to prevent its >> recvqueue from overloading. >> I believe the flow control mechanism can really reduce the rate of >> overflow, but it may not eliminate the risk at all because: >> 1. In distribution system, we cannot easily estimate which water mark >> value notifying that some receiver will enter into congested state is >> the best. >> 2. As I stated in past, in port layer, as long as we receive any >> message, we cannot drop it even if we know receive buffer will be >> overloaded. >> 3. Even for the same skb buffer, the skb's truesize might be different >> for different hosts. Therefor, we cannot 100% prevent socket receive >> buffer from being overloaded with TM algorithm. >> Like your specified, with my TM patches sent to you previously we cannot >> also prohibit the issue happening. >> >> In all, I cannot figure out a better method to resolve the issue. >> >> Regards, >> Ying >> >> Erik Hugne wrote: >> > There is a problem with SOCK_RDM messaging in TIPC. >> > If the receiving port of an RDM packet is unable to handle the message >> > (socket rcvbuf full), the message can be: >> > A. Dropped, if DEST_DROPPABLE flag is true. >> > B. Rejected back to sender if DEST_DROPPABLE flag is false. >> > For B, if one or more messages have been rejected, subsequent messages >> > that are still in transit can still be delivered to the socket if it >> > have managed to drain the socket rcvqueue enough for them to be >> buffered. >> > This breaks the guaranteed sequentiality of RDM message delivery and >> > applies to both uni/multicast messaging. >> > >> > Easy way out of this is to force applications to use SEQPACKET/STREAM >> > sockets if they need guaranteed message delivery. >> > But we already have (almost) all the infrastructure in place to >> > implement blocking send for RDM if one or more of the recipients >> > cannot handle the packet. >> > If we add a 'congested' state in the publication struct: >> > http://lxr.free-electrons.com/source/net/tipc/name_table.h#L68 >> > >> > Everytime we send an RDM message, we will resolve a publication >> > through either tipc_nametbl_translate or tipc_nametbl_mc_translate. >> > If one or more of the destination ports that have published the given >> > name is congested, this will be reported back to the port layer, and >> > the send() call can be blocked/return -EAGAIN. >> > If the node/portid is explicitly specified (sockaddr_tipc of type >> > TIPC_ADDR_ID), we would probably need to add a reverse publication >> > lookup in tipc_send2port to check the 'congested' state. >> > >> > When a message is received on a RDM socket, we check the buffer fill >> > level. If this exceeds TIPC_RDM_HIGH_WMARK, we send out a TIPC >> > broadcast message with a (new?) TIPC protocol primitive that states >> > that port X on node a.b.c is overloaded. >> > Upon receiving this message, all nodes in the cluster updates the >> > corresponding publication and sets 'congested'=true. >> > >> > The WMARK limits should probably be based directly on the socket >> > rcvbuf size, maybe: >> > #define TIPC_RDM_LOW_WMARK(sk) (sk->sk_rcvbuf/4) >> > #define TIPC_RDM_HIGH_WMARK(sk) (sk->sk_rcvbuf - TIPC_RDM_LOW_WMARK(sk)) >> > >> > When the buffer fill level drops below TIPC_RDM_LOW_WMARK, we send out >> > another TIPC broadcast message, reporting that port X on node a.b.c is >> > no longer congested. >> > >> > Will this work? >> > >> > >> > >> > >> >> ------------------------------------------------------------------------------ >> Everyone hates slow websites. So do we. >> Make your web apps faster with AppDynamics >> Download AppDynamics Lite for free today: >> http://p.sf.net/sfu/appdyn_sfd2d_oct >> _______________________________________________ >> tipc-discussion mailing list >> tip...@li... >> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> > > |
From: Jon M. <ma...@do...> - 2012-11-07 08:53:36
|
On 11/05/2012 09:58 PM, Ying Xue wrote: > I agree with Andrew's proposal which is an interesting idea because it > can help me understand how to drop messages when socket receive buffer > is overflowed. > Meanwhile, we would not affect link layer state even if this thing happens. > Although link layer has been arranged one reliable flow control > algorithm, it seems redundant to involve another new reliable flow > control mechanism on port layer. To enhance the RDM socket, I believe > it's necessary. With this thought, it's easy to fix my lastly listed > questions. > > Which reliable flow control mechanism is best for us? > > Firstly I can figure out the algorithm which is already deployed on > mulicast link layer. So to implement it on port layer again, i think at > least it can work reliably under mutlicast communication circumstance. The problem with this is that the sender port doesn't know the number of receivers. And anyway, that number may change from one moment to another. I fear it would be extremely challenging to make such a protocol work reliably, given the problems we have seen with the presumably simpler link-layer broadcast. But I am open for suggestions. Maybe I am wrong. > > I noticed Erik's enhanced solution about it. Now it seems workable :-) > But I reminder actually it's now non-reliable to send back rejected > messages to its senders. Especially, when link is congested, perhaps the > rejected message is dropped silently. > > Maybe we can find other algorithms by looking into some papers. > > In all, lastly we also need to implement some prototypes to prove which > is best for us. > > In addition, i am looking forward to listening Jon's comments :-) First, I want to say that what Andrew and others asking for in reality is impossible. SOCK_RDM is by its Posix definition a datagram protocol mode, i.e. we may have multiple senders and multiple receivers simultaneously. Also, per definition it is not required to deliver messages in sequence, while I am less certain about cardinality. When I see such requests, I suspect that what they really want is SOCK_SEQPACKET, which basically has all the properties they ask for. But, who are we if we don't try to achieve the impossible for our users ;-) I like Erik's idea (it is not new) to attach a load level indicator to the publication items. The thoughts I had about this was to reserve two bits and use those to indicate load level. E.g., 00 = < 70%, 01=70-85%, 10= 85-95%, 11 = >95% (and full stop). I assume that the load levels (when > 70%) are broadcast out in a new NAME_DISTR message type, since there is no more space in the current publication items (yes, I have checked), and the NAME_DISTR message header cannot be used for this. This would be fully backwards compatible. We could then extend the name lookup algorithm to find an instance with acceptable laod, and if there is none we either block the sender until the overload abates (as we do with link congestion now), or we return EAGAIN. This would, as Erik said, not guarantee against buffer overflow, but it should reduce the risk significantly. There is a risk with this, however. What happens in a heavy overloaded system, if a lot of destinations start to broadcast overload messages at a massive scale? We may end up with worsening the situation, instead of improving it. It is possible that we should make this feature an opt-in service, set via a socket option at the server side. But I would prefer to make it default and transparent, if possible. I think only some prototyping and experimenting during high load can give the answer here. > > Regards, > Ying > > > Erik Hugne wrote: >> The unconnected nature of RDM means that we will never know if the published >> ports/nodes will actually receive a copy of the message. Maybe the socket have >> been close()'d, but the sender's nametable not yet updated? So, we cannot >> provide any guarantee for delivery of RDM messages to the ports bound to a >> specific name. >> Basically, we don't have enough 'state' in the RDM protocol to base any >> ack/retransmit decisions on. >> >> It is my understanding that, for unconnected transmission methods, the best we >> can do is to introduce a feature that communicates to the peer nodes >> publication table if a remote port cannot handle more data. >> >> The method that i proposed will work for the trivial case with one sender and >> one receiver. But it becomes more complicated in the second case, if there are >> multiple senders and a single receiver since all senders can potentially send >> a big burst of data, exhausting the receivers buffer. >> The third case is when we have multiple ports with overlapping names that >> should receive an RDM message. If any of the recepient ports cannot handle the >> message we should block the send() call. >> >> If we have reached an RDM port congestion, it will eventually be 'unblocked' >> (when LOW_WMARK is reached) and a broadcast message indication is sent to the >> peer nodes. Now, all peer's may send a huge burst of data to the recipient >> port, and we can potentially run in to an overload condition again. >> This overload will obviously also trigger the HIGH_WMARK message to be sent out >> again, telling the senders to back-off. >> >> If i interpret tipc_port_recv_mcast/tipc_port_recv_msg code correctly, messages >> will be rejected for both mcast/unicast messaging if a port cannot handle a >> message. >> >> What we have gained with this is that when a server port gets >> overloaded/rejects a set of messages, we will no longer receive any >> 'out of sequence' messages, and based on the rejected messages, the sending >> ports will always know that the server have received up to message X before >> it got choked. >> >> This may introduce a 'stop and go' behavior for heavily loaded servers, which might >> be less than ideal.. but i hope you agree with me that it's better than receiving >> out-of-sequence data. >> >> //E >> >> On Fri, Nov 02, 2012 at 11:10:07AM -0400, Andrew Booth wrote: >> >>> I agree with Ying, I don't think this approach handles all cases, though >>> it might help with some common cases. >>> Maybe enhance the RDM code to allow for acknowledgement and retransmission >>> per-port? Not trivial, but I'm under the impression that is the approach >>> taken by other reliable transports (tcp, etc). >>> Andrew >>> -----Ying Xue <yin...@wi...> wrote: ----- >>> To: Erik Hugne <eri...@er...> >>> From: Ying Xue <yin...@wi...> >>> Date: 10/26/2012 05:07AM >>> Cc: Jon Maloy <jon...@er...>, >>> "tip...@li..." >>> <tip...@li...> >>> Subject: Re: [tipc-discussion] Enhancement to SOCK_RDM >>> >>> Hi Erik, >>> >>> I understood your idea. >>> >>> Firstly I think the solution might be workable, but I believe it's not >>> easy to implement a stronger version. As you know, in a distribution >>> system, it's a hard thing to maintain a shared variable(i,e, "congested" >>> as you mentioned). When we want to keep the variable synchronized in all >>> nodes of cluster in time, we have to make hard efforts. >>> >>> Furthermore, from my understanding, you actually try to add a simple >>> flow control algorithm for RDM socket on port layer to prevent its >>> recvqueue from overloading. >>> I believe the flow control mechanism can really reduce the rate of >>> overflow, but it may not eliminate the risk at all because: >>> 1. In distribution system, we cannot easily estimate which water mark >>> value notifying that some receiver will enter into congested state is >>> the best. >>> 2. As I stated in past, in port layer, as long as we receive any >>> message, we cannot drop it even if we know receive buffer will be >>> overloaded. >>> 3. Even for the same skb buffer, the skb's truesize might be different >>> for different hosts. Therefor, we cannot 100% prevent socket receive >>> buffer from being overloaded with TM algorithm. >>> Like your specified, with my TM patches sent to you previously we cannot >>> also prohibit the issue happening. >>> >>> In all, I cannot figure out a better method to resolve the issue. >>> >>> Regards, >>> Ying >>> >>> Erik Hugne wrote: >>> > There is a problem with SOCK_RDM messaging in TIPC. >>> > If the receiving port of an RDM packet is unable to handle the message >>> > (socket rcvbuf full), the message can be: >>> > A. Dropped, if DEST_DROPPABLE flag is true. >>> > B. Rejected back to sender if DEST_DROPPABLE flag is false. >>> > For B, if one or more messages have been rejected, subsequent messages >>> > that are still in transit can still be delivered to the socket if it >>> > have managed to drain the socket rcvqueue enough for them to be >>> buffered. >>> > This breaks the guaranteed sequentiality of RDM message delivery and >>> > applies to both uni/multicast messaging. >>> > >>> > Easy way out of this is to force applications to use SEQPACKET/STREAM >>> > sockets if they need guaranteed message delivery. >>> > But we already have (almost) all the infrastructure in place to >>> > implement blocking send for RDM if one or more of the recipients >>> > cannot handle the packet. >>> > If we add a 'congested' state in the publication struct: >>> > http://lxr.free-electrons.com/source/net/tipc/name_table.h#L68 >>> > >>> > Everytime we send an RDM message, we will resolve a publication >>> > through either tipc_nametbl_translate or tipc_nametbl_mc_translate. >>> > If one or more of the destination ports that have published the given >>> > name is congested, this will be reported back to the port layer, and >>> > the send() call can be blocked/return -EAGAIN. >>> > If the node/portid is explicitly specified (sockaddr_tipc of type >>> > TIPC_ADDR_ID), we would probably need to add a reverse publication >>> > lookup in tipc_send2port to check the 'congested' state. >>> > >>> > When a message is received on a RDM socket, we check the buffer fill >>> > level. If this exceeds TIPC_RDM_HIGH_WMARK, we send out a TIPC >>> > broadcast message with a (new?) TIPC protocol primitive that states >>> > that port X on node a.b.c is overloaded. >>> > Upon receiving this message, all nodes in the cluster updates the >>> > corresponding publication and sets 'congested'=true. >>> > >>> > The WMARK limits should probably be based directly on the socket >>> > rcvbuf size, maybe: >>> > #define TIPC_RDM_LOW_WMARK(sk) (sk->sk_rcvbuf/4) >>> > #define TIPC_RDM_HIGH_WMARK(sk) (sk->sk_rcvbuf - TIPC_RDM_LOW_WMARK(sk)) >>> > >>> > When the buffer fill level drops below TIPC_RDM_LOW_WMARK, we send out >>> > another TIPC broadcast message, reporting that port X on node a.b.c is >>> > no longer congested. >>> > >>> > Will this work? >>> > >>> > >>> > >>> > >>> >>> ------------------------------------------------------------------------------ >>> Everyone hates slow websites. So do we. >>> Make your web apps faster with AppDynamics >>> Download AppDynamics Lite for free today: >>> http://p.sf.net/sfu/appdyn_sfd2d_oct >>> _______________________________________________ >>> tipc-discussion mailing list >>> tip...@li... >>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> >> > > ------------------------------------------------------------------------------ > LogMeIn Central: Instant, anywhere, Remote PC access and management. > Stay in control, update software, and manage PCs from one command center > Diagnose problems and improve visibility into emerging IT issues > Automate, monitor and manage. Do more in less time with Central > http://p.sf.net/sfu/logmein12331_d2d > _______________________________________________ > tipc-discussion mailing list > tip...@li... > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > |
From: Erik H. <eri...@er...> - 2012-11-07 15:05:02
|
> But, who are we if we don't try to achieve the impossible > for our users ;-) > > I like Erik's idea (it is not new) to attach a load level indicator > to the publication items. The thoughts I had about this > was to reserve two bits and use those to indicate load level. > E.g., 00 = < 70%, 01=70-85%, 10= 85-95%, 11 = >95% > (and full stop). Do we need this level of detail? i thought it would be enough with a flag indicating 'full stop'. > I assume that the load levels (when > 70%) are broadcast > out in a new NAME_DISTR message type, since there is no more > space in the current publication items (yes, I have checked), > and the NAME_DISTR message header cannot be used for this. > This would be fully backwards compatible. > > We could then extend the name lookup algorithm to find > an instance with acceptable laod, and if there is none > we either block the sender until the overload abates (as we > do with link congestion now), or we return EAGAIN. This raises a question for me, maybe we should make it possible for the sending side to control the send() behavior if one or more receiving ports are congested with a socket option? TIPC_ALLOW_PARTIAL_MCAST (boolean 1/0), default 0. If this is set, and one or more of the recipient ports are currently overloaded, we send the message only to the ports that are not (could potentially be zero if all ports are overloaded). > > This would, as Erik said, not guarantee against buffer > overflow, but it should reduce the risk significantly. Yes, and most importantly, the sending side can know exactly which messages have been delivered to the receiver (based on the rejected messages). > > There is a risk with this, however. What happens in a heavy > overloaded system, if a lot of destinations start to broadcast > overload messages at a massive scale? We may end up with > worsening the situation, instead of improving it. Maybe if we limit the amount of broadcast messages indicating that the port overload have ceased to 1 per port every link timeout period.. but that would require that we keep track of all congested ports in a separate list, or iterate through the nametable to find congested ports every timeout interval.. Not very good. > It is possible that we should make this feature an opt-in > service, set via a socket option at the server side. But I would > prefer to make it default and transparent, if possible. What if we keep it on by default, but make it opt-out with a socket option? > > I think only some prototyping and experimenting during > high load can give the answer here. |