Thread: Re: [tipc-discussion] [net-next 0/3] tipc: introdice variable window congestion control

Cluster wide IPC providing datagram, connection, and bus messaging

Brought to you by: goranm, johnr64, jonmaloy, ldcanh, and 6 others

tipc-discussion

Re: [tipc-discussion] [net-next 0/3] tipc: introdice variable window congestion control

From: Jon M. <jon...@er...> - 2019-12-02 03:42:47

Just so there is no misunderstanding: this is a completely new version of the variable congestion window algorithm, comprising both slow start, congestion avoidance and fast recovery.
I maybe should have added a v5 is similar to it...
Performance is consistently better than in the previous versions.

///jon


> -----Original Message-----
> From: Jon Maloy <jon...@er...>
> Sent: 1-Dec-19 19:32
> To: Jon Maloy <jon...@er...>; Jon Maloy <ma...@do...>
> Cc: Mohan Krishna Ghanta Krishnamurthy <moh...@er...>;
> par...@gm...; Tung Quang Nguyen <tun...@de...>; Hoang
> Huu Le <hoa...@de...>; Tuong Tong Lien <tuo...@de...>; Gordan
> Mihaljevic <gor...@de...>; yin...@wi...; tipc-
> dis...@li...
> Subject: [net-next 0/3] tipc: introdice variable window congestion control
> 
> We improve thoughput greatly by introducing a variety of the Reno
> congestion control algorithm at the link level.
> 
> This algorithm is similar to one Xin developed last summer, but is
> re-adapted to the latest changes at the link level and leverages the
> special features of the TIPC link protocol, such as the explicit
> nacks sent when a packet loss is detected.
> 
> Jon Maloy (3):
>   tipc: eliminate gap indicator from ACK messages
>   tipc: eliminate more unnecessary nacks and retransmissions
>   tipc: introduce variable window congestion control
> 
>  net/tipc/bcast.c     |  11 +--
>  net/tipc/bearer.c    |  11 +--
>  net/tipc/bearer.h    |   6 +-
>  net/tipc/eth_media.c |   3 +-
>  net/tipc/ib_media.c  |   5 +-
>  net/tipc/link.c      | 191 ++++++++++++++++++++++++++++++++++++---------------
>  net/tipc/link.h      |   9 +--
>  net/tipc/node.c      |  16 +++--
>  net/tipc/udp_media.c |   3 +-
>  9 files changed, 172 insertions(+), 83 deletions(-)
> 
> --
> 2.1.4

Re: [tipc-discussion] [net-next 1/3] tipc: eliminate gap indicator from ACK messages

From: Jon M. <ma...@do...> - 2019-12-04 18:06:58

 Hi Ying,
(cc-ing tipc-discussion, since yout original mail seems to have been dropped somewhere, and I want everybody to be able follow the discussion)

Actually we do have a kind of SACK mechanism already, but maybe too simple.See below.

    On Wednesday, December 4, 2019, 11:45:37 AM GMT-5, Xue, Ying <yin...@wi...> wrote:  

 I don’t know why I received lots of complains from outlook which indicate my below email was not sent to you.
I just try to resend. If you received, please ignore this one.

Thanks,
Ying

-----Original Message-----
From: Ying Xue [mailto:yin...@wi...] 
Sent: Thursday, December 5, 2019 12:39 AM
To: Jon Maloy; Jon Maloy
Cc: moh...@er...; par...@gm...; tun...@de...; hoa...@de...; tuo...@de...; gor...@de...; tip...@li...
Subject: Re: [net-next 1/3] tipc: eliminate gap indicator from ACK messages

On 12/2/19 8:32 AM, Jon Maloy wrote:
> When we increase the link send window we sometimes observe the
> following scenario:
> 
> 1) A packet #N arrives out of order far ahead of a sequence of older
>    packets which are still under way. The packet is added to the
>    deferred queue.
> 2) The missing packets arrive in sequence, and for each 16th of them
>    an ACK is sent back to the receiver, as it should be.
> 3) When building those ACK messages, it is checked if there is a gap
>    between the link's 'rcv_nxt' and the first packet in the deferred
>    queue. This is always the case until packet number #N-1 arrives, and
>    a 'gap' indicator is added, effectively turning them into NACK
>    messages.
> 4) When those NACKs arrive at the sender, all the requested
>    retransmissions are done, since it is a first-time request.
> 
> This sometimes leads to a huge amount of redundant retransmissions,
> causing a drop in max throughput. This problem gets worse when we
> in a later commit introduce variable window congestion control,
> since it drops the link back to 'slow start' much more often
> than necessary.
> 

I think the essential thing is about how we should precisely detect if package is really lost.
When we can identify a message is really lost, that will be good moment to ask sender to retransmit the lost message to receiver.
But it's very difficult to identify this thing. When receiver finds a missing message in its receipt queue, the message has two statuses:
1. The missing message is just out of order. 
2. The missing message is really lost.

For the first case, when receiver detects the missing message and it sends a retransmission request before the out-of-order message arrives at the receiver, 
It will cause a redundant retransmission. In order to avoid this situation, we should send retransmission request as late as possible.

For the second case, when the missing message is really lost, it will cause one bad thing that lots of messages are queued in sending buffer of sender side if
Receiver doesn't timely send retransmission request. In order to avoid this situation, we should send retransmission request as early as possible.

As a consequence, the two conclusions we drew above conflict each other. It means we cannot just use one method to perfectly solve this problem.

So let's take SCTP as an example to understand how it handles this situation:

In SCTP, there are two different retransmission mechanisms: one is normal retransmission mechanism, and another is called fast retransmission. 
In the normal retransmission algorithm, it sets a timer to detect whether message is lost or not. When a message's timer is expired, it supposes the message
Is lost and the message is then retransmitted.
[jon] This is what we are missing, and as a result first retransmission requests are always done. I have already observed that >90% of those are useless under normal circumstances, since they end up as duplicates. I was able to somewhat improve this figure with my first two patches in the series, but there is still a lot of duplicates.I am sure we can do better here.On a general note, I strongly dislike timers, especially if they are set per packet. Maybe we could just set the control block time stamp to e.g. N ms, and then on each link timeout check the age of the first (and oldest) buffer in the transmit queue. If the age exceeds N ms, we retransmit, and go on to check the next buffer etc.But remember, nobody is using TIPC across the internet, so delay times are short, and it is far likelier that the receiver will detect a gap and send a NACK long before this limit has expired. So, I suspect such a mechanism would be of limited value. Also note that in the last patch we actually introduce something similar in tipc_link_timeout(), although the limit will in practice be 375 ms, and I never once saw it being trigged during my testing.

 In the fast retransmission algorithm, three SACK messages indicate the same message is missed, the message
will be immediately retransmitted before its retransmission timer is expired. 

[jon] We have a similar mechanism, but we based it on the skb timestamp. If a packet has been retransmitted once, we don't allow it to be retransmitted until 1 ms has elapsed, irrespective of number of NACKs. We could of course base this on the already existing retransmission counter instead. I will try that.

Regarding my understanding, the normal retransmission mechanism is used to handle the first case to send retransmission request as late as possible.
The fast retransmission algorithm is used to handle the second case to send retransmission request earlier than retransmission timer expiration. 

Therefore, probably we can use its ideas to handle our TIPC cases:
1. We can introduce a retransmission timer to perform delayed retransmission. 
2. We should remain the gap indicator. But when receiver reports a gap to sender, it doesn't send NACK to sender. Instead it should SACK to sender. 

[jon] All our NACKs are in reality SACKs. It is up to the receiver to interpret them correctly.

Whenever sender receives NACK, it will immediately retransmit missed messages, which probably cause lots of duplicated messages. But for SACK,
It means Selective ACK which contains gap info. But when a sender receives SACK and gap info, it doesn't immediately retransmit the missed message.
Instead, it releases the messages in its sending queue which have been acknowledged by SACK message. Meanwhile, once three SACK messages report the same message
Is missed, the sender can immediately retransmit the message, which is exactly what the fast retransmission algorithm is doing.

Of course, the two suggested approaches are much more complex than our current TIPC link traffic control algorithm. If we decide to introduce retransmission timer, we have to introduce another
algorithm to measure RTT time. I am not sure whether we should introduce retransmission timer, but in my opinion, introducing SACK might be a good idea to TIPC.

[jon] Once we have an RTT measuring mechanism in place, we should be able to calculate an optimal expiration time according to the above suggestion. But I am skeptical to its practical value, at least the way TIPC is used now.But I do agree that we need a mechanism for limiting the number of first retransmission attempts, which constitutes the vast majority of retransmissions, and to it vast majority are useless. I think I will first try leveraging the skb retransmission counter here, and see what will happen.
///jon

About SCTP SACK, please refer to: https://tools.ietf.org/html/rfc4960#page-34

Thanks,
Ying

> We now fix this by not sending any 'gap' indicator in regular ACK
> messages. We already have a mechanism for sending explicit NACKs
> in place, and this is sufficient to keep up the packet flow.
> 
> Signed-off-by: Jon Maloy <jon...@er...>
> ---
>  net/tipc/link.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/net/tipc/link.c b/net/tipc/link.c
> index 24d4d10..6d86446 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -1521,7 +1521,8 @@ static int tipc_link_build_nack_msg(struct tipc_link *l,
>                      struct sk_buff_head *xmitq)
>  {
>      u32 def_cnt = ++l->stats.deferred_recv;
> -    u32 defq_len = skb_queue_len(&l->deferdq);
> +    struct sk_buff_head *dfq = &l->deferdq;
> +    u32 defq_len = skb_queue_len(dfq);
>      int match1, match2;
>  
>      if (link_is_bc_rcvlink(l)) {
> @@ -1532,8 +1533,12 @@ static int tipc_link_build_nack_msg(struct tipc_link *l,
>          return 0;
>      }
>  
> -    if (defq_len >= 3 && !((defq_len - 3) % 16))
> -        tipc_link_build_proto_msg(l, STATE_MSG, 0, 0, 0, 0, 0, xmitq);
> +    if (defq_len >= 3 && !((defq_len - 3) % 16)) {
> +        u16 rcvgap = buf_seqno(skb_peek(dfq)) - l->rcv_nxt;
> +
> +        tipc_link_build_proto_msg(l, STATE_MSG, 0, 0,
> +                      rcvgap, 0, 0, xmitq);
> +    }
>      return 0;
>  }
>  
> @@ -1631,7 +1636,7 @@ static void tipc_link_build_proto_msg(struct tipc_link *l, int mtyp, bool probe,
>      if (!tipc_link_is_up(l) && (mtyp == STATE_MSG))
>          return;
>  
> -    if (!skb_queue_empty(dfq))
> +    if ((probe || probe_reply) && !skb_queue_empty(dfq))
>          rcvgap = buf_seqno(skb_peek(dfq)) - l->rcv_nxt;
>  
>      skb = tipc_msg_create(LINK_PROTOCOL, mtyp, INT_H_SIZE,
> @@ -2079,7 +2084,6 @@ static int tipc_link_proto_rcv(struct tipc_link *l, struct sk_buff *skb,
>          if (rcvgap || reply)
>              tipc_link_build_proto_msg(l, STATE_MSG, 0, reply,
>                            rcvgap, 0, 0, xmitq);
> -
>          rc |= tipc_link_advance_transmq(l, ack, gap, ga, xmitq);
>  
>          /* If NACK, retransmit will now start at right position */
>

Re: [tipc-discussion] [net-next 1/3] tipc: eliminate gap indicator from ACK messages

From: Jon M. <jon...@er...> - 2019-12-04 20:14:37

I tried different varieties of the solutions discussed below.

  1.  Ignoring 1st, 2d or 3d first NACKs, and rely on the time stamp for repeated retransmissions.
  2.  Ignoring 1st,2d,4th,5th and all other NACKS which are not a multiple of 3

Some gave the same throughput as with the patches I posted, but none was definitely better, as far as I could see.
However, I didn’t do any long series, so there might still be some small percentage improvement I have missed.

///jon

From: Jon Maloy <ma...@do...>
Sent: 4-Dec-19 13:07
To: Jon Maloy <jon...@er...>; Xue, Ying <Yin...@wi...>; xi...@re...; Tuong Tong Lien <tuo...@de...>; Tung Quang Nguyen <tun...@de...>; Hoang Huu Le <hoa...@de...>; John Rutherford <joh...@de...>; tip...@li...; tip...@de...
Subject: Re: [net-next 1/3] tipc: eliminate gap indicator from ACK messages

Hi Ying,
(cc-ing tipc-discussion, since yout original mail seems to have been dropped somewhere, and I want everybody to be able follow the discussion)

Actually we do have a kind of SACK mechanism already, but maybe too simple.
See below.

On Wednesday, December 4, 2019, 11:45:37 AM GMT-5, Xue, Ying <yin...@wi...<mailto:yin...@wi...>> wrote:

I don’t know why I received lots of complains from outlook which indicate my below email was not sent to you.
I just try to resend. If you received, please ignore this one.

Thanks,
Ying

-----Original Message-----
From: Ying Xue [mailto:yin...@wi...<mailto:yin...@wi...>]
Sent: Thursday, December 5, 2019 12:39 AM
To: Jon Maloy; Jon Maloy
Cc: moh...@er...<mailto:moh...@er...>; par...@gm...<mailto:par...@gm...>; tun...@de...<mailto:tun...@de...>; hoa...@de...<mailto:hoa...@de...>; tuo...@de...<mailto:tuo...@de...>; gor...@de...<mailto:gor...@de...>; tip...@li...<mailto:tip...@li...>
Subject: Re: [net-next 1/3] tipc: eliminate gap indicator from ACK messages

On 12/2/19 8:32 AM, Jon Maloy wrote:
> When we increase the link send window we sometimes observe the
> following scenario:
>
> 1) A packet #N arrives out of order far ahead of a sequence of older
>    packets which are still under way. The packet is added to the
>    deferred queue.
> 2) The missing packets arrive in sequence, and for each 16th of them
>    an ACK is sent back to the receiver, as it should be.
> 3) When building those ACK messages, it is checked if there is a gap
>    between the link's 'rcv_nxt' and the first packet in the deferred
>    queue. This is always the case until packet number #N-1 arrives, and
>    a 'gap' indicator is added, effectively turning them into NACK
>    messages.
> 4) When those NACKs arrive at the sender, all the requested
>    retransmissions are done, since it is a first-time request.
>
> This sometimes leads to a huge amount of redundant retransmissions,
> causing a drop in max throughput. This problem gets worse when we
> in a later commit introduce variable window congestion control,
> since it drops the link back to 'slow start' much more often
> than necessary.
>

I think the essential thing is about how we should precisely detect if package is really lost.
When we can identify a message is really lost, that will be good moment to ask sender to retransmit the lost message to receiver.
But it's very difficult to identify this thing. When receiver finds a missing message in its receipt queue, the message has two statuses:
1. The missing message is just out of order.
2. The missing message is really lost.

For the first case, when receiver detects the missing message and it sends a retransmission request before the out-of-order message arrives at the receiver,
It will cause a redundant retransmission. In order to avoid this situation, we should send retransmission request as late as possible.

For the second case, when the missing message is really lost, it will cause one bad thing that lots of messages are queued in sending buffer of sender side if
Receiver doesn't timely send retransmission request. In order to avoid this situation, we should send retransmission request as early as possible.

As a consequence, the two conclusions we drew above conflict each other. It means we cannot just use one method to perfectly solve this problem.

So let's take SCTP as an example to understand how it handles this situation:

In SCTP, there are two different retransmission mechanisms: one is normal retransmission mechanism, and another is called fast retransmission.
In the normal retransmission algorithm, it sets a timer to detect whether message is lost or not. When a message's timer is expired, it supposes the message
Is lost and the message is then retransmitted.

[jon] This is what we are missing, and as a result first retransmission requests are always done. I have already observed that >90% of those are useless under normal circumstances, since they end up as duplicates. I was able to somewhat improve this figure with my first two patches in the series, but there is still a lot of duplicates.
I am sure we can do better here.
On a general note, I strongly dislike timers, especially if they are set per packet. Maybe we could just set the control block time stamp to e.g. N ms, and then on each link timeout check the age of the first (and oldest) buffer in the transmit queue. If the age exceeds N ms, we retransmit, and go on to check the next buffer etc.
But remember, nobody is using TIPC across the internet, so delay times are short, and it is far likelier that the receiver will detect a gap and send a NACK long before this limit has expired. So, I suspect such a mechanism would be of limited value. Also note that in the last patch we actually introduce something similar in tipc_link_timeout(), although the limit will in practice be 375 ms, and I never once saw it being trigged during my testing.

 In the fast retransmission algorithm, three SACK messages indicate the same message is missed, the message
will be immediately retransmitted before its retransmission timer is expired.

[jon] We have a similar mechanism, but we based it on the skb timestamp. If a packet has been retransmitted once, we don't allow it to be retransmitted until 1 ms has elapsed, irrespective of number of NACKs. We could of course base this on the already existing retransmission counter instead. I will try that.

Regarding my understanding, the normal retransmission mechanism is used to handle the first case to send retransmission request as late as possible.
The fast retransmission algorithm is used to handle the second case to send retransmission request earlier than retransmission timer expiration.

Therefore, probably we can use its ideas to handle our TIPC cases:
1. We can introduce a retransmission timer to perform delayed retransmission.
2. We should remain the gap indicator. But when receiver reports a gap to sender, it doesn't send NACK to sender. Instead it should SACK to sender.

[jon] All our NACKs are in reality SACKs. It is up to the receiver to interpret them correctly.

Whenever sender receives NACK, it will immediately retransmit missed messages, which probably cause lots of duplicated messages. But for SACK,
It means Selective ACK which contains gap info. But when a sender receives SACK and gap info, it doesn't immediately retransmit the missed message.
Instead, it releases the messages in its sending queue which have been acknowledged by SACK message. Meanwhile, once three SACK messages report the same message
Is missed, the sender can immediately retransmit the message, which is exactly what the fast retransmission algorithm is doing.

Of course, the two suggested approaches are much more complex than our current TIPC link traffic control algorithm. If we decide to introduce retransmission timer, we have to introduce another
algorithm to measure RTT time. I am not sure whether we should introduce retransmission timer, but in my opinion, introducing SACK might be a good idea to TIPC.

[jon] Once we have an RTT measuring mechanism in place, we should be able to calculate an optimal expiration time according to the above suggestion. But I am skeptical to its practical value, at least the way TIPC is used now.
But I do agree that we need a mechanism for limiting the number of first retransmission attempts, which constitutes the vast majority of retransmissions, and to it vast majority are useless. I think I will first try leveraging the skb retransmission counter here, and see what will happen.

///jon

About SCTP SACK, please refer to: https://tools.ietf.org/html/rfc4960#page-34

Thanks,
Ying

> We now fix this by not sending any 'gap' indicator in regular ACK
> messages. We already have a mechanism for sending explicit NACKs
> in place, and this is sufficient to keep up the packet flow.
>
> Signed-off-by: Jon Maloy <jon...@er...<mailto:jon...@er...>>
> ---
>  net/tipc/link.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/net/tipc/link.c b/net/tipc/link.c
> index 24d4d10..6d86446 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -1521,7 +1521,8 @@ static int tipc_link_build_nack_msg(struct tipc_link *l,
>                      struct sk_buff_head *xmitq)
>  {
>      u32 def_cnt = ++l->stats.deferred_recv;
> -    u32 defq_len = skb_queue_len(&l->deferdq);
> +    struct sk_buff_head *dfq = &l->deferdq;
> +    u32 defq_len = skb_queue_len(dfq);
>      int match1, match2;
>
>      if (link_is_bc_rcvlink(l)) {
> @@ -1532,8 +1533,12 @@ static int tipc_link_build_nack_msg(struct tipc_link *l,
>          return 0;
>      }
>
> -    if (defq_len >= 3 && !((defq_len - 3) % 16))
> -        tipc_link_build_proto_msg(l, STATE_MSG, 0, 0, 0, 0, 0, xmitq);
> +    if (defq_len >= 3 && !((defq_len - 3) % 16)) {
> +        u16 rcvgap = buf_seqno(skb_peek(dfq)) - l->rcv_nxt;
> +
> +        tipc_link_build_proto_msg(l, STATE_MSG, 0, 0,
> +                      rcvgap, 0, 0, xmitq);
> +    }
>      return 0;
>  }
>
> @@ -1631,7 +1636,7 @@ static void tipc_link_build_proto_msg(struct tipc_link *l, int mtyp, bool probe,
>      if (!tipc_link_is_up(l) && (mtyp == STATE_MSG))
>          return;
>
> -    if (!skb_queue_empty(dfq))
> +    if ((probe || probe_reply) && !skb_queue_empty(dfq))
>          rcvgap = buf_seqno(skb_peek(dfq)) - l->rcv_nxt;
>
>      skb = tipc_msg_create(LINK_PROTOCOL, mtyp, INT_H_SIZE,
> @@ -2079,7 +2084,6 @@ static int tipc_link_proto_rcv(struct tipc_link *l, struct sk_buff *skb,
>          if (rcvgap || reply)
>              tipc_link_build_proto_msg(l, STATE_MSG, 0, reply,
>                            rcvgap, 0, 0, xmitq);
> -
>          rc |= tipc_link_advance_transmq(l, ack, gap, ga, xmitq);
>
>          /* If NACK, retransmit will now start at right position */
>

Re: [tipc-discussion] [net-next 1/3] tipc: eliminate gap indicator from ACK messages

From: Xue, Y. <Yin...@wi...> - 2019-12-05 15:37:38

Thanks Jon for your comments. I fully agree with you. To introduce a retransmission timer for every message might be not a good idea, but introducing a retransmission timer for a block of messages might be a good idea. In most of time tipc network condition is quite good and TIPC message doesn’t need to cross internet, so RTT value should be slightly constant between two nodes. But different TIPC nodes have different HW capability, which means the RTT values might be quite different between different tipc unicast links. Particularly node’s workload is often changed from time to time, which also has impact on RTT value. In short, it looks finally we really need to dynamically measure RTT and dynamically detect message is lost or not with a retransmission timer by comparing with the adjusted RTT although you and I both dislike this approach. Using retransmission timer to detect TIPC message loss probably should be the majority approach. By contrast, SACK or NACK retransmission mechanism might be a supplementary method like fast retransmission in SCTP or TCP world.

Lastly, I don’t object to this series. Instead, this series is quite good because it can bring us a large throughput improvement. We can submit request to merge the series to upstream first. Please add my ack-by if you want.
After this series, we can do more experiments under quite different network environments to validate whether it’s worth introducing retransmission timer.

Thanks,
Ying

From: Jon Maloy [mailto:jon...@er...]
Sent: Thursday, December 5, 2019 4:14 AM
To: Jon Maloy; Xue, Ying; xi...@re...; Tuong Tong Lien; Tung Quang Nguyen; Hoang Huu Le; John Rutherford; tip...@li...; tip...@de...
Subject: RE: [net-next 1/3] tipc: eliminate gap indicator from ACK messages

I tried different varieties of the solutions discussed below.
1)    Ignoring 1st, 2d or 3d first NACKs, and rely on the time stamp for repeated retransmissions.
2)    Ignoring 1st,2d,4th,5th and all other NACKS which are not a multiple of 3

Some gave the same throughput as with the patches I posted, but none was definitely better, as far as I could see.
However, I didn’t do any long series, so there might still be some small percentage improvement I have missed.

///jon

From: Jon Maloy <ma...@do...>
Sent: 4-Dec-19 13:07
To: Jon Maloy <jon...@er...>; Xue, Ying <Yin...@wi...>; xi...@re...; Tuong Tong Lien <tuo...@de...>; Tung Quang Nguyen <tun...@de...>; Hoang Huu Le <hoa...@de...>; John Rutherford <joh...@de...>; tip...@li...; tip...@de...
Subject: Re: [net-next 1/3] tipc: eliminate gap indicator from ACK messages

Hi Ying,
(cc-ing tipc-discussion, since yout original mail seems to have been dropped somewhere, and I want everybody to be able follow the discussion)

Actually we do have a kind of SACK mechanism already, but maybe too simple.
See below.

On Wednesday, December 4, 2019, 11:45:37 AM GMT-5, Xue, Ying <yin...@wi...<mailto:yin...@wi...>> wrote:

I don’t know why I received lots of complains from outlook which indicate my below email was not sent to you.
I just try to resend. If you received, please ignore this one.

Thanks,
Ying

-----Original Message-----
From: Ying Xue [mailto:yin...@wi...<mailto:yin...@wi...>]
Sent: Thursday, December 5, 2019 12:39 AM
To: Jon Maloy; Jon Maloy
Cc: moh...@er...<mailto:moh...@er...>; par...@gm...<mailto:par...@gm...>; tun...@de...<mailto:tun...@de...>; hoa...@de...<mailto:hoa...@de...>; tuo...@de...<mailto:tuo...@de...>; gor...@de...<mailto:gor...@de...>; tip...@li...<mailto:tip...@li...>
Subject: Re: [net-next 1/3] tipc: eliminate gap indicator from ACK messages

On 12/2/19 8:32 AM, Jon Maloy wrote:
> When we increase the link send window we sometimes observe the
> following scenario:
>
> 1) A packet #N arrives out of order far ahead of a sequence of older
>    packets which are still under way. The packet is added to the
>    deferred queue.
> 2) The missing packets arrive in sequence, and for each 16th of them
>    an ACK is sent back to the receiver, as it should be.
> 3) When building those ACK messages, it is checked if there is a gap
>    between the link's 'rcv_nxt' and the first packet in the deferred
>    queue. This is always the case until packet number #N-1 arrives, and
>    a 'gap' indicator is added, effectively turning them into NACK
>    messages.
> 4) When those NACKs arrive at the sender, all the requested
>    retransmissions are done, since it is a first-time request.
>
> This sometimes leads to a huge amount of redundant retransmissions,
> causing a drop in max throughput. This problem gets worse when we
> in a later commit introduce variable window congestion control,
> since it drops the link back to 'slow start' much more often
> than necessary.
>

I think the essential thing is about how we should precisely detect if package is really lost.
When we can identify a message is really lost, that will be good moment to ask sender to retransmit the lost message to receiver.
But it's very difficult to identify this thing. When receiver finds a missing message in its receipt queue, the message has two statuses:
1. The missing message is just out of order.
2. The missing message is really lost.

For the first case, when receiver detects the missing message and it sends a retransmission request before the out-of-order message arrives at the receiver,
It will cause a redundant retransmission. In order to avoid this situation, we should send retransmission request as late as possible.

For the second case, when the missing message is really lost, it will cause one bad thing that lots of messages are queued in sending buffer of sender side if
Receiver doesn't timely send retransmission request. In order to avoid this situation, we should send retransmission request as early as possible.

As a consequence, the two conclusions we drew above conflict each other. It means we cannot just use one method to perfectly solve this problem.

So let's take SCTP as an example to understand how it handles this situation:

In SCTP, there are two different retransmission mechanisms: one is normal retransmission mechanism, and another is called fast retransmission.
In the normal retransmission algorithm, it sets a timer to detect whether message is lost or not. When a message's timer is expired, it supposes the message
Is lost and the message is then retransmitted.

[jon] This is what we are missing, and as a result first retransmission requests are always done. I have already observed that >90% of those are useless under normal circumstances, since they end up as duplicates. I was able to somewhat improve this figure with my first two patches in the series, but there is still a lot of duplicates.
I am sure we can do better here.
On a general note, I strongly dislike timers, especially if they are set per packet. Maybe we could just set the control block time stamp to e.g. N ms, and then on each link timeout check the age of the first (and oldest) buffer in the transmit queue. If the age exceeds N ms, we retransmit, and go on to check the next buffer etc.
But remember, nobody is using TIPC across the internet, so delay times are short, and it is far likelier that the receiver will detect a gap and send a NACK long before this limit has expired. So, I suspect such a mechanism would be of limited value. Also note that in the last patch we actually introduce something similar in tipc_link_timeout(), although the limit will in practice be 375 ms, and I never once saw it being trigged during my testing.

 In the fast retransmission algorithm, three SACK messages indicate the same message is missed, the message
will be immediately retransmitted before its retransmission timer is expired.

[jon] We have a similar mechanism, but we based it on the skb timestamp. If a packet has been retransmitted once, we don't allow it to be retransmitted until 1 ms has elapsed, irrespective of number of NACKs. We could of course base this on the already existing retransmission counter instead. I will try that.

Regarding my understanding, the normal retransmission mechanism is used to handle the first case to send retransmission request as late as possible.
The fast retransmission algorithm is used to handle the second case to send retransmission request earlier than retransmission timer expiration.

Therefore, probably we can use its ideas to handle our TIPC cases:
1. We can introduce a retransmission timer to perform delayed retransmission.
2. We should remain the gap indicator. But when receiver reports a gap to sender, it doesn't send NACK to sender. Instead it should SACK to sender.

[jon] All our NACKs are in reality SACKs. It is up to the receiver to interpret them correctly.

Whenever sender receives NACK, it will immediately retransmit missed messages, which probably cause lots of duplicated messages. But for SACK,
It means Selective ACK which contains gap info. But when a sender receives SACK and gap info, it doesn't immediately retransmit the missed message.
Instead, it releases the messages in its sending queue which have been acknowledged by SACK message. Meanwhile, once three SACK messages report the same message
Is missed, the sender can immediately retransmit the message, which is exactly what the fast retransmission algorithm is doing.

Of course, the two suggested approaches are much more complex than our current TIPC link traffic control algorithm. If we decide to introduce retransmission timer, we have to introduce another
algorithm to measure RTT time. I am not sure whether we should introduce retransmission timer, but in my opinion, introducing SACK might be a good idea to TIPC.

[jon] Once we have an RTT measuring mechanism in place, we should be able to calculate an optimal expiration time according to the above suggestion. But I am skeptical to its practical value, at least the way TIPC is used now.
But I do agree that we need a mechanism for limiting the number of first retransmission attempts, which constitutes the vast majority of retransmissions, and to it vast majority are useless. I think I will first try leveraging the skb retransmission counter here, and see what will happen.

///jon

About SCTP SACK, please refer to: https://tools.ietf.org/html/rfc4960#page-34

Thanks,
Ying

> We now fix this by not sending any 'gap' indicator in regular ACK
> messages. We already have a mechanism for sending explicit NACKs
> in place, and this is sufficient to keep up the packet flow.
>
> Signed-off-by: Jon Maloy <jon...@er...<mailto:jon...@er...>>
> ---
>  net/tipc/link.c | 14 +++++++++-----
>  1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/net/tipc/link.c b/net/tipc/link.c
> index 24d4d10..6d86446 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -1521,7 +1521,8 @@ static int tipc_link_build_nack_msg(struct tipc_link *l,
>                      struct sk_buff_head *xmitq)
>  {
>      u32 def_cnt = ++l->stats.deferred_recv;
> -    u32 defq_len = skb_queue_len(&l->deferdq);
> +    struct sk_buff_head *dfq = &l->deferdq;
> +    u32 defq_len = skb_queue_len(dfq);
>      int match1, match2;
>
>      if (link_is_bc_rcvlink(l)) {
> @@ -1532,8 +1533,12 @@ static int tipc_link_build_nack_msg(struct tipc_link *l,
>          return 0;
>      }
>
> -    if (defq_len >= 3 && !((defq_len - 3) % 16))
> -        tipc_link_build_proto_msg(l, STATE_MSG, 0, 0, 0, 0, 0, xmitq);
> +    if (defq_len >= 3 && !((defq_len - 3) % 16)) {
> +        u16 rcvgap = buf_seqno(skb_peek(dfq)) - l->rcv_nxt;
> +
> +        tipc_link_build_proto_msg(l, STATE_MSG, 0, 0,
> +                      rcvgap, 0, 0, xmitq);
> +    }
>      return 0;
>  }
>
> @@ -1631,7 +1636,7 @@ static void tipc_link_build_proto_msg(struct tipc_link *l, int mtyp, bool probe,
>      if (!tipc_link_is_up(l) && (mtyp == STATE_MSG))
>          return;
>
> -    if (!skb_queue_empty(dfq))
> +    if ((probe || probe_reply) && !skb_queue_empty(dfq))
>          rcvgap = buf_seqno(skb_peek(dfq)) - l->rcv_nxt;
>
>      skb = tipc_msg_create(LINK_PROTOCOL, mtyp, INT_H_SIZE,
> @@ -2079,7 +2084,6 @@ static int tipc_link_proto_rcv(struct tipc_link *l, struct sk_buff *skb,
>          if (rcvgap || reply)
>              tipc_link_build_proto_msg(l, STATE_MSG, 0, reply,
>                            rcvgap, 0, 0, xmitq);
> -
>          rc |= tipc_link_advance_transmq(l, ack, gap, ga, xmitq);
>
>          /* If NACK, retransmit will now start at right position */
>

Re: [tipc-discussion] [net-next 3/3] tipc: introduce variable window congestion control

From: Tuong L. T. <tuo...@de...> - 2020-03-17 10:55:58

Hi Jon,

For the "variable window congestion control" patch, if I remember correctly,
it is for unicast link only? Why did you apply it for broadcast link, a
mistake or ...?
It now causes user messages disordered on the receiving side, because on the
sending side, the broadcast link's window is suddenly increased to 300 (i.e.
max_t(u16, l->window / 2, 300)) at a packet retransmission, leaving some
gaps between the link's 'transmq' & 'backlogq' unexpectedly... Will we fix
this by removing it?

@@ -1160,7 +1224,6 @@ static int tipc_link_bc_retrans(struct tipc_link *l,
struct tipc_link *r,
 			continue;
 		if (more(msg_seqno(hdr), to))
 			break;
-
 		if (time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr))
 			continue;
 		TIPC_SKB_CB(skb)->nxt_retr = TIPC_BC_RETR_LIM;
@@ -1173,11 +1236,12 @@ static int tipc_link_bc_retrans(struct tipc_link *l,
struct tipc_link *r,
 		_skb->priority = TC_PRIO_CONTROL;
 		__skb_queue_tail(xmitq, _skb);
 		l->stats.retransmitted++;
-
+		retransmitted++;
 		/* Increase actual retrans counter & mark first time */
 		if (!TIPC_SKB_CB(skb)->retr_cnt++)
 			TIPC_SKB_CB(skb)->retr_stamp = jiffies;
 	}
+	tipc_link_update_cwin(l, 0, retransmitted);             // ???
 	return 0;
 }

+static void tipc_link_update_cwin(struct tipc_link *l, int released,
+				  bool retransmitted)
+{
+	int bklog_len = skb_queue_len(&l->backlogq);
+	struct sk_buff_head *txq = &l->transmq;
+	int txq_len = skb_queue_len(txq);
+	u16 cwin = l->window;
+
+	/* Enter fast recovery */
+	if (unlikely(retransmitted)) {
+		l->ssthresh = max_t(u16, l->window / 2, 300);
+		l->window = l->ssthresh;
+		return;
+	}

BR/Tuong

-----Original Message-----
From: Jon Maloy <jon...@er...> 
Sent: Monday, December 2, 2019 7:33 AM
To: Jon Maloy <jon...@er...>; Jon Maloy <ma...@do...>
Cc: moh...@er...;
par...@gm...; tun...@de...;
hoa...@de...; tuo...@de...;
gor...@de...; yin...@wi...;
tip...@li...
Subject: [net-next 3/3] tipc: introduce variable window congestion control

We introduce a simple variable window congestion control for links.
The algorithm is inspired by the Reno algorithm, covering both 'slow
start', 'congestion avoidance', and 'fast recovery' modes.

- We introduce hard lower and upper window limits per link, still
  different and configurable per bearer type.

- We introduce as 'slow start theshold' variable, initially set to
  the maximum window size.

- We let a link start at the minimum congestion window, i.e. in slow
  start mode, and then let is grow rapidly (+1 per rceived ACK) until
  it reaches the slow start threshold and enters congestion avoidance
  mode.

- In congestion avoidance mode we increment the congestion window for
  each window_size number of acked packets, up to a possible maximum
  equal to the configured maximum window.

- For each non-duplicate NACK received, we drop back to fast recovery
  mode, by setting the both the slow start threshold to and the
  congestion window to (current_congestion_window / 2).

- If the timeout handler finds that the transmit queue has not moved
  timeout, it drops the link back to slow start and forces a probe
  containing the last sent sequence number to the sent to the peer.

This change does in reality have effect only on unicast ethernet
transport, as we have seen that there is no room whatsoever for
increasing the window max size for the UDP bearer.
For now, we also choose to keep the limits for the broadcast link
unchanged and equal.

This algorithm seems to give a 50-100% throughput improvement for
messages larger than MTU.

Suggested-by: Xin Long <luc...@gm...>
Acked-by: Xin Long <luc...@gm...>
Signed-off-by: Jon Maloy <jon...@er...>
---
 net/tipc/bcast.c     |  11 ++--
 net/tipc/bearer.c    |  11 ++--
 net/tipc/bearer.h    |   6 +-
 net/tipc/eth_media.c |   3 +-
 net/tipc/ib_media.c  |   5 +-
 net/tipc/link.c      | 175
+++++++++++++++++++++++++++++++++++----------------
 net/tipc/link.h      |   9 +--
 net/tipc/node.c      |  16 ++---
 net/tipc/udp_media.c |   3 +-
 9 files changed, 160 insertions(+), 79 deletions(-)

Re: [tipc-discussion] [net-next 3/3] tipc: introduce variable window congestion control

From: Jon M. <jm...@re...> - 2020-03-17 18:38:09


On 3/17/20 6:55 AM, Tuong Lien Tong wrote:
> Hi Jon,
>
> For the "variable window congestion control" patch, if I remember correctly,
> it is for unicast link only? Why did you apply it for broadcast link, a
> mistake or ...?
I did it so the code would be the same everywhere. Then, by setting both
min_win and max_win to the same value BC_LINK_WIN_DEFAULT (==50)
in the broadcast send link this window should never change.

> It now causes user messages disordered on the receiving side, because on the
> sending side, the broadcast link's window is suddenly increased to 300 (i.e.
> max_t(u16, l->window / 2, 300)) at a packet retransmission, leaving some
> gaps between the link's 'transmq' & 'backlogq' unexpectedly... Will we fix
> this by removing it?
That is clearly a bug that breaks the above stated limitation.
It should be sufficient to check that also l->ssthresh never exceeds
l->max_win to remedy this.

///jon

>
> @@ -1160,7 +1224,6 @@ static int tipc_link_bc_retrans(struct tipc_link *l,
> struct tipc_link *r,
>   			continue;
>   		if (more(msg_seqno(hdr), to))
>   			break;
> -
>   		if (time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr))
>   			continue;
>   		TIPC_SKB_CB(skb)->nxt_retr = TIPC_BC_RETR_LIM;
> @@ -1173,11 +1236,12 @@ static int tipc_link_bc_retrans(struct tipc_link *l,
> struct tipc_link *r,
>   		_skb->priority = TC_PRIO_CONTROL;
>   		__skb_queue_tail(xmitq, _skb);
>   		l->stats.retransmitted++;
> -
> +		retransmitted++;
>   		/* Increase actual retrans counter & mark first time */
>   		if (!TIPC_SKB_CB(skb)->retr_cnt++)
>   			TIPC_SKB_CB(skb)->retr_stamp = jiffies;
>   	}
> +	tipc_link_update_cwin(l, 0, retransmitted);             // ???
>   	return 0;
>   }
>
> +static void tipc_link_update_cwin(struct tipc_link *l, int released,
> +				  bool retransmitted)
> +{
> +	int bklog_len = skb_queue_len(&l->backlogq);
> +	struct sk_buff_head *txq = &l->transmq;
> +	int txq_len = skb_queue_len(txq);
> +	u16 cwin = l->window;
> +
> +	/* Enter fast recovery */
> +	if (unlikely(retransmitted)) {
> +		l->ssthresh = max_t(u16, l->window / 2, 300);
> +		l->window = l->ssthresh;
> +		return;
> +	}
>
> BR/Tuong
>
> -----Original Message-----
> From: Jon Maloy <jon...@er...>
> Sent: Monday, December 2, 2019 7:33 AM
> To: Jon Maloy <jon...@er...>; Jon Maloy <ma...@do...>
> Cc: moh...@er...;
> par...@gm...; tun...@de...;
> hoa...@de...; tuo...@de...;
> gor...@de...; yin...@wi...;
> tip...@li...
> Subject: [net-next 3/3] tipc: introduce variable window congestion control
>
> We introduce a simple variable window congestion control for links.
> The algorithm is inspired by the Reno algorithm, covering both 'slow
> start', 'congestion avoidance', and 'fast recovery' modes.
>
> - We introduce hard lower and upper window limits per link, still
>    different and configurable per bearer type.
>
> - We introduce as 'slow start theshold' variable, initially set to
>    the maximum window size.
>
> - We let a link start at the minimum congestion window, i.e. in slow
>    start mode, and then let is grow rapidly (+1 per rceived ACK) until
>    it reaches the slow start threshold and enters congestion avoidance
>    mode.
>
> - In congestion avoidance mode we increment the congestion window for
>    each window_size number of acked packets, up to a possible maximum
>    equal to the configured maximum window.
>
> - For each non-duplicate NACK received, we drop back to fast recovery
>    mode, by setting the both the slow start threshold to and the
>    congestion window to (current_congestion_window / 2).
>
> - If the timeout handler finds that the transmit queue has not moved
>    timeout, it drops the link back to slow start and forces a probe
>    containing the last sent sequence number to the sent to the peer.
>
> This change does in reality have effect only on unicast ethernet
> transport, as we have seen that there is no room whatsoever for
> increasing the window max size for the UDP bearer.
> For now, we also choose to keep the limits for the broadcast link
> unchanged and equal.
>
> This algorithm seems to give a 50-100% throughput improvement for
> messages larger than MTU.
>
> Suggested-by: Xin Long <luc...@gm...>
> Acked-by: Xin Long <luc...@gm...>
> Signed-off-by: Jon Maloy <jon...@er...>
> ---
>   net/tipc/bcast.c     |  11 ++--
>   net/tipc/bearer.c    |  11 ++--
>   net/tipc/bearer.h    |   6 +-
>   net/tipc/eth_media.c |   3 +-
>   net/tipc/ib_media.c  |   5 +-
>   net/tipc/link.c      | 175
> +++++++++++++++++++++++++++++++++++----------------
>   net/tipc/link.h      |   9 +--
>   net/tipc/node.c      |  16 ++---
>   net/tipc/udp_media.c |   3 +-
>   9 files changed, 160 insertions(+), 79 deletions(-)
>
>
>
>
>
>
>
> _______________________________________________
> tipc-discussion mailing list
> tip...@li...
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>

Re: [tipc-discussion] [net-next 3/3] tipc: introduce variable window congestion control

From: Tuong L. T. <tuo...@de...> - 2020-03-18 04:50:23

Hi Jon,

Ok, that makes sense (but we should have covered the case a broadcast packet is released too...).
However, I have another concern about the logic here:

> +	/* Enter fast recovery */
> +	if (unlikely(retransmitted)) {
> +		l->ssthresh = max_t(u16, l->window / 2, 300);
> +		l->window = l->ssthresh;
> +		return;
> +	}

What will if we have a retransmission when it's still in the slow-start phase? For example:
l->ssthresh = 300
l-> window = 60
==> retransmitted = true, then: l->ssthresh = 300; l->window = 300???

This looks not correct?
Should it be:

> +	/* Enter fast recovery */
> +	if (unlikely(retransmitted)) {
> +		l->ssthresh = max_t(u16, l->window / 2, 300);
> -		l->window = l->ssthresh;
> +		l->window = min_t(u16, l->window, l->ssthresh);
> +		return;
> +	}

So will fix the issue with broadcast case as well?

BR/Tuong

-----Original Message-----
From: Jon Maloy <jm...@re...> 
Sent: Wednesday, March 18, 2020 1:38 AM
To: Tuong Lien Tong <tuo...@de...>; 'Jon Maloy' <jon...@er...>; 'Jon Maloy' <ma...@do...>
Cc: tip...@li...; moh...@er...
Subject: Re: [tipc-discussion] [net-next 3/3] tipc: introduce variable window congestion control



On 3/17/20 6:55 AM, Tuong Lien Tong wrote:
> Hi Jon,
>
> For the "variable window congestion control" patch, if I remember correctly,
> it is for unicast link only? Why did you apply it for broadcast link, a
> mistake or ...?
I did it so the code would be the same everywhere. Then, by setting both
min_win and max_win to the same value BC_LINK_WIN_DEFAULT (==50)
in the broadcast send link this window should never change.

> It now causes user messages disordered on the receiving side, because on the
> sending side, the broadcast link's window is suddenly increased to 300 (i.e.
> max_t(u16, l->window / 2, 300)) at a packet retransmission, leaving some
> gaps between the link's 'transmq' & 'backlogq' unexpectedly... Will we fix
> this by removing it?
That is clearly a bug that breaks the above stated limitation.
It should be sufficient to check that also l->ssthresh never exceeds
l->max_win to remedy this.

///jon

>
> @@ -1160,7 +1224,6 @@ static int tipc_link_bc_retrans(struct tipc_link *l,
> struct tipc_link *r,
>   			continue;
>   		if (more(msg_seqno(hdr), to))
>   			break;
> -
>   		if (time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr))
>   			continue;
>   		TIPC_SKB_CB(skb)->nxt_retr = TIPC_BC_RETR_LIM;
> @@ -1173,11 +1236,12 @@ static int tipc_link_bc_retrans(struct tipc_link *l,
> struct tipc_link *r,
>   		_skb->priority = TC_PRIO_CONTROL;
>   		__skb_queue_tail(xmitq, _skb);
>   		l->stats.retransmitted++;
> -
> +		retransmitted++;
>   		/* Increase actual retrans counter & mark first time */
>   		if (!TIPC_SKB_CB(skb)->retr_cnt++)
>   			TIPC_SKB_CB(skb)->retr_stamp = jiffies;
>   	}
> +	tipc_link_update_cwin(l, 0, retransmitted);             // ???
>   	return 0;
>   }
>
> +static void tipc_link_update_cwin(struct tipc_link *l, int released,
> +				  bool retransmitted)
> +{
> +	int bklog_len = skb_queue_len(&l->backlogq);
> +	struct sk_buff_head *txq = &l->transmq;
> +	int txq_len = skb_queue_len(txq);
> +	u16 cwin = l->window;
> +
> +	/* Enter fast recovery */
> +	if (unlikely(retransmitted)) {
> +		l->ssthresh = max_t(u16, l->window / 2, 300);
> +		l->window = l->ssthresh;
> +		return;
> +	}
>
> BR/Tuong
>
> -----Original Message-----
> From: Jon Maloy <jon...@er...>
> Sent: Monday, December 2, 2019 7:33 AM
> To: Jon Maloy <jon...@er...>; Jon Maloy <ma...@do...>
> Cc: moh...@er...;
> par...@gm...; tun...@de...;
> hoa...@de...; tuo...@de...;
> gor...@de...; yin...@wi...;
> tip...@li...
> Subject: [net-next 3/3] tipc: introduce variable window congestion control
>
> We introduce a simple variable window congestion control for links.
> The algorithm is inspired by the Reno algorithm, covering both 'slow
> start', 'congestion avoidance', and 'fast recovery' modes.
>
> - We introduce hard lower and upper window limits per link, still
>    different and configurable per bearer type.
>
> - We introduce as 'slow start theshold' variable, initially set to
>    the maximum window size.
>
> - We let a link start at the minimum congestion window, i.e. in slow
>    start mode, and then let is grow rapidly (+1 per rceived ACK) until
>    it reaches the slow start threshold and enters congestion avoidance
>    mode.
>
> - In congestion avoidance mode we increment the congestion window for
>    each window_size number of acked packets, up to a possible maximum
>    equal to the configured maximum window.
>
> - For each non-duplicate NACK received, we drop back to fast recovery
>    mode, by setting the both the slow start threshold to and the
>    congestion window to (current_congestion_window / 2).
>
> - If the timeout handler finds that the transmit queue has not moved
>    timeout, it drops the link back to slow start and forces a probe
>    containing the last sent sequence number to the sent to the peer.
>
> This change does in reality have effect only on unicast ethernet
> transport, as we have seen that there is no room whatsoever for
> increasing the window max size for the UDP bearer.
> For now, we also choose to keep the limits for the broadcast link
> unchanged and equal.
>
> This algorithm seems to give a 50-100% throughput improvement for
> messages larger than MTU.
>
> Suggested-by: Xin Long <luc...@gm...>
> Acked-by: Xin Long <luc...@gm...>
> Signed-off-by: Jon Maloy <jon...@er...>
> ---
>   net/tipc/bcast.c     |  11 ++--
>   net/tipc/bearer.c    |  11 ++--
>   net/tipc/bearer.h    |   6 +-
>   net/tipc/eth_media.c |   3 +-
>   net/tipc/ib_media.c  |   5 +-
>   net/tipc/link.c      | 175
> +++++++++++++++++++++++++++++++++++----------------
>   net/tipc/link.h      |   9 +--
>   net/tipc/node.c      |  16 ++---
>   net/tipc/udp_media.c |   3 +-
>   9 files changed, 160 insertions(+), 79 deletions(-)
>
>
>
>
>
>
>
> _______________________________________________
> tipc-discussion mailing list
> tip...@li...
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>

Re: [tipc-discussion] [net-next 3/3] tipc: introduce variable window congestion control

From: Jon M. <jm...@re...> - 2020-03-18 14:47:34


On 3/18/20 12:50 AM, Tuong Lien Tong wrote:
> Hi Jon,
>
> Ok, that makes sense (but we should have covered the case a broadcast packet is released too...).
> However, I have another concern about the logic here:
>
>> +	/* Enter fast recovery */
>> +	if (unlikely(retransmitted)) {
>> +		l->ssthresh = max_t(u16, l->window / 2, 300);
>> +		l->window = l->ssthresh;
>> +		return;
>> +	}
> What will if we have a retransmission when it's still in the slow-start phase? For example:
> l->ssthresh = 300
> l-> window = 60
> ==> retransmitted = true, then: l->ssthresh = 300; l->window = 300???
>
> This looks not correct?
> Should it be:
>
>> +	/* Enter fast recovery */
>> +	if (unlikely(retransmitted)) {
>> +		l->ssthresh = max_t(u16, l->window / 2, 300);
>> -		l->window = l->ssthresh;
>> +		l->window = min_t(u16, l->window, l->ssthresh);
>> +		return;
>> +	}
> So will fix the issue with broadcast case as well?
Yes, this would fix both issues, so I think it is a good suggestion.
To my surprise I have realized that this code has not been released yet 
(I only find it in 5.6-rc1 and later versions) but maybe that is just as 
well ;-)

///jon

>
> BR/Tuong
>
> -----Original Message-----
> From: Jon Maloy <jm...@re...>
> Sent: Wednesday, March 18, 2020 1:38 AM
> To: Tuong Lien Tong <tuo...@de...>; 'Jon Maloy' <jon...@er...>; 'Jon Maloy' <ma...@do...>
> Cc: tip...@li...; moh...@er...
> Subject: Re: [tipc-discussion] [net-next 3/3] tipc: introduce variable window congestion control
>
>
>
> On 3/17/20 6:55 AM, Tuong Lien Tong wrote:
>> Hi Jon,
>>
>> For the "variable window congestion control" patch, if I remember correctly,
>> it is for unicast link only? Why did you apply it for broadcast link, a
>> mistake or ...?
> I did it so the code would be the same everywhere. Then, by setting both
> min_win and max_win to the same value BC_LINK_WIN_DEFAULT (==50)
> in the broadcast send link this window should never change.
>
>> It now causes user messages disordered on the receiving side, because on the
>> sending side, the broadcast link's window is suddenly increased to 300 (i.e.
>> max_t(u16, l->window / 2, 300)) at a packet retransmission, leaving some
>> gaps between the link's 'transmq' & 'backlogq' unexpectedly... Will we fix
>> this by removing it?
> That is clearly a bug that breaks the above stated limitation.
> It should be sufficient to check that also l->ssthresh never exceeds
> l->max_win to remedy this.
>
> ///jon
>
>> @@ -1160,7 +1224,6 @@ static int tipc_link_bc_retrans(struct tipc_link *l,
>> struct tipc_link *r,
>>    			continue;
>>    		if (more(msg_seqno(hdr), to))
>>    			break;
>> -
>>    		if (time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr))
>>    			continue;
>>    		TIPC_SKB_CB(skb)->nxt_retr = TIPC_BC_RETR_LIM;
>> @@ -1173,11 +1236,12 @@ static int tipc_link_bc_retrans(struct tipc_link *l,
>> struct tipc_link *r,
>>    		_skb->priority = TC_PRIO_CONTROL;
>>    		__skb_queue_tail(xmitq, _skb);
>>    		l->stats.retransmitted++;
>> -
>> +		retransmitted++;
>>    		/* Increase actual retrans counter & mark first time */
>>    		if (!TIPC_SKB_CB(skb)->retr_cnt++)
>>    			TIPC_SKB_CB(skb)->retr_stamp = jiffies;
>>    	}
>> +	tipc_link_update_cwin(l, 0, retransmitted);             // ???
>>    	return 0;
>>    }
>>
>> +static void tipc_link_update_cwin(struct tipc_link *l, int released,
>> +				  bool retransmitted)
>> +{
>> +	int bklog_len = skb_queue_len(&l->backlogq);
>> +	struct sk_buff_head *txq = &l->transmq;
>> +	int txq_len = skb_queue_len(txq);
>> +	u16 cwin = l->window;
>> +
>> +	/* Enter fast recovery */
>> +	if (unlikely(retransmitted)) {
>> +		l->ssthresh = max_t(u16, l->window / 2, 300);
>> +		l->window = l->ssthresh;
>> +		return;
>> +	}
>>
>> BR/Tuong
>>
>> -----Original Message-----
>> From: Jon Maloy <jon...@er...>
>> Sent: Monday, December 2, 2019 7:33 AM
>> To: Jon Maloy <jon...@er...>; Jon Maloy <ma...@do...>
>> Cc: moh...@er...;
>> par...@gm...; tun...@de...;
>> hoa...@de...; tuo...@de...;
>> gor...@de...; yin...@wi...;
>> tip...@li...
>> Subject: [net-next 3/3] tipc: introduce variable window congestion control
>>
>> We introduce a simple variable window congestion control for links.
>> The algorithm is inspired by the Reno algorithm, covering both 'slow
>> start', 'congestion avoidance', and 'fast recovery' modes.
>>
>> - We introduce hard lower and upper window limits per link, still
>>     different and configurable per bearer type.
>>
>> - We introduce as 'slow start theshold' variable, initially set to
>>     the maximum window size.
>>
>> - We let a link start at the minimum congestion window, i.e. in slow
>>     start mode, and then let is grow rapidly (+1 per rceived ACK) until
>>     it reaches the slow start threshold and enters congestion avoidance
>>     mode.
>>
>> - In congestion avoidance mode we increment the congestion window for
>>     each window_size number of acked packets, up to a possible maximum
>>     equal to the configured maximum window.
>>
>> - For each non-duplicate NACK received, we drop back to fast recovery
>>     mode, by setting the both the slow start threshold to and the
>>     congestion window to (current_congestion_window / 2).
>>
>> - If the timeout handler finds that the transmit queue has not moved
>>     timeout, it drops the link back to slow start and forces a probe
>>     containing the last sent sequence number to the sent to the peer.
>>
>> This change does in reality have effect only on unicast ethernet
>> transport, as we have seen that there is no room whatsoever for
>> increasing the window max size for the UDP bearer.
>> For now, we also choose to keep the limits for the broadcast link
>> unchanged and equal.
>>
>> This algorithm seems to give a 50-100% throughput improvement for
>> messages larger than MTU.
>>
>> Suggested-by: Xin Long <luc...@gm...>
>> Acked-by: Xin Long <luc...@gm...>
>> Signed-off-by: Jon Maloy <jon...@er...>
>> ---
>>    net/tipc/bcast.c     |  11 ++--
>>    net/tipc/bearer.c    |  11 ++--
>>    net/tipc/bearer.h    |   6 +-
>>    net/tipc/eth_media.c |   3 +-
>>    net/tipc/ib_media.c  |   5 +-
>>    net/tipc/link.c      | 175
>> +++++++++++++++++++++++++++++++++++----------------
>>    net/tipc/link.h      |   9 +--
>>    net/tipc/node.c      |  16 ++---
>>    net/tipc/udp_media.c |   3 +-
>>    9 files changed, 160 insertions(+), 79 deletions(-)
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> tipc-discussion mailing list
>> tip...@li...
>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>
>