Re: [tipc-discussion] [net 1/1] tipc: reduce risk of wakeup queue starvation

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

See below.

> -----Original Message-----
> From: tung quang nguyen <tun...@de...>
> Sent: 19-Jul-19 01:41
> To: Jon Maloy <jon...@er...>; 'Jon Maloy'
> <ma...@do...>; tip...@li...;
> yin...@wi...
> Cc: Hoang Huu Le <hoa...@de...>; Tuong Tong Lien
> <tuo...@de...>; John Rutherford
> <joh...@de...>
> Subject: RE: [net 1/1] tipc: reduce risk of wakeup queue starvation
> 
> Hi Jon,
> 
> See below.
> 
> Best regards,
> Tung Nguyen
> 
> -----Original Message-----
> From: Jon Maloy <jon...@er...>
> Sent: Friday, July 19, 2019 10:05 AM
> To: Tung Quang Nguyen <tun...@de...>; 'Jon Maloy'
> <ma...@do...>; tip...@li...;
> yin...@wi...
> Cc: Mohan Krishna Ghanta Krishnamurthy
> <moh...@er...>;
> par...@gm...; Hoang Huu Le
> <hoa...@de...>; Canh Duc Luu
> <can...@de...>; Tuong Tong Lien
> <tuo...@de...>; Gordan Mihaljevic
> <gor...@de...>
> Subject: RE: [net 1/1] tipc: reduce risk of wakeup queue starvation
> 
> Hi Tung,
> I did of course do some measurements before sending out the patch
> yesterday, but I saw no significant difference in performance between the
> methods.
> The results were within the range of normal, stochastic variations, and making
> many runs only confirmed that.
> This was also what I was expecting.
> 
> I have now made a couple of changes to my patch:
> 1) I took into account that backlog[imp].len often is larger than
> backlog[imp].limit (because of the oversubscription) when calculating the
> available number of backlog queue slots, resulting in a negative avail[imp]
> value.  This would have the effect that we sometimes keep submitting wakeup
> jobs when there are no slots left in the backlog queue. Those jobs will be
> launched, run, immediately find that we are at congestion again, and be
> suspended a second time. That is, a gross waste of CPU resources. (This is a
> weakness of your algorithm, too)
> [Tung]: Exactly but the lists of skb are added to the link backlog queues after
> this "Those jobs will be launched, run, immediately find that we are at
> congestion again". That's the reason I chose to schedule all wakeup messages
> right after the condition is met. All sleeping sockets will be waken up and send
> their messages to link backlog queues before being put into sleeping state
> again. With that, we can reduce the risk of starvation.

True. But doing it this way is a very aggressive approach, as the backlog queues may grow *very* long. The use of oversubscription does in practice mean overriding the backlog queue limit, and my fear is that those can accumulate to insane lengths. If you look at the statistics in my previous mail, you will see that 94 wakeup jobs for 65536 byte messages means that  94 * 46 = 4324 packets have been added to the queue *beyond* the backlog queue limit. And nothing is stopping it from growing further if new sockets come along and start transmitting.
My algorithm is an attempt to limit this effect, so that we only wake up as many jobs as we think there is space for in the backlog queue. 

The algorithm is far from perfect. E.g., when counting free slots in the backlog queue we count packets, while each job we launch may send multi-packet messages. 
I.e., if we in the example above have freed up 10 slots in the backlog queue, we launch 10 jobs, which each and one will add 46 packets to the queue, a total of 460 packets. The remaining 84 jobs will have to wait, maybe with another risk of starvation. We should try to figure out if this can be done in a smarter way somehow.

> Your patch V2 was exactly my first patch except the prepending of wakeup
> messages to the queue tail. But I thought more about above-mentioned
> scenario and changed to the patch I sent to you to avoid the risk completely.
> 
> 2) After adding some statistics counters, I realized that there are practically
> never any wakeup messages in the input queue when we run
> prepare_wakeup(), so we can just as well skip the slightly expensive step to
> count and reschedule those jobs. This may lead to that we at rare occasions
> issue too many wakeup jobs, but it is  seems to be worth the cost.
> 
> 3) I added the wakeup jobs to the tail of the input queue, instead of to the
> head, so they will be run after regular input queue messages. I felt slightly
> uncomfortable with prepending the wakeup messages, as it might lead to
> surprising new behaviors.  This is in practice what you are doing, too, since
> you check the wakeup queue and call tipc_sk_wakeup_rcv() after we have
> checked the input queue and calles tipc_sk_rcv().
> 
> I also re-ran my measurements with the new version, many times, and the
> pattern was always the same.
> [Tung]: I verified and confirm it. It is better than the first version which
> showed very bad result in all my rounds of execution.
> Let's wait and see if this patch could fix the issue.

Yes. If it doesn't, we will have to eliminate the "available backlog slot" limitation again, like you did in your algorithm, but I would prefer to avoid that.
In my view, chances are good that this will work, I think the real culprit was the 10-limit counter which was accumulated across all importance levels.

BR
///jon

> 
> Below is a sample.
> 
> No patch:
> -------------
> 
> node1 ~ # bmc -t -c100
> ****** TIPC Benchmark Client Started ****** Transferring 64000 messages
> in TIPC Throughput Benchmark
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |  Msg Size  | #     |  # Msgs/  |  Elapsed  |                    Throughput
> |
> |  [octets]  | Conns |    Conn   |  [ms]
> +------------------------------------------------+
> |            |       |           |           | Total [Msg/s] | Total [Mb/s]
> | Per Conn [Mb/s] |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |        64  |  100  |    64000  |     5635  |      1135714  |         581
> |              5  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |       256  |  100  |    32000  |     4221  |       758006  |        1552
> |             15  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |      1024  |  100  |    16000  |     9023  |       177316  |        1452
> |             14  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |      4096  |  100  |     8000  |    10238  |        78139  |        2560
> |             25  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |     16384  |  100  |     4000  |    15651  |        25557  |        3349
> |             33  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |     65536  |  100  |     2000  |    29197  |         6849  |        3590
> |             35  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> Completed Throughput Benchmark
> ****** TIPC Benchmark Client Finished ******
> node1 ~ #
> 
> Tung's patch:
> -----------------
> node1 ~ # bmc -t -c100
> ****** TIPC Benchmark Client Started ****** Transferring 64000 messages
> in TIPC Throughput Benchmark
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |  Msg Size  | #     |  # Msgs/  |  Elapsed  |                    Throughput
> |
> |  [octets]  | Conns |    Conn   |  [ms]
> +------------------------------------------------+
> |            |       |           |           | Total [Msg/s] | Total [Mb/s]
> | Per Conn [Mb/s] |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |        64  |  100  |    64000  |     5906  |      1083527  |         554
> |              5  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |       256  |  100  |    32000  |     3510  |       911531  |        1866
> |             18  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |      1024  |  100  |    16000  |     9594  |       166755  |        1366
> |             13  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |      4096  |  100  |     8000  |     9738  |        82144  |        2691
> |             26  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |     16384  |  100  |     4000  |    15760  |        25379  |        3326
> |             33  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |     65536  |  100  |     2000  |    30615  |         6532  |        3424
> |             34  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> Completed Throughput Benchmark
> ****** TIPC Benchmark Client Finished ******
> node1 ~ #
> 
> Jon's patch, v2:
> -----------------
> node1 ~ # bmc -t -c100
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |  Msg Size  | #     |  # Msgs/  |  Elapsed  |                    Throughput
> |
> |  [octets]  | Conns |    Conn   |  [ms]
> +------------------------------------------------+
> |            |       |           |           | Total [Msg/s] | Total [Mb/s]
> | Per Conn [Mb/s] |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |        64  |  100  |    64000  |     5465  |      1171064  |         599
> |              5  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |       256  |  100  |    32000  |     3270  |       978490  |        2003
> |             20  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |      1024  |  100  |    16000  |     6987  |       228964  |        1875
> |             18  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |      4096  |  100  |     8000  |     9562  |        83657  |        2741
> |             27  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |     16384  |  100  |     4000  |    15603  |        25635  |        3360
> |             33  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> |     65536  |  100  |     2000  |    28385  |         7045  |        3693
> |             36  |
> +-----------------------------------------------------------------------
> +----
> ------------------+
> Completed Throughput Benchmark
> ****** TIPC Benchmark Client Finished ******
> node1 ~ #
> 
> BR
> ///jon
> 
> 
> > -----Original Message-----
> > From: tung quang nguyen <tun...@de...>
> > Sent: 18-Jul-19 06:03
> > To: Jon Maloy <jon...@er...>; 'Jon Maloy'
> > <ma...@do...>; tip...@li...;
> > yin...@wi...
> > Cc: Mohan Krishna Ghanta Krishnamurthy
> > <moh...@er...>;
> > par...@gm...; Hoang Huu Le
> > <hoa...@de...>; Canh Duc Luu
> <can...@de...>;
> > Tuong Tong Lien <tuo...@de...>; Gordan Mihaljevic
> > <gor...@de...>
> > Subject: RE: [net 1/1] tipc: reduce risk of wakeup queue starvation
> >
> > Hi Jon,
> >
> > You patch is too complex. There will be a lot of overheads when
> > grabbing/releasing locks (3 times) and using 2 loops.
> > As a result, performance of benchmark test is reduced significantly
> > while
> my
> > patch shows the same performance with the original code.
> >
> > This is the comparison between my patch and yours after running
> > benchmark using 100 connections (Message size in bytes: x% better):
> > - 64: 23.5%
> > - 256: 30.2%
> > - 1024: 19.5%
> > - 4096: 14.9%
> > - 16384: 6.7%
> > - 65536: 2.4%
> >
> > So, I do not think your patch would solve the issue.
> >
> > Thanks.
> >
> > Best regards,
> > Tung Nguyen
> >
> > -----Original Message-----
> > From: Jon Maloy <jon...@er...>
> > Sent: Thursday, July 18, 2019 4:22 AM
> > To: Jon Maloy <ma...@do...>
> > Cc: Mohan Krishna Ghanta Krishnamurthy
> > <moh...@er...>;
> > par...@gm...; Tung Quang Nguyen
> > <tun...@de...>; Hoang Huu Le
> > <hoa...@de...>; Canh Duc Luu
> <can...@de...>;
> > Tuong Tong Lien <tuo...@de...>; Gordan Mihaljevic
> > <gor...@de...>; yin...@wi...; tipc-
> > dis...@li...
> > Subject: RE: [net 1/1] tipc: reduce risk of wakeup queue starvation
> >
> > Hi Tung,
> > After thinking more about this, I realized that the problem is a
> > little
> more
> > complex than I first imagined.
> > We must also take into account that there may still be old wakeup
> > messages
> in
> > the input queue when we are trying to add new ones. Those may be
> > scattered around in the input queue because  new data messages have
> > arrived before they were scheduled.
> > So, we must make sure that those are still placed at the head of the
> > input queue, before any new wakeup messages, which should be before
> > any data messages.
> > Those messages should also be counted when we calculate the available
> space
> > in the backlog queue, so that there is never more pending wakeup jobs
> > than there are available slots in in that queue.
> > So, I ended up with writing my own patch for this, I hope you don't mind.
> > I tested this as far as I could, but I assume you have a more relevant
> test
> > program than me for this.
> >
> > BR
> > ///jon
> >
> >
> > > -----Original Message-----
> > > From: Jon Maloy
> > > Sent: 17-Jul-19 16:56
> > > To: Jon Maloy <jon...@er...>; Jon Maloy
> > <ma...@do...>
> > > Cc: Mohan Krishna Ghanta Krishnamurthy
> > > <moh...@er...>;
> > > par...@gm...; Tung Quang Nguyen
> > > <tun...@de...>; Hoang Huu Le
> > > <hoa...@de...>; Canh Duc Luu
> > <can...@de...>;
> > > Tuong Tong Lien <tuo...@de...>; Gordan Mihaljevic
> > > <gor...@de...>; yin...@wi...; tipc-
> > > dis...@li...
> > > Subject: [net 1/1] tipc: reduce risk of wakeup queue starvation
> > >
> > > In commit 365ad353c256 ("tipc: reduce risk of user starvation during
> > > link
> > > congestion") we allowed senders to add exactly one list of extra
> > > buffers
> > to the
> > > link backlog queues during link congestion (aka "oversubscription").
> > However,
> > > the criteria for when to stop adding wakeup messages to the input
> > > queue when the overload abates is inaccurate, and may cause
> > > starvation problems during very high load.
> > >
> > > Currently, we stop adding wakeup messages after 10 total failed
> > > attempts where we find that there is no space left in the backlog
> > > queue for a
> > certain
> > > importance level. The counter for this is accumulated across all
> > > levels,
> > which
> > > may lead the algorithm to leave the loop prematurely, although there
> > > may
> > still
> > > be plenty of space available at some levels.
> > > The result is sometimes that messages near the wakeup queue tail are
> > > not added to the input queue as they should be.
> > >
> > > We now introduce a more exact algorithm, where we don't stop adding
> > > messages until all backlog queue levels have have saturated or there
> > > are
> > no
> > > more wakeup messages to dequeue.
> > >
> > > Fixes: 365ad35 ("tipc: reduce risk of user starvation during link
> > congestion")
> > > Reported-by: Tung Nguyen <tun...@de...>
> > > Signed-off-by: Jon Maloy <jon...@er...>
> > > ---
> > >  net/tipc/link.c | 43 +++++++++++++++++++++++++++++++++++--------
> > >  1 file changed, 35 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/net/tipc/link.c b/net/tipc/link.c index
> > > 66d3a07..1f41523
> > 100644
> > > --- a/net/tipc/link.c
> > > +++ b/net/tipc/link.c
> > > @@ -853,18 +853,45 @@ static int link_schedule_user(struct tipc_link
> > > *l, struct tipc_msg *hdr)
> > >   */
> > >  static void link_prepare_wakeup(struct tipc_link *l)  {
> > > +	struct sk_buff_head *wakeupq = &l->wakeupq;
> > > +	struct sk_buff_head *inputq = l->inputq;
> > >  	struct sk_buff *skb, *tmp;
> > > -	int imp, i = 0;
> > > +	struct sk_buff_head tmpq;
> > > +	int avail[5] = {0,};
> > > +	int imp = 0;
> > > +
> > > +	__skb_queue_head_init(&tmpq);
> > > +
> > > +	for (; imp <= TIPC_SYSTEM_IMPORTANCE; imp++)
> > > +		avail[imp] = l->backlog[imp].limit - l->backlog[imp].len;
> > >
> > > -	skb_queue_walk_safe(&l->wakeupq, skb, tmp) {
> > > +	/* Already pending wakeup messages in inputq must come first */
> > > +	spin_lock_bh(&inputq->lock);
> > > +	skb_queue_walk_safe(inputq, skb, tmp) {
> > > +		if (msg_user(buf_msg(skb)) != SOCK_WAKEUP)
> > > +			continue;
> > > +		__skb_unlink(skb, inputq);
> > > +		__skb_queue_tail(&tmpq, skb);
> > >  		imp = TIPC_SKB_CB(skb)->chain_imp;
> > > -		if (l->backlog[imp].len < l->backlog[imp].limit) {
> > > -			skb_unlink(skb, &l->wakeupq);
> > > -			skb_queue_tail(l->inputq, skb);
> > > -		} else if (i++ > 10) {
> > > -			break;
> > > -		}
> > > +		if (avail[imp])
> > > +			avail[imp]--;
> > >  	}
> > > +	spin_unlock_bh(&inputq->lock);
> > > +
> > > +	spin_lock_bh(&wakeupq->lock);
> > > +	skb_queue_walk_safe(wakeupq, skb, tmp) {
> > > +		imp = TIPC_SKB_CB(skb)->chain_imp;
> > > +		if (!avail[imp])
> > > +			continue;
> > > +		avail[imp]--;
> > > +		__skb_unlink(skb, wakeupq);
> > > +		__skb_queue_tail(&tmpq, skb);
> > > +	}
> > > +	spin_unlock_bh(&wakeupq->lock);
> > > +
> > > +	spin_lock_bh(&inputq->lock);
> > > +	skb_queue_splice(&tmpq, inputq);
> > > +	spin_unlock_bh(&inputq->lock);
> > >  }
> > >
> > >  void tipc_link_reset(struct tipc_link *l)
> > > --
> > > 2.1.4
> >
> 

Re: [tipc-discussion] [net 1/1] tipc: reduce risk of wakeup queue starvation

Cluster wide IPC providing datagram, connection, and bus messaging

Re: [tipc-discussion] [net 1/1] tipc: reduce risk of wakeup queue starvation