Re: [tipc-discussion] Solution to problem of losing fragmented messages

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Jon:

I agree that the proposed solution is not ideal, but it does appear to
be an acceptable interim solution to the problem at hand. As far as I
can tell, there are no new side effects introduced by the act of
ignoring an "unprocessable" first fragment message -- TIPC simply ends
up doing exactly what it would have done if the message had actually
been lost while passing over the bearer. I also acknowledge that
performance will take a hit whenever this condition arises --
potentially stalling the link until the problematic fragment can be
processed -- however, I don't see this as being any different than what
happens in other cases where a TIPC network isn't really configured for
the traffic it's being asked to carry (eg. a link can get stalled
because it's transmit queue length threshold kicks in for other reasons,
too).

A couple of things to keep in mind:

1) This fix doesn't have any impact unless a node runs out of sk_buffs
for reassembling incoming fragmented messages, so it shouldn't effect
the vast majority of TIPC users. (Max Fillipov's recent email on this
topic indicates that he doesn't see this condition on Linux-based TIPC
nodes.)

2) Since TIPC touts itself as a reliable transport mechanism, users
expect their messages to get through (if at all possible). I think users
will be happier with a system that delivers all messages at a lower rate
than one which has better performance but which loses messages
periodically. (Especially if the performance issue can be solved by
tuning the node's configuration settings.)

Eventually, of course, it would be nice to enhance TIPC so that there is
no need to allocate any additional sk_buffs for message reassembly. (I
don't think your suggestion of chaining together incoming fragments and
then allocating a new buffer once they are all present really solves
anything -- it merely delays the inevitable discard that will have to
occur if no such buffer can be obtained.) until that time I think
Howard's suggested fix is a relatively quick solution to an observed
problem.

Regards,
Al

________________________________

	From: Jon Maloy [mailto:jon...@er...] 
	Sent: Tuesday, August 31, 2010 3:45 PM
	To: Stephens, Allan
	Cc: tip...@li...
	Subject: RE: Solution to problem of losing fragmented messages

	Hi Allan,
	I am sceptical to this solution. 
	Reverting the link state is something I always tried to avoid.
What do we do with the fragment reception counter? Are there more side
effects? 
	One drawback is clearly that if somebody sends a 66 k message,
and fails to allocate the big buffer front up, the whole link will in
practice be stale until the memory squeeze abates. That is, all other
users of the link, with potentially important message interchange, will
be suffering.

	Why not just chain up the received fragments *without*
allocating the buffer from the beginning. Then we can allocate and
reassemble at the last fragment. 
	This may require some metadata buffer (about reassembly state)
to be kept, but that can probably be added as a regular member of the
link structure, or maybe even as "pseudo header fields" in the first
buffer, similar to what we do today.

	(Later on, when we feel comfortable with this solution, we can
move the defragmentation where it belongs: to the socket layer, i.e. the
whole buffer chain is sent up to the socket layer, and the reassembly
goes directly to user memory, within the context of a read() call. Thus
there will be no need for any intermediate buffer at all.)

	Regards
	///jon

	Jon Maloy M.Sc. EE 
	Researcher 
	Ericsson Canada
	Broadband and Systems Research
	8400 Decarie
	H4P 2N2, Montreal, Quebec, Canada
	Phone + 1 514 345-7900 x42056
	Mobile + 1 514 591-5578
	jon...@er...
	www.ericsson.com 

________________________________

		From: Stephens, Allan
[mailto:all...@wi...] 
		Sent: August-31-10 14:57
		To: Jon Maloy
		Cc: tip...@li...
		Subject: Solution to problem of losing fragmented
messages

		Hi Jon:

		It looks like Howard (the Wind River TIPC developer in
China) is working on a solution to the long-standing problem of what to
do if TIPC is unable to begin reassembling a fragmented message because
there isn't an available sk_buff to hold the reassembled message.
Basically, his idea is to simply ignore the first fragment completely --
as if the message was never received -- and let the link code take care
of retransmitting it later on. I think his solution for unicast links is
more or less correct, but his solution for the broadcast link isn't;
I've suggested a solution that should work there too.

		Can you take a look at Howard's email (at the bottom),
and then my comments (above Howard's email), and let me know if you see
any problems with either area?

		I can provide more details on the defect that Howard is
trying to fix, if needed. Basically, someone was blasting large SOCK_RDM
messages down a link and found that between 1% and 10% of the messages
were being lost because there weren't enough available buffers to allow
them to be reassembled.

		Regards,
		Al

________________________________

		From: Stephens, Allan 
		Sent: Tuesday, August 31, 2010 2:47 PM
		To: Xu, Hao (Howard)
		Cc: Kang, Qi; Xue, Ying
		Subject: RE: WIND00229427

		Hi Howard:

		[Note: This email contains comments on this defect only;
I don't want to discuss multiple different issues in a single email, as
it gets too confusing. I'll send separate responses for the other
defects you've asked about.]

		I like the approach you've taken to dealing with this
issue, but more work needs to be done to have a complete solution.

		1) It would probably be more Linux-like to have
tipc_link_recv_fragment() return -1 if an error occurred, rather than 2.
This way, you'd have the following set of return values:
		    0 - fragment received, message not complete,
		    1 - fragment received, message complete,
		    -1 - fragment not received, message not started
		You'll need to modify tipc_recv_msg() accordingly to
look for -1 instead of 2.

		2) The change you've made to tipc_bclink_recv_pkt()
doesn't do anything to force the discarded message fragment to be
retransmitted, which is essential if you don't want to lose messages.
However, it doesn't look like this can be done as easily as simply
decrementing "next_in_no" like you did for unicast links. Basically, you
need to look at all code that happens after the "if (seqno == next_in)"
check and make sure nothing important happens before you check the
result of tipc_link_recv_fragment(); this will allow you to bail out if
a problem occurs and result in the same effect as if the message had
never been received, which is what you want. I think you probably want
to code something like this:

		 if (seqno == next_in) {
		receive:

		  /* Deliver message to destination */

		  if (likely(msg_isdata(msg))) {
		   spin_lock_bh(&bc_lock);
		   bclink_accept_pkt(...);
		   spin_unlock_bh(&bc_lock);
		   tipc_node_unlock(node);
		   tipc_port_recv_mcast(buf, NULL);
		  } else if (msg_user(msg) == MSG_BUNDLER) {
		   spin_lock_bh(&bc_lock);
		   bclink_accept_pkt(...);
		   bcl->stats.recv_bundles++;
		   bcl->stats.recv_bundled += msg_msgcnt(msg);
		   spin_unlock_bh(&bc_lock);
		   tipc_node_unlock(node);
		   tipc_link_recv_bundle(buf);
		  } else if (msg_user(msg) == MSG_FRAGMENTER) {
		   ret = tipc_link_recv_fragment(&node->bclink.defragm,
		          &buf, &msg);
		   if (ret < 0) {
		        /* unable to start reassembly -- discard
fragment & wait for retransmission */
		        goto unlock;
		   } 
		   spin_lock_bh(&bc_lock);
		   bclink_accept_pkt(...);
		   bcl->stats.recv_fragments++;
		   if (ret > 0) {
		        bcl->stats.recv_fragmented++;
		        msg_set_destnode_cache(msg, tipc_own_addr);
		   }
		   spin_unlock_bh(&bc_lock);
		   tipc_node_unlock(node);
		   tipc_net_route_msg(buf);
		  } else if (msg_user(msg) == NAME_DISTRIBUTOR) {
		  ... <and so on>

		where bclink_accept_pkt(...) does:

		  bclink_update_last_sent(node, seqno);
		  node->bclink.last_in = seqno;
		  node->bclink.oos_state = 0;
		  bcl->stats.recv_info++;
		  /*
		   * Unicast an ACK periodically, ensuring that
		   * all nodes in the cluster don't ACK at the same time
		   */
		  if (((seqno - tipc_own_addr) % TIPC_MIN_LINK_WIN) ==
0) {
		   tipc_link_send_proto_msg(
		    node->active_links[node->elm.addr & 1],
		    STATE_MSG, 0, 0, 0, 0, 0, 0);
		   bcl->stats.sent_acks++;
		  }

		I realize that this stuff is probably confusing to you
-- indeed, I find the broadcast link code confusing too! -- so if you've
got any questions, feel free to ask.

		-- Al

			2. WIND00229427
			To the issue, I already provided a workaround
(in red) to the customer and waited for feedback. Also could you review
the modification? I think in the defect the description is detailed. So
please refer to the defect.

			a. In tipc_link.c

			void tipc_recv_msg(struct sk_buff *head, struct
tipc_bearer *tb_ptr)
			{
			   read_lock_bh(&tipc_net_lock);
			 while (head) {
			  struct bearer *b_ptr = (struct bearer
*)tb_ptr;
			  struct tipc_node *n_ptr;
			  struct link *l_ptr;
			  struct sk_buff *crs;
			  struct sk_buff *buf;
			  struct tipc_msg *msg;
			  u32 type;
			  u32 seq_no;
			  u32 ackd;
			  u32 released;
			  u32 ret;

			  [...]
			    case ROUTE_DISTRIBUTOR:

tipc_node_unlock(n_ptr);
			     tipc_route_recv(buf);

continue;
			                                case
CONN_MANAGER:
			                                        /* route
message normally */
			                                        break;
			                                case
MSG_FRAGMENTER:

l_ptr->stats.recv_fragments++;
			                                        ret =
tipc_link_recv_fragment(&l_ptr->defragm_buf, 

&buf, &msg);
			                                        if (ret
== 1) {

l_ptr->stats.recv_fragmented++;

goto deliver;
			                                        }

			                                        if (ret
== 2)
			                                        {

l_ptr->next_in_no--;         ----->>> When memory is exhausted, decrease
the value of l_ptr->next_in_no for requesting retransmitting the packet.
Thus the packet will not be lost.
			                                        }

			                                        break;
			#ifdef CONFIG_TIPC_MULTIPLE_LINKS
			                                case
CHANGEOVER_PROTOCOL:
			                                        type =
msg_type(msg);
			  [...]
			}

			int tipc_link_recv_fragment(struct sk_buff
**pending, struct sk_buff **fb,
			       struct tipc_msg **m)
			{
			 struct sk_buff *pbuf = *pending;
			 struct sk_buff *fbuf = *fb;
			 struct sk_buff *prev = NULL;
			 struct tipc_msg *fragm = buf_msg(fbuf);
			 u32 long_msg_orig = msg_orignode(fragm);
			 u32 long_msg_seq_no = msg_fragm_msg_no(fragm);
			 [....]
			   pbuf = buf_acquire(msg_size(imsg));
			  if (pbuf != NULL) {
			   pbuf->next = *pending;
			   *pending = pbuf;
			   skb_copy_to_linear_data(pbuf, imsg,
			      msg_data_sz(fragm));

			   /*  Prepare buffer for subsequent fragments.
*/

			   set_long_msg_orig(pbuf, long_msg_orig); 
			   set_long_msg_seqno(pbuf, long_msg_seq_no); 
			   set_fragm_size(pbuf,fragm_sz); 
			   set_expected_frags(pbuf,exp_fragm_cnt - 1); 
			   reset_timer_cnt(pbuf);
			  } else {
			      warn("Link unable to reassemble fragmented
message\n");
			      buf_discard(fbuf);
			      return 2;
----->>>> When memory is exhausted, the retured value is 2.
			  }
			  buf_discard(fbuf);
			  return 0;
			 [....]
			}

			b. In tipc_bcast.c

			void tipc_bclink_recv_pkt(struct sk_buff *buf)
			{
			 struct tipc_msg *msg;
			 struct tipc_node *node;
			 u32 seqno;
			 u32 next_in;
			 int deferred;
			 [...]
			   if (likely(msg_isdata(msg))) {
			   spin_unlock_bh(&bc_lock);
			   tipc_node_unlock(node);
			   tipc_port_recv_mcast(buf, NULL);
			  } else if (msg_user(msg) == MSG_BUNDLER) {
			   bcl->stats.recv_bundles++;
			   bcl->stats.recv_bundled += msg_msgcnt(msg);
			   spin_unlock_bh(&bc_lock);
			   tipc_node_unlock(node);
			   tipc_link_recv_bundle(buf);
			  } else if (msg_user(msg) == MSG_FRAGMENTER) {
			   bcl->stats.recv_fragments++;
			   if
(tipc_link_recv_fragment(&node->bclink.defragm,
			          &buf, &msg) == 1) {
			    bcl->stats.recv_fragmented++;
			    msg_set_destnode_cache(msg, tipc_own_addr);
			   }
			   spin_unlock_bh(&bc_lock);
			   tipc_node_unlock(node);
			   tipc_net_route_msg(buf);
			  } else if (msg_user(msg) == NAME_DISTRIBUTOR)
{

			 [...]
			}

Re: [tipc-discussion] Solution to problem of losing fragmented messages

Cluster wide IPC providing datagram, connection, and bus messaging

Re: [tipc-discussion] Solution to problem of losing fragmented messages