From: Stephens, A. <all...@wi...> - 2010-08-31 18:57:10
|
Hi Jon: It looks like Howard (the Wind River TIPC developer in China) is working on a solution to the long-standing problem of what to do if TIPC is unable to begin reassembling a fragmented message because there isn't an available sk_buff to hold the reassembled message. Basically, his idea is to simply ignore the first fragment completely -- as if the message was never received -- and let the link code take care of retransmitting it later on. I think his solution for unicast links is more or less correct, but his solution for the broadcast link isn't; I've suggested a solution that should work there too. Can you take a look at Howard's email (at the bottom), and then my comments (above Howard's email), and let me know if you see any problems with either area? I can provide more details on the defect that Howard is trying to fix, if needed. Basically, someone was blasting large SOCK_RDM messages down a link and found that between 1% and 10% of the messages were being lost because there weren't enough available buffers to allow them to be reassembled. Regards, Al ________________________________ From: Stephens, Allan Sent: Tuesday, August 31, 2010 2:47 PM To: Xu, Hao (Howard) Cc: Kang, Qi; Xue, Ying Subject: RE: WIND00229427 Hi Howard: [Note: This email contains comments on this defect only; I don't want to discuss multiple different issues in a single email, as it gets too confusing. I'll send separate responses for the other defects you've asked about.] I like the approach you've taken to dealing with this issue, but more work needs to be done to have a complete solution. 1) It would probably be more Linux-like to have tipc_link_recv_fragment() return -1 if an error occurred, rather than 2. This way, you'd have the following set of return values: 0 - fragment received, message not complete, 1 - fragment received, message complete, -1 - fragment not received, message not started You'll need to modify tipc_recv_msg() accordingly to look for -1 instead of 2. 2) The change you've made to tipc_bclink_recv_pkt() doesn't do anything to force the discarded message fragment to be retransmitted, which is essential if you don't want to lose messages. However, it doesn't look like this can be done as easily as simply decrementing "next_in_no" like you did for unicast links. Basically, you need to look at all code that happens after the "if (seqno == next_in)" check and make sure nothing important happens before you check the result of tipc_link_recv_fragment(); this will allow you to bail out if a problem occurs and result in the same effect as if the message had never been received, which is what you want. I think you probably want to code something like this: if (seqno == next_in) { receive: /* Deliver message to destination */ if (likely(msg_isdata(msg))) { spin_lock_bh(&bc_lock); bclink_accept_pkt(...); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_port_recv_mcast(buf, NULL); } else if (msg_user(msg) == MSG_BUNDLER) { spin_lock_bh(&bc_lock); bclink_accept_pkt(...); bcl->stats.recv_bundles++; bcl->stats.recv_bundled += msg_msgcnt(msg); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_link_recv_bundle(buf); } else if (msg_user(msg) == MSG_FRAGMENTER) { ret = tipc_link_recv_fragment(&node->bclink.defragm, &buf, &msg); if (ret < 0) { /* unable to start reassembly -- discard fragment & wait for retransmission */ goto unlock; } spin_lock_bh(&bc_lock); bclink_accept_pkt(...); bcl->stats.recv_fragments++; if (ret > 0) { bcl->stats.recv_fragmented++; msg_set_destnode_cache(msg, tipc_own_addr); } spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_net_route_msg(buf); } else if (msg_user(msg) == NAME_DISTRIBUTOR) { ... <and so on> where bclink_accept_pkt(...) does: bclink_update_last_sent(node, seqno); node->bclink.last_in = seqno; node->bclink.oos_state = 0; bcl->stats.recv_info++; /* * Unicast an ACK periodically, ensuring that * all nodes in the cluster don't ACK at the same time */ if (((seqno - tipc_own_addr) % TIPC_MIN_LINK_WIN) == 0) { tipc_link_send_proto_msg( node->active_links[node->elm.addr & 1], STATE_MSG, 0, 0, 0, 0, 0, 0); bcl->stats.sent_acks++; } I realize that this stuff is probably confusing to you -- indeed, I find the broadcast link code confusing too! -- so if you've got any questions, feel free to ask. -- Al 2. WIND00229427 To the issue, I already provided a workaround (in red) to the customer and waited for feedback. Also could you review the modification? I think in the defect the description is detailed. So please refer to the defect. a. In tipc_link.c void tipc_recv_msg(struct sk_buff *head, struct tipc_bearer *tb_ptr) { read_lock_bh(&tipc_net_lock); while (head) { struct bearer *b_ptr = (struct bearer *)tb_ptr; struct tipc_node *n_ptr; struct link *l_ptr; struct sk_buff *crs; struct sk_buff *buf; struct tipc_msg *msg; u32 type; u32 seq_no; u32 ackd; u32 released; u32 ret; [...] case ROUTE_DISTRIBUTOR: tipc_node_unlock(n_ptr); tipc_route_recv(buf); continue; case CONN_MANAGER: /* route message normally */ break; case MSG_FRAGMENTER: l_ptr->stats.recv_fragments++; ret = tipc_link_recv_fragment(&l_ptr->defragm_buf, &buf, &msg); if (ret == 1) { l_ptr->stats.recv_fragmented++; goto deliver; } if (ret == 2) { l_ptr->next_in_no--; ----->>> When memory is exhausted, decrease the value of l_ptr->next_in_no for requesting retransmitting the packet. Thus the packet will not be lost. } break; #ifdef CONFIG_TIPC_MULTIPLE_LINKS case CHANGEOVER_PROTOCOL: type = msg_type(msg); [...] } int tipc_link_recv_fragment(struct sk_buff **pending, struct sk_buff **fb, struct tipc_msg **m) { struct sk_buff *pbuf = *pending; struct sk_buff *fbuf = *fb; struct sk_buff *prev = NULL; struct tipc_msg *fragm = buf_msg(fbuf); u32 long_msg_orig = msg_orignode(fragm); u32 long_msg_seq_no = msg_fragm_msg_no(fragm); [....] pbuf = buf_acquire(msg_size(imsg)); if (pbuf != NULL) { pbuf->next = *pending; *pending = pbuf; skb_copy_to_linear_data(pbuf, imsg, msg_data_sz(fragm)); /* Prepare buffer for subsequent fragments. */ set_long_msg_orig(pbuf, long_msg_orig); set_long_msg_seqno(pbuf, long_msg_seq_no); set_fragm_size(pbuf,fragm_sz); set_expected_frags(pbuf,exp_fragm_cnt - 1); reset_timer_cnt(pbuf); } else { warn("Link unable to reassemble fragmented message\n"); buf_discard(fbuf); return 2; ----->>>> When memory is exhausted, the retured value is 2. } buf_discard(fbuf); return 0; [....] } b. In tipc_bcast.c void tipc_bclink_recv_pkt(struct sk_buff *buf) { struct tipc_msg *msg; struct tipc_node *node; u32 seqno; u32 next_in; int deferred; [...] if (likely(msg_isdata(msg))) { spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_port_recv_mcast(buf, NULL); } else if (msg_user(msg) == MSG_BUNDLER) { bcl->stats.recv_bundles++; bcl->stats.recv_bundled += msg_msgcnt(msg); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_link_recv_bundle(buf); } else if (msg_user(msg) == MSG_FRAGMENTER) { bcl->stats.recv_fragments++; if (tipc_link_recv_fragment(&node->bclink.defragm, &buf, &msg) == 1) { bcl->stats.recv_fragmented++; msg_set_destnode_cache(msg, tipc_own_addr); } spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_net_route_msg(buf); } else if (msg_user(msg) == NAME_DISTRIBUTOR) { [...] } |
From: Max F. <jcm...@gm...> - 2010-08-31 19:35:16
|
> Hi Jon: > > It looks like Howard (the Wind River TIPC developer in China) is working > on a solution to the long-standing problem of what to do if TIPC is > unable to begin reassembling a fragmented message because there isn't an > available sk_buff to hold the reassembled message. Basically, his idea > is to simply ignore the first fragment completely -- as if the message > was never received -- and let the link code take care of retransmitting > it later on. I think his solution for unicast links is more or less > correct, but his solution for the broadcast link isn't; I've suggested a > solution that should work there too. > > Can you take a look at Howard's email (at the bottom), and then my > comments (above Howard's email), and let me know if you see any problems > with either area? > > I can provide more details on the defect that Howard is trying to fix, > if needed. Basically, someone was blasting large SOCK_RDM messages down > a link and found that between 1% and 10% of the messages were being lost > because there weren't enough available buffers to allow them to be > reassembled. > > Regards, > Al Hello. It was me who stress-tested TIPC on vxWorks in this way. Also, I've evaluated the patch at the bottom of the letter. And although it solves message loss problem, it severely affects link performance: while I observed up to 100000 non-fragmented messages per second over gigabit link, with this patch I see only about 600 fragmented (into two parts) messages per second over this same link. I also tested other combinations of host OS and never noticed message loss between two linux machines, or when vxWorks sent messages to the linux. This shows up only when vxWorks acts as receiving side. > > ________________________________ > > From: Stephens, Allan > Sent: Tuesday, August 31, 2010 2:47 PM > To: Xu, Hao (Howard) > Cc: Kang, Qi; Xue, Ying > Subject: RE: WIND00229427 > > > Hi Howard: > > [Note: This email contains comments on this defect only; I don't want to > discuss multiple different issues in a single email, as it gets too > confusing. I'll send separate responses for the other defects you've > asked about.] > > I like the approach you've taken to dealing with this issue, but more > work needs to be done to have a complete solution. > > 1) It would probably be more Linux-like to have > tipc_link_recv_fragment() return -1 if an error occurred, rather than 2. > This way, you'd have the following set of return values: > 0 - fragment received, message not complete, > 1 - fragment received, message complete, > -1 - fragment not received, message not started > You'll need to modify tipc_recv_msg() accordingly to look for -1 instead > of 2. > > 2) The change you've made to tipc_bclink_recv_pkt() doesn't do anything > to force the discarded message fragment to be retransmitted, which is > essential if you don't want to lose messages. However, it doesn't look > like this can be done as easily as simply decrementing "next_in_no" like > you did for unicast links. Basically, you need to look at all code that > happens after the "if (seqno == next_in)" check and make sure nothing > important happens before you check the result of > tipc_link_recv_fragment(); this will allow you to bail out if a problem > occurs and result in the same effect as if the message had never been > received, which is what you want. I think you probably want to code > something like this: > > if (seqno == next_in) { > receive: > > /* Deliver message to destination */ > > if (likely(msg_isdata(msg))) { > spin_lock_bh(&bc_lock); > bclink_accept_pkt(...); > spin_unlock_bh(&bc_lock); > tipc_node_unlock(node); > tipc_port_recv_mcast(buf, NULL); > } else if (msg_user(msg) == MSG_BUNDLER) { > spin_lock_bh(&bc_lock); > bclink_accept_pkt(...); > bcl->stats.recv_bundles++; > bcl->stats.recv_bundled += msg_msgcnt(msg); > spin_unlock_bh(&bc_lock); > tipc_node_unlock(node); > tipc_link_recv_bundle(buf); > } else if (msg_user(msg) == MSG_FRAGMENTER) { > ret = tipc_link_recv_fragment(&node->bclink.defragm, > &buf, &msg); > if (ret < 0) { > /* unable to start reassembly -- discard fragment & wait for > retransmission */ > goto unlock; > } > spin_lock_bh(&bc_lock); > bclink_accept_pkt(...); > bcl->stats.recv_fragments++; > if (ret > 0) { > bcl->stats.recv_fragmented++; > msg_set_destnode_cache(msg, tipc_own_addr); > } > spin_unlock_bh(&bc_lock); > tipc_node_unlock(node); > tipc_net_route_msg(buf); > } else if (msg_user(msg) == NAME_DISTRIBUTOR) { > ... <and so on> > > where bclink_accept_pkt(...) does: > > bclink_update_last_sent(node, seqno); > node->bclink.last_in = seqno; > node->bclink.oos_state = 0; > bcl->stats.recv_info++; > /* > * Unicast an ACK periodically, ensuring that > * all nodes in the cluster don't ACK at the same time > */ > if (((seqno - tipc_own_addr) % TIPC_MIN_LINK_WIN) == 0) { > tipc_link_send_proto_msg( > node->active_links[node->elm.addr & 1], > STATE_MSG, 0, 0, 0, 0, 0, 0); > bcl->stats.sent_acks++; > } > > I realize that this stuff is probably confusing to you -- indeed, I find > the broadcast link code confusing too! -- so if you've got any > questions, feel free to ask. > > -- Al > > > > 2. WIND00229427 > To the issue, I already provided a workaround (in red) to the > customer and waited for feedback. Also could you review the > modification? I think in the defect the description is detailed. So > please refer to the defect. > > a. In tipc_link.c > > void tipc_recv_msg(struct sk_buff *head, struct tipc_bearer > *tb_ptr) > { > read_lock_bh(&tipc_net_lock); > while (head) { > struct bearer *b_ptr = (struct bearer *)tb_ptr; > struct tipc_node *n_ptr; > struct link *l_ptr; > struct sk_buff *crs; > struct sk_buff *buf; > struct tipc_msg *msg; > u32 type; > u32 seq_no; > u32 ackd; > u32 released; > u32 ret; > > [...] > case ROUTE_DISTRIBUTOR: > tipc_node_unlock(n_ptr); > tipc_route_recv(buf); > continue; > case CONN_MANAGER: > /* route message > normally */ > break; > case MSG_FRAGMENTER: > > l_ptr->stats.recv_fragments++; > ret = > tipc_link_recv_fragment(&l_ptr->defragm_buf, > > &buf, &msg); > if (ret == 1) { > > l_ptr->stats.recv_fragmented++; > goto deliver; > } > > if (ret == 2) > { > l_ptr->next_in_no--; > ----->>> When memory is exhausted, decrease the value of > l_ptr->next_in_no for requesting retransmitting the packet. Thus the > packet will not be lost. > } > > break; > #ifdef CONFIG_TIPC_MULTIPLE_LINKS > case CHANGEOVER_PROTOCOL: > type = msg_type(msg); > [...] > } > > int tipc_link_recv_fragment(struct sk_buff **pending, struct > sk_buff **fb, > struct tipc_msg **m) > { > struct sk_buff *pbuf = *pending; > struct sk_buff *fbuf = *fb; > struct sk_buff *prev = NULL; > struct tipc_msg *fragm = buf_msg(fbuf); > u32 long_msg_orig = msg_orignode(fragm); > u32 long_msg_seq_no = msg_fragm_msg_no(fragm); > [....] > pbuf = buf_acquire(msg_size(imsg)); > if (pbuf != NULL) { > pbuf->next = *pending; > *pending = pbuf; > skb_copy_to_linear_data(pbuf, imsg, > msg_data_sz(fragm)); > > /* Prepare buffer for subsequent fragments. */ > > set_long_msg_orig(pbuf, long_msg_orig); > set_long_msg_seqno(pbuf, long_msg_seq_no); > set_fragm_size(pbuf,fragm_sz); > set_expected_frags(pbuf,exp_fragm_cnt - 1); > reset_timer_cnt(pbuf); > } else { > warn("Link unable to reassemble fragmented message\n"); > buf_discard(fbuf); > return 2; ----->>>> When memory is > exhausted, the retured value is 2. > } > buf_discard(fbuf); > return 0; > [....] > } > > > b. In tipc_bcast.c > > void tipc_bclink_recv_pkt(struct sk_buff *buf) > { > struct tipc_msg *msg; > struct tipc_node *node; > u32 seqno; > u32 next_in; > int deferred; > [...] > if (likely(msg_isdata(msg))) { > spin_unlock_bh(&bc_lock); > tipc_node_unlock(node); > tipc_port_recv_mcast(buf, NULL); > } else if (msg_user(msg) == MSG_BUNDLER) { > bcl->stats.recv_bundles++; > bcl->stats.recv_bundled += msg_msgcnt(msg); > spin_unlock_bh(&bc_lock); > tipc_node_unlock(node); > tipc_link_recv_bundle(buf); > } else if (msg_user(msg) == MSG_FRAGMENTER) { > bcl->stats.recv_fragments++; > if (tipc_link_recv_fragment(&node->bclink.defragm, > &buf, &msg) == 1) { > bcl->stats.recv_fragmented++; > msg_set_destnode_cache(msg, tipc_own_addr); > } > spin_unlock_bh(&bc_lock); > tipc_node_unlock(node); > tipc_net_route_msg(buf); > } else if (msg_user(msg) == NAME_DISTRIBUTOR) { > > [...] > } > Thanks. -- Max |
From: Stephens, A. <all...@wi...> - 2010-08-31 19:45:58
|
Hi Max: I'm not totally surprised to hear that you see a significant performance hit if messages are being discarded due to a lack of buffers. One of TIPC's fundamental axioms is that message loss is a rare event, and the link code is designed around that principle. A couple of suggestions that could help improve throughput: 1) Change the configuration settings for TIPC on VxWorks so that it has a larger pool of large size message buffers. This should prevent messages from being discarded, and avoid the performance hit associated with message retransmission. 2) Change the configuration settings for TIPC so that the link window size is reduced. This should help flow control the stream of messages arriving at the VxWorks node, increasing the chances that there will always be the necessary message buffers in place to reassemble the incoming fragmented traffic. Regards, Al > -----Original Message----- > From: Max Filippov [mailto:jcm...@gm...] > Sent: Tuesday, August 31, 2010 3:35 PM > To: tip...@li... > Cc: Stephens, Allan; jon...@er... > Subject: Re: [tipc-discussion] Solution to problem of losing > fragmented messages > > > Hi Jon: > > > > It looks like Howard (the Wind River TIPC developer in China) is > > working on a solution to the long-standing problem of what to do if > > TIPC is unable to begin reassembling a fragmented message because > > there isn't an available sk_buff to hold the reassembled message. > > Basically, his idea is to simply ignore the first fragment > completely > > -- as if the message was never received -- and let the link > code take > > care of retransmitting it later on. I think his solution > for unicast > > links is more or less correct, but his solution for the > broadcast link > > isn't; I've suggested a solution that should work there too. > > > > Can you take a look at Howard's email (at the bottom), and then my > > comments (above Howard's email), and let me know if you see any > > problems with either area? > > > > I can provide more details on the defect that Howard is > trying to fix, > > if needed. Basically, someone was blasting large SOCK_RDM messages > > down a link and found that between 1% and 10% of the messages were > > being lost because there weren't enough available buffers to allow > > them to be reassembled. > > > > Regards, > > Al > > Hello. > > It was me who stress-tested TIPC on vxWorks in this way. > Also, I've evaluated the patch at the bottom of the letter. > And although it solves message loss problem, it severely > affects link performance: while I observed up to 100000 > non-fragmented messages per second over gigabit link, with > this patch I see only about 600 fragmented (into two parts) > messages per second over this same link. > > I also tested other combinations of host OS and never noticed > message loss between two linux machines, or when vxWorks sent > messages to the linux. This shows up only when vxWorks acts > as receiving side. > > > > > ________________________________ > > > > From: Stephens, Allan > > Sent: Tuesday, August 31, 2010 2:47 PM > > To: Xu, Hao (Howard) > > Cc: Kang, Qi; Xue, Ying > > Subject: RE: WIND00229427 > > > > > > Hi Howard: > > > > [Note: This email contains comments on this defect only; I > don't want > > to discuss multiple different issues in a single email, as > it gets too > > confusing. I'll send separate responses for the other > defects you've > > asked about.] > > > > I like the approach you've taken to dealing with this > issue, but more > > work needs to be done to have a complete solution. > > > > 1) It would probably be more Linux-like to have > > tipc_link_recv_fragment() return -1 if an error occurred, > rather than 2. > > This way, you'd have the following set of return values: > > 0 - fragment received, message not complete, > > 1 - fragment received, message complete, > > -1 - fragment not received, message not started You'll need to > > modify tipc_recv_msg() accordingly to look for -1 instead of 2. > > > > 2) The change you've made to tipc_bclink_recv_pkt() doesn't do > > anything to force the discarded message fragment to be > retransmitted, > > which is essential if you don't want to lose messages. However, it > > doesn't look like this can be done as easily as simply decrementing > > "next_in_no" like you did for unicast links. Basically, you need to > > look at all code that happens after the "if (seqno == > next_in)" check > > and make sure nothing important happens before you check > the result of > > tipc_link_recv_fragment(); this will allow you to bail out if a > > problem occurs and result in the same effect as if the message had > > never been received, which is what you want. I think you > probably want > > to code something like this: > > > > if (seqno == next_in) { > > receive: > > > > /* Deliver message to destination */ > > > > if (likely(msg_isdata(msg))) { > > spin_lock_bh(&bc_lock); > > bclink_accept_pkt(...); > > spin_unlock_bh(&bc_lock); > > tipc_node_unlock(node); > > tipc_port_recv_mcast(buf, NULL); > > } else if (msg_user(msg) == MSG_BUNDLER) { > > spin_lock_bh(&bc_lock); > > bclink_accept_pkt(...); > > bcl->stats.recv_bundles++; > > bcl->stats.recv_bundled += msg_msgcnt(msg); > > spin_unlock_bh(&bc_lock); > > tipc_node_unlock(node); > > tipc_link_recv_bundle(buf); > > } else if (msg_user(msg) == MSG_FRAGMENTER) { > > ret = tipc_link_recv_fragment(&node->bclink.defragm, > > &buf, &msg); > > if (ret < 0) { > > /* unable to start reassembly -- discard fragment & > wait for > > retransmission */ > > goto unlock; > > } > > spin_lock_bh(&bc_lock); > > bclink_accept_pkt(...); > > bcl->stats.recv_fragments++; > > if (ret > 0) { > > bcl->stats.recv_fragmented++; > > msg_set_destnode_cache(msg, tipc_own_addr); > > } > > spin_unlock_bh(&bc_lock); > > tipc_node_unlock(node); > > tipc_net_route_msg(buf); > > } else if (msg_user(msg) == NAME_DISTRIBUTOR) { > > ... <and so on> > > > > where bclink_accept_pkt(...) does: > > > > bclink_update_last_sent(node, seqno); > > node->bclink.last_in = seqno; > > node->bclink.oos_state = 0; > > bcl->stats.recv_info++; > > /* > > * Unicast an ACK periodically, ensuring that > > * all nodes in the cluster don't ACK at the same time > > */ > > if (((seqno - tipc_own_addr) % TIPC_MIN_LINK_WIN) == 0) { > > tipc_link_send_proto_msg( > > node->active_links[node->elm.addr & 1], > > STATE_MSG, 0, 0, 0, 0, 0, 0); > > bcl->stats.sent_acks++; > > } > > > > I realize that this stuff is probably confusing to you -- indeed, I > > find the broadcast link code confusing too! -- so if you've got any > > questions, feel free to ask. > > > > -- Al > > > > > > > > 2. WIND00229427 > > To the issue, I already provided a workaround (in red) to the > > customer and waited for feedback. Also could you review the > > modification? I think in the defect the description is detailed. So > > please refer to the defect. > > > > a. In tipc_link.c > > > > void tipc_recv_msg(struct sk_buff *head, struct tipc_bearer > > *tb_ptr) > > { > > read_lock_bh(&tipc_net_lock); > > while (head) { > > struct bearer *b_ptr = (struct bearer *)tb_ptr; > > struct tipc_node *n_ptr; > > struct link *l_ptr; > > struct sk_buff *crs; > > struct sk_buff *buf; > > struct tipc_msg *msg; > > u32 type; > > u32 seq_no; > > u32 ackd; > > u32 released; > > u32 ret; > > > > [...] > > case ROUTE_DISTRIBUTOR: > > tipc_node_unlock(n_ptr); > > tipc_route_recv(buf); > > continue; > > case CONN_MANAGER: > > /* route > message normally */ > > break; > > case MSG_FRAGMENTER: > > > > l_ptr->stats.recv_fragments++; > > ret = > > tipc_link_recv_fragment(&l_ptr->defragm_buf, > > > > &buf, &msg); > > if (ret == 1) { > > > > l_ptr->stats.recv_fragmented++; > > goto deliver; > > } > > > > if (ret == 2) > > { > > l_ptr->next_in_no--; > > ----->>> When memory is exhausted, decrease the value of > > l_ptr->next_in_no for requesting retransmitting the packet. > Thus the > > packet will not be lost. > > } > > > > break; > > #ifdef CONFIG_TIPC_MULTIPLE_LINKS > > case CHANGEOVER_PROTOCOL: > > type = msg_type(msg); > > [...] > > } > > > > int tipc_link_recv_fragment(struct sk_buff **pending, > struct sk_buff > > **fb, > > struct tipc_msg **m) > > { > > struct sk_buff *pbuf = *pending; > > struct sk_buff *fbuf = *fb; > > struct sk_buff *prev = NULL; > > struct tipc_msg *fragm = buf_msg(fbuf); > > u32 long_msg_orig = msg_orignode(fragm); > > u32 long_msg_seq_no = msg_fragm_msg_no(fragm); > > [....] > > pbuf = buf_acquire(msg_size(imsg)); > > if (pbuf != NULL) { > > pbuf->next = *pending; > > *pending = pbuf; > > skb_copy_to_linear_data(pbuf, imsg, > > msg_data_sz(fragm)); > > > > /* Prepare buffer for subsequent fragments. */ > > > > set_long_msg_orig(pbuf, long_msg_orig); > > set_long_msg_seqno(pbuf, long_msg_seq_no); > > set_fragm_size(pbuf,fragm_sz); > > set_expected_frags(pbuf,exp_fragm_cnt - 1); > > reset_timer_cnt(pbuf); > > } else { > > warn("Link unable to reassemble fragmented message\n"); > > buf_discard(fbuf); > > return 2; ----->>>> When memory is > > exhausted, the retured value is 2. > > } > > buf_discard(fbuf); > > return 0; > > [....] > > } > > > > > > b. In tipc_bcast.c > > > > void tipc_bclink_recv_pkt(struct sk_buff *buf) > > { > > struct tipc_msg *msg; > > struct tipc_node *node; > > u32 seqno; > > u32 next_in; > > int deferred; > > [...] > > if (likely(msg_isdata(msg))) { > > spin_unlock_bh(&bc_lock); > > tipc_node_unlock(node); > > tipc_port_recv_mcast(buf, NULL); > > } else if (msg_user(msg) == MSG_BUNDLER) { > > bcl->stats.recv_bundles++; > > bcl->stats.recv_bundled += msg_msgcnt(msg); > > spin_unlock_bh(&bc_lock); > > tipc_node_unlock(node); > > tipc_link_recv_bundle(buf); > > } else if (msg_user(msg) == MSG_FRAGMENTER) { > > bcl->stats.recv_fragments++; > > if (tipc_link_recv_fragment(&node->bclink.defragm, > > &buf, &msg) == 1) { > > bcl->stats.recv_fragmented++; > > msg_set_destnode_cache(msg, tipc_own_addr); > > } > > spin_unlock_bh(&bc_lock); > > tipc_node_unlock(node); > > tipc_net_route_msg(buf); > > } else if (msg_user(msg) == NAME_DISTRIBUTOR) { > > > > [...] > > } > > > > > Thanks. > -- Max > |
From: Jon M. <jon...@er...> - 2010-08-31 19:45:02
|
Hi Allan, I am sceptical to this solution. Reverting the link state is something I always tried to avoid. What do we do with the fragment reception counter? Are there more side effects? One drawback is clearly that if somebody sends a 66 k message, and fails to allocate the big buffer front up, the whole link will in practice be stale until the memory squeeze abates. That is, all other users of the link, with potentially important message interchange, will be suffering. Why not just chain up the received fragments *without* allocating the buffer from the beginning. Then we can allocate and reassemble at the last fragment. This may require some metadata buffer (about reassembly state) to be kept, but that can probably be added as a regular member of the link structure, or maybe even as "pseudo header fields" in the first buffer, similar to what we do today. (Later on, when we feel comfortable with this solution, we can move the defragmentation where it belongs: to the socket layer, i.e. the whole buffer chain is sent up to the socket layer, and the reassembly goes directly to user memory, within the context of a read() call. Thus there will be no need for any intermediate buffer at all.) Regards ///jon Jon Maloy M.Sc. EE Researcher Ericsson Canada Broadband and Systems Research 8400 Decarie H4P 2N2, Montreal, Quebec, Canada Phone + 1 514 345-7900 x42056 Mobile + 1 514 591-5578 jon...@er... www.ericsson.com ________________________________ From: Stephens, Allan [mailto:all...@wi...] Sent: August-31-10 14:57 To: Jon Maloy Cc: tip...@li... Subject: Solution to problem of losing fragmented messages Hi Jon: It looks like Howard (the Wind River TIPC developer in China) is working on a solution to the long-standing problem of what to do if TIPC is unable to begin reassembling a fragmented message because there isn't an available sk_buff to hold the reassembled message. Basically, his idea is to simply ignore the first fragment completely -- as if the message was never received -- and let the link code take care of retransmitting it later on. I think his solution for unicast links is more or less correct, but his solution for the broadcast link isn't; I've suggested a solution that should work there too. Can you take a look at Howard's email (at the bottom), and then my comments (above Howard's email), and let me know if you see any problems with either area? I can provide more details on the defect that Howard is trying to fix, if needed. Basically, someone was blasting large SOCK_RDM messages down a link and found that between 1% and 10% of the messages were being lost because there weren't enough available buffers to allow them to be reassembled. Regards, Al ________________________________ From: Stephens, Allan Sent: Tuesday, August 31, 2010 2:47 PM To: Xu, Hao (Howard) Cc: Kang, Qi; Xue, Ying Subject: RE: WIND00229427 Hi Howard: [Note: This email contains comments on this defect only; I don't want to discuss multiple different issues in a single email, as it gets too confusing. I'll send separate responses for the other defects you've asked about.] I like the approach you've taken to dealing with this issue, but more work needs to be done to have a complete solution. 1) It would probably be more Linux-like to have tipc_link_recv_fragment() return -1 if an error occurred, rather than 2. This way, you'd have the following set of return values: 0 - fragment received, message not complete, 1 - fragment received, message complete, -1 - fragment not received, message not started You'll need to modify tipc_recv_msg() accordingly to look for -1 instead of 2. 2) The change you've made to tipc_bclink_recv_pkt() doesn't do anything to force the discarded message fragment to be retransmitted, which is essential if you don't want to lose messages. However, it doesn't look like this can be done as easily as simply decrementing "next_in_no" like you did for unicast links. Basically, you need to look at all code that happens after the "if (seqno == next_in)" check and make sure nothing important happens before you check the result of tipc_link_recv_fragment(); this will allow you to bail out if a problem occurs and result in the same effect as if the message had never been received, which is what you want. I think you probably want to code something like this: if (seqno == next_in) { receive: /* Deliver message to destination */ if (likely(msg_isdata(msg))) { spin_lock_bh(&bc_lock); bclink_accept_pkt(...); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_port_recv_mcast(buf, NULL); } else if (msg_user(msg) == MSG_BUNDLER) { spin_lock_bh(&bc_lock); bclink_accept_pkt(...); bcl->stats.recv_bundles++; bcl->stats.recv_bundled += msg_msgcnt(msg); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_link_recv_bundle(buf); } else if (msg_user(msg) == MSG_FRAGMENTER) { ret = tipc_link_recv_fragment(&node->bclink.defragm, &buf, &msg); if (ret < 0) { /* unable to start reassembly -- discard fragment & wait for retransmission */ goto unlock; } spin_lock_bh(&bc_lock); bclink_accept_pkt(...); bcl->stats.recv_fragments++; if (ret > 0) { bcl->stats.recv_fragmented++; msg_set_destnode_cache(msg, tipc_own_addr); } spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_net_route_msg(buf); } else if (msg_user(msg) == NAME_DISTRIBUTOR) { ... <and so on> where bclink_accept_pkt(...) does: bclink_update_last_sent(node, seqno); node->bclink.last_in = seqno; node->bclink.oos_state = 0; bcl->stats.recv_info++; /* * Unicast an ACK periodically, ensuring that * all nodes in the cluster don't ACK at the same time */ if (((seqno - tipc_own_addr) % TIPC_MIN_LINK_WIN) == 0) { tipc_link_send_proto_msg( node->active_links[node->elm.addr & 1], STATE_MSG, 0, 0, 0, 0, 0, 0); bcl->stats.sent_acks++; } I realize that this stuff is probably confusing to you -- indeed, I find the broadcast link code confusing too! -- so if you've got any questions, feel free to ask. -- Al 2. WIND00229427 To the issue, I already provided a workaround (in red) to the customer and waited for feedback. Also could you review the modification? I think in the defect the description is detailed. So please refer to the defect. a. In tipc_link.c void tipc_recv_msg(struct sk_buff *head, struct tipc_bearer *tb_ptr) { read_lock_bh(&tipc_net_lock); while (head) { struct bearer *b_ptr = (struct bearer *)tb_ptr; struct tipc_node *n_ptr; struct link *l_ptr; struct sk_buff *crs; struct sk_buff *buf; struct tipc_msg *msg; u32 type; u32 seq_no; u32 ackd; u32 released; u32 ret; [...] case ROUTE_DISTRIBUTOR: tipc_node_unlock(n_ptr); tipc_route_recv(buf); continue; case CONN_MANAGER: /* route message normally */ break; case MSG_FRAGMENTER: l_ptr->stats.recv_fragments++; ret = tipc_link_recv_fragment(&l_ptr->defragm_buf, &buf, &msg); if (ret == 1) { l_ptr->stats.recv_fragmented++; goto deliver; } if (ret == 2) { l_ptr->next_in_no--; ----->>> When memory is exhausted, decrease the value of l_ptr->next_in_no for requesting retransmitting the packet. Thus the packet will not be lost. } break; #ifdef CONFIG_TIPC_MULTIPLE_LINKS case CHANGEOVER_PROTOCOL: type = msg_type(msg); [...] } int tipc_link_recv_fragment(struct sk_buff **pending, struct sk_buff **fb, struct tipc_msg **m) { struct sk_buff *pbuf = *pending; struct sk_buff *fbuf = *fb; struct sk_buff *prev = NULL; struct tipc_msg *fragm = buf_msg(fbuf); u32 long_msg_orig = msg_orignode(fragm); u32 long_msg_seq_no = msg_fragm_msg_no(fragm); [....] pbuf = buf_acquire(msg_size(imsg)); if (pbuf != NULL) { pbuf->next = *pending; *pending = pbuf; skb_copy_to_linear_data(pbuf, imsg, msg_data_sz(fragm)); /* Prepare buffer for subsequent fragments. */ set_long_msg_orig(pbuf, long_msg_orig); set_long_msg_seqno(pbuf, long_msg_seq_no); set_fragm_size(pbuf,fragm_sz); set_expected_frags(pbuf,exp_fragm_cnt - 1); reset_timer_cnt(pbuf); } else { warn("Link unable to reassemble fragmented message\n"); buf_discard(fbuf); return 2; ----->>>> When memory is exhausted, the retured value is 2. } buf_discard(fbuf); return 0; [....] } b. In tipc_bcast.c void tipc_bclink_recv_pkt(struct sk_buff *buf) { struct tipc_msg *msg; struct tipc_node *node; u32 seqno; u32 next_in; int deferred; [...] if (likely(msg_isdata(msg))) { spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_port_recv_mcast(buf, NULL); } else if (msg_user(msg) == MSG_BUNDLER) { bcl->stats.recv_bundles++; bcl->stats.recv_bundled += msg_msgcnt(msg); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_link_recv_bundle(buf); } else if (msg_user(msg) == MSG_FRAGMENTER) { bcl->stats.recv_fragments++; if (tipc_link_recv_fragment(&node->bclink.defragm, &buf, &msg) == 1) { bcl->stats.recv_fragmented++; msg_set_destnode_cache(msg, tipc_own_addr); } spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_net_route_msg(buf); } else if (msg_user(msg) == NAME_DISTRIBUTOR) { [...] } |
From: Stephens, A. <all...@wi...> - 2010-08-31 20:11:48
|
Hi Jon: I agree that the proposed solution is not ideal, but it does appear to be an acceptable interim solution to the problem at hand. As far as I can tell, there are no new side effects introduced by the act of ignoring an "unprocessable" first fragment message -- TIPC simply ends up doing exactly what it would have done if the message had actually been lost while passing over the bearer. I also acknowledge that performance will take a hit whenever this condition arises -- potentially stalling the link until the problematic fragment can be processed -- however, I don't see this as being any different than what happens in other cases where a TIPC network isn't really configured for the traffic it's being asked to carry (eg. a link can get stalled because it's transmit queue length threshold kicks in for other reasons, too). A couple of things to keep in mind: 1) This fix doesn't have any impact unless a node runs out of sk_buffs for reassembling incoming fragmented messages, so it shouldn't effect the vast majority of TIPC users. (Max Fillipov's recent email on this topic indicates that he doesn't see this condition on Linux-based TIPC nodes.) 2) Since TIPC touts itself as a reliable transport mechanism, users expect their messages to get through (if at all possible). I think users will be happier with a system that delivers all messages at a lower rate than one which has better performance but which loses messages periodically. (Especially if the performance issue can be solved by tuning the node's configuration settings.) Eventually, of course, it would be nice to enhance TIPC so that there is no need to allocate any additional sk_buffs for message reassembly. (I don't think your suggestion of chaining together incoming fragments and then allocating a new buffer once they are all present really solves anything -- it merely delays the inevitable discard that will have to occur if no such buffer can be obtained.) until that time I think Howard's suggested fix is a relatively quick solution to an observed problem. Regards, Al ________________________________ From: Jon Maloy [mailto:jon...@er...] Sent: Tuesday, August 31, 2010 3:45 PM To: Stephens, Allan Cc: tip...@li... Subject: RE: Solution to problem of losing fragmented messages Hi Allan, I am sceptical to this solution. Reverting the link state is something I always tried to avoid. What do we do with the fragment reception counter? Are there more side effects? One drawback is clearly that if somebody sends a 66 k message, and fails to allocate the big buffer front up, the whole link will in practice be stale until the memory squeeze abates. That is, all other users of the link, with potentially important message interchange, will be suffering. Why not just chain up the received fragments *without* allocating the buffer from the beginning. Then we can allocate and reassemble at the last fragment. This may require some metadata buffer (about reassembly state) to be kept, but that can probably be added as a regular member of the link structure, or maybe even as "pseudo header fields" in the first buffer, similar to what we do today. (Later on, when we feel comfortable with this solution, we can move the defragmentation where it belongs: to the socket layer, i.e. the whole buffer chain is sent up to the socket layer, and the reassembly goes directly to user memory, within the context of a read() call. Thus there will be no need for any intermediate buffer at all.) Regards ///jon Jon Maloy M.Sc. EE Researcher Ericsson Canada Broadband and Systems Research 8400 Decarie H4P 2N2, Montreal, Quebec, Canada Phone + 1 514 345-7900 x42056 Mobile + 1 514 591-5578 jon...@er... www.ericsson.com ________________________________ From: Stephens, Allan [mailto:all...@wi...] Sent: August-31-10 14:57 To: Jon Maloy Cc: tip...@li... Subject: Solution to problem of losing fragmented messages Hi Jon: It looks like Howard (the Wind River TIPC developer in China) is working on a solution to the long-standing problem of what to do if TIPC is unable to begin reassembling a fragmented message because there isn't an available sk_buff to hold the reassembled message. Basically, his idea is to simply ignore the first fragment completely -- as if the message was never received -- and let the link code take care of retransmitting it later on. I think his solution for unicast links is more or less correct, but his solution for the broadcast link isn't; I've suggested a solution that should work there too. Can you take a look at Howard's email (at the bottom), and then my comments (above Howard's email), and let me know if you see any problems with either area? I can provide more details on the defect that Howard is trying to fix, if needed. Basically, someone was blasting large SOCK_RDM messages down a link and found that between 1% and 10% of the messages were being lost because there weren't enough available buffers to allow them to be reassembled. Regards, Al ________________________________ From: Stephens, Allan Sent: Tuesday, August 31, 2010 2:47 PM To: Xu, Hao (Howard) Cc: Kang, Qi; Xue, Ying Subject: RE: WIND00229427 Hi Howard: [Note: This email contains comments on this defect only; I don't want to discuss multiple different issues in a single email, as it gets too confusing. I'll send separate responses for the other defects you've asked about.] I like the approach you've taken to dealing with this issue, but more work needs to be done to have a complete solution. 1) It would probably be more Linux-like to have tipc_link_recv_fragment() return -1 if an error occurred, rather than 2. This way, you'd have the following set of return values: 0 - fragment received, message not complete, 1 - fragment received, message complete, -1 - fragment not received, message not started You'll need to modify tipc_recv_msg() accordingly to look for -1 instead of 2. 2) The change you've made to tipc_bclink_recv_pkt() doesn't do anything to force the discarded message fragment to be retransmitted, which is essential if you don't want to lose messages. However, it doesn't look like this can be done as easily as simply decrementing "next_in_no" like you did for unicast links. Basically, you need to look at all code that happens after the "if (seqno == next_in)" check and make sure nothing important happens before you check the result of tipc_link_recv_fragment(); this will allow you to bail out if a problem occurs and result in the same effect as if the message had never been received, which is what you want. I think you probably want to code something like this: if (seqno == next_in) { receive: /* Deliver message to destination */ if (likely(msg_isdata(msg))) { spin_lock_bh(&bc_lock); bclink_accept_pkt(...); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_port_recv_mcast(buf, NULL); } else if (msg_user(msg) == MSG_BUNDLER) { spin_lock_bh(&bc_lock); bclink_accept_pkt(...); bcl->stats.recv_bundles++; bcl->stats.recv_bundled += msg_msgcnt(msg); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_link_recv_bundle(buf); } else if (msg_user(msg) == MSG_FRAGMENTER) { ret = tipc_link_recv_fragment(&node->bclink.defragm, &buf, &msg); if (ret < 0) { /* unable to start reassembly -- discard fragment & wait for retransmission */ goto unlock; } spin_lock_bh(&bc_lock); bclink_accept_pkt(...); bcl->stats.recv_fragments++; if (ret > 0) { bcl->stats.recv_fragmented++; msg_set_destnode_cache(msg, tipc_own_addr); } spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_net_route_msg(buf); } else if (msg_user(msg) == NAME_DISTRIBUTOR) { ... <and so on> where bclink_accept_pkt(...) does: bclink_update_last_sent(node, seqno); node->bclink.last_in = seqno; node->bclink.oos_state = 0; bcl->stats.recv_info++; /* * Unicast an ACK periodically, ensuring that * all nodes in the cluster don't ACK at the same time */ if (((seqno - tipc_own_addr) % TIPC_MIN_LINK_WIN) == 0) { tipc_link_send_proto_msg( node->active_links[node->elm.addr & 1], STATE_MSG, 0, 0, 0, 0, 0, 0); bcl->stats.sent_acks++; } I realize that this stuff is probably confusing to you -- indeed, I find the broadcast link code confusing too! -- so if you've got any questions, feel free to ask. -- Al 2. WIND00229427 To the issue, I already provided a workaround (in red) to the customer and waited for feedback. Also could you review the modification? I think in the defect the description is detailed. So please refer to the defect. a. In tipc_link.c void tipc_recv_msg(struct sk_buff *head, struct tipc_bearer *tb_ptr) { read_lock_bh(&tipc_net_lock); while (head) { struct bearer *b_ptr = (struct bearer *)tb_ptr; struct tipc_node *n_ptr; struct link *l_ptr; struct sk_buff *crs; struct sk_buff *buf; struct tipc_msg *msg; u32 type; u32 seq_no; u32 ackd; u32 released; u32 ret; [...] case ROUTE_DISTRIBUTOR: tipc_node_unlock(n_ptr); tipc_route_recv(buf); continue; case CONN_MANAGER: /* route message normally */ break; case MSG_FRAGMENTER: l_ptr->stats.recv_fragments++; ret = tipc_link_recv_fragment(&l_ptr->defragm_buf, &buf, &msg); if (ret == 1) { l_ptr->stats.recv_fragmented++; goto deliver; } if (ret == 2) { l_ptr->next_in_no--; ----->>> When memory is exhausted, decrease the value of l_ptr->next_in_no for requesting retransmitting the packet. Thus the packet will not be lost. } break; #ifdef CONFIG_TIPC_MULTIPLE_LINKS case CHANGEOVER_PROTOCOL: type = msg_type(msg); [...] } int tipc_link_recv_fragment(struct sk_buff **pending, struct sk_buff **fb, struct tipc_msg **m) { struct sk_buff *pbuf = *pending; struct sk_buff *fbuf = *fb; struct sk_buff *prev = NULL; struct tipc_msg *fragm = buf_msg(fbuf); u32 long_msg_orig = msg_orignode(fragm); u32 long_msg_seq_no = msg_fragm_msg_no(fragm); [....] pbuf = buf_acquire(msg_size(imsg)); if (pbuf != NULL) { pbuf->next = *pending; *pending = pbuf; skb_copy_to_linear_data(pbuf, imsg, msg_data_sz(fragm)); /* Prepare buffer for subsequent fragments. */ set_long_msg_orig(pbuf, long_msg_orig); set_long_msg_seqno(pbuf, long_msg_seq_no); set_fragm_size(pbuf,fragm_sz); set_expected_frags(pbuf,exp_fragm_cnt - 1); reset_timer_cnt(pbuf); } else { warn("Link unable to reassemble fragmented message\n"); buf_discard(fbuf); return 2; ----->>>> When memory is exhausted, the retured value is 2. } buf_discard(fbuf); return 0; [....] } b. In tipc_bcast.c void tipc_bclink_recv_pkt(struct sk_buff *buf) { struct tipc_msg *msg; struct tipc_node *node; u32 seqno; u32 next_in; int deferred; [...] if (likely(msg_isdata(msg))) { spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_port_recv_mcast(buf, NULL); } else if (msg_user(msg) == MSG_BUNDLER) { bcl->stats.recv_bundles++; bcl->stats.recv_bundled += msg_msgcnt(msg); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_link_recv_bundle(buf); } else if (msg_user(msg) == MSG_FRAGMENTER) { bcl->stats.recv_fragments++; if (tipc_link_recv_fragment(&node->bclink.defragm, &buf, &msg) == 1) { bcl->stats.recv_fragmented++; msg_set_destnode_cache(msg, tipc_own_addr); } spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_net_route_msg(buf); } else if (msg_user(msg) == NAME_DISTRIBUTOR) { [...] } |
From: Randy M. <rwm...@gm...> - 2010-09-01 03:30:12
|
Hi, I agree with Jon's goals of zero copy and only copying when the user does a read as I'm sure you do Allan. Consider two other use cases: 1. Jumbo frame network with MTU of say 9600 bytes. Many (well behaving) NICs will present MTU sized packet to TIPC as a non-linear skb. Currently TIPC linearizes these and that's fine until there's a shortage of memory. I submitted a patch some time ago to deal with the payload packets without linearizing them. Is it worth re-basing and re-submitting this work? 2. The Kerlabs kernel patch uses TIPC to do transparent process migration from one node to another. This is usually done when memory is scarce. Louis discussed the need for a native api for zero copy send to handle this situation. (btw, they have an easily set-up demo using qemu on Ubuntu) I know this doesn't help with the short term issues but I think we have to keep the "right" design in mind and in the mailing list. Back to the fire of the week, -- ../Randy/.. |
From: Jon M. <jon...@er...> - 2010-09-01 15:11:03
|
Hi Allan, I can buy that as a purely temporary work-around this is acceptable. After all there seems to be nobody willing to put resources into implementing a real solution to this. I looked at Howard's and your suggestions, and I can't see any problem with them. ///jon Jon Maloy M.Sc. EE Researcher Ericsson Canada Broadband and Systems Research 8400 Decarie H4P 2N2, Montreal, Quebec, Canada Phone + 1 514 345-7900 x42056 Mobile + 1 514 591-5578 jon...@er... www.ericsson.com ________________________________ From: Stephens, Allan [mailto:all...@wi...] Sent: August-31-10 16:12 To: Jon Maloy Cc: tip...@li... Subject: RE: Solution to problem of losing fragmented messages Hi Jon: I agree that the proposed solution is not ideal, but it does appear to be an acceptable interim solution to the problem at hand. As far as I can tell, there are no new side effects introduced by the act of ignoring an "unprocessable" first fragment message -- TIPC simply ends up doing exactly what it would have done if the message had actually been lost while passing over the bearer. I also acknowledge that performance will take a hit whenever this condition arises -- potentially stalling the link until the problematic fragment can be processed -- however, I don't see this as being any different than what happens in other cases where a TIPC network isn't really configured for the traffic it's being asked to carry (eg. a link can get stalled because it's transmit queue length threshold kicks in for other reasons, too). A couple of things to keep in mind: 1) This fix doesn't have any impact unless a node runs out of sk_buffs for reassembling incoming fragmented messages, so it shouldn't effect the vast majority of TIPC users. (Max Fillipov's recent email on this topic indicates that he doesn't see this condition on Linux-based TIPC nodes.) 2) Since TIPC touts itself as a reliable transport mechanism, users expect their messages to get through (if at all possible). I think users will be happier with a system that delivers all messages at a lower rate than one which has better performance but which loses messages periodically. (Especially if the performance issue can be solved by tuning the node's configuration settings.) Eventually, of course, it would be nice to enhance TIPC so that there is no need to allocate any additional sk_buffs for message reassembly. (I don't think your suggestion of chaining together incoming fragments and then allocating a new buffer once they are all present really solves anything -- it merely delays the inevitable discard that will have to occur if no such buffer can be obtained.) until that time I think Howard's suggested fix is a relatively quick solution to an observed problem. Regards, Al ________________________________ From: Jon Maloy [mailto:jon...@er...] Sent: Tuesday, August 31, 2010 3:45 PM To: Stephens, Allan Cc: tip...@li... Subject: RE: Solution to problem of losing fragmented messages Hi Allan, I am sceptical to this solution. Reverting the link state is something I always tried to avoid. What do we do with the fragment reception counter? Are there more side effects? One drawback is clearly that if somebody sends a 66 k message, and fails to allocate the big buffer front up, the whole link will in practice be stale until the memory squeeze abates. That is, all other users of the link, with potentially important message interchange, will be suffering. Why not just chain up the received fragments *without* allocating the buffer from the beginning. Then we can allocate and reassemble at the last fragment. This may require some metadata buffer (about reassembly state) to be kept, but that can probably be added as a regular member of the link structure, or maybe even as "pseudo header fields" in the first buffer, similar to what we do today. (Later on, when we feel comfortable with this solution, we can move the defragmentation where it belongs: to the socket layer, i.e. the whole buffer chain is sent up to the socket layer, and the reassembly goes directly to user memory, within the context of a read() call. Thus there will be no need for any intermediate buffer at all.) Regards ///jon Jon Maloy M.Sc. EE Researcher Ericsson Canada Broadband and Systems Research 8400 Decarie H4P 2N2, Montreal, Quebec, Canada Phone + 1 514 345-7900 x42056 Mobile + 1 514 591-5578 jon...@er... www.ericsson.com ________________________________ From: Stephens, Allan [mailto:all...@wi...] Sent: August-31-10 14:57 To: Jon Maloy Cc: tip...@li... Subject: Solution to problem of losing fragmented messages Hi Jon: It looks like Howard (the Wind River TIPC developer in China) is working on a solution to the long-standing problem of what to do if TIPC is unable to begin reassembling a fragmented message because there isn't an available sk_buff to hold the reassembled message. Basically, his idea is to simply ignore the first fragment completely -- as if the message was never received -- and let the link code take care of retransmitting it later on. I think his solution for unicast links is more or less correct, but his solution for the broadcast link isn't; I've suggested a solution that should work there too. Can you take a look at Howard's email (at the bottom), and then my comments (above Howard's email), and let me know if you see any problems with either area? I can provide more details on the defect that Howard is trying to fix, if needed. Basically, someone was blasting large SOCK_RDM messages down a link and found that between 1% and 10% of the messages were being lost because there weren't enough available buffers to allow them to be reassembled. Regards, Al ________________________________ From: Stephens, Allan Sent: Tuesday, August 31, 2010 2:47 PM To: Xu, Hao (Howard) Cc: Kang, Qi; Xue, Ying Subject: RE: WIND00229427 Hi Howard: [Note: This email contains comments on this defect only; I don't want to discuss multiple different issues in a single email, as it gets too confusing. I'll send separate responses for the other defects you've asked about.] I like the approach you've taken to dealing with this issue, but more work needs to be done to have a complete solution. 1) It would probably be more Linux-like to have tipc_link_recv_fragment() return -1 if an error occurred, rather than 2. This way, you'd have the following set of return values: 0 - fragment received, message not complete, 1 - fragment received, message complete, -1 - fragment not received, message not started You'll need to modify tipc_recv_msg() accordingly to look for -1 instead of 2. 2) The change you've made to tipc_bclink_recv_pkt() doesn't do anything to force the discarded message fragment to be retransmitted, which is essential if you don't want to lose messages. However, it doesn't look like this can be done as easily as simply decrementing "next_in_no" like you did for unicast links. Basically, you need to look at all code that happens after the "if (seqno == next_in)" check and make sure nothing important happens before you check the result of tipc_link_recv_fragment(); this will allow you to bail out if a problem occurs and result in the same effect as if the message had never been received, which is what you want. I think you probably want to code something like this: if (seqno == next_in) { receive: /* Deliver message to destination */ if (likely(msg_isdata(msg))) { spin_lock_bh(&bc_lock); bclink_accept_pkt(...); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_port_recv_mcast(buf, NULL); } else if (msg_user(msg) == MSG_BUNDLER) { spin_lock_bh(&bc_lock); bclink_accept_pkt(...); bcl->stats.recv_bundles++; bcl->stats.recv_bundled += msg_msgcnt(msg); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_link_recv_bundle(buf); } else if (msg_user(msg) == MSG_FRAGMENTER) { ret = tipc_link_recv_fragment(&node->bclink.defragm, &buf, &msg); if (ret < 0) { /* unable to start reassembly -- discard fragment & wait for retransmission */ goto unlock; } spin_lock_bh(&bc_lock); bclink_accept_pkt(...); bcl->stats.recv_fragments++; if (ret > 0) { bcl->stats.recv_fragmented++; msg_set_destnode_cache(msg, tipc_own_addr); } spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_net_route_msg(buf); } else if (msg_user(msg) == NAME_DISTRIBUTOR) { ... <and so on> where bclink_accept_pkt(...) does: bclink_update_last_sent(node, seqno); node->bclink.last_in = seqno; node->bclink.oos_state = 0; bcl->stats.recv_info++; /* * Unicast an ACK periodically, ensuring that * all nodes in the cluster don't ACK at the same time */ if (((seqno - tipc_own_addr) % TIPC_MIN_LINK_WIN) == 0) { tipc_link_send_proto_msg( node->active_links[node->elm.addr & 1], STATE_MSG, 0, 0, 0, 0, 0, 0); bcl->stats.sent_acks++; } I realize that this stuff is probably confusing to you -- indeed, I find the broadcast link code confusing too! -- so if you've got any questions, feel free to ask. -- Al 2. WIND00229427 To the issue, I already provided a workaround (in red) to the customer and waited for feedback. Also could you review the modification? I think in the defect the description is detailed. So please refer to the defect. a. In tipc_link.c void tipc_recv_msg(struct sk_buff *head, struct tipc_bearer *tb_ptr) { read_lock_bh(&tipc_net_lock); while (head) { struct bearer *b_ptr = (struct bearer *)tb_ptr; struct tipc_node *n_ptr; struct link *l_ptr; struct sk_buff *crs; struct sk_buff *buf; struct tipc_msg *msg; u32 type; u32 seq_no; u32 ackd; u32 released; u32 ret; [...] case ROUTE_DISTRIBUTOR: tipc_node_unlock(n_ptr); tipc_route_recv(buf); continue; case CONN_MANAGER: /* route message normally */ break; case MSG_FRAGMENTER: l_ptr->stats.recv_fragments++; ret = tipc_link_recv_fragment(&l_ptr->defragm_buf, &buf, &msg); if (ret == 1) { l_ptr->stats.recv_fragmented++; goto deliver; } if (ret == 2) { l_ptr->next_in_no--; ----->>> When memory is exhausted, decrease the value of l_ptr->next_in_no for requesting retransmitting the packet. Thus the packet will not be lost. } break; #ifdef CONFIG_TIPC_MULTIPLE_LINKS case CHANGEOVER_PROTOCOL: type = msg_type(msg); [...] } int tipc_link_recv_fragment(struct sk_buff **pending, struct sk_buff **fb, struct tipc_msg **m) { struct sk_buff *pbuf = *pending; struct sk_buff *fbuf = *fb; struct sk_buff *prev = NULL; struct tipc_msg *fragm = buf_msg(fbuf); u32 long_msg_orig = msg_orignode(fragm); u32 long_msg_seq_no = msg_fragm_msg_no(fragm); [....] pbuf = buf_acquire(msg_size(imsg)); if (pbuf != NULL) { pbuf->next = *pending; *pending = pbuf; skb_copy_to_linear_data(pbuf, imsg, msg_data_sz(fragm)); /* Prepare buffer for subsequent fragments. */ set_long_msg_orig(pbuf, long_msg_orig); set_long_msg_seqno(pbuf, long_msg_seq_no); set_fragm_size(pbuf,fragm_sz); set_expected_frags(pbuf,exp_fragm_cnt - 1); reset_timer_cnt(pbuf); } else { warn("Link unable to reassemble fragmented message\n"); buf_discard(fbuf); return 2; ----->>>> When memory is exhausted, the retured value is 2. } buf_discard(fbuf); return 0; [....] } b. In tipc_bcast.c void tipc_bclink_recv_pkt(struct sk_buff *buf) { struct tipc_msg *msg; struct tipc_node *node; u32 seqno; u32 next_in; int deferred; [...] if (likely(msg_isdata(msg))) { spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_port_recv_mcast(buf, NULL); } else if (msg_user(msg) == MSG_BUNDLER) { bcl->stats.recv_bundles++; bcl->stats.recv_bundled += msg_msgcnt(msg); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_link_recv_bundle(buf); } else if (msg_user(msg) == MSG_FRAGMENTER) { bcl->stats.recv_fragments++; if (tipc_link_recv_fragment(&node->bclink.defragm, &buf, &msg) == 1) { bcl->stats.recv_fragmented++; msg_set_destnode_cache(msg, tipc_own_addr); } spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_net_route_msg(buf); } else if (msg_user(msg) == NAME_DISTRIBUTOR) { [...] } |
From: Stephens, A. <all...@wi...> - 2010-09-01 15:25:32
|
OK, we'll continue to move forward on this course for the time being, and hope for a better solution to emerge someday ... -- Al ________________________________ From: Jon Maloy [mailto:jon...@er...] Sent: Wednesday, September 01, 2010 11:11 AM To: Stephens, Allan Cc: tip...@li... Subject: RE: Solution to problem of losing fragmented messages Hi Allan, I can buy that as a purely temporary work-around this is acceptable. After all there seems to be nobody willing to put resources into implementing a real solution to this. I looked at Howard's and your suggestions, and I can't see any problem with them. ///jon Jon Maloy M.Sc. EE Researcher Ericsson Canada Broadband and Systems Research 8400 Decarie H4P 2N2, Montreal, Quebec, Canada Phone + 1 514 345-7900 x42056 Mobile + 1 514 591-5578 jon...@er... www.ericsson.com ________________________________ From: Stephens, Allan [mailto:all...@wi...] Sent: August-31-10 16:12 To: Jon Maloy Cc: tip...@li... Subject: RE: Solution to problem of losing fragmented messages Hi Jon: I agree that the proposed solution is not ideal, but it does appear to be an acceptable interim solution to the problem at hand. As far as I can tell, there are no new side effects introduced by the act of ignoring an "unprocessable" first fragment message -- TIPC simply ends up doing exactly what it would have done if the message had actually been lost while passing over the bearer. I also acknowledge that performance will take a hit whenever this condition arises -- potentially stalling the link until the problematic fragment can be processed -- however, I don't see this as being any different than what happens in other cases where a TIPC network isn't really configured for the traffic it's being asked to carry (eg. a link can get stalled because it's transmit queue length threshold kicks in for other reasons, too). A couple of things to keep in mind: 1) This fix doesn't have any impact unless a node runs out of sk_buffs for reassembling incoming fragmented messages, so it shouldn't effect the vast majority of TIPC users. (Max Fillipov's recent email on this topic indicates that he doesn't see this condition on Linux-based TIPC nodes.) 2) Since TIPC touts itself as a reliable transport mechanism, users expect their messages to get through (if at all possible). I think users will be happier with a system that delivers all messages at a lower rate than one which has better performance but which loses messages periodically. (Especially if the performance issue can be solved by tuning the node's configuration settings.) Eventually, of course, it would be nice to enhance TIPC so that there is no need to allocate any additional sk_buffs for message reassembly. (I don't think your suggestion of chaining together incoming fragments and then allocating a new buffer once they are all present really solves anything -- it merely delays the inevitable discard that will have to occur if no such buffer can be obtained.) until that time I think Howard's suggested fix is a relatively quick solution to an observed problem. Regards, Al ________________________________ From: Jon Maloy [mailto:jon...@er...] Sent: Tuesday, August 31, 2010 3:45 PM To: Stephens, Allan Cc: tip...@li... Subject: RE: Solution to problem of losing fragmented messages Hi Allan, I am sceptical to this solution. Reverting the link state is something I always tried to avoid. What do we do with the fragment reception counter? Are there more side effects? One drawback is clearly that if somebody sends a 66 k message, and fails to allocate the big buffer front up, the whole link will in practice be stale until the memory squeeze abates. That is, all other users of the link, with potentially important message interchange, will be suffering. Why not just chain up the received fragments *without* allocating the buffer from the beginning. Then we can allocate and reassemble at the last fragment. This may require some metadata buffer (about reassembly state) to be kept, but that can probably be added as a regular member of the link structure, or maybe even as "pseudo header fields" in the first buffer, similar to what we do today. (Later on, when we feel comfortable with this solution, we can move the defragmentation where it belongs: to the socket layer, i.e. the whole buffer chain is sent up to the socket layer, and the reassembly goes directly to user memory, within the context of a read() call. Thus there will be no need for any intermediate buffer at all.) Regards ///jon Jon Maloy M.Sc. EE Researcher Ericsson Canada Broadband and Systems Research 8400 Decarie H4P 2N2, Montreal, Quebec, Canada Phone + 1 514 345-7900 x42056 Mobile + 1 514 591-5578 jon...@er... www.ericsson.com ________________________________ From: Stephens, Allan [mailto:all...@wi...] Sent: August-31-10 14:57 To: Jon Maloy Cc: tip...@li... Subject: Solution to problem of losing fragmented messages Hi Jon: It looks like Howard (the Wind River TIPC developer in China) is working on a solution to the long-standing problem of what to do if TIPC is unable to begin reassembling a fragmented message because there isn't an available sk_buff to hold the reassembled message. Basically, his idea is to simply ignore the first fragment completely -- as if the message was never received -- and let the link code take care of retransmitting it later on. I think his solution for unicast links is more or less correct, but his solution for the broadcast link isn't; I've suggested a solution that should work there too. Can you take a look at Howard's email (at the bottom), and then my comments (above Howard's email), and let me know if you see any problems with either area? I can provide more details on the defect that Howard is trying to fix, if needed. Basically, someone was blasting large SOCK_RDM messages down a link and found that between 1% and 10% of the messages were being lost because there weren't enough available buffers to allow them to be reassembled. Regards, Al ________________________________ From: Stephens, Allan Sent: Tuesday, August 31, 2010 2:47 PM To: Xu, Hao (Howard) Cc: Kang, Qi; Xue, Ying Subject: RE: WIND00229427 Hi Howard: [Note: This email contains comments on this defect only; I don't want to discuss multiple different issues in a single email, as it gets too confusing. I'll send separate responses for the other defects you've asked about.] I like the approach you've taken to dealing with this issue, but more work needs to be done to have a complete solution. 1) It would probably be more Linux-like to have tipc_link_recv_fragment() return -1 if an error occurred, rather than 2. This way, you'd have the following set of return values: 0 - fragment received, message not complete, 1 - fragment received, message complete, -1 - fragment not received, message not started You'll need to modify tipc_recv_msg() accordingly to look for -1 instead of 2. 2) The change you've made to tipc_bclink_recv_pkt() doesn't do anything to force the discarded message fragment to be retransmitted, which is essential if you don't want to lose messages. However, it doesn't look like this can be done as easily as simply decrementing "next_in_no" like you did for unicast links. Basically, you need to look at all code that happens after the "if (seqno == next_in)" check and make sure nothing important happens before you check the result of tipc_link_recv_fragment(); this will allow you to bail out if a problem occurs and result in the same effect as if the message had never been received, which is what you want. I think you probably want to code something like this: if (seqno == next_in) { receive: /* Deliver message to destination */ if (likely(msg_isdata(msg))) { spin_lock_bh(&bc_lock); bclink_accept_pkt(...); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_port_recv_mcast(buf, NULL); } else if (msg_user(msg) == MSG_BUNDLER) { spin_lock_bh(&bc_lock); bclink_accept_pkt(...); bcl->stats.recv_bundles++; bcl->stats.recv_bundled += msg_msgcnt(msg); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_link_recv_bundle(buf); } else if (msg_user(msg) == MSG_FRAGMENTER) { ret = tipc_link_recv_fragment(&node->bclink.defragm, &buf, &msg); if (ret < 0) { /* unable to start reassembly -- discard fragment & wait for retransmission */ goto unlock; } spin_lock_bh(&bc_lock); bclink_accept_pkt(...); bcl->stats.recv_fragments++; if (ret > 0) { bcl->stats.recv_fragmented++; msg_set_destnode_cache(msg, tipc_own_addr); } spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_net_route_msg(buf); } else if (msg_user(msg) == NAME_DISTRIBUTOR) { ... <and so on> where bclink_accept_pkt(...) does: bclink_update_last_sent(node, seqno); node->bclink.last_in = seqno; node->bclink.oos_state = 0; bcl->stats.recv_info++; /* * Unicast an ACK periodically, ensuring that * all nodes in the cluster don't ACK at the same time */ if (((seqno - tipc_own_addr) % TIPC_MIN_LINK_WIN) == 0) { tipc_link_send_proto_msg( node->active_links[node->elm.addr & 1], STATE_MSG, 0, 0, 0, 0, 0, 0); bcl->stats.sent_acks++; } I realize that this stuff is probably confusing to you -- indeed, I find the broadcast link code confusing too! -- so if you've got any questions, feel free to ask. -- Al 2. WIND00229427 To the issue, I already provided a workaround (in red) to the customer and waited for feedback. Also could you review the modification? I think in the defect the description is detailed. So please refer to the defect. a. In tipc_link.c void tipc_recv_msg(struct sk_buff *head, struct tipc_bearer *tb_ptr) { read_lock_bh(&tipc_net_lock); while (head) { struct bearer *b_ptr = (struct bearer *)tb_ptr; struct tipc_node *n_ptr; struct link *l_ptr; struct sk_buff *crs; struct sk_buff *buf; struct tipc_msg *msg; u32 type; u32 seq_no; u32 ackd; u32 released; u32 ret; [...] case ROUTE_DISTRIBUTOR: tipc_node_unlock(n_ptr); tipc_route_recv(buf); continue; case CONN_MANAGER: /* route message normally */ break; case MSG_FRAGMENTER: l_ptr->stats.recv_fragments++; ret = tipc_link_recv_fragment(&l_ptr->defragm_buf, &buf, &msg); if (ret == 1) { l_ptr->stats.recv_fragmented++; goto deliver; } if (ret == 2) { l_ptr->next_in_no--; ----->>> When memory is exhausted, decrease the value of l_ptr->next_in_no for requesting retransmitting the packet. Thus the packet will not be lost. } break; #ifdef CONFIG_TIPC_MULTIPLE_LINKS case CHANGEOVER_PROTOCOL: type = msg_type(msg); [...] } int tipc_link_recv_fragment(struct sk_buff **pending, struct sk_buff **fb, struct tipc_msg **m) { struct sk_buff *pbuf = *pending; struct sk_buff *fbuf = *fb; struct sk_buff *prev = NULL; struct tipc_msg *fragm = buf_msg(fbuf); u32 long_msg_orig = msg_orignode(fragm); u32 long_msg_seq_no = msg_fragm_msg_no(fragm); [....] pbuf = buf_acquire(msg_size(imsg)); if (pbuf != NULL) { pbuf->next = *pending; *pending = pbuf; skb_copy_to_linear_data(pbuf, imsg, msg_data_sz(fragm)); /* Prepare buffer for subsequent fragments. */ set_long_msg_orig(pbuf, long_msg_orig); set_long_msg_seqno(pbuf, long_msg_seq_no); set_fragm_size(pbuf,fragm_sz); set_expected_frags(pbuf,exp_fragm_cnt - 1); reset_timer_cnt(pbuf); } else { warn("Link unable to reassemble fragmented message\n"); buf_discard(fbuf); return 2; ----->>>> When memory is exhausted, the retured value is 2. } buf_discard(fbuf); return 0; [....] } b. In tipc_bcast.c void tipc_bclink_recv_pkt(struct sk_buff *buf) { struct tipc_msg *msg; struct tipc_node *node; u32 seqno; u32 next_in; int deferred; [...] if (likely(msg_isdata(msg))) { spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_port_recv_mcast(buf, NULL); } else if (msg_user(msg) == MSG_BUNDLER) { bcl->stats.recv_bundles++; bcl->stats.recv_bundled += msg_msgcnt(msg); spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_link_recv_bundle(buf); } else if (msg_user(msg) == MSG_FRAGMENTER) { bcl->stats.recv_fragments++; if (tipc_link_recv_fragment(&node->bclink.defragm, &buf, &msg) == 1) { bcl->stats.recv_fragmented++; msg_set_destnode_cache(msg, tipc_own_addr); } spin_unlock_bh(&bc_lock); tipc_node_unlock(node); tipc_net_route_msg(buf); } else if (msg_user(msg) == NAME_DISTRIBUTOR) { [...] } |
From: Stephens, A. <all...@wi...> - 2010-09-01 15:23:59
|
Randy wrote: > > I agree with Jon's goals of zero copy and only copying when > the user > > does a read as I'm sure you do Allan. Absolutely. > > Consider two other use cases: > > > > 1. Jumbo frame network with MTU of say 9600 bytes. > > Many (well behaving) NICs will present MTU sized packet to > TIPC as a > > non-linear skb. > > Currently TIPC linearizes these and that's fine until there's a > > shortage of memory. > > > > I submitted a patch some time ago to deal with the payload packets > > without linearizing them. Is it worth re-basing and > re-submitting this work? If you've got something that can eliminate unnecessary linearization, it sounds like it would be worthwhile to rebase this against TIPC 2.0 and resubmit it. The trick is to ensure that any such work handles *all* of the paths in TIPC that need to process incoming messages, and not just the basic one in which a data message is passed to a socket receive queue; for example, the code needs to handle bundled messages, message fragments, name table messages, and so on. Another question to ask how such a change impacts TIPC's native API. Right now, code using the native API is guaranteed that any callback routine that is given a message can access the contents of the message as a linear array. Is this something that we need to continue supporting? Regards, Al |