From: Erik H. <eri...@er...> - 2012-10-30 16:38:22
|
So, after the problems with packet loss due to full receive buffers where out of the way, I believe we have discovered a bug in the TIPC link layer. Using the testcase from: http://thread.gmane.org/gmane.network.tipc.general/3354 and inserting a piece of code dumping the first few octets from the packet payload after finished reassembly at link layer: > diff --git a/net/tipc/link.c b/net/tipc/link.c > index a79c755..917e3a3 100644 > --- a/net/tipc/link.c > +++ b/net/tipc/link.c > @@ -1741,6 +1741,10 @@ deliver: > &buf, &msg); > if (ret == 1) { > l_ptr->stats.recv_fragmented++; > + unsigned int *test = (unsigned int*)msg_data(msg); > + u32 onode = msg_orignode(msg); > + pr_debug("[link] from %d.%d.%d recv id:\t %u\n", > + tipc_zone(onode), tipc_cluster(onode), tipc_node(onode), *test); > goto deliver; > } > if (ret == -1) I found that reassembly sometimes fails (never finishes) due to lost fragments, reproduced on a 12 node cluster. The test server reports after a while: > received msg 1511 but expected 1510 fd 7 > ^C The link level trace confirms that it's lost on the link layer: .... > Oct 30 15:55:42 Megatron user.warn kernel: [ 1968.657400] tipc: [link] from 1.1.5 recv id: 1506 > Oct 30 15:55:42 Megatron user.warn kernel: [ 1968.789924] tipc: [link] from 1.1.5 recv id: 1507 > Oct 30 15:55:42 Megatron user.warn kernel: [ 1968.962455] tipc: [link] from 1.1.5 recv id: 1508 > Oct 30 15:55:43 Megatron user.warn kernel: [ 1969.117469] tipc: [link] from 1.1.5 recv id: 1509 > Oct 30 15:55:45 Megatron user.warn kernel: [ 1971.023229] tipc: [link] from 1.1.5 recv id: 1511 > Oct 30 15:55:45 Megatron user.warn kernel: [ 1971.107994] tipc: [link] from 1.1.5 recv id: 1512 > Oct 30 15:55:45 Megatron user.warn kernel: [ 1971.211754] tipc: [link] from 1.1.5 recv id: 1513 .... Pcap can be found here: https://t.co/PYqNQDor Node 5 reports: Link <1.1.5:eth0-1.1.1:eth0> ACTIVE MTU:1500 Priority:10 Tolerance:1500 ms Window:50 packets RX packets:428 fragments:0/0 bundles:0/0 TX packets:151296 fragments:151293/3219 bundles:0/0 TX profile sample:831 packets average:4859 octets 0-64:0% -256:0% -1024:2% -4096:92% -16384:0% -32768:0% -66000:5% RX states:56134 probes:2959 naks:41862 defs:0 dups:0 TX states:9016 probes:2914 naks:0 acks:3143 dups:156035 Congestion bearer:0 link:3215 Send queue max:96 avg:6 Node 1: Link <1.1.1:eth0-1.1.5:eth0> ACTIVE MTU:1500 Priority:10 Tolerance:1500 ms Window:50 packets RX packets:151296 fragments:151293/3218 bundles:0/0 TX packets:428 fragments:0/0 bundles:0/0 TX profile sample:83 packets average:82 octets 0-64:96% -256:0% -1024:0% -4096:4% -16384:0% -32768:0% -66000:0% RX states:8270 probes:2893 naks:0 defs:290484 dups:90125 TX states:56106 probes:2945 naks:41862 acks:25868 dups:0 Congestion bearer:0 link:0 Send queue max:2 avg:0 Node 1 reports to have reassembled 1 less buffer from node5 than it has transmitted (3219 vs 3218), but the amount of transmitted/received fragments match. Pcap shows that from fragment message buffer id 3194, fragment number 7 is never sent out by node 5.. Probably, the reassembly buffer is still held by node1, waiting for fragment #7 to arrive, which it never does.. I thought i'd give you a heads up on this, i'll continue troubleshooting tomorrow. //E |
From: Ying X. <yin...@wi...> - 2012-10-31 03:16:48
|
Erik Hugne wrote: > So, after the problems with packet loss due to full receive buffers > where out of the way, I believe we have discovered a bug in the TIPC > link layer. > Using the testcase from: > http://thread.gmane.org/gmane.network.tipc.general/3354 > and inserting a piece of code dumping the first few octets from the > packet payload after finished reassembly at link layer: > >> diff --git a/net/tipc/link.c b/net/tipc/link.c >> index a79c755..917e3a3 100644 >> --- a/net/tipc/link.c >> +++ b/net/tipc/link.c >> @@ -1741,6 +1741,10 @@ deliver: >> &buf, &msg); >> if (ret == 1) { >> >> l_ptr->stats.recv_fragmented++; >> + unsigned int *test = >> (unsigned int*)msg_data(msg); >> + u32 onode = >> msg_orignode(msg); >> + pr_debug("[link] from >> %d.%d.%d recv id:\t %u\n", >> + tipc_zone(onode), >> tipc_cluster(onode), tipc_node(onode), *test); >> goto deliver; >> } >> if (ret == -1) > > I found that reassembly sometimes fails (never finishes) due to lost > fragments, reproduced on a 12 node cluster. > The test server reports after a while: > >> received msg 1511 but expected 1510 fd 7 >> ^C > > The link level trace confirms that it's lost on the link layer: > .... >> Oct 30 15:55:42 Megatron user.warn kernel: [ 1968.657400] tipc: >> [link] from 1.1.5 recv id: 1506 >> Oct 30 15:55:42 Megatron user.warn kernel: [ 1968.789924] tipc: >> [link] from 1.1.5 recv id: 1507 >> Oct 30 15:55:42 Megatron user.warn kernel: [ 1968.962455] tipc: >> [link] from 1.1.5 recv id: 1508 >> Oct 30 15:55:43 Megatron user.warn kernel: [ 1969.117469] tipc: >> [link] from 1.1.5 recv id: 1509 >> Oct 30 15:55:45 Megatron user.warn kernel: [ 1971.023229] tipc: >> [link] from 1.1.5 recv id: 1511 >> Oct 30 15:55:45 Megatron user.warn kernel: [ 1971.107994] tipc: >> [link] from 1.1.5 recv id: 1512 >> Oct 30 15:55:45 Megatron user.warn kernel: [ 1971.211754] tipc: >> [link] from 1.1.5 recv id: 1513 > .... > > Pcap can be found here: > https://t.co/PYqNQDor > > > Node 5 reports: > Link <1.1.5:eth0-1.1.1:eth0> > ACTIVE MTU:1500 Priority:10 Tolerance:1500 ms Window:50 packets > RX packets:428 fragments:0/0 bundles:0/0 > TX packets:151296 fragments:151293/3219 bundles:0/0 > TX profile sample:831 packets average:4859 octets > 0-64:0% -256:0% -1024:2% -4096:92% -16384:0% -32768:0% -66000:5% > RX states:56134 probes:2959 naks:41862 defs:0 dups:0 > TX states:9016 probes:2914 naks:0 acks:3143 dups:156035 > Congestion bearer:0 link:3215 Send queue max:96 avg:6 > > Node 1: > Link <1.1.1:eth0-1.1.5:eth0> > ACTIVE MTU:1500 Priority:10 Tolerance:1500 ms Window:50 packets > RX packets:151296 fragments:151293/3218 bundles:0/0 > TX packets:428 fragments:0/0 bundles:0/0 > TX profile sample:83 packets average:82 octets > 0-64:96% -256:0% -1024:0% -4096:4% -16384:0% -32768:0% -66000:0% > RX states:8270 probes:2893 naks:0 defs:290484 dups:90125 > TX states:56106 probes:2945 naks:41862 acks:25868 dups:0 > Congestion bearer:0 link:0 Send queue max:2 avg:0 > > > > Node 1 reports to have reassembled 1 less buffer from node5 than it > has transmitted (3219 vs 3218), but the amount of transmitted/received > fragments match. > Pcap shows that from fragment message buffer id 3194, fragment number > 7 is never sent out by node 5.. > I guess sender may cause the issue. So I suggest we should add more logs in link_send_sections_long() to print several variables, such as fragment number, packet sequence number as well as message type like FIRST_FRAGMENT, FRAGMENT and LAST_FRAGMENT etc. These logs may help us find where is root cause. Regards, Ying > Probably, the reassembly buffer is still held by node1, waiting for > fragment #7 to arrive, which it never does.. > > I thought i'd give you a heads up on this, i'll continue > troubleshooting tomorrow. > > //E > |
From: Erik H. <eri...@er...> - 2012-10-31 08:11:10
|
> > I guess sender may cause the issue. So I suggest we should add more logs > in link_send_sections_long() to print several variables, such as > fragment number, packet sequence number as well as message type like > FIRST_FRAGMENT, FRAGMENT and LAST_FRAGMENT etc. These logs may help us > find where is root cause. > I think it looks like a receiver fault. It's strange that the receiver (Node1) acks seq. no 19009 with sequence gap 2 in packet #7392. This packet (seq. no 19009) should correspond to fragment #7, but Node1 cannot have received that packet, since it's not visible in the pcap.. //E |
From: Ying X. <yin...@wi...> - 2012-10-31 09:41:01
|
Erik Hugne wrote: >> >> I guess sender may cause the issue. So I suggest we should add more logs >> in link_send_sections_long() to print several variables, such as >> fragment number, packet sequence number as well as message type like >> FIRST_FRAGMENT, FRAGMENT and LAST_FRAGMENT etc. These logs may help us >> find where is root cause. >> > I think it looks like a receiver fault. > It's strange that the receiver (Node1) acks seq. no 19009 with > sequence gap 2 in packet #7392. > This packet (seq. no 19009) should correspond to fragment #7, but > Node1 cannot have received that packet, since it's not visible in the > pcap.. > You are right, it's very strange. But I want to know where the pcap file is captured, like on node 1, node 5, or even router etc. I suggest you should capture tipc packets on both sender and receiver. Regards, Ying > //E > > |
From: Erik H. <eri...@er...> - 2012-10-31 10:39:33
|
> You are right, it's very strange. But I want to know where the pcap file > is captured, like on node 1, node 5, or even router etc. > I suggest you should capture tipc packets on both sender and receiver. I'm running in a virtual environment on my laptop with 12 KVM instances, connected with a network bridge. The packets are captured on this bridge. The problem seems to be appearing during conditions with large amounts of retransmission. Right now I'm suspecting that the code handling deferred send in tipc_link_push_packet is somehow blocked from sending out the deferred packets (fragments), and that's why we never see them on the 'wire'. But that still doesn't explain why the receiver acked messages that it never received.. //E |
From: Erik H. <eri...@er...> - 2012-10-31 16:07:49
|
I think i'm closing in on the problem. During retransmission of fragments, if we have not yet allocated a buffer (that's only done when FIRST_FRAGMENT is received), and receive a fragment that is not FIRST_FRAGMENT. We will fall through the tipc_link_recv_fragment function, free the fragment buffer and return zero. http://lxr.free-electrons.com/source/net/tipc/link.c#L2506 tipc_recv_msg (in the link layer) will then continue, believing that the fragment have been put in a reassembly queue, and the (now lost) message will be acked. What i have yet to discover is _why_ this occurs, packets should always be put on the deferred queue if they are not 'next_in_no', and the deferred packets should only be replayed back on the receive loop when we have all previous packets.. //E |
From: Ying X. <yin...@wi...> - 2012-11-01 08:47:58
|
Erik Hugne wrote: > I think i'm closing in on the problem. > During retransmission of fragments, if we have not yet allocated a > buffer (that's only done when FIRST_FRAGMENT is received), and receive > a fragment that is not FIRST_FRAGMENT. > We will fall through the tipc_link_recv_fragment function, free the > fragment buffer and return zero. > http://lxr.free-electrons.com/source/net/tipc/link.c#L2506 > Sorry, actually i possibly don't understand your descried issue. From my understanding, the sequence number of FIRST_FRAGMENT packet is always less than FRAGMENT packet. If FRAGMENT packets are incoming earlier than the FIRST_FRAGMENT packet, we have to put them into deferred queue. So in this case, i cannot understand why tipc_link_recv_fragment() will be called, and fragment buffer will be freed. Please give some details. Thanks, Ying > tipc_recv_msg (in the link layer) will then continue, believing that > the fragment have been put in a reassembly queue, and the (now lost) > message will be acked. > > What i have yet to discover is _why_ this occurs, packets should > always be put on the deferred queue if they are not 'next_in_no', and > the deferred packets should only be replayed back on the receive loop > when we have all previous packets.. > > //E > > |
From: Erik H. <eri...@er...> - 2012-11-01 07:32:44
|
> What i have yet to discover is _why_ this occurs, packets should always > be put on the deferred queue if they are not 'next_in_no', and the > deferred packets should only be replayed back on the receive loop when > we have all previous packets.. > When the link supervision timer kicks in, we will flush ongoing reassemblies if they have not successfully received a fragment within [link tol]*4 seconds. This will then lead to the problem explained in the previous mail, since we obviously have dropped the buffer. I think this is a fault from which we cannot recover nicely, since the dropped buffer is built up of packets already acked on the link. So i'm suggesting that we reset the link immediately and print a helpful error message when this occurs. The problem will be mitigated by increasing link window size and/or tolerance levels. Jon: do you the remember the reasoning for flushing stale reassembly buffers every fourth link timeout, if we have not received a fragment for this ongoing reassembly yet? http://lxr.free-electrons.com/source/net/tipc/link.c#L2584 //E |
From: Ying X. <yin...@wi...> - 2012-11-01 09:24:15
|
Erik Hugne wrote: >> What i have yet to discover is _why_ this occurs, packets should always >> be put on the deferred queue if they are not 'next_in_no', and the >> deferred packets should only be replayed back on the receive loop when >> we have all previous packets.. >> > > When the link supervision timer kicks in, we will flush ongoing > reassemblies if they have not successfully received a fragment within > [link tol]*4 seconds. > > This will then lead to the problem explained in the previous mail, > since we obviously have dropped the buffer. > > I think this is a fault from which we cannot recover nicely, since the > dropped buffer is built up of packets already acked on the link. > > So i'm suggesting that we reset the link immediately and print a > helpful error message when this occurs. > The problem will be mitigated by increasing link window size and/or > tolerance levels. > Reseting link is not a good idea. If we are not in the worst case, we definitely don't do that. I also cannot understand why link_check_defragm_bufs() is used. We should not flush acked packets in general except that its link will be reset. > Jon: do you the remember the reasoning for flushing stale reassembly > buffers every fourth link timeout, if we have not received a fragment > for this ongoing reassembly yet? > http://lxr.free-electrons.com/source/net/tipc/link.c#L2584 > > > //E > > > |
From: Erik H. <eri...@er...> - 2012-11-01 09:45:24
|
> Reseting link is not a good idea. If we are not in the worst case, we > definitely don't do that. This is actually a worst case. We have dropped packets that are already acked before they are delivered to user (all frags have not been received yet) we cannot recover from this situation. > I also cannot understand why link_check_defragm_bufs() is used. We > should not flush acked packets in general except that its link will be > reset. > Either we have to reset the link, or we remove the checking of stale defragm_bufs. //E |
From: Ying X. <yin...@wi...> - 2012-11-01 09:58:48
|
Erik Hugne wrote: > >> Reseting link is not a good idea. If we are not in the worst case, we >> definitely don't do that. > This is actually a worst case. > We have dropped packets that are already acked before they are > delivered to user (all frags have not been received yet) we cannot > recover from this situation. > >> I also cannot understand why link_check_defragm_bufs() is used. We >> should not flush acked packets in general except that its link will be >> reset. >> > > Either we have to reset the link, or we remove the checking of stale > defragm_bufs. > It's better to remove the checking. When link is reset as the number of link timeout reaches its abort limit, the defragm_bufs will be purged. So it's unnecessary to check defragm_bufs periodically. > //E > > |
From: Erik H. <eri...@er...> - 2012-11-01 11:24:53
|
> It's better to remove the checking. When link is reset as the number of > link timeout reaches its abort limit, the defragm_bufs will be purged. > So it's unnecessary to check defragm_bufs periodically. > Yes, it should be safe to remove it. Jon, do you agree? //E |
From: Jon M. <ma...@do...> - 2012-11-01 12:20:19
|
On 11/01/2012 07:24 AM, Erik Hugne wrote: >> It's better to remove the checking. When link is reset as the number of >> link timeout reaches its abort limit, the defragm_bufs will be purged. >> So it's unnecessary to check defragm_bufs periodically. >> > Yes, it should be safe to remove it. > Jon, do you agree? Yes, I agree. I think it was put there during one of the attempts to introduce multi-cluster systems, where packets may be routed, reordered or lost without necessarily forcing a reset of the the link. Clearly, we forgot to remove it when that effort was abandoned. ///jon |
From: <eri...@er...> - 2012-11-01 12:08:13
|
From: Erik Hugne <eri...@er...> There is a periodic per-link job running every [link tolerance] seconds. This job performs a check on the active reassembly buffers, and if any of these have not received a packet during the last 4 runs it will be purged. This causes an inconsistency on the link level, since the individual fragments from [1...(n-x)] have already been acked and retransmission will only be done for packets [(n-x)...n]. These "orphaned" fragments will be free'd when we cannot find an active reassembly buffer for them. The end result of this is that the message is lost silently on the link level. The selected solution for this is to remove the periodic check for stale reassembly buffers. This is safe because the buffers are freed when a link is reset and repeated reassembly failures will always cause a link to reset. The reassembly timeout mechanism was originally introduced as part of the (abandonded) effort to introduce multi-cluster functionality where message could be routed between links. Signed-off-by: Erik Hugne <eri...@er...> --- net/tipc/link.c | 34 ---------------------------------- 1 files changed, 0 insertions(+), 34 deletions(-) diff --git a/net/tipc/link.c b/net/tipc/link.c index a79c755..bf81c7d 100644 --- a/net/tipc/link.c +++ b/net/tipc/link.c @@ -97,7 +97,6 @@ static int link_send_sections_long(struct tipc_port *sender, struct iovec const *msg_sect, u32 num_sect, unsigned int total_len, u32 destnode); -static void link_check_defragm_bufs(struct tipc_link *l_ptr); static void link_state_event(struct tipc_link *l_ptr, u32 event); static void link_reset_statistics(struct tipc_link *l_ptr); static void link_print(struct tipc_link *l_ptr, const char *str); @@ -269,7 +268,6 @@ static void link_timeout(struct tipc_link *l_ptr) } /* do all other link processing performed on a periodic basis */ - link_check_defragm_bufs(l_ptr); link_state_event(l_ptr, TIMEOUT_EVT); @@ -2575,38 +2573,6 @@ int tipc_link_recv_fragment(struct sk_buff **pending, struct sk_buff **fb, return 0; } -/** - * link_check_defragm_bufs - flush stale incoming message fragments - * @l_ptr: pointer to link - */ -static void link_check_defragm_bufs(struct tipc_link *l_ptr) -{ - struct sk_buff *prev = NULL; - struct sk_buff *next = NULL; - struct sk_buff *buf = l_ptr->defragm_buf; - - if (!buf) - return; - if (!link_working_working(l_ptr)) - return; - while (buf) { - u32 cnt = get_timer_cnt(buf); - - next = buf->next; - if (cnt < 4) { - incr_timer_cnt(buf); - prev = buf; - } else { - if (prev) - prev->next = buf->next; - else - l_ptr->defragm_buf = buf->next; - kfree_skb(buf); - } - buf = next; - } -} - static void link_set_supervision_props(struct tipc_link *l_ptr, u32 tolerance) { if ((tolerance < TIPC_MIN_LINK_TOL) || (tolerance > TIPC_MAX_LINK_TOL)) -- 1.7.5.4 |
From: Ying X. <yin...@wi...> - 2012-11-02 10:46:48
|
eri...@er... wrote: > From: Erik Hugne <eri...@er...> > > There is a periodic per-link job running every [link tolerance] seconds. > This job performs a check on the active reassembly buffers, and if any > of these have not received a packet during the last 4 runs it will be > purged. This causes an inconsistency on the link level, since the > individual fragments from [1...(n-x)] have already been acked and > retransmission will only be done for packets [(n-x)...n]. > These "orphaned" fragments will be free'd when we cannot find an active > reassembly buffer for them. The end result of this is that the message > is lost silently on the link level. > > The selected solution for this is to remove the periodic check for stale > reassembly buffers. This is safe because the buffers are freed when > a link is reset and repeated reassembly failures will always cause a link > to reset. The reassembly timeout mechanism was originally introduced as > part of the (abandonded) effort to introduce multi-cluster functionality > where message could be routed between links. > > Signed-off-by: Erik Hugne <eri...@er...> > Acked-by: Ying Xue <yin...@wi...> |