[tipc-discussion] Link level packet loss

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

So, after the problems with packet loss due to full receive buffers 
where out of the way, I believe we have discovered a bug in the TIPC 
link layer.
Using the testcase from:
http://thread.gmane.org/gmane.network.tipc.general/3354
and inserting a piece of code dumping the first few octets from the 
packet payload after finished reassembly at link layer:

> diff --git a/net/tipc/link.c b/net/tipc/link.c
> index a79c755..917e3a3 100644
> --- a/net/tipc/link.c
> +++ b/net/tipc/link.c
> @@ -1741,6 +1741,10 @@ deliver:
>                                                 &buf, &msg);
>                                         if (ret == 1) {
>                                                 l_ptr->stats.recv_fragmented++;
> +                                               unsigned int *test = (unsigned int*)msg_data(msg);
> +                                               u32 onode = msg_orignode(msg);
> +                                               pr_debug("[link] from %d.%d.%d recv id:\t %u\n",
> +                                               tipc_zone(onode), tipc_cluster(onode), tipc_node(onode), *test);
>                                                 goto deliver;
>                                         }
>                                         if (ret == -1)

I found that reassembly sometimes fails (never finishes) due to lost 
fragments, reproduced on a 12 node cluster.
The test server reports after a while:

> received msg 1511 but expected 1510 fd 7
> ^C

The link level trace confirms that it's lost on the link layer:
....
> Oct 30 15:55:42 Megatron user.warn kernel: [ 1968.657400] tipc: [link] from 1.1.5 recv id:	 1506
> Oct 30 15:55:42 Megatron user.warn kernel: [ 1968.789924] tipc: [link] from 1.1.5 recv id:	 1507
> Oct 30 15:55:42 Megatron user.warn kernel: [ 1968.962455] tipc: [link] from 1.1.5 recv id:	 1508
> Oct 30 15:55:43 Megatron user.warn kernel: [ 1969.117469] tipc: [link] from 1.1.5 recv id:	 1509
> Oct 30 15:55:45 Megatron user.warn kernel: [ 1971.023229] tipc: [link] from 1.1.5 recv id:	 1511
> Oct 30 15:55:45 Megatron user.warn kernel: [ 1971.107994] tipc: [link] from 1.1.5 recv id:	 1512
> Oct 30 15:55:45 Megatron user.warn kernel: [ 1971.211754] tipc: [link] from 1.1.5 recv id:	 1513
....

Pcap can be found here:
https://t.co/PYqNQDor

Node 5 reports:
Link <1.1.5:eth0-1.1.1:eth0>
   ACTIVE  MTU:1500  Priority:10  Tolerance:1500 ms  Window:50 packets
   RX packets:428 fragments:0/0 bundles:0/0
   TX packets:151296 fragments:151293/3219 bundles:0/0
   TX profile sample:831 packets  average:4859 octets
   0-64:0% -256:0% -1024:2% -4096:92% -16384:0% -32768:0% -66000:5%
   RX states:56134 probes:2959 naks:41862 defs:0 dups:0
   TX states:9016 probes:2914 naks:0 acks:3143 dups:156035
   Congestion bearer:0 link:3215  Send queue max:96 avg:6

Node 1:
Link <1.1.1:eth0-1.1.5:eth0>
   ACTIVE  MTU:1500  Priority:10  Tolerance:1500 ms  Window:50 packets
   RX packets:151296 fragments:151293/3218 bundles:0/0
   TX packets:428 fragments:0/0 bundles:0/0
   TX profile sample:83 packets  average:82 octets
   0-64:96% -256:0% -1024:0% -4096:4% -16384:0% -32768:0% -66000:0%
   RX states:8270 probes:2893 naks:0 defs:290484 dups:90125
   TX states:56106 probes:2945 naks:41862 acks:25868 dups:0
   Congestion bearer:0 link:0  Send queue max:2 avg:0

Node 1 reports to have reassembled 1 less buffer from node5 than it has 
transmitted (3219 vs 3218), but the amount of transmitted/received 
fragments match.
Pcap shows that from fragment message buffer id 3194, fragment number 7 
is never sent out by node 5..

Probably, the reassembly buffer is still held by node1, waiting for 
fragment #7 to arrive, which it never does..

I thought i'd give you a heads up on this, i'll continue troubleshooting 
tomorrow.

//E

[tipc-discussion] Link level packet loss

Cluster wide IPC providing datagram, connection, and bus messaging

[tipc-discussion] Link level packet loss