Re: [tipc-discussion] Link level packet loss

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Erik Hugne wrote:
> So, after the problems with packet loss due to full receive buffers 
> where out of the way, I believe we have discovered a bug in the TIPC 
> link layer.
> Using the testcase from:
> http://thread.gmane.org/gmane.network.tipc.general/3354
> and inserting a piece of code dumping the first few octets from the 
> packet payload after finished reassembly at link layer:
>
>> diff --git a/net/tipc/link.c b/net/tipc/link.c
>> index a79c755..917e3a3 100644
>> --- a/net/tipc/link.c
>> +++ b/net/tipc/link.c
>> @@ -1741,6 +1741,10 @@ deliver:
>>                                                 &buf, &msg);
>>                                         if (ret == 1) {
>>                                                 
>> l_ptr->stats.recv_fragmented++;
>> +                                               unsigned int *test = 
>> (unsigned int*)msg_data(msg);
>> +                                               u32 onode = 
>> msg_orignode(msg);
>> +                                               pr_debug("[link] from 
>> %d.%d.%d recv id:\t %u\n",
>> +                                               tipc_zone(onode), 
>> tipc_cluster(onode), tipc_node(onode), *test);
>>                                                 goto deliver;
>>                                         }
>>                                         if (ret == -1)
>
> I found that reassembly sometimes fails (never finishes) due to lost 
> fragments, reproduced on a 12 node cluster.
> The test server reports after a while:
>
>> received msg 1511 but expected 1510 fd 7
>> ^C
>
> The link level trace confirms that it's lost on the link layer:
> ....
>> Oct 30 15:55:42 Megatron user.warn kernel: [ 1968.657400] tipc: 
>> [link] from 1.1.5 recv id:     1506
>> Oct 30 15:55:42 Megatron user.warn kernel: [ 1968.789924] tipc: 
>> [link] from 1.1.5 recv id:     1507
>> Oct 30 15:55:42 Megatron user.warn kernel: [ 1968.962455] tipc: 
>> [link] from 1.1.5 recv id:     1508
>> Oct 30 15:55:43 Megatron user.warn kernel: [ 1969.117469] tipc: 
>> [link] from 1.1.5 recv id:     1509
>> Oct 30 15:55:45 Megatron user.warn kernel: [ 1971.023229] tipc: 
>> [link] from 1.1.5 recv id:     1511
>> Oct 30 15:55:45 Megatron user.warn kernel: [ 1971.107994] tipc: 
>> [link] from 1.1.5 recv id:     1512
>> Oct 30 15:55:45 Megatron user.warn kernel: [ 1971.211754] tipc: 
>> [link] from 1.1.5 recv id:     1513
> ....
>
> Pcap can be found here:
> https://t.co/PYqNQDor
>
>
> Node 5 reports:
> Link <1.1.5:eth0-1.1.1:eth0>
>   ACTIVE  MTU:1500  Priority:10  Tolerance:1500 ms  Window:50 packets
>   RX packets:428 fragments:0/0 bundles:0/0
>   TX packets:151296 fragments:151293/3219 bundles:0/0
>   TX profile sample:831 packets  average:4859 octets
>   0-64:0% -256:0% -1024:2% -4096:92% -16384:0% -32768:0% -66000:5%
>   RX states:56134 probes:2959 naks:41862 defs:0 dups:0
>   TX states:9016 probes:2914 naks:0 acks:3143 dups:156035
>   Congestion bearer:0 link:3215  Send queue max:96 avg:6
>
> Node 1:
> Link <1.1.1:eth0-1.1.5:eth0>
>   ACTIVE  MTU:1500  Priority:10  Tolerance:1500 ms  Window:50 packets
>   RX packets:151296 fragments:151293/3218 bundles:0/0
>   TX packets:428 fragments:0/0 bundles:0/0
>   TX profile sample:83 packets  average:82 octets
>   0-64:96% -256:0% -1024:0% -4096:4% -16384:0% -32768:0% -66000:0%
>   RX states:8270 probes:2893 naks:0 defs:290484 dups:90125
>   TX states:56106 probes:2945 naks:41862 acks:25868 dups:0
>   Congestion bearer:0 link:0  Send queue max:2 avg:0
>
>
>
> Node 1 reports to have reassembled 1 less buffer from node5 than it 
> has transmitted (3219 vs 3218), but the amount of transmitted/received 
> fragments match.
> Pcap shows that from fragment message buffer id 3194, fragment number 
> 7 is never sent out by node 5..
>

I guess sender may cause the issue. So I suggest we should add more logs 
in link_send_sections_long() to print several variables, such as 
fragment number, packet sequence number as well as message type like 
FIRST_FRAGMENT, FRAGMENT and LAST_FRAGMENT etc. These logs may help us 
find where is root cause.

Regards,
Ying

> Probably, the reassembly buffer is still held by node1, waiting for 
> fragment #7 to arrive, which it never does..
>
> I thought i'd give you a heads up on this, i'll continue 
> troubleshooting tomorrow.
>
> //E
>

Re: [tipc-discussion] Link level packet loss

Cluster wide IPC providing datagram, connection, and bus messaging

Re: [tipc-discussion] Link level packet loss