OpenNHRP / Support Requests / #6 [Dual Hub] A direct spoke to spoke connection breaks down if the primary Hub fails

#6 [Dual Hub] A direct spoke to spoke connection breaks down if the primary Hub fails

Milestone: v1.0_(example)

Status: open

Owner: Timo Teras

Labels: None

Priority: 5

Updated: 2018-02-23

Created: 2018-02-23

Creator: Vladislav

Private: No

Greetings!
As it seems to me, I found a serious problem in the NHRP.
I will try to describe in detail about the problem.

Topology:
1) I have in my test environment two hubs and two spoks (mGRE 10.10.10.0/24).
2) Hub1: mGRE IP - 10.10.10.100, NBMA IP - 172.16.100.2 (behind static DNAT, pre NAT IP - 100.100.100.10)
3) Hub2: mGRE IP - 10.10.10.200, NBMA IP - 172.16.200.2 (behind static DNAT, pre NAT IP - 100.100.200.20)
4) Spoke1: mGRE IP - 10.10.10.1, NBMA IP - 172.16.1.2 (behind static DNAT, pre NAT IP - 100.100.1.2)
5) Spoke2: mGRE IP - 10.10.10.2, NBMA IP - 172.16.2.2 (without NAT)

OS information:
Debian 7 (kernel: 3.2.0-4-amd64 #1 SMP Debian 3.2.81-2 x86_64 GNU/Linux)

Problem:
If the Hub1 fails, then after a while the traffic between the Spoke1 and Spoke2 starts to go through the Hub2.

Details (on Spoke1):
1) Hub1 and Hun2 is online:

root@Spoke1:~#
opennhrp[7546]: [10.10.10.100] Peer up script: success
opennhrp[7546]: NL-ARP(mgre0) 10.10.10.100 is-at 172.16.100.2
opennhrp[7546]: Sending Registration Request to 10.10.10.100 (my mtu=0)
opennhrp[7546]: Sending packet 3, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.100 (nbma 172.16.100.2)
opennhrp[7546]: [10.10.10.200] Peer up script: success
opennhrp[7546]: NL-ARP(mgre0) 10.10.10.200 is-at 172.16.200.2
opennhrp[7546]: Sending Registration Request to 10.10.10.200 (my mtu=0)
opennhrp[7546]: Sending packet 3, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.200 (nbma 172.16.200.2)
opennhrp[7546]: Received Registration Reply from 10.10.10.100: success
opennhrp[7546]: NAT detected: our real NBMA address is 172.16.1.2
opennhrp[7546]: Sending Purge Request (of local routes) to 10.10.10.100
opennhrp[7546]: Sending packet 5, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.100 (nbma 172.16.100.2)
opennhrp[7546]: [10.10.10.100] Peer inserted to multicast list
NHS is UP
opennhrp[7546]: Received Registration Reply from 10.10.10.200: success
opennhrp[7546]: NAT detected: our real NBMA address is 172.16.1.2
opennhrp[7546]: Sending Purge Request (of local routes) to 10.10.10.200
opennhrp[7546]: Sending packet 5, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.200 (nbma 172.16.200.2)
opennhrp[7546]: [10.10.10.200] Peer inserted to multicast list
NHS is UP

NOTE: As you can see, registration of Spoke1 was successful on Hub1 and Hub2.

opennhrp[7546]: NL-ARP(mgre0) who-has 10.10.10.2
opennhrp[7546]: NL-ARP(mgre0) 10.10.10.2 is-at 172.16.100.2
opennhrp[7546]: Adding incomplete 10.10.10.2/32 dev mgre0
opennhrp[7546]: Sending Resolution Request to 10.10.10.2
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.200.2)
opennhrp[7546]: Forwarding packet from nbma src 100.100.100.10, proto src 10.10.10.100 to proto dst 192.168.1.100, hop count 0
opennhrp[7546]: Sending packet 7, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.100 (nbma 172.16.100.2)
opennhrp[7546]: Received Resolution Reply 10.10.10.2/32 is at proto 10.10.10.2 nbma 172.16.2.2
opennhrp[7546]: [10.10.10.2] Peer up script: success
opennhrp[7546]: NL-ARP(mgre0) 10.10.10.2 is-at 172.16.2.2

root@Spoke1:~# ip n
192.168.1.100 dev eth1 lladdr 00:50:79:66:68:00 REACHABLE
100.100.1.1 dev eth0 lladdr 00:a0:63:37:85:01 REACHABLE
10.10.10.2 dev mgre0 lladdr 172.16.2.2 REACHABLE
10.10.10.200 dev mgre0 lladdr 172.16.200.2 STALE
10.10.10.100 dev mgre0 lladdr 172.16.100.2 STALE

NOTE: As you can see, NHRP added correct mapping on Spoke1 for Spoke2: mGRE IP 10.10.10.2 to NBMA IP: 172.16.2.2 (see Topology).

2) Hub1 went offline (failure), Hub2 is onlie:

opennhrp[7546]: Failed to register to 10.10.10.100: timeout (65535)
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending Registration Request to 10.10.10.200 (my mtu=0)
opennhrp[7546]: Sending packet 3, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.200 (nbma 172.16.200.2)
opennhrp[7546]: Received Registration Reply from 10.10.10.200: success
opennhrp[7546]: NAT detected: our real NBMA address is 172.16.1.2
opennhrp[7546]: [10.10.10.200] Peer inserted to multicast list
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending Resolution Request to 10.10.10.2
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Failed to resolve 10.10.10.2: timeout (65535)
opennhrp[7546]: Removing cached 10.10.10.2/32 nbma 172.16.2.2 dev mgre0 used up expires_in 2:49
opennhrp[7546]: NL-ARP(mgre0) 10.10.10.2 not-reachable
Delete link from 10.10.10.1 (100.100.1.2) to 10.10.10.2 (172.16.2.2)
RTNETLINK answers: No such process
opennhrp[7546]: NL-ARP(mgre0) who-has 10.10.10.2
opennhrp[7546]: NL-ARP(mgre0) 10.10.10.2 is-at 172.16.200.2
opennhrp[7546]: Adding incomplete 10.10.10.2/32 dev mgre0
opennhrp[7546]: Sending Resolution Request to 10.10.10.2
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: [10.10.10.100] Peer up script: success
opennhrp[7546]: NL-ARP(mgre0) 10.10.10.100 is-at 172.16.100.2
opennhrp[7546]: Sending Registration Request to 10.10.10.100 (my mtu=0)
opennhrp[7546]: Sending packet 3, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.100 (nbma 172.16.100.2)
opennhrp[7546]: Failed to resolve 10.10.10.2: timeout (65535)
opennhrp[7546]: Removing incomplete 10.10.10.2/32 dev mgre0 used

root@Spoke1:~# opennhrpctl show

opennhrp[7546]: Admin: show
Status: ok

Interface: mgre0
Type: local
Protocol-Address: 10.10.10.255/32
Alias-Address: 10.10.10.1
Flags: up

Interface: mgre0
Type: local
Protocol-Address: 10.10.10.1/32
Flags: up

Interface: mgre0
Type: static
Protocol-Address: 10.10.10.200/24
NBMA-Address: 172.16.200.2
Flags: up

Interface: mgre0
Type: static
Protocol-Address: 10.10.10.100/24
NBMA-Address: 172.16.100.2
Flags: used up # why is UP?

root@Spoke1:~# ip n
192.168.1.100 dev eth1 lladdr 00:50:79:66:68:00 REACHABLE
100.100.1.1 dev eth0 lladdr 00:a0:63:37:85:01 REACHABLE
10.10.10.2 dev mgre0 lladdr 172.16.200.2 REACHABLE # inccorect mapping, must be 172.16.2.2 not 172.16.200.2
10.10.10.100 dev mgre0 lladdr 172.16.100.2 REACHABLE
root@Spoke1:~#

NOTE: As you can see,
- NHRP cannot resolve 10.10.10.2 even then Hub2 is online;
- Hub1 has "used up" flags even then Hub1 is offline;
- in the table of neighbors there is an incorrect maping.

Discussion

Vladislav - 2018-02-23

opennhrp configurations:
1) Hub1:
root@Hub1:~# cat /etc/opennhrp/opennhrp.conf
interface mgre0
map 10.10.10.200/24 172.16.200.2
multicast dynamic
holding-time 600
cisco-authentication secret
#redirect
non-caching
2) Hub2:
root@Hub2:~# cat /etc/opennhrp/opennhrp.conf
interface mgre0
map 10.10.10.100/24 172.16.100.2
multicast dynamic
holding-time 600
cisco-authentication secret
#redirect
non-caching

** 3) Spoke1:**
root@Spoke1:~# cat /etc/opennhrp/opennhrp.conf
interface mgre0
map 10.10.10.100/24 172.16.100.2 register
map 10.10.10.200/24 172.16.200.2 register
multicast nhs
holding-time 600
cisco-authentication secret
non-caching

4) Spoke2:
root@Spoke2:~# cat /etc/opennhrp/opennhrp.conf
interface mgre0
map 10.10.10.100/24 172.16.100.2 register
map 10.10.10.200/24 172.16.200.2 register
multicast nhs
holding-time 600
cisco-authentication secret
non-caching

Do you need any additional information about the problem?
Can you help me?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Timo Teras - 2018-02-23

To start off I recommend using Quagga/NHRP or FRR/NHRP if possible.

I am not sure how IPsec is configured, but that likely is the cause. This is because NHRP does not detect liveliness but depends on IPsec to do it.

If IPsec is not in use, this would cause the issue.

If IPsec is in use, the racoon's phase1_dead hook is not likely configured, or the script is not working. On ipsec-tools/opennhrp the dead peer detection works so that ipsec-tools executes a dead peer hook which should be a script executing opennhrpctl to inform which node died. (In Quagga|FRR/NHRP there's strong API level integration to strongSwan for this.)

Last edit: Timo Teras 2018-02-23

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Vladislav - 2018-02-23
  
  Thanks for the answer!
  1) I can not use the latest version of Quagga and FRR because there are no deb packages for the Debian 7 operating system.
  2) I use IPsec, but it's not a racoon/strongSwan, I use a proprietary implementation of IPsec by S-Terra CSP (Russian vendor with GOST cipher algorithms).
  I assumed that the NHRP protocol should not depend on IPsec. Is this assumption wrong?
  If the problem described by me can be reproduced without using IPsec, then this is a problem/bug in opennhrp?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

[Dual Hub] A direct spoke to spoke connection breaks down if the primary Hub fails

Group

Searches

Help

#6 [Dual Hub] A direct spoke to spoke connection breaks down if the primary Hub fails

Discussion