Menu

#6 [Dual Hub] A direct spoke to spoke connection breaks down if the primary Hub fails

v1.0_(example)
open
None
5
2018-02-23
2018-02-23
Vladislav
No

Greetings!
As it seems to me, I found a serious problem in the NHRP.
I will try to describe in detail about the problem.

Topology:
1) I have in my test environment two hubs and two spoks (mGRE 10.10.10.0/24).
2) Hub1: mGRE IP - 10.10.10.100, NBMA IP - 172.16.100.2 (behind static DNAT, pre NAT IP - 100.100.100.10)
3) Hub2: mGRE IP - 10.10.10.200, NBMA IP - 172.16.200.2 (behind static DNAT, pre NAT IP - 100.100.200.20)
4) Spoke1: mGRE IP - 10.10.10.1, NBMA IP - 172.16.1.2 (behind static DNAT, pre NAT IP - 100.100.1.2)
5) Spoke2: mGRE IP - 10.10.10.2, NBMA IP - 172.16.2.2 (without NAT)

OS information:
Debian 7 (kernel: 3.2.0-4-amd64 #1 SMP Debian 3.2.81-2 x86_64 GNU/Linux)

Problem:
If the Hub1 fails, then after a while the traffic between the Spoke1 and Spoke2 starts to go through the Hub2.

Details (on Spoke1):
1) Hub1 and Hun2 is online:

root@Spoke1:~#
opennhrp[7546]: [10.10.10.100] Peer up script: success
opennhrp[7546]: NL-ARP(mgre0) 10.10.10.100 is-at 172.16.100.2
opennhrp[7546]: Sending Registration Request to 10.10.10.100 (my mtu=0)
opennhrp[7546]: Sending packet 3, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.100 (nbma 172.16.100.2)
opennhrp[7546]: [10.10.10.200] Peer up script: success
opennhrp[7546]: NL-ARP(mgre0) 10.10.10.200 is-at 172.16.200.2
opennhrp[7546]: Sending Registration Request to 10.10.10.200 (my mtu=0)
opennhrp[7546]: Sending packet 3, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.200 (nbma 172.16.200.2)
opennhrp[7546]: Received Registration Reply from 10.10.10.100: success
opennhrp[7546]: NAT detected: our real NBMA address is 172.16.1.2
opennhrp[7546]: Sending Purge Request (of local routes) to 10.10.10.100
opennhrp[7546]: Sending packet 5, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.100 (nbma 172.16.100.2)
opennhrp[7546]: [10.10.10.100] Peer inserted to multicast list
NHS is UP
opennhrp[7546]: Received Registration Reply from 10.10.10.200: success
opennhrp[7546]: NAT detected: our real NBMA address is 172.16.1.2
opennhrp[7546]: Sending Purge Request (of local routes) to 10.10.10.200
opennhrp[7546]: Sending packet 5, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.200 (nbma 172.16.200.2)
opennhrp[7546]: [10.10.10.200] Peer inserted to multicast list
NHS is UP


NOTE: As you can see, registration of Spoke1 was successful on Hub1 and Hub2.


opennhrp[7546]: NL-ARP(mgre0) who-has 10.10.10.2
opennhrp[7546]: NL-ARP(mgre0) 10.10.10.2 is-at 172.16.100.2
opennhrp[7546]: Adding incomplete 10.10.10.2/32 dev mgre0
opennhrp[7546]: Sending Resolution Request to 10.10.10.2
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.200.2)
opennhrp[7546]: Forwarding packet from nbma src 100.100.100.10, proto src 10.10.10.100 to proto dst 192.168.1.100, hop count 0
opennhrp[7546]: Sending packet 7, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.100 (nbma 172.16.100.2)
opennhrp[7546]: Received Resolution Reply 10.10.10.2/32 is at proto 10.10.10.2 nbma 172.16.2.2
opennhrp[7546]: [10.10.10.2] Peer up script: success
opennhrp[7546]: NL-ARP(mgre0) 10.10.10.2 is-at 172.16.2.2


root@Spoke1:~# ip n
192.168.1.100 dev eth1 lladdr 00:50:79:66:68:00 REACHABLE
100.100.1.1 dev eth0 lladdr 00:a0:63:37:85:01 REACHABLE
10.10.10.2 dev mgre0 lladdr 172.16.2.2 REACHABLE
10.10.10.200 dev mgre0 lladdr 172.16.200.2 STALE
10.10.10.100 dev mgre0 lladdr 172.16.100.2 STALE


NOTE: As you can see, NHRP added correct mapping on Spoke1 for Spoke2: mGRE IP 10.10.10.2 to NBMA IP: 172.16.2.2 (see Topology).

2) Hub1 went offline (failure), Hub2 is onlie:

opennhrp[7546]: Failed to register to 10.10.10.100: timeout (65535)
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending Registration Request to 10.10.10.200 (my mtu=0)
opennhrp[7546]: Sending packet 3, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.200 (nbma 172.16.200.2)
opennhrp[7546]: Received Registration Reply from 10.10.10.200: success
opennhrp[7546]: NAT detected: our real NBMA address is 172.16.1.2
opennhrp[7546]: [10.10.10.200] Peer inserted to multicast list
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending Resolution Request to 10.10.10.2
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Failed to resolve 10.10.10.2: timeout (65535)
opennhrp[7546]: Removing cached 10.10.10.2/32 nbma 172.16.2.2 dev mgre0 used up expires_in 2:49
opennhrp[7546]: NL-ARP(mgre0) 10.10.10.2 not-reachable
Delete link from 10.10.10.1 (100.100.1.2) to 10.10.10.2 (172.16.2.2)
RTNETLINK answers: No such process
opennhrp[7546]: NL-ARP(mgre0) who-has 10.10.10.2
opennhrp[7546]: NL-ARP(mgre0) 10.10.10.2 is-at 172.16.200.2
opennhrp[7546]: Adding incomplete 10.10.10.2/32 dev mgre0
opennhrp[7546]: Sending Resolution Request to 10.10.10.2
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: Multicast from 10.10.10.1 to 224.0.0.5
opennhrp[7546]: Sending packet 1, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.2 (nbma 172.16.100.2)
opennhrp[7546]: [10.10.10.100] Peer up script: success
opennhrp[7546]: NL-ARP(mgre0) 10.10.10.100 is-at 172.16.100.2
opennhrp[7546]: Sending Registration Request to 10.10.10.100 (my mtu=0)
opennhrp[7546]: Sending packet 3, from: 10.10.10.1 (nbma 100.100.1.2), to: 10.10.10.100 (nbma 172.16.100.2)
opennhrp[7546]: Failed to resolve 10.10.10.2: timeout (65535)
opennhrp[7546]: Removing incomplete 10.10.10.2/32 dev mgre0 used


root@Spoke1:~# opennhrpctl show

opennhrp[7546]: Admin: show
Status: ok

Interface: mgre0
Type: local
Protocol-Address: 10.10.10.255/32
Alias-Address: 10.10.10.1
Flags: up

Interface: mgre0
Type: local
Protocol-Address: 10.10.10.1/32
Flags: up

Interface: mgre0
Type: static
Protocol-Address: 10.10.10.200/24
NBMA-Address: 172.16.200.2
Flags: up

Interface: mgre0
Type: static
Protocol-Address: 10.10.10.100/24
NBMA-Address: 172.16.100.2
Flags: used up # why is UP?


root@Spoke1:~# ip n
192.168.1.100 dev eth1 lladdr 00:50:79:66:68:00 REACHABLE
100.100.1.1 dev eth0 lladdr 00:a0:63:37:85:01 REACHABLE
10.10.10.2 dev mgre0 lladdr 172.16.200.2 REACHABLE # inccorect mapping, must be 172.16.2.2 not 172.16.200.2
10.10.10.100 dev mgre0 lladdr 172.16.100.2 REACHABLE
root@Spoke1:~#


NOTE: As you can see,
- NHRP cannot resolve 10.10.10.2 even then Hub2 is online;
- Hub1 has "used up" flags even then Hub1 is offline;
- in the table of neighbors there is an incorrect maping.

Discussion

  • Vladislav

    Vladislav - 2018-02-23

    opennhrp configurations:
    1) Hub1:
    root@Hub1:~# cat /etc/opennhrp/opennhrp.conf
    interface mgre0
    map 10.10.10.200/24 172.16.200.2
    multicast dynamic
    holding-time 600
    cisco-authentication secret
    #redirect
    non-caching
    2) Hub2:
    root@Hub2:~# cat /etc/opennhrp/opennhrp.conf
    interface mgre0
    map 10.10.10.100/24 172.16.100.2
    multicast dynamic
    holding-time 600
    cisco-authentication secret
    #redirect
    non-caching

    3) Spoke1:
    root@Spoke1:~# cat /etc/opennhrp/opennhrp.conf
    interface mgre0
    map 10.10.10.100/24 172.16.100.2 register
    map 10.10.10.200/24 172.16.200.2 register
    multicast nhs
    holding-time 600
    cisco-authentication secret
    non-caching

    4) Spoke2:
    root@Spoke2:~# cat /etc/opennhrp/opennhrp.conf
    interface mgre0
    map 10.10.10.100/24 172.16.100.2 register
    map 10.10.10.200/24 172.16.200.2 register
    multicast nhs
    holding-time 600
    cisco-authentication secret
    non-caching

    Do you need any additional information about the problem?
    Can you help me?

     
  • Timo Teras

    Timo Teras - 2018-02-23

    To start off I recommend using Quagga/NHRP or FRR/NHRP if possible.

    I am not sure how IPsec is configured, but that likely is the cause. This is because NHRP does not detect liveliness but depends on IPsec to do it.

    If IPsec is not in use, this would cause the issue.

    If IPsec is in use, the racoon's phase1_dead hook is not likely configured, or the script is not working. On ipsec-tools/opennhrp the dead peer detection works so that ipsec-tools executes a dead peer hook which should be a script executing opennhrpctl to inform which node died. (In Quagga|FRR/NHRP there's strong API level integration to strongSwan for this.)

     

    Last edit: Timo Teras 2018-02-23
    • Vladislav

      Vladislav - 2018-02-23

      Thanks for the answer!
      1) I can not use the latest version of Quagga and FRR because there are no deb packages for the Debian 7 operating system.
      2) I use IPsec, but it's not a racoon/strongSwan, I use a proprietary implementation of IPsec by S-Terra CSP (Russian vendor with GOST cipher algorithms).
      I assumed that the NHRP protocol should not depend on IPsec. Is this assumption wrong?
      If the problem described by me can be reproduced without using IPsec, then this is a problem/bug in opennhrp?

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.