Re: [opennhrp-devel] Random Disconnects from Static Peers
Brought to you by:
fabled80
From: Timo T. <tim...@ik...> - 2013-12-11 06:41:22
|
Hi, As first comment. Sounds like your kernel is missing CONFIG_ARPD option and all the woes are result of that. Can you confirm your kernel configuration first? Few additional comments inlined. On Tue, 10 Dec 2013 10:39:10 -0500 "Hegner, Travis" <TH...@tr...> wrote: > I've discovered an issue where a statically mapped peer seemingly > randomly loses connectivity over the mGRE interface. Even during the > connectivity issue, OSPF still keeps perfect adjacencies with the > disconnected peer, and racoon is able to keep re-negotiating new SAs. > No other connections seem to work over the GRE interfaces. In many > searches I've stumbled on to other people describing a similar issue > without resolve. I think I may have discovered the issue as it > affects my environment, and have a proposed solution with a patch, > which you should find attached. First I will go into some detail > about my environment, and the nature of the problem. OSPF (= multicast) packets are actually handled by opennhrp directly, and opennhrp bypasses kernel's arp table when it send individual copies of the packets to each NBMA target. Same applies to NHRP traffic. Thus the difference. > In troubleshooting this issue, I have stripped off all of the many > layers that comprise a DMVPN in order to nail down a single, > reproducible problem. I landed with the following configuration: > > OS: Ubuntu Server 12.04.3 LTS (64bit > Virtualized) Kernel: 3.8.0-34-generic (latest Ubuntu > stock kernel) OpenNHRP: 0.14.1 Checking your kernel config: $ wget http://launchpadlibrarian.net/156432117/linux-image-3.8.0-34-generic_3.8.0-34.49_amd64.deb $ ar x linux-image-3.8.0-34-generic_3.8.0-34.49_amd64.deb $ tar xjvf data.tar.bz2 ./boot/config-3.8.0-34-generic $ grep ARPD boot/config-3.8.0-34-generic # CONFIG_ARPD is not set So, uh, I'm 99.9% sure that your problems are because of this. You really need kernel with this enabled as is exlained in the README file. Please read the README. Fortunately, beginning Linux 3.12.x this config option is always enabled (no longer exists), so that fixes it. > My version of OpenNHRP was built from here: > https://github.com/darkskiez/opennhrp-debian which simply wraps the > debian build scripts around the vanilla version. The deb package was > built using "debuild -uc -us -B" and then installed with "dpkg -i". I > was able to reproduce the issue much more consistently by removing > our routing protocol and all encryption altogether. The test > consisted of two machines on separate routable subnets, and the > opennhrp.conf file had only a short holding-time, and a single map to > the opposing endpoint. This configuration actually never sends NHRP > packets over the network, yet the issue remained. After the > connectivity failure, an opennhrpctl purge on both ends would resolve > the issue, whether a full purge or just for the static map. Yep. Missing CONFIG_ARPD has exactly these systems. Has been discussed several times on this list over the years, and always the answer has been "enable CONFIG_ARPD". > In another test I configured one endpoint to register with the other. > In this case the node with the static map would lose connectivity, > yet still continually update its registration with the other node > successfully. During the connectivity loss, the affected node would > get a destination unreachable reply to an ICMP ping, while the node > with the dynamic mapping would actually send the packets out and not > get a response. > > In this environment, it seems that the kernel is marking the static > peer neighbor entries as stale, and then removing them from the > neighbor table completely. This is even with a continual ICMP ping > going across the connection. I am not sure what indications this > kernel uses to classify a neighbor entry as in use, but the ICMP ping > isn't enough. Coupled with the idle nature of a lab setup, this issue > had been affecting us severely. From what I can tell, there doesn't > seem to be any execution path which will re-add the static map back > into the neighbor table. It actually appears that the code assumes it > will always be there. > > With that said, the proposed solution would be to insert the neighbor > entry for static maps with a "NUD_PERMANENT" flag, rather than a > "NUD_REACHABLE" flag. In this scenario, it prevents the kernel from > being able to remove the affected entry, and allows opennhrp to fully > manage the entry itself. It also avoids having to do any periodic > refreshes from within the application. In my humble opinion, that > behavior is more appropriate given the nature of a static mapping > anyway. This is debatable. However, the patch you proposes fixes only fraction of the problem. You will still get problems with dynamic peers without CONFIG_ARPD. So I'm not convinced that static mappings benefit from this. In fact, it might mislead you to think that your kernel is OK when things work with some peers, but not with others. There is certain 'equivalence' with static mappings and NUD_PERMANENT. But I think the end result is more harm than benefit - even if on conceptual level it would make sense. > The attached patch tries to be minimally invasive, and simply adds a > condition check for peer->type during > nhrp_peer_script_peer_up_done(), and then calls a modified version of > kernel_inject_neighbor(), with an added (pseudo) Boolean flag for > whether it should be injected with "NUD_PERMANENT". I'm sure that > there are much more appropriate and cleaner ways to implement the > solution, so I'll leave it up to the core developers to decide that. > The first few that come to mind are getting the peer->type directly > if possible, having only one kernel_inject function (requires > modifying all calls to that function), and utilizing the passed int > (pseudo Boolean) as a whole set of bit flags for optionally > controlling the injected neighbor in other future ways. If it's > agreed that the behavior is appropriate, it may be worth exploring > whether other peer types should also be injected with the > "NUD_PERMANENT" flag. There is a code path to do it. But it's enabled only with CONFIG_ARPD. > If the behavior is deemed too environment specific, then I would need > a way to inject the neighbor during the peer-up script, that won't be > over-written by opennhrp itself. > > Thanks, and I hope this solution, or at least this discovery, proves > to be useful for someone. I think it is a hack that only fixes fractions of the problems that you experience due using unsuitably configured kernel. I'd prefer you to use kernel that fulfills the requirements outlined in README. Thanks, Timo |