Menu

netRefreshIGMP - stop leaving multicast group

2016-08-05
2016-08-05
  • Luke Pyzowski

    Luke Pyzowski - 2016-08-05

    I have a new ptp client using sfptpd on a box using their hardware. I started to notice in my logs the following:

    Aug 5 11:05:09 daemon.warning lnxr02 sfptpd: failed to receive DelayResp for DelayReq sequence number 1603

    After a lot of investigation of what was going on, I finally found the issue - the ptpd master is leaving the multicast group every 60 seconds, and if it happens to coincide with a DelayReq at the same time from a client, it will miss that message.

    Looking at the source code, in src/dep/net.c we have NetRefreshIGMP - and in a refresh we call netShutdownMulticast - which leaves the multicast group.

    If one is refreshing multicast - you don't need to leave the group to send another join - by leaving you have now have a ~ 100 millisecond window where you will not be responding to ptp clients, and that seems to be not optimal. Why does a refresh involve a shutdown of the multicast group?

     
  • Wojciech Owczarek

    Hi Luke,

    Unfortunately this is not as straightforward as it seems. Naturally, it should not be leaving the group for no apparent reason. However, this is the only way to force re-sending an IGMP join on Linux - or at least it was. This can be disabled - obviously the kernel manages its own joins and responds to IGMP queries, so you don't necessarily need to re-join. PTPd refreshes the joins this way when running as master so that when you change an IP address for example or when a link flaps, you re-join soon enough. The side effect is the leave. You can disable the refresh, it does not disable the join, just the periodic flap.

    We cannot simply change this in the code, it needs more research. PTPd works across a vast range of kernel versions and distributions. We can consider this for the next version. You can raise an issue on git: https://github.com/ptpd/ptpd/issues

    As a workaround, you can either:

    • disable the "IGMP refresh": ptpengine:igmp_refresh=n- the host should still respond to your querier as long as a join was sent once and the process didn't die,
    • switch to IGMPv1 which V2 is backwards-compatible with, but it does not support explicit leaves - but there may be IGMPv2 dependencies on your network I guess. My guess is that your environment uses much more multicast than just PTPd,
    • change to hybrid mode (Enterprise Profile) - then your master does not need to process multicast DealyReq. Sfptpd supports this (automatically - it tries it before multicast) - the benefit is that you don't need bi-directional multicast when you do this, the slaves only need to receive multicast sync/fup/announce; delayreq/resp is unicast.

    Finally, to be honest, the one missed reply every 60 seconds (default of ptpengine:master_igmp_refresh_interval) will give you warnings, but it will not cause any performance issues. Mean Path Delay is so heavily filtered that it becomes nearly constant even if the network PDV is high. Honestly, one or two or five missed replies do not change much.

    Hope this helps.

    Thanks,
    Wojciech

     
  • Luke Pyzowski

    Luke Pyzowski - 2016-08-05

    For our purposes "ptpengine:igmp_refresh=n" is actually sufficient to resolve the issue.

    You are correct on your last point, but it seemed very odd that ptpd was not a functional server for a small period of time, but I understand your reasoning as to why it is implemented in that fashion. We did not think it was a critical issue at all, we understood the scope of missing a message periodically, but it was a very curious issue while we were investigating the errors and as such wanted a better handle on the reasoning for this behavior and implementation.

    Thank you for the explanation, it was highly insightful.

     

Log in to post a comment.