Menu

PTP panic mode & failover to NTP

Help
brian w
2015-02-24
2015-02-27
  • brian w

    brian w - 2015-02-24

    I have been testing ptpd failover to NTP during panic mode, and resuming PTP sync after panic mode expires. I have noticed that regardless of panic_mode_exit_threshold, ptpd will resume sync after panic period expires.

    I am running ptpd 2.3.0-1.el6 on RHEL 6.5.

    When I set the master to an arbitrary date in 2019, the slave panics. However, despite the master date remaining greater than 10,000,000 nanoseconds from the current time, ptpd still syncs and sets the slave to 2019.

    Log:
    2015-02-24 13:56:46.462420 ptpd2[1915].eth1 (notice) () Shutdown on close signal
    2015-02-24 13:56:47.582184 ptpd2[1957].startup (info) (
    ) Info: Now running as a daemon
    2015-02-24 13:56:47.582276 ptpd2[1957].startup (info) () Successfully acquired lock on /var/run/ptpd2.lock
    2015-02-24 13:56:47.582297 ptpd2[1957].startup (info) (
    ) Successfully bound ptpd2 to CPU core 0
    2015-02-24 13:56:47.582309 ptpd2[1957].startup (notice) (___) PTPDv2 started successfully on eth1 using "slaveonly" preset (PID 1957)
    2015-02-24 13:56:47.582701 ptpd2[1957].eth1 (info) (?) Found NTPd running and controlling the clock
    2015-02-24 13:56:47.582717 ptpd2[1957].eth1 (info) (?) Attempting to disable local NTPd
    2015-02-24 13:56:47.582768 ptpd2[1957].eth1 (notice) (?) Succesfully disabled local NTPd
    2015-02-24 13:56:47.584055 ptpd2[1957].eth1 (info) (init) Observed drift loaded from /etc/ptpd2_kernelclock.drift: -79401
    2015-02-24 13:56:47.647604 ptpd2[1957].eth1 (notice) (lstn_init) Now in state: PTP_LISTENING
    2015-02-24 13:56:49.084482 ptpd2[1957].eth1 (info) (lstn_init) New best master selected: 080027fffe920555(unknown)/01
    2015-02-24 13:56:49.084599 ptpd2[1957].eth1 (info) (lstn_init) About to enter SLAVE state - checking local NTPd
    2015-02-24 13:56:49.084650 ptpd2[1957].eth1 (notice) (slv) Now in state: PTP_SLAVE, Best master: 080027fffe920555(unknown)/01
    2015-02-24 13:56:50.084352 ptpd2[1957].eth1 (notice) (slv) Received first Sync from Master
    2015-02-24 13:56:52.960859 ptpd2[1957].eth1 (notice) (slv) Received first Delay Response from Master
    2015-02-24 13:57:14.084345 ptpd2[1957].eth1 (notice) (slv) PTP engine now calibrated - enabling clock control
    2015-02-24 13:57:51.772177 ptpd2[1957].eth1 (info) (slv) Servo: Ignoring delayResp because of large OFM
    2015-02-24 13:57:52.080098 ptpd2[1957].eth1 (critical) (slv) Offset above 1 second - entering panic mode
    2015-02-24 13:58:22.023011 ptpd2[1957].eth1 (notice) (slv) NTP failover triggered
    2015-02-24 13:58:22.023162 ptpd2[1957].eth1 (info) (slv) Found NTPd running and not controlling the clock
    2015-02-24 13:58:22.023187 ptpd2[1957].eth1 (info) (slv) Attempting to fail over to local NTPd
    2015-02-24 13:58:22.023334 ptpd2[1957].eth1 (notice) (slv) Succesfully failed over to NTP
    2015-02-24 13:59:52.079956 ptpd2[1957].eth1 (critical) (slv) Panic mode timeout - accepting current offset. Clock will jump
    2015-02-24 13:59:52.080047 ptpd2[1957].eth1 (info) (slv) Found NTPd running and controlling the clock
    2015-02-24 13:59:52.080072 ptpd2[1957].eth1 (info) (slv) Attempting to disable local NTPd
    2015-02-24 13:59:52.080130 ptpd2[1957].eth1 (notice) (slv) Succesfully disabled local NTPd
    2019-11-28 00:02:01.827811 ptpd2[1957].eth1 (warning) (slv) Stepped the system clock to: 11/28/19 00:02:01.827798134
    2019-11-28 00:02:01.895325 ptpd2[1957].eth1 (notice) (lstn_reset) Now in state: PTP_LISTENING
    2019-11-28 00:02:02.828516 ptpd2[1957].eth1 (info) (lstn_reset) New best master selected: 080027fffe920555(unknown)/01
    2019-11-28 00:02:02.828674 ptpd2[1957].eth1 (info) (lstn_reset) About to enter SLAVE state - checking local NTPd
    2019-11-28 00:02:02.828766 ptpd2[1957].eth1 (notice) (slv) Now in state: PTP_SLAVE, Best master: 080027fffe920555(unknown)/01
    2019-11-28 00:02:03.828463 ptpd2[1957].eth1 (notice) (slv) Received first Sync from Master
    2019-11-28 00:02:06.708287 ptpd2[1957].eth1 (notice) (slv) Received first Delay Response from Master
    2019-11-28 00:02:27.768776 ptpd2[1957].eth1 (notice) (slv) PTP engine now calibrated - enabling clock control

    ptpd2.conf:
    ; ==============================================================================
    ; This is a recommended configuration for a PTPv2 slave
    ; For a full list of options run ptpd2 -H or see the documentation and man pages
    ; ==============================================================================

    ; interface has to be specified
    ptpengine:interface=eth1

    ; PTP domain
    ptpengine:domain=0

    ; available presets are slaveonly, masteronly and masterslave (full IEEE 1588 implementation)
    ptpengine:preset=slaveonly

    ; multicast for both sync and delay requests - use hybrid for unicast delay requests
    ptpengine:ip_mode=multicast

    ; when enabled, instead of sockets, libpcap is used to receive (sniff) and send (inject) packets.
    ; on low latency hardware such as 10GE NICs this can provide results close to hardware-assisted PTP
    ptpengine:use_libpcap=n

    ; go into panic mode for 10 minutes instead of resetting the clock
    ptpengine:panic_mode=y
    ptpengine:panic_mode_duration=2

    ; uncomment this to enable outlier filters
    ptpengine:sync_outlier_filter_enable=y
    ptpengine:delay_outlier_filter_enable=y

    ; store observed drift in a file
    clock:drift_handling=file

    ; update online statistics every 5 seconds
    global:statistics_update_interval=5

    ; wait 5 statistics intervals for one-way delay to stabilise
    ptpengine:calibration_delay=5

    ; log file, event log only. if timing statistics are needed, see statistics_file
    global:log_file=/var/log/ptpd2.log
    ; log file up to 5M
    global:log_file_max_size=5000
    ; rotate logs up to 5 files
    global:log_file_max_files=5

    ; status file providing an overview of ptpd's operation and statistics
    global:log_status=y

    ; required if ip_mode is set to hybrid
    ;ptpengine:log_delayreq_interval=0

    ; uncomment this to log a timing log like in previous ptpd versions
    global:statistics_file=/var/log/ptpd2.stats

    global:statistics_file_max_size=50000

    ; on multi-core systems it is recommended to bind ptpd to a single core
    ;global:cpuaffinity_cpucore=0

    ; use DSCP 46 for expedited forwarding over ipv4 networks
    ptpengine:ip_dscp=46

    ; NTP failover
    ntpengine:enabled=y
    ntpengine:control_enabled=y
    ntpengine:key_id=101
    ntpengine:key=password
    ptpengine:ntp_failover=y
    ptpengine:ntp_failover_timeout=30
    ptpengine:panic_mode_ntp=y
    ptpengine:panic_mode_exit_threshold=10000000

    ; always keep a new line in the end

     
    • Wojciech Owczarek

      Hi Brian,

      What you are experiencing is the panic mode ending. The setting is:

      ; default is 2 minutes
      ptpengine:panic_mode_duration = [minutes ]

      Check the timestamps - exactly 2 minutes:


      015-02-24 13:57:52.080098 ptpd2[1957].eth1 (critical) (slv) Offset above 1 second - entering panic mode
      2015-02-24 13:58:22.023011 ptpd2[1957].eth1 (notice) (slv) NTP failover triggered
      2015-02-24 13:58:22.023162 ptpd2[1957].eth1 (info) (slv) Found NTPd running and not controlling the clock
      2015-02-24 13:58:22.023187 ptpd2[1957].eth1 (info) (slv) Attempting to fail over to local NTPd
      2015-02-24 13:58:22.023334 ptpd2[1957].eth1 (notice) (slv) Succesfully failed over to NTP
      2015-02-24 13:*59:52.079956 ptpd2[1957].eth1 (critical) (slv) Panic mode timeout

      This is intended. Maybe the man page needs updated to explain the behaviour. If you increase the panic mode duration AND set the exit threshold, this should work.

      The panic mode duration is an unconditional exit out of the panic mode, so that we don't end up with a locked up slave that never restores sync.

      Panic mode duration should be set to be long enough to allow for the offset to drop.

      The algorithm is:

      1. If panic mode duration exceeded, exit panic mode
      2. If offset < 1 second and threshold not set, exit panic mode
      3. If offset < 1 second and offset below threshold, exit panic mode

      Hope this helps.

      Regards,
      Wojciech

       
  • brian w

    brian w - 2015-02-24

    We have two grandmaster symmetricom devices. The issue came when we had one of them (primary - set via priority1) went haywire and started publishing incorrect time (2019).

    We are looking for the client to ideally find the second PTP master device (it can see both and fails over perfectly when the PTP master falls off the network), or if not - failover to NTP indefinitely, or until the PTP device comes back within a defined offset.

     
    • Wojciech Owczarek

      See, not sure if we can help here. Unfortunately - this is a "flaw" of IEEE 1588 - clocks are only judged by the BMCA evaluation - by the various flags in the Announce message.

      However, not sure if you have any captures or if you can do a post-mortem in any way, but how did the Announce look like while the GM was sending incorrect time?

      Was this a cold start problem, or did this happen spontaneously?

      I'm asking because you should have priority 1 set to the same value on both GMs. This is because lower P1 makes a GM unconditionally "best", even if it drops time reference or whatever.

      The BMCA order (shortened - this is incomplete but I left the meaningful parts only) is:

      1. compare P1
        ...
      2. compare ClockClass <- will be worse during holdover or cold start etc.
        ...
      3. compare P2

      So as you see, if your P1 is lower on one GM, the other GM will only ever take over if the primary is down, and that's not what it's all about.

      So there is a chance that the GM suffering from a momentary lapse of reason would actually not be best, had P1 not been lower.

       
      • Wojciech Owczarek

        I see P1 only as a manual pre-emption mechanism.

        If P1 is equal, then ClockClass (== reference state really) decides, and P2 sets the GM order under normal conditions.

        Regardless of what happened in your environment, the secondary GM didn't even have a chance to take over.

         

        Last edit: Wojciech Owczarek 2015-02-24
  • Wojciech Owczarek

    What models are the Symmetricoms - S3xx series, TimeProvider, XLi or something else?

     
  • brian w

    brian w - 2015-02-27

    Ah that makes sense. We have P1 set lower on one of the clocks for all of the clients to default to that one. A bit of an oversight on my part.

    The issue was spontaneous - basically the clock lost GPS signal. My understanding now is that the clock will realize this, and it's clock class will adjust. When the clients see the worsened clock class (assuming P1 is equal), they will failover to the other master.

    We are using TimeProvider 2700s. They have also acknowledged that they may have a bug somewhere that caused the clock to spit out a 2019 date as a result of an open circuit/lost antenna connection.

    Thank you for your help!

     
  • brian w

    brian w - 2015-02-27

    Ah that makes sense. We have P1 set lower on one of the clocks for all of the clients to default to that one. A bit of an oversight on my part.

    The issue was spontaneous - basically the clock lost GPS signal. My understanding now is that the clock will realize this, and it's clock class will adjust. When the clients see the worsened clock class (assuming P1 is equal), they will failover to the other master.

    We are using TimeProvider 2700s. They have also acknowledged that they may have a bug somewhere that caused the clock to spit out a 2019 date as a result of an open circuit/lost antenna connection.

    Thank you for your help!

     

Log in to post a comment.