I have been testing ptpd failover to NTP during panic mode, and resuming PTP sync after panic mode expires. I have noticed that regardless of panic_mode_exit_threshold, ptpd will resume sync after panic period expires.
I am running ptpd 2.3.0-1.el6 on RHEL 6.5.
When I set the master to an arbitrary date in 2019, the slave panics. However, despite the master date remaining greater than 10,000,000 nanoseconds from the current time, ptpd still syncs and sets the slave to 2019.
Log:
2015-02-24 13:56:46.462420 ptpd2[1915].eth1 (notice) () Shutdown on close signal
2015-02-24 13:56:47.582184 ptpd2[1957].startup (info) () Info: Now running as a daemon
2015-02-24 13:56:47.582276 ptpd2[1957].startup (info) () Successfully acquired lock on /var/run/ptpd2.lock
2015-02-24 13:56:47.582297 ptpd2[1957].startup (info) () Successfully bound ptpd2 to CPU core 0
2015-02-24 13:56:47.582309 ptpd2[1957].startup (notice) (___) PTPDv2 started successfully on eth1 using "slaveonly" preset (PID 1957)
2015-02-24 13:56:47.582701 ptpd2[1957].eth1 (info) (?) Found NTPd running and controlling the clock
2015-02-24 13:56:47.582717 ptpd2[1957].eth1 (info) (?) Attempting to disable local NTPd
2015-02-24 13:56:47.582768 ptpd2[1957].eth1 (notice) (?) Succesfully disabled local NTPd
2015-02-24 13:56:47.584055 ptpd2[1957].eth1 (info) (init) Observed drift loaded from /etc/ptpd2_kernelclock.drift: -79401
2015-02-24 13:56:47.647604 ptpd2[1957].eth1 (notice) (lstn_init) Now in state: PTP_LISTENING
2015-02-24 13:56:49.084482 ptpd2[1957].eth1 (info) (lstn_init) New best master selected: 080027fffe920555(unknown)/01
2015-02-24 13:56:49.084599 ptpd2[1957].eth1 (info) (lstn_init) About to enter SLAVE state - checking local NTPd
2015-02-24 13:56:49.084650 ptpd2[1957].eth1 (notice) (slv) Now in state: PTP_SLAVE, Best master: 080027fffe920555(unknown)/01
2015-02-24 13:56:50.084352 ptpd2[1957].eth1 (notice) (slv) Received first Sync from Master
2015-02-24 13:56:52.960859 ptpd2[1957].eth1 (notice) (slv) Received first Delay Response from Master
2015-02-24 13:57:14.084345 ptpd2[1957].eth1 (notice) (slv) PTP engine now calibrated - enabling clock control
2015-02-24 13:57:51.772177 ptpd2[1957].eth1 (info) (slv) Servo: Ignoring delayResp because of large OFM
2015-02-24 13:57:52.080098 ptpd2[1957].eth1 (critical) (slv) Offset above 1 second - entering panic mode
2015-02-24 13:58:22.023011 ptpd2[1957].eth1 (notice) (slv) NTP failover triggered
2015-02-24 13:58:22.023162 ptpd2[1957].eth1 (info) (slv) Found NTPd running and not controlling the clock
2015-02-24 13:58:22.023187 ptpd2[1957].eth1 (info) (slv) Attempting to fail over to local NTPd
2015-02-24 13:58:22.023334 ptpd2[1957].eth1 (notice) (slv) Succesfully failed over to NTP
2015-02-24 13:59:52.079956 ptpd2[1957].eth1 (critical) (slv) Panic mode timeout - accepting current offset. Clock will jump
2015-02-24 13:59:52.080047 ptpd2[1957].eth1 (info) (slv) Found NTPd running and controlling the clock
2015-02-24 13:59:52.080072 ptpd2[1957].eth1 (info) (slv) Attempting to disable local NTPd
2015-02-24 13:59:52.080130 ptpd2[1957].eth1 (notice) (slv) Succesfully disabled local NTPd
2019-11-28 00:02:01.827811 ptpd2[1957].eth1 (warning) (slv) Stepped the system clock to: 11/28/19 00:02:01.827798134
2019-11-28 00:02:01.895325 ptpd2[1957].eth1 (notice) (lstn_reset) Now in state: PTP_LISTENING
2019-11-28 00:02:02.828516 ptpd2[1957].eth1 (info) (lstn_reset) New best master selected: 080027fffe920555(unknown)/01
2019-11-28 00:02:02.828674 ptpd2[1957].eth1 (info) (lstn_reset) About to enter SLAVE state - checking local NTPd
2019-11-28 00:02:02.828766 ptpd2[1957].eth1 (notice) (slv) Now in state: PTP_SLAVE, Best master: 080027fffe920555(unknown)/01
2019-11-28 00:02:03.828463 ptpd2[1957].eth1 (notice) (slv) Received first Sync from Master
2019-11-28 00:02:06.708287 ptpd2[1957].eth1 (notice) (slv) Received first Delay Response from Master
2019-11-28 00:02:27.768776 ptpd2[1957].eth1 (notice) (slv) PTP engine now calibrated - enabling clock control
ptpd2.conf:
; ==============================================================================
; This is a recommended configuration for a PTPv2 slave
; For a full list of options run ptpd2 -H or see the documentation and man pages
; ==============================================================================
; interface has to be specified
ptpengine:interface=eth1
; PTP domain
ptpengine:domain=0
; available presets are slaveonly, masteronly and masterslave (full IEEE 1588 implementation)
ptpengine:preset=slaveonly
; multicast for both sync and delay requests - use hybrid for unicast delay requests
ptpengine:ip_mode=multicast
; when enabled, instead of sockets, libpcap is used to receive (sniff) and send (inject) packets.
; on low latency hardware such as 10GE NICs this can provide results close to hardware-assisted PTP
ptpengine:use_libpcap=n
; go into panic mode for 10 minutes instead of resetting the clock
ptpengine:panic_mode=y
ptpengine:panic_mode_duration=2
; uncomment this to enable outlier filters
ptpengine:sync_outlier_filter_enable=y
ptpengine:delay_outlier_filter_enable=y
; store observed drift in a file
clock:drift_handling=file
; update online statistics every 5 seconds
global:statistics_update_interval=5
; wait 5 statistics intervals for one-way delay to stabilise
ptpengine:calibration_delay=5
; log file, event log only. if timing statistics are needed, see statistics_file
global:log_file=/var/log/ptpd2.log
; log file up to 5M
global:log_file_max_size=5000
; rotate logs up to 5 files
global:log_file_max_files=5
; status file providing an overview of ptpd's operation and statistics
global:log_status=y
; required if ip_mode is set to hybrid
;ptpengine:log_delayreq_interval=0
; uncomment this to log a timing log like in previous ptpd versions
global:statistics_file=/var/log/ptpd2.stats
global:statistics_file_max_size=50000
; on multi-core systems it is recommended to bind ptpd to a single core
;global:cpuaffinity_cpucore=0
; use DSCP 46 for expedited forwarding over ipv4 networks
ptpengine:ip_dscp=46
What you are experiencing is the panic mode ending. The setting is:
; default is 2 minutes
ptpengine:panic_mode_duration = [minutes ]
Check the timestamps - exactly 2 minutes:
015-02-24 13:57:52.080098 ptpd2[1957].eth1 (critical) (slv) Offset above 1 second - entering panic mode
2015-02-24 13:58:22.023011 ptpd2[1957].eth1 (notice) (slv) NTP failover triggered
2015-02-24 13:58:22.023162 ptpd2[1957].eth1 (info) (slv) Found NTPd running and not controlling the clock
2015-02-24 13:58:22.023187 ptpd2[1957].eth1 (info) (slv) Attempting to fail over to local NTPd
2015-02-24 13:58:22.023334 ptpd2[1957].eth1 (notice) (slv) Succesfully failed over to NTP
2015-02-24 13:*59:52.079956 ptpd2[1957].eth1 (critical) (slv) Panic mode timeout
This is intended. Maybe the man page needs updated to explain the behaviour. If you increase the panic mode duration AND set the exit threshold, this should work.
The panic mode duration is an unconditional exit out of the panic mode, so that we don't end up with a locked up slave that never restores sync.
Panic mode duration should be set to be long enough to allow for the offset to drop.
The algorithm is:
If panic mode duration exceeded, exit panic mode
If offset < 1 second and threshold not set, exit panic mode
If offset < 1 second and offset below threshold, exit panic mode
Hope this helps.
Regards,
Wojciech
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We have two grandmaster symmetricom devices. The issue came when we had one of them (primary - set via priority1) went haywire and started publishing incorrect time (2019).
We are looking for the client to ideally find the second PTP master device (it can see both and fails over perfectly when the PTP master falls off the network), or if not - failover to NTP indefinitely, or until the PTP device comes back within a defined offset.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
See, not sure if we can help here. Unfortunately - this is a "flaw" of IEEE 1588 - clocks are only judged by the BMCA evaluation - by the various flags in the Announce message.
However, not sure if you have any captures or if you can do a post-mortem in any way, but how did the Announce look like while the GM was sending incorrect time?
Was this a cold start problem, or did this happen spontaneously?
I'm asking because you should have priority 1 set to the same value on both GMs. This is because lower P1 makes a GM unconditionally "best", even if it drops time reference or whatever.
The BMCA order (shortened - this is incomplete but I left the meaningful parts only) is:
compare P1
...
compare ClockClass <- will be worse during holdover or cold start etc.
...
compare P2
So as you see, if your P1 is lower on one GM, the other GM will only ever take over if the primary is down, and that's not what it's all about.
So there is a chance that the GM suffering from a momentary lapse of reason would actually not be best, had P1 not been lower.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ah that makes sense. We have P1 set lower on one of the clocks for all of the clients to default to that one. A bit of an oversight on my part.
The issue was spontaneous - basically the clock lost GPS signal. My understanding now is that the clock will realize this, and it's clock class will adjust. When the clients see the worsened clock class (assuming P1 is equal), they will failover to the other master.
We are using TimeProvider 2700s. They have also acknowledged that they may have a bug somewhere that caused the clock to spit out a 2019 date as a result of an open circuit/lost antenna connection.
Thank you for your help!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ah that makes sense. We have P1 set lower on one of the clocks for all of the clients to default to that one. A bit of an oversight on my part.
The issue was spontaneous - basically the clock lost GPS signal. My understanding now is that the clock will realize this, and it's clock class will adjust. When the clients see the worsened clock class (assuming P1 is equal), they will failover to the other master.
We are using TimeProvider 2700s. They have also acknowledged that they may have a bug somewhere that caused the clock to spit out a 2019 date as a result of an open circuit/lost antenna connection.
Thank you for your help!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have been testing ptpd failover to NTP during panic mode, and resuming PTP sync after panic mode expires. I have noticed that regardless of panic_mode_exit_threshold, ptpd will resume sync after panic period expires.
I am running ptpd 2.3.0-1.el6 on RHEL 6.5.
When I set the master to an arbitrary date in 2019, the slave panics. However, despite the master date remaining greater than 10,000,000 nanoseconds from the current time, ptpd still syncs and sets the slave to 2019.
Log:
2015-02-24 13:56:46.462420 ptpd2[1915].eth1 (notice) () Shutdown on close signal
2015-02-24 13:56:47.582184 ptpd2[1957].startup (info) () Info: Now running as a daemon
2015-02-24 13:56:47.582276 ptpd2[1957].startup (info) () Successfully acquired lock on /var/run/ptpd2.lock
2015-02-24 13:56:47.582297 ptpd2[1957].startup (info) () Successfully bound ptpd2 to CPU core 0
2015-02-24 13:56:47.582309 ptpd2[1957].startup (notice) (___) PTPDv2 started successfully on eth1 using "slaveonly" preset (PID 1957)
2015-02-24 13:56:47.582701 ptpd2[1957].eth1 (info) (?) Found NTPd running and controlling the clock
2015-02-24 13:56:47.582717 ptpd2[1957].eth1 (info) (?) Attempting to disable local NTPd
2015-02-24 13:56:47.582768 ptpd2[1957].eth1 (notice) (?) Succesfully disabled local NTPd
2015-02-24 13:56:47.584055 ptpd2[1957].eth1 (info) (init) Observed drift loaded from /etc/ptpd2_kernelclock.drift: -79401
2015-02-24 13:56:47.647604 ptpd2[1957].eth1 (notice) (lstn_init) Now in state: PTP_LISTENING
2015-02-24 13:56:49.084482 ptpd2[1957].eth1 (info) (lstn_init) New best master selected: 080027fffe920555(unknown)/01
2015-02-24 13:56:49.084599 ptpd2[1957].eth1 (info) (lstn_init) About to enter SLAVE state - checking local NTPd
2015-02-24 13:56:49.084650 ptpd2[1957].eth1 (notice) (slv) Now in state: PTP_SLAVE, Best master: 080027fffe920555(unknown)/01
2015-02-24 13:56:50.084352 ptpd2[1957].eth1 (notice) (slv) Received first Sync from Master
2015-02-24 13:56:52.960859 ptpd2[1957].eth1 (notice) (slv) Received first Delay Response from Master
2015-02-24 13:57:14.084345 ptpd2[1957].eth1 (notice) (slv) PTP engine now calibrated - enabling clock control
2015-02-24 13:57:51.772177 ptpd2[1957].eth1 (info) (slv) Servo: Ignoring delayResp because of large OFM
2015-02-24 13:57:52.080098 ptpd2[1957].eth1 (critical) (slv) Offset above 1 second - entering panic mode
2015-02-24 13:58:22.023011 ptpd2[1957].eth1 (notice) (slv) NTP failover triggered
2015-02-24 13:58:22.023162 ptpd2[1957].eth1 (info) (slv) Found NTPd running and not controlling the clock
2015-02-24 13:58:22.023187 ptpd2[1957].eth1 (info) (slv) Attempting to fail over to local NTPd
2015-02-24 13:58:22.023334 ptpd2[1957].eth1 (notice) (slv) Succesfully failed over to NTP
2015-02-24 13:59:52.079956 ptpd2[1957].eth1 (critical) (slv) Panic mode timeout - accepting current offset. Clock will jump
2015-02-24 13:59:52.080047 ptpd2[1957].eth1 (info) (slv) Found NTPd running and controlling the clock
2015-02-24 13:59:52.080072 ptpd2[1957].eth1 (info) (slv) Attempting to disable local NTPd
2015-02-24 13:59:52.080130 ptpd2[1957].eth1 (notice) (slv) Succesfully disabled local NTPd
2019-11-28 00:02:01.827811 ptpd2[1957].eth1 (warning) (slv) Stepped the system clock to: 11/28/19 00:02:01.827798134
2019-11-28 00:02:01.895325 ptpd2[1957].eth1 (notice) (lstn_reset) Now in state: PTP_LISTENING
2019-11-28 00:02:02.828516 ptpd2[1957].eth1 (info) (lstn_reset) New best master selected: 080027fffe920555(unknown)/01
2019-11-28 00:02:02.828674 ptpd2[1957].eth1 (info) (lstn_reset) About to enter SLAVE state - checking local NTPd
2019-11-28 00:02:02.828766 ptpd2[1957].eth1 (notice) (slv) Now in state: PTP_SLAVE, Best master: 080027fffe920555(unknown)/01
2019-11-28 00:02:03.828463 ptpd2[1957].eth1 (notice) (slv) Received first Sync from Master
2019-11-28 00:02:06.708287 ptpd2[1957].eth1 (notice) (slv) Received first Delay Response from Master
2019-11-28 00:02:27.768776 ptpd2[1957].eth1 (notice) (slv) PTP engine now calibrated - enabling clock control
ptpd2.conf:
; ==============================================================================
; This is a recommended configuration for a PTPv2 slave
; For a full list of options run ptpd2 -H or see the documentation and man pages
; ==============================================================================
; interface has to be specified
ptpengine:interface=eth1
; PTP domain
ptpengine:domain=0
; available presets are slaveonly, masteronly and masterslave (full IEEE 1588 implementation)
ptpengine:preset=slaveonly
; multicast for both sync and delay requests - use hybrid for unicast delay requests
ptpengine:ip_mode=multicast
; when enabled, instead of sockets, libpcap is used to receive (sniff) and send (inject) packets.
; on low latency hardware such as 10GE NICs this can provide results close to hardware-assisted PTP
ptpengine:use_libpcap=n
; go into panic mode for 10 minutes instead of resetting the clock
ptpengine:panic_mode=y
ptpengine:panic_mode_duration=2
; uncomment this to enable outlier filters
ptpengine:sync_outlier_filter_enable=y
ptpengine:delay_outlier_filter_enable=y
; store observed drift in a file
clock:drift_handling=file
; update online statistics every 5 seconds
global:statistics_update_interval=5
; wait 5 statistics intervals for one-way delay to stabilise
ptpengine:calibration_delay=5
; log file, event log only. if timing statistics are needed, see statistics_file
global:log_file=/var/log/ptpd2.log
; log file up to 5M
global:log_file_max_size=5000
; rotate logs up to 5 files
global:log_file_max_files=5
; status file providing an overview of ptpd's operation and statistics
global:log_status=y
; required if ip_mode is set to hybrid
;ptpengine:log_delayreq_interval=0
; uncomment this to log a timing log like in previous ptpd versions
global:statistics_file=/var/log/ptpd2.stats
global:statistics_file_max_size=50000
; on multi-core systems it is recommended to bind ptpd to a single core
;global:cpuaffinity_cpucore=0
; use DSCP 46 for expedited forwarding over ipv4 networks
ptpengine:ip_dscp=46
; NTP failover
ntpengine:enabled=y
ntpengine:control_enabled=y
ntpengine:key_id=101
ntpengine:key=password
ptpengine:ntp_failover=y
ptpengine:ntp_failover_timeout=30
ptpengine:panic_mode_ntp=y
ptpengine:panic_mode_exit_threshold=10000000
; always keep a new line in the end
Hi Brian,
What you are experiencing is the panic mode ending. The setting is:
; default is 2 minutes
ptpengine:panic_mode_duration = [minutes ]
Check the timestamps - exactly 2 minutes:
015-02-24 13:57:52.080098 ptpd2[1957].eth1 (critical) (slv) Offset above 1 second - entering panic mode
2015-02-24 13:58:22.023011 ptpd2[1957].eth1 (notice) (slv) NTP failover triggered
2015-02-24 13:58:22.023162 ptpd2[1957].eth1 (info) (slv) Found NTPd running and not controlling the clock
2015-02-24 13:58:22.023187 ptpd2[1957].eth1 (info) (slv) Attempting to fail over to local NTPd
2015-02-24 13:58:22.023334 ptpd2[1957].eth1 (notice) (slv) Succesfully failed over to NTP
2015-02-24 13:*59:52.079956 ptpd2[1957].eth1 (critical) (slv) Panic mode timeout
This is intended. Maybe the man page needs updated to explain the behaviour. If you increase the panic mode duration AND set the exit threshold, this should work.
The panic mode duration is an unconditional exit out of the panic mode, so that we don't end up with a locked up slave that never restores sync.
Panic mode duration should be set to be long enough to allow for the offset to drop.
The algorithm is:
Hope this helps.
Regards,
Wojciech
We have two grandmaster symmetricom devices. The issue came when we had one of them (primary - set via priority1) went haywire and started publishing incorrect time (2019).
We are looking for the client to ideally find the second PTP master device (it can see both and fails over perfectly when the PTP master falls off the network), or if not - failover to NTP indefinitely, or until the PTP device comes back within a defined offset.
See, not sure if we can help here. Unfortunately - this is a "flaw" of IEEE 1588 - clocks are only judged by the BMCA evaluation - by the various flags in the Announce message.
However, not sure if you have any captures or if you can do a post-mortem in any way, but how did the Announce look like while the GM was sending incorrect time?
Was this a cold start problem, or did this happen spontaneously?
I'm asking because you should have priority 1 set to the same value on both GMs. This is because lower P1 makes a GM unconditionally "best", even if it drops time reference or whatever.
The BMCA order (shortened - this is incomplete but I left the meaningful parts only) is:
...
...
So as you see, if your P1 is lower on one GM, the other GM will only ever take over if the primary is down, and that's not what it's all about.
So there is a chance that the GM suffering from a momentary lapse of reason would actually not be best, had P1 not been lower.
I see P1 only as a manual pre-emption mechanism.
If P1 is equal, then ClockClass (== reference state really) decides, and P2 sets the GM order under normal conditions.
Regardless of what happened in your environment, the secondary GM didn't even have a chance to take over.
Last edit: Wojciech Owczarek 2015-02-24
What models are the Symmetricoms - S3xx series, TimeProvider, XLi or something else?
Ah that makes sense. We have P1 set lower on one of the clocks for all of the clients to default to that one. A bit of an oversight on my part.
The issue was spontaneous - basically the clock lost GPS signal. My understanding now is that the clock will realize this, and it's clock class will adjust. When the clients see the worsened clock class (assuming P1 is equal), they will failover to the other master.
We are using TimeProvider 2700s. They have also acknowledged that they may have a bug somewhere that caused the clock to spit out a 2019 date as a result of an open circuit/lost antenna connection.
Thank you for your help!
Ah that makes sense. We have P1 set lower on one of the clocks for all of the clients to default to that one. A bit of an oversight on my part.
The issue was spontaneous - basically the clock lost GPS signal. My understanding now is that the clock will realize this, and it's clock class will adjust. When the clients see the worsened clock class (assuming P1 is equal), they will failover to the other master.
We are using TimeProvider 2700s. They have also acknowledged that they may have a bug somewhere that caused the clock to spit out a 2019 date as a result of an open circuit/lost antenna connection.
Thank you for your help!