Re: [Linuxptp-users] [External] Re: clockcheck jumps forwards and backwards
PTP IEEE 1588 stack for Linux
Brought to you by:
rcochran
From: Ian T. <Ian...@pg...> - 2017-04-05 14:14:06
|
All Here’s a log with phc2sys output. This board ran for 11 hours without an error before this happened … Apr 5 09:18:21 localhost user.info phc2sys: [38796.187] rms 28 max 65 freq -242 +/- 1 delay 1099 +/- 11 Apr 5 09:20:21 localhost user.info phc2sys: [38916.211] rms 29 max 73 freq -242 +/- 3 delay 1100 +/- 12 Apr 5 09:20:40 localhost user.info ptp4l: [38934.698] rms 29 max 76 freq -242 +/- 6 delay 2029 +/- 4 Apr 5 09:20:52 localhost user.err ptp4l: [38946.961] timed out while polling for tx timestamp Apr 5 09:20:52 localhost user.err ptp4l: [38946.961] increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug Apr 5 09:20:52 localhost user.err ptp4l: [38946.962] Failure in transport_send(to)() Apr 5 09:20:52 localhost user.err ptp4l: [38946.962] port 1: send delay request failed Apr 5 09:20:52 localhost user.notice ptp4l: [38946.962] port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) Apr 5 09:20:52 localhost user.info phc2sys: [38947.222] port 2cee26.fffe.000189-1 changed state Apr 5 09:20:52 localhost user.info phc2sys: [38947.222] reconfiguring after port state change Apr 5 09:20:52 localhost user.info phc2sys: [38947.222] selecting eth0 for synchronization Apr 5 09:20:52 localhost user.info phc2sys: [38947.222] nothing to synchronize Apr 5 09:21:08 localhost user.notice ptp4l: [38962.962] port 1: FAULTY to LISTENING on INIT_COMPLETE Apr 5 09:21:08 localhost user.warn ptp4l: [38963.198] clockcheck: clock jumped backward or running slower than expected! Apr 5 09:21:10 localhost user.notice ptp4l: [38964.698] port 1: new foreign master 000cec.fffe.0a0f8d-1 Apr 5 09:21:14 localhost user.notice ptp4l: [38968.698] selected best master clock 000cec.fffe.0a0f8d Apr 5 09:21:14 localhost user.notice ptp4l: [38968.698] port 1: LISTENING to UNCALIBRATED on RS_SLAVE Apr 5 09:21:14 localhost user.info phc2sys: [38969.224] port 2cee26.fffe.000189-1 changed state Apr 5 09:21:14 localhost user.info phc2sys: [38969.224] reconfiguring after port state change Apr 5 09:21:14 localhost user.info phc2sys: [38969.224] master clock not ready, waiting... Apr 5 09:21:21 localhost user.notice ptp4l: [38975.698] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED Apr 5 09:21:21 localhost user.info phc2sys: [38976.229] port 2cee26.fffe.000189-1 changed state Apr 5 09:21:21 localhost user.info phc2sys: [38976.230] reconfiguring after port state change Apr 5 09:21:21 localhost user.info phc2sys: [38976.230] selecting CLOCK_REALTIME for synchronization Apr 5 09:21:21 localhost user.info phc2sys: [38976.230] selecting eth0 as the master clock Apr 5 09:22:50 localhost user.info phc2sys: [39065.260] rms 34 max 88 freq -241 +/- 5 delay 1100 +/- 10 Apr 5 09:24:50 localhost user.info phc2sys: [39185.288] rms 28 max 67 freq -242 +/- 1 delay 1099 +/- 11 Apr 5 09:25:23 localhost user.info ptp4l: [39217.698] rms 3270369174 max 37000003713 freq -238 +/- 95 delay 2030 +/- 8 Apr 5 09:26:50 localhost user.info phc2sys: [39305.330] rms 29 max 70 freq -241 +/- 2 delay 1099 +/- 10 Apr 5 09:28:50 localhost user.info phc2sys: [39425.376] rms 29 max 71 freq -242 +/- 1 delay 1100 +/- 11 Apr 5 09:29:39 localhost user.info ptp4l: [39473.698] rms 29 max 75 freq -243 +/- 2 delay 2023 +/- 9 Ian T. From: David Mirabito [mailto:dav...@gm...] Sent: Tuesday, April 04, 2017 6:18 PM To: Ian Thompson Cc: lin...@li... Subject: [External] Re: [Linuxptp-users] clockcheck jumps forwards and backwards Hi, The device is a: 00:14.0 Ethernet controller: Intel Corporation Ethernet Connection I354 (rev 03) Using bash-4.3# ethtool -i ma2 driver: igb version: 5.3.0-k firmware-version: 0.0.0 expansion-rom-version: bus-info: 0000:00:14.1 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: no And: # ethtool -T ma2 Time stamping parameters for ma2: Capabilities: hardware-transmit (SOF_TIMESTAMPING_TX_HARDWARE) software-transmit (SOF_TIMESTAMPING_TX_SOFTWARE) hardware-receive (SOF_TIMESTAMPING_RX_HARDWARE) software-receive (SOF_TIMESTAMPING_RX_SOFTWARE) software-system-clock (SOF_TIMESTAMPING_SOFTWARE) hardware-raw-clock (SOF_TIMESTAMPING_RAW_HARDWARE) PTP Hardware Clock: 1 Hardware Transmit Timestamp Modes: off (HWTSTAMP_TX_OFF) on (HWTSTAMP_TX_ON) Hardware Receive Filter Modes: none (HWTSTAMP_FILTER_NONE) all (HWTSTAMP_FILTER_ALL) The config is: [global] slaveOnly 1 summary_interval 6 priority1 255 [ma2] And running as: /usr/sbin/ptp4l -f /etc/ptp4l.conf /usr/sbin/phc2sys -a -r -u 64 -n 5 We are running version 1.8, downloaded from the sourceforge mirror. It's built with openembedede/bitbake and their recipie defines some extra cflags, I can look iwhy these were deemed to be necessary or if they could affect anything: EXTRA_OEMAKE = "'CFLAGS=-D_GNU_SOURCE -DHAVE_CLOCK_ADJTIME -DHAVE_POSIX_SPAWN -DHAVE_ONESTEP_SYNC'" I will look into obtaining more verbose logs. For what it's worth, this exact same setup works elsewhere it is just this one physical setup that exhibits this, although unclear if the cause a physical fault or something about the network/master outside. Additionally, since Ian brought it up a) We do sometimes see tx timestamp timeouts too b) We also occasionally see UNEXPECTED_SYSWRAP messages from igb My understanding is that b) is an intel bug (bad per-device assumptions made in code regarding default state of PPS IRQ) on this HW and seems to be generally treated as benign. I do have a slight suspicion that a and b may be somehow related (backing out of the unexpected wrap IRQ 'forgets' to notice the available tx timestamp being ready?) but I have some digging to to on that front. I currently expect (although happy to be proven wrong) that both a) and b) are unrelated to the clockcheck jumps, since a+b happens readily and doesn't affect sync *too* badly, whereas constant clockcheck aborts happens only in one place and is apparently disastrous to sync quality. Cheers, and thanks for your replies, Dave On Wed, Apr 5, 2017 at 1:45 AM, Ian Thompson <Ian...@pg...<mailto:Ian...@pg...>> wrote: Possibly following on from David’s post. We have a system with 18 boards in a rack, each board has a Altera SoC with the STM Ethernet MAC connected via gigabit Ethernet to an Arista ptp-aware switch and then a Spectracom GrandMaster. The boards are running Linux kernel 3.15.0. They lock quickly after boot and can remain locked for several hours but usually any one of the boards may do the following … Apr 4 13:42:04 localhost user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=> ptp4l: [537.164] rms 123 max 599 freq +255 +/- 39 delay 7362 +/- 48 Apr 4 13:42:29 localhost user.err ptp4l: [561.387] timed out while polling for tx timestamp Apr 4 13:42:29 localhost user.err ptp4l: [561.387] increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug Apr 4 13:42:29 localhost user.err ptp4l: [561.387] port 1: send delay request failed Apr 4 13:42:29 localhost user.notice ptp4l: [561.387] port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED) Apr 4 13:42:45 localhost user.notice ptp4l: [577.388] port 1: FAULTY to LISTENING on FAULT_CLEARED Apr 4 13:42:45 localhost user.warn ptp4l: [577.414] clockcheck: clock jumped backward or running slower than expected! Apr 4 13:42:45 localhost user.notice ptp4l: [577.414] port 1: new foreign master 000cec.fffe.0a085d-1 Apr 4 13:42:47 localhost user.notice ptp4l: [579.414] selected best master clock 000cec.fffe.0a085d Apr 4 13:42:47 localhost user.notice ptp4l: [579.414] port 1: LISTENING to UNCALIBRATED on RS_SLAVE Apr 4 13:42:54 localhost user.notice ptp4l: [587.164] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED Apr 4 13:46:46 localhost user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=> ptp4l: [818.414] rms 2312500092 max 37000001557 freq +246 +/- 250 delay 7358 +/- 46 Apr 4 13:51:02 localhost user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=> ptp4l: [1074.413] rms 116 max 681 freq +256 +/- 48 delay 7373 +/- 88 Does this imply that one lost delay request can do this, or is there a retry mechanism? Notice that the system recovers but we can’t afford the large timing glitch that gets introduced. We have a lot of traffic leaving the boards but only PTP traffic coming in. As we increase the off board transfer rates the problem seems to occur more often. Thanks for any help, Ian T. |