Re: [Linuxptp-users] [External] Re: clockcheck jumps forwards and backwards

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

All

Here’s a log with phc2sys output. This board ran for 11 hours without an error before this happened …

Apr  5 09:18:21 localhost user.info phc2sys: [38796.187] rms   28 max   65 freq   -242 +/-   1 delay  1099 +/-  11
Apr  5 09:20:21 localhost user.info phc2sys: [38916.211] rms   29 max   73 freq   -242 +/-   3 delay  1100 +/-  12
Apr  5 09:20:40 localhost user.info ptp4l: [38934.698] rms   29 max   76 freq   -242 +/-   6 delay  2029 +/-   4
Apr  5 09:20:52 localhost user.err ptp4l: [38946.961] timed out while polling for tx timestamp
Apr  5 09:20:52 localhost user.err ptp4l: [38946.961] increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
Apr  5 09:20:52 localhost user.err ptp4l: [38946.962] Failure in transport_send(to)()
Apr  5 09:20:52 localhost user.err ptp4l: [38946.962] port 1: send delay request failed
Apr  5 09:20:52 localhost user.notice ptp4l: [38946.962] port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
Apr  5 09:20:52 localhost user.info phc2sys: [38947.222] port 2cee26.fffe.000189-1 changed state
Apr  5 09:20:52 localhost user.info phc2sys: [38947.222] reconfiguring after port state change
Apr  5 09:20:52 localhost user.info phc2sys: [38947.222] selecting eth0 for synchronization
Apr  5 09:20:52 localhost user.info phc2sys: [38947.222] nothing to synchronize
Apr  5 09:21:08 localhost user.notice ptp4l: [38962.962] port 1: FAULTY to LISTENING on INIT_COMPLETE
Apr  5 09:21:08 localhost user.warn ptp4l: [38963.198] clockcheck: clock jumped backward or running slower than expected!
Apr  5 09:21:10 localhost user.notice ptp4l: [38964.698] port 1: new foreign master 000cec.fffe.0a0f8d-1
Apr  5 09:21:14 localhost user.notice ptp4l: [38968.698] selected best master clock 000cec.fffe.0a0f8d
Apr  5 09:21:14 localhost user.notice ptp4l: [38968.698] port 1: LISTENING to UNCALIBRATED on RS_SLAVE
Apr  5 09:21:14 localhost user.info phc2sys: [38969.224] port 2cee26.fffe.000189-1 changed state
Apr  5 09:21:14 localhost user.info phc2sys: [38969.224] reconfiguring after port state change
Apr  5 09:21:14 localhost user.info phc2sys: [38969.224] master clock not ready, waiting...
Apr  5 09:21:21 localhost user.notice ptp4l: [38975.698] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
Apr  5 09:21:21 localhost user.info phc2sys: [38976.229] port 2cee26.fffe.000189-1 changed state
Apr  5 09:21:21 localhost user.info phc2sys: [38976.230] reconfiguring after port state change
Apr  5 09:21:21 localhost user.info phc2sys: [38976.230] selecting CLOCK_REALTIME for synchronization
Apr  5 09:21:21 localhost user.info phc2sys: [38976.230] selecting eth0 as the master clock
Apr  5 09:22:50 localhost user.info phc2sys: [39065.260] rms   34 max   88 freq   -241 +/-   5 delay  1100 +/-  10
Apr  5 09:24:50 localhost user.info phc2sys: [39185.288] rms   28 max   67 freq   -242 +/-   1 delay  1099 +/-  11
Apr  5 09:25:23 localhost user.info ptp4l: [39217.698] rms 3270369174 max 37000003713 freq   -238 +/-  95 delay  2030 +/-   8
Apr  5 09:26:50 localhost user.info phc2sys: [39305.330] rms   29 max   70 freq   -241 +/-   2 delay  1099 +/-  10
Apr  5 09:28:50 localhost user.info phc2sys: [39425.376] rms   29 max   71 freq   -242 +/-   1 delay  1100 +/-  11
Apr  5 09:29:39 localhost user.info ptp4l: [39473.698] rms   29 max   75 freq   -243 +/-   2 delay  2023 +/-   9

Ian T.

From: David Mirabito [mailto:dav...@gm...]
Sent: Tuesday, April 04, 2017 6:18 PM
To: Ian Thompson
Cc: lin...@li...
Subject: [External] Re: [Linuxptp-users] clockcheck jumps forwards and backwards

Hi,

The device is a:
  00:14.0 Ethernet controller: Intel Corporation Ethernet Connection I354 (rev 03)

Using
  bash-4.3# ethtool -i ma2
  driver: igb
  version: 5.3.0-k
  firmware-version: 0.0.0
  expansion-rom-version:
  bus-info: 0000:00:14.1
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: yes
  supports-register-dump: yes
  supports-priv-flags: no

And:
 # ethtool -T ma2
 Time stamping parameters for ma2:
 Capabilities:
     hardware-transmit     (SOF_TIMESTAMPING_TX_HARDWARE)
     software-transmit     (SOF_TIMESTAMPING_TX_SOFTWARE)
     hardware-receive      (SOF_TIMESTAMPING_RX_HARDWARE)
     software-receive      (SOF_TIMESTAMPING_RX_SOFTWARE)
     software-system-clock (SOF_TIMESTAMPING_SOFTWARE)
     hardware-raw-clock    (SOF_TIMESTAMPING_RAW_HARDWARE)
 PTP Hardware Clock: 1
 Hardware Transmit Timestamp Modes:
     off                   (HWTSTAMP_TX_OFF)
     on                    (HWTSTAMP_TX_ON)
 Hardware Receive Filter Modes:
     none                  (HWTSTAMP_FILTER_NONE)
     all                   (HWTSTAMP_FILTER_ALL)

The config is:

[global]
slaveOnly 1
summary_interval 6
priority1 255

[ma2]

And running as:
   /usr/sbin/ptp4l -f /etc/ptp4l.conf
   /usr/sbin/phc2sys -a -r -u 64 -n 5

We are running version 1.8, downloaded from the sourceforge mirror. It's built with openembedede/bitbake and their recipie defines some extra cflags, I can look iwhy these were deemed to be necessary or if they could affect anything:
EXTRA_OEMAKE = "'CFLAGS=-D_GNU_SOURCE -DHAVE_CLOCK_ADJTIME -DHAVE_POSIX_SPAWN -DHAVE_ONESTEP_SYNC'"

I will look into obtaining more verbose logs.

For what it's worth, this exact same setup works elsewhere it is just this one physical setup that exhibits this, although unclear if the cause a physical fault or something about the network/master outside.

Additionally, since Ian brought it up
   a) We do sometimes see tx timestamp timeouts too
   b) We also occasionally see UNEXPECTED_SYSWRAP messages from igb
My understanding is that b) is an intel bug (bad per-device assumptions made in code regarding default state of PPS IRQ) on this HW and seems to be generally treated as benign. I do have a slight suspicion that a and b may be somehow related (backing out of the unexpected wrap IRQ 'forgets' to notice the available tx timestamp being ready?) but I have some digging to to on that front.

I currently expect (although happy to be proven wrong) that both a) and b) are unrelated to the clockcheck jumps, since a+b happens readily and doesn't affect sync *too* badly, whereas constant clockcheck aborts happens only in one place and is apparently disastrous to sync quality.

Cheers, and thanks for your replies,
Dave

On Wed, Apr 5, 2017 at 1:45 AM, Ian Thompson <Ian...@pg...<mailto:Ian...@pg...>> wrote:
Possibly following on from David’s post.

We have a system with 18 boards in a rack, each board has a Altera SoC with the STM Ethernet MAC connected via gigabit Ethernet to an Arista ptp-aware switch and then a Spectracom GrandMaster.
The boards are running Linux kernel 3.15.0.

They lock quickly after boot and can remain locked for several hours but usually any one of the boards may do the following …

Apr  4 13:42:04 localhost user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=> ptp4l: [537.164] rms  123 max  599 freq   +255 +/-  39 delay  7362 +/-  48
Apr  4 13:42:29 localhost user.err ptp4l: [561.387] timed out while polling for tx timestamp
Apr  4 13:42:29 localhost user.err ptp4l: [561.387] increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
Apr  4 13:42:29 localhost user.err ptp4l: [561.387] port 1: send delay request failed
Apr  4 13:42:29 localhost user.notice ptp4l: [561.387] port 1: SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
Apr  4 13:42:45 localhost user.notice ptp4l: [577.388] port 1: FAULTY to LISTENING on FAULT_CLEARED
Apr  4 13:42:45 localhost user.warn ptp4l: [577.414] clockcheck: clock jumped backward or running slower than expected!
Apr  4 13:42:45 localhost user.notice ptp4l: [577.414] port 1: new foreign master 000cec.fffe.0a085d-1
Apr  4 13:42:47 localhost user.notice ptp4l: [579.414] selected best master clock 000cec.fffe.0a085d
Apr  4 13:42:47 localhost user.notice ptp4l: [579.414] port 1: LISTENING to UNCALIBRATED on RS_SLAVE
Apr  4 13:42:54 localhost user.notice ptp4l: [587.164] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
Apr  4 13:46:46 localhost user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=> ptp4l: [818.414] rms 2312500092 max 37000001557 freq   +246 +/- 250 delay  7358 +/-  46
Apr  4 13:51:02 localhost user.info<https://urldefense.proofpoint.com/v2/url?u=http-3A__user.info&d=DwMFaQ&c=KV_I7O14pmwRcmAVyJ1eg4Jwb8Y2JAxuL5YgMGHpjcQ&r=zdHnydvzOnwuGQ--L90nq9WdYaiUdEVnfAroj9WKyYs&m=YrWhtlXVxxCzYLBrGRVRx46qRxB7vLHf6gtzNDah7es&s=fbUeH1NnQuOsYg0ca_GQ4d7QbQdPD0Q3K4UU8ucBTZY&e=> ptp4l: [1074.413] rms  116 max  681 freq   +256 +/-  48 delay  7373 +/-  88

Does this imply that one lost delay request can do this, or is there a retry mechanism?
Notice that the system recovers but we can’t afford the large timing glitch that gets introduced.
We have a lot of traffic leaving the boards but only PTP traffic coming in. As we increase the off board transfer rates the problem seems to occur more often.

Thanks for any help,
Ian T.

Re: [Linuxptp-users] [External] Re: clockcheck jumps forwards and backwards

PTP IEEE 1588 stack for Linux

Re: [Linuxptp-users] [External] Re: clockcheck jumps forwards and backwards