[Linuxptp-users] clockcheck jumps forwards and backwards
PTP IEEE 1588 stack for Linux
Brought to you by:
rcochran
From: David M. <dav...@gm...> - 2017-04-04 08:52:29
|
Hello List, I understand PTP in general needs stable/symmetrical round-trip times to be at it's best and that things like non-PTP-aware switches will introduce noise. It makes sense that as the network gets less deterministic and symmetrical then PTP will necessarily perform worse - probably both in terms of the self-reported rms/max offsets as well as as measured by some external "correct" reference. Is it safe to assume, that given a crappy PTP network ptp4l would just degrade to ntp-like performance at worst? Or are there some other thresholds or sanity checks which would cause it to throw in the towel in situations where NTP would keep trucking, being designed for such situations and necessarily more robust? I have one machine which is admittedly on a not-ideal PTP network, but with some bizzare logs I'm having difficulty understanding: Mar 19 09:06:50 user.notice ptp4l: [1337049.193] selected best master clock 94bc40 Mar 19 09:06:50 user.warning ptp4l: [1337049.193] running in a temporal vortex Mar 19 09:06:50 user.notice ptp4l: [1337049.193] port 1: LISTENING to UNCALIBRATED on RS_SLAVE Mar 19 09:06:51 user.info phc2sys: [1337049.328] port d24202-1 changed state Mar 19 09:06:51 user.info phc2sys: [1337049.328] reconfiguring after port state change Mar 19 09:06:51 user.info phc2sys: [1337049.328] master clock not ready, waiting... Mar 19 09:06:52 user.warning ptp4l: [1337050.328] clockcheck: clock jumped backward or running slower than expected! Mar 19 09:06:52 user.warning ptp4l: [1337050.520] clockcheck: clock jumped forward or running faster than expected! Mar 19 09:06:53 user.notice ptp4l: [1337051.884] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED Mar 19 09:06:54 user.info phc2sys: [1337052.328] port d24202-1 changed state Mar 19 09:06:54 user.info phc2sys: [1337052.328] reconfiguring after port state change Mar 19 09:06:54 user.info phc2sys: [1337052.328] selecting CLOCK_REALTIME for synchronization Mar 19 09:06:54 user.info phc2sys: [1337052.328] selecting eth2 as the master clock Mar 19 09:06:56 user.warning ptp4l: [1337054.329] clockcheck: clock jumped backward or running slower than expected! Mar 19 09:06:56 user.notice ptp4l: [1337054.331] port 1: SLAVE to UNCALIBRATED on SYNCHRONIZATION_FAULT Mar 19 09:06:56 user.warning ptp4l: [1337054.562] clockcheck: clock jumped forward or running faster than expected! Mar 19 09:06:57 user.info phc2sys: [1337055.329] port d24202-1 changed state Mar 19 09:06:57 user.info phc2sys: [1337055.329] reconfiguring after port state change Mar 19 09:06:57 user.info phc2sys: [1337055.329] master clock not ready, waiting... Mar 19 09:06:57 user.notice ptp4l: [1337055.887] port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED Mar 19 09:06:58 user.info phc2sys: [1337056.329] port d24202-1 changed state Mar 19 09:06:58 user.info phc2sys: [1337056.329] reconfiguring after port state change Mar 19 09:06:58 user.info phc2sys: [1337056.329] selecting CLOCK_REALTIME for synchronization Mar 19 09:06:58 user.info phc2sys: [1337056.329] selecting eth2 as the master clock Is this a possible failure mode of a network / master that is just too poor to survive? In particular the clock check differences - swinging by +-20% from the system time *within the same second*? Given we are also running phc2sys I am wondering if we've hit some case where the system clock is lagging the PHC and there's some resonance between this lag and the swings of the PHC? So system clock would speed up to try catch the PHC and be at it's fastest just as the PHC has turned around and been at it's slowest; so the system clock then slows down. I have an inkling this could be additive over time and the swings get larger and larger until each is around 10% adjusted frequency to "true" which makes them ~20% from one another and clockcheck bites. Is this a reasonable explanation? Alternatively, are there other things which could cause near-simultaneous clockcheck faster and slower complaints? The config is pretty vanilla, just UDP in slave-only mode on a single port. The network is not ideal for PTP, but I was hoping it would at least truck along, even if not with record-breaking accuracy. Analysing ~5000 status log mesages from ptp4l (in between state machine resets) shows * freq: mean 42714 stddev 406 * delay: mean 3577 stddev 40 * err_rms mean: 1005 stddev 10058 (nb: mean/dev of an already RMS term might not be meaningful, and this is super-spiky probably to always hitting fault mode and re-starting) Am happy to try provide more info or run tests if there are any ideas or suggestions. Thanks, David |