Thread: [Linuxptp-users] Filter occasional spikes in offset
PTP IEEE 1588 stack for Linux
Brought to you by:
rcochran
|
From: Oleg O. <leo...@fb...> - 2022-05-03 15:18:25
|
Hi team, In large distributed networks very many factors can lead to a short term spike in offset. Primarily network equipment without Transparent Clock support (even on a single device). Path delay calculations have the filtering buffer which helps to mitigate synchronous changes in path delay, however this doesn’t help if only syncs are affected for example. We often end up in a situation like this (for demonstration we set delay_filter_length = 1): Apr 21 09:21:29 ptp4l[1732497.662]: master offset 17 s2 freq -12361 path delay 4070 Apr 21 09:21:30 ptp4l[1732498.662]: master offset 0 s2 freq -12373 path delay 4074 Apr 21 09:21:31 ptp4l[1732499.662]: master offset 37 s2 freq -12336 path delay 4067 Apr 21 09:21:32 ptp4l[1732500.662]: master offset 3 s2 freq -12359 path delay 4067 Apr 21 09:21:33 ptp4l[1732501.662]: master offset -122 s2 freq -12483 path delay 4193 Apr 21 09:21:34 ptp4l[1732502.662]: master offset 119 s2 freq -12279 path delay 4068 Apr 21 09:21:35 ptp4l[1732503.662]: master offset -25 s2 freq -12387 path delay 4110 Apr 21 09:21:36 ptp4l[1732504.662]: master offset 57 s2 freq -12313 path delay 4063 Apr 21 09:21:37 ptp4l[1732505.662]: master offset -18 s2 freq -12371 path delay 4063 Apr 21 09:21:38 ptp4l[1732506.662]: master offset 13 s2 freq -12345 path delay 4068 Apr 21 09:21:39 ptp4l[1732507.662]: master offset -76 s2 freq -12430 path delay 4107 Apr 21 09:21:40 ptp4l[1732508.662]: master offset -24 s2 freq -12401 path delay 4107 Apr 21 09:21:41 ptp4l[1732509.662]: master offset 279231 s2 freq +266847 path delay 4070 Apr 21 09:21:42 ptp4l[1732510.662]: master offset -454738 s2 freq -383353 path delay 179782 Apr 21 09:21:43 ptp4l[1732511.662]: master offset 258063 s2 freq +193027 path delay -162110 Apr 21 09:21:44 ptp4l[1732512.662]: master offset 52769 s2 freq +65152 path delay -162110 Apr 21 09:21:45 ptp4l[1732513.662]: master offset -221568 s2 freq -193355 path delay 34721 Apr 21 09:21:46 ptp4l[1732514.662]: master offset 19170 s2 freq -19087 path delay -25061 Apr 21 09:21:47 ptp4l[1732515.662]: master offset 25906 s2 freq -6600 path delay -25061 Apr 21 09:21:48 ptp4l[1732516.662]: master offset -10978 s2 freq -35712 path delay 6064 Apr 21 09:21:49 ptp4l[1732517.662]: master offset 12336 s2 freq -15692 path delay 6064 Apr 21 09:21:50 ptp4l[1732518.662]: master offset 18310 s2 freq -6017 path delay 3439 Apr 21 09:21:51 ptp4l[1732519.662]: master offset 11139 s2 freq -7695 path delay 4247 Apr 21 09:21:52 ptp4l[1732520.662]: master offset 5108 s2 freq -10384 path delay 5614 Apr 21 09:21:53 ptp4l[1732521.662]: master offset 3093 s2 freq -10867 path delay 5614 Apr 21 09:21:54 ptp4l[1732522.662]: master offset 2945 s2 freq -10087 path delay 4281 Apr 21 09:21:55 ptp4l[1732523.662]: master offset 205 s2 freq -11943 path delay 4700 Apr 21 09:21:56 ptp4l[1732524.662]: master offset -212 s2 freq -12299 path delay 4700 Apr 21 09:21:57 ptp4l[1732525.662]: master offset 325 s2 freq -11825 path delay 4079 Apr 21 09:21:58 ptp4l[1732526.662]: master offset -414 s2 freq -12467 path delay 4287 Apr 21 09:21:59 ptp4l[1732527.662]: master offset -142 s2 freq -12319 path delay 4098 Apr 21 09:22:00 ptp4l[1732528.662]: master offset -236 s2 freq -12456 path delay 4171 Apr 21 09:22:01 ptp4l[1732529.662]: master offset -182 s2 freq -12473 path delay 4171 Apr 21 09:22:02 ptp4l[1732530.662]: master offset 83 s2 freq -12262 path delay 4028 Apr 21 09:22:03 ptp4l[1732531.662]: master offset -113 s2 freq -12433 path delay 4126 Apr 21 09:22:04 ptp4l[1732532.662]: master offset 11 s2 freq -12343 path delay 4057 Apr 21 09:22:05 ptp4l[1732533.662]: master offset -94 s2 freq -12445 path delay 4125 Apr 21 09:22:06 ptp4l[1732534.662]: master offset 73 s2 freq -12306 path delay 4049 Apr 21 09:22:07 ptp4l[1732535.662]: master offset -23 s2 freq -12380 path delay 4077 As you see we have a “regular” path delay is around “4100” with an offset within ±200ns when suddenly offset jumps to "27923" for a very short amount of time (in fact only once) everything goes crazy. The issue is further complicated because delay_req/resp may not be affected when syncs are (different queues, fabric paths etc). So with delay_filter_length set to 10 (default) there may be short term asymmetry literally for 1 packet. Looking at ptp4l config I didn’t to find anything to overcome this situation and ignore this 1 bad outlier. I implemented a quick patch https://gist.github.com/leoleovich/5a4dff7e089bd429c5d208d9276e1683 which can mitigate this and it works very well: May 2 14:34:26 ptp4l[2772335.049]: master offset -9 s2 freq -10406 path delay 3957 May 2 14:34:27 ptp4l[2772336.049]: master offset 0 s2 freq -10399 path delay 3957 May 2 14:34:28 ptp4l[2772337.049]: master offset -7 s2 freq -10406 path delay 3957 May 2 14:34:30 ptp4l[2772338.805]: master offset 7 s2 freq -10395 path delay 3957 May 2 14:34:30 ptp4l[2772339.049]: master offset -6 s2 freq -10405 path delay 3957 May 2 14:34:31 ptp4l[2772340.049]: master offset -16 s2 freq -10417 path delay 3957 May 2 14:34:32 ptp4l[2772341.049]: skip 1/2 large offset (>20000) 486196 May 2 14:34:33 ptp4l[2772342.049]: master offset 26 s2 freq -10380 path delay 3956 May 2 14:34:34 ptp4l[2772343.049]: master offset 20 s2 freq -10378 path delay 3956 May 2 14:34:35 ptp4l[2772344.049]: master offset 14 s2 freq -10378 path delay 3956 May 2 14:34:36 ptp4l[2772345.049]: master offset -21 s2 freq -10409 path delay 3956 May 2 14:34:37 ptp4l[2772346.049]: master offset 3 s2 freq -10391 path delay 3955 Preventing unnecessary tuning of the servo for a short period of time by using a padding technique (simply filling with previous values). The bottom line is - we need to find a way to ignore outliers in a locked state where it’s not expected to have shot term large jumps in offset. Please check this out and let me know if there is a better way to handle this situation or if this patch can inspire any other ideas… Thank you in advance, Oleg. |
|
From: Miroslav L. <mli...@re...> - 2022-05-04 10:34:10
|
On Tue, May 03, 2022 at 02:26:21PM +0000, Oleg Obleukhov via Linuxptp-users wrote: > Hi team, > In large distributed networks very many factors can lead to a short term spike in offset. Primarily network equipment without Transparent Clock support (even on a single device). PTP was designed for networks with constant delay. On switched networks that requires full on-path PTP support. If you don't have that, you should be looking at NTP or another protocol designed for networks with variable delays, where more effective filtering can be implemented. Of course, that doesn't mean linuxptp couldn't try to do better in these suboptimal conditions. The question is if it's in the scope of the project. As you seem to have found out, the main issue with the current design is that dropping samples can lead to servo instability. > Looking at ptp4l config I didn’t to find anything to overcome this situation and ignore this 1 bad outlier. > I implemented a quick patch https://gist.github.com/leoleovich/5a4dff7e089bd429c5d208d9276e1683 which can mitigate this and it works very well: > Preventing unnecessary tuning of the servo for a short period of time by using a padding technique (simply filling with previous values). That patch seems to be dropping the sample and there is a different output shown in the example. Is there a newer version of the patch you didn't publish? > The bottom line is - we need to find a way to ignore outliers in a locked state where it’s not expected to have shot term large jumps in offset. > Please check this out and let me know if there is a better way to handle this situation or if this patch can inspire any other ideas… If a spike filter needs to be implemented, I think it would better if the threshold was automatically adjusted based on the jitter. For an example, see the "Popcorn spike suppressor" in RFC5905 (NTPv4). -- Miroslav Lichvar |
|
From: Oleg O. <leo...@fb...> - 2022-05-04 11:39:40
|
> On 4 May 2022, at 11:33, Miroslav Lichvar <mli...@re...> wrote: > Hi Miroslav, Thank you for your response. I appreciate it. > On Tue, May 03, 2022 at 02:26:21PM +0000, Oleg Obleukhov via Linuxptp-users wrote: >> Hi team, >> In large distributed networks very many factors can lead to a short term spike in offset. Primarily network equipment without Transparent Clock support (even on a single device). > > PTP was designed for networks with constant delay. On switched > networks that requires full on-path PTP support. If you don't have > that, you should be looking at NTP or another protocol designed for > networks with variable delays, where more effective filtering can be > implemented. While we are phasing out old equipment the reality is - there will be always some % of misbehaving/old switches in large distributed systems with thousands switches on the way. During congestion which only lasts several microseconds we may be affected and we need to survive. > > Of course, that doesn't mean linuxptp couldn't try to do better in > these suboptimal conditions. The question is if it's in the scope of > the project. As you seem to have found out, the main issue with the > current design is that dropping samples can lead to servo instability. > >> Looking at ptp4l config I didn’t to find anything to overcome this situation and ignore this 1 bad outlier. >> I implemented a quick patch https://gist.github.com/leoleovich/5a4dff7e089bd429c5d208d9276e1683 which can mitigate this and it works very well: > >> Preventing unnecessary tuning of the servo for a short period of time by using a padding technique (simply filling with previous values). The patch I proposed simply doesn’t pass the offset to a servo - so it shouldn’t be too bad. For example with default ptp4l settings we can tolerate several missed syncs in a row. But I am open for suggestions of course. > > That patch seems to be dropping the sample and there is a different > output shown in the example. Is there a newer version of the patch you > didn't publish? The code I suggested matches the output. It simply prints something like: skip 1/2 large offset (>20000) -248483 When occasional spikes arise. The only difference is max_offset_locked and max_offset_locked_skip should be set to 0 and currently they are at 20000 and 2 respectively. > >> The bottom line is - we need to find a way to ignore outliers in a locked state where it’s not expected to have shot term large jumps in offset. >> Please check this out and let me know if there is a better way to handle this situation or if this patch can inspire any other ideas… > > If a spike filter needs to be implemented, I think it would better if > the threshold was automatically adjusted based on the jitter. For an > example, see the "Popcorn spike suppressor" in RFC5905 (NTPv4). Automatically adjusted filter is something even better. If you open for such idea we can discuss this as well. I wanted to start somewhere. > > -- > Miroslav Lichvar > Thank you, Oleg. |
|
From: Miroslav L. <mli...@re...> - 2022-05-10 14:30:02
|
This should be discussed at linuxptp-devel. On Wed, May 04, 2022 at 11:39:19AM +0000, Oleg Obleukhov wrote: > >> I implemented a quick patch https://gist.github.com/leoleovich/5a4dff7e089bd429c5d208d9276e1683 which can mitigate this and it works very well: > > > >> Preventing unnecessary tuning of the servo for a short period of time by using a padding technique (simply filling with previous values). > The patch I proposed simply doesn’t pass the offset to a servo - so it shouldn’t be too bad. For example with default ptp4l settings we can tolerate several missed syncs in a row. But I am open for suggestions of course. I think the default configuration of the PI servo can tolerate only one missed update at a time. If you repeatedly drop two samples and pass one, it will be unstable and oscillations will grow until it is updated at a higher rate. However unlikely that would be to happen in real world, your patch doesn't seem to prevent that. > > That patch seems to be dropping the sample and there is a different > > output shown in the example. Is there a newer version of the patch you > > didn't publish? > The code I suggested matches the output. It simply prints something like: > skip 1/2 large offset (>20000) -248483 > When occasional spikes arise. The only difference is max_offset_locked and max_offset_locked_skip should be set to 0 and currently they are at 20000 and 2 respectively. The example output posted here didn't have the SYNCHRONIZATION_FAULT messages, so I assumed you were doing something with the servo. -- Miroslav Lichvar |