Thread: [Linuxptp-devel] [PATCH 0/3] Fixes for sanity clock check
PTP IEEE 1588 stack for Linux
Brought to you by:
rcochran
From: Miroslav L. <mli...@re...> - 2021-05-25 12:28:31
|
These patches should make the clock check more reliable, e.g. when the clock is controlled externally by another process, or when there are multiple clocks in the jbod mode as was discussed recently on this list. The first patch is not strictly related to the clock check, e.g. it also fixes the path delay after switching the slave port. Miroslav Lichvar (3): clock: Reset state when switching port with same best clock. clock: Reset clock check on port state change. port: Don't check timestamps from non-slave ports. clock.c | 3 ++- clockcheck.c | 9 ++++++++- clockcheck.h | 6 ++++++ port.c | 5 ++++- 4 files changed, 20 insertions(+), 3 deletions(-) -- 2.26.3 |
From: Miroslav L. <mli...@re...> - 2021-05-25 12:28:32
|
When the best port is changed, but the ID of the best clock doesn't change (e.g. a passive port is activated on link failure), reset the current delay and other master/link-specific state to avoid the switch throwing the clock off. Signed-off-by: Miroslav Lichvar <mli...@re...> --- clock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/clock.c b/clock.c index e545a9b..a073575 100644 --- a/clock.c +++ b/clock.c @@ -1947,7 +1947,7 @@ static void handle_state_decision_event(struct clock *c) cid2str(&best_id)); } - if (!cid_eq(&best_id, &c->best_id)) { + if (!cid_eq(&best_id, &c->best_id) || best != c->best) { clock_freq_est_reset(c); tsproc_reset(c->tsproc, 1); if (!tmv_is_zero(c->initial_delay)) -- 2.26.3 |
From: Keller, J. E <jac...@in...> - 2021-05-25 20:34:21
|
> -----Original Message----- > From: Miroslav Lichvar <mli...@re...> > Sent: Tuesday, May 25, 2021 5:28 AM > To: lin...@li... > Subject: [Linuxptp-devel] [PATCH 1/3] clock: Reset state when switching port with > same best clock. > > When the best port is changed, but the ID of the best clock doesn't > change (e.g. a passive port is activated on link failure), reset the > current delay and other master/link-specific state to avoid the switch > throwing the clock off. > Right. Since we are on a new port, we are going to have different characteristics, even if the clock is still the same. Makes sense. Reviewed-by: Jacob Keller <jac...@in...> > Signed-off-by: Miroslav Lichvar <mli...@re...> > --- > clock.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/clock.c b/clock.c > index e545a9b..a073575 100644 > --- a/clock.c > +++ b/clock.c > @@ -1947,7 +1947,7 @@ static void handle_state_decision_event(struct clock > *c) > cid2str(&best_id)); > } > > - if (!cid_eq(&best_id, &c->best_id)) { > + if (!cid_eq(&best_id, &c->best_id) || best != c->best) { > clock_freq_est_reset(c); > tsproc_reset(c->tsproc, 1); > if (!tmv_is_zero(c->initial_delay)) > -- > 2.26.3 > > > > _______________________________________________ > Linuxptp-devel mailing list > Lin...@li... > https://lists.sourceforge.net/lists/listinfo/linuxptp-devel |
From: Miroslav L. <mli...@re...> - 2021-05-25 12:28:35
|
Don't perform the sanity check on receive timestamps from ports in non-slave states to avoid false positives in the jbod mode, where the timestamps can be generated by different clocks. Reported-by: Amar Subramanyam <asu...@al...> Signed-off-by: Miroslav Lichvar <mli...@re...> --- port.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/port.c b/port.c index 10bb9e1..fb420fb 100644 --- a/port.c +++ b/port.c @@ -2744,7 +2744,10 @@ static enum fsm_event bc_event(struct port *p, int fd_index) } if (msg_sots_valid(msg)) { ts_add(&msg->hwts.ts, -p->rx_timestamp_offset); - clock_check_ts(p->clock, tmv_to_nanoseconds(msg->hwts.ts)); + if (p->state == PS_SLAVE) { + clock_check_ts(p->clock, + tmv_to_nanoseconds(msg->hwts.ts)); + } } switch (msg_type(msg)) { -- 2.26.3 |
From: Miroslav L. <mli...@re...> - 2021-05-25 12:28:37
|
Reset the clock check to avoid false positives when switching between slave and non-slave state and the clock is controlled by an external process (e.g. phc2sys -rr). Signed-off-by: Miroslav Lichvar <mli...@re...> --- clock.c | 1 + clockcheck.c | 9 ++++++++- clockcheck.h | 6 ++++++ 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/clock.c b/clock.c index a073575..69f2b66 100644 --- a/clock.c +++ b/clock.c @@ -1949,6 +1949,7 @@ static void handle_state_decision_event(struct clock *c) if (!cid_eq(&best_id, &c->best_id) || best != c->best) { clock_freq_est_reset(c); + clockcheck_reset(c->sanity_check); tsproc_reset(c->tsproc, 1); if (!tmv_is_zero(c->initial_delay)) tsproc_set_delay(c->tsproc, c->initial_delay); diff --git a/clockcheck.c b/clockcheck.c index d48a578..d0b4714 100644 --- a/clockcheck.c +++ b/clockcheck.c @@ -47,9 +47,16 @@ struct clockcheck *clockcheck_create(int freq_limit) if (!cc) return NULL; cc->freq_limit = freq_limit; + clockcheck_reset(cc); + return cc; +} + +void clockcheck_reset(struct clockcheck *cc) +{ + cc->freq_known = 0; cc->max_freq = -CHECK_MAX_FREQ; cc->min_freq = CHECK_MAX_FREQ; - return cc; + cc->last_ts = 0; } int clockcheck_sample(struct clockcheck *cc, uint64_t ts) diff --git a/clockcheck.h b/clockcheck.h index 78aca48..1ff86eb 100644 --- a/clockcheck.h +++ b/clockcheck.h @@ -33,6 +33,12 @@ struct clockcheck; */ struct clockcheck *clockcheck_create(int freq_limit); +/** + * Reset a clock check. + * @param cc Pointer to a clock check obtained via @ref clockcheck_create(). + */ +void clockcheck_reset(struct clockcheck *cc); + /** * Perform the sanity check on a time stamp. * @param cc Pointer to a clock check obtained via @ref clockcheck_create(). -- 2.26.3 |
From: Keller, J. E <jac...@in...> - 2021-05-25 20:35:50
|
> -----Original Message----- > From: Miroslav Lichvar <mli...@re...> > Sent: Tuesday, May 25, 2021 5:28 AM > To: lin...@li... > Subject: [Linuxptp-devel] [PATCH 2/3] clock: Reset clock check on port state > change. > > Reset the clock check to avoid false positives when switching between > slave and non-slave state and the clock is controlled by an external > process (e.g. phc2sys -rr). > The way this is worded, I expected clockcheck_reset to only be called if the clock is controlled by an external process, but we seem to call this every time. > Signed-off-by: Miroslav Lichvar <mli...@re...> > --- > clock.c | 1 + > clockcheck.c | 9 ++++++++- > clockcheck.h | 6 ++++++ > 3 files changed, 15 insertions(+), 1 deletion(-) > > diff --git a/clock.c b/clock.c > index a073575..69f2b66 100644 > --- a/clock.c > +++ b/clock.c > @@ -1949,6 +1949,7 @@ static void handle_state_decision_event(struct clock > *c) > > if (!cid_eq(&best_id, &c->best_id) || best != c->best) { > clock_freq_est_reset(c); > + clockcheck_reset(c->sanity_check); > tsproc_reset(c->tsproc, 1); > if (!tmv_is_zero(c->initial_delay)) > tsproc_set_delay(c->tsproc, c->initial_delay); > diff --git a/clockcheck.c b/clockcheck.c > index d48a578..d0b4714 100644 > --- a/clockcheck.c > +++ b/clockcheck.c > @@ -47,9 +47,16 @@ struct clockcheck *clockcheck_create(int freq_limit) > if (!cc) > return NULL; > cc->freq_limit = freq_limit; > + clockcheck_reset(cc); > + return cc; > +} > + > +void clockcheck_reset(struct clockcheck *cc) > +{ > + cc->freq_known = 0; > cc->max_freq = -CHECK_MAX_FREQ; > cc->min_freq = CHECK_MAX_FREQ; > - return cc; > + cc->last_ts = 0; > } > > int clockcheck_sample(struct clockcheck *cc, uint64_t ts) > diff --git a/clockcheck.h b/clockcheck.h > index 78aca48..1ff86eb 100644 > --- a/clockcheck.h > +++ b/clockcheck.h > @@ -33,6 +33,12 @@ struct clockcheck; > */ > struct clockcheck *clockcheck_create(int freq_limit); > > +/** > + * Reset a clock check. > + * @param cc Pointer to a clock check obtained via @ref clockcheck_create(). > + */ > +void clockcheck_reset(struct clockcheck *cc); > + > /** > * Perform the sanity check on a time stamp. > * @param cc Pointer to a clock check obtained via @ref clockcheck_create(). > -- > 2.26.3 > > > > _______________________________________________ > Linuxptp-devel mailing list > Lin...@li... > https://lists.sourceforge.net/lists/listinfo/linuxptp-devel |
From: Miroslav L. <mli...@re...> - 2021-05-26 07:46:12
|
On Tue, May 25, 2021 at 08:35:35PM +0000, Keller, Jacob E wrote: > > Reset the clock check to avoid false positives when switching between > > slave and non-slave state and the clock is controlled by an external > > process (e.g. phc2sys -rr). > > > > > The way this is worded, I expected clockcheck_reset to only be called if the clock is controlled by an external process, but we seem to call this every time. Detecting that would be difficult and unreliable. I'll reword the message to me it more clear. > > if (!cid_eq(&best_id, &c->best_id) || best != c->best) { > > clock_freq_est_reset(c); > > + clockcheck_reset(c->sanity_check); And here is a missing check for NULL in case it is disabled. Thanks, -- Miroslav Lichvar |
From: Miroslav L. <mli...@re...> - 2021-05-26 08:23:52
|
v2 - improved commit message - added missing NULL check These patches should make the clock check more reliable, e.g. when the clock is controlled externally by another process, or when there are multiple clocks in the jbod mode as was discussed recently on this list. The first patch is not strictly related to the clock check, e.g. it also fixes the path delay after switching the slave port. Miroslav Lichvar (3): clock: Reset state when switching port with same best clock. clock: Reset clock check on best clock/port change. port: Don't check timestamps from non-slave ports. clock.c | 4 +++- clockcheck.c | 9 ++++++++- clockcheck.h | 6 ++++++ port.c | 5 ++++- 4 files changed, 21 insertions(+), 3 deletions(-) -- 2.26.3 |
From: Miroslav L. <mli...@re...> - 2021-05-26 08:23:49
|
When the best port is changed, but the ID of the best clock doesn't change (e.g. a passive port is activated on link failure), reset the current delay and other master/link-specific state to avoid the switch throwing the clock off. Reviewed-by: Jacob Keller <jac...@in...> Signed-off-by: Miroslav Lichvar <mli...@re...> --- clock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/clock.c b/clock.c index e545a9b..a073575 100644 --- a/clock.c +++ b/clock.c @@ -1947,7 +1947,7 @@ static void handle_state_decision_event(struct clock *c) cid2str(&best_id)); } - if (!cid_eq(&best_id, &c->best_id)) { + if (!cid_eq(&best_id, &c->best_id) || best != c->best) { clock_freq_est_reset(c); tsproc_reset(c->tsproc, 1); if (!tmv_is_zero(c->initial_delay)) -- 2.26.3 |
From: Miroslav L. <mli...@re...> - 2021-05-26 08:23:50
|
Don't perform the sanity check on receive timestamps from ports in non-slave states to avoid false positives in the jbod mode, where the timestamps can be generated by different clocks. Reported-by: Amar Subramanyam <asu...@al...> Signed-off-by: Miroslav Lichvar <mli...@re...> --- port.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/port.c b/port.c index 10bb9e1..fb420fb 100644 --- a/port.c +++ b/port.c @@ -2744,7 +2744,10 @@ static enum fsm_event bc_event(struct port *p, int fd_index) } if (msg_sots_valid(msg)) { ts_add(&msg->hwts.ts, -p->rx_timestamp_offset); - clock_check_ts(p->clock, tmv_to_nanoseconds(msg->hwts.ts)); + if (p->state == PS_SLAVE) { + clock_check_ts(p->clock, + tmv_to_nanoseconds(msg->hwts.ts)); + } } switch (msg_type(msg)) { -- 2.26.3 |
From: Keller, J. E <jac...@in...> - 2021-05-26 08:53:40
|
> -----Original Message----- > From: Miroslav Lichvar <mli...@re...> > Sent: Wednesday, May 26, 2021 1:24 AM > To: lin...@li... > Subject: [Linuxptp-devel] [PATCH v2 3/3] port: Don't check timestamps from non- > slave ports. > > Don't perform the sanity check on receive timestamps from ports in > non-slave states to avoid false positives in the jbod mode, where > the timestamps can be generated by different clocks. > Makes sense. Reviewed-by: Jacob Keller <jac...@in...> > Reported-by: Amar Subramanyam <asu...@al...> > Signed-off-by: Miroslav Lichvar <mli...@re...> > --- > port.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/port.c b/port.c > index 10bb9e1..fb420fb 100644 > --- a/port.c > +++ b/port.c > @@ -2744,7 +2744,10 @@ static enum fsm_event bc_event(struct port *p, int > fd_index) > } > if (msg_sots_valid(msg)) { > ts_add(&msg->hwts.ts, -p->rx_timestamp_offset); > - clock_check_ts(p->clock, tmv_to_nanoseconds(msg->hwts.ts)); > + if (p->state == PS_SLAVE) { > + clock_check_ts(p->clock, > + tmv_to_nanoseconds(msg->hwts.ts)); > + } > } > > switch (msg_type(msg)) { > -- > 2.26.3 > > > > _______________________________________________ > Linuxptp-devel mailing list > Lin...@li... > https://lists.sourceforge.net/lists/listinfo/linuxptp-devel |
From: Miroslav L. <mli...@re...> - 2021-05-26 08:23:54
|
Reset the clock check when the best clock or port changes, together with the other state like current estimated delay and frequency. This avoids false positives if the clock is controlled by an external process when not synchronized by PTP (e.g. phc2sys -rr). Signed-off-by: Miroslav Lichvar <mli...@re...> --- clock.c | 2 ++ clockcheck.c | 9 ++++++++- clockcheck.h | 6 ++++++ 3 files changed, 16 insertions(+), 1 deletion(-) diff --git a/clock.c b/clock.c index a073575..2dd7ef9 100644 --- a/clock.c +++ b/clock.c @@ -1949,6 +1949,8 @@ static void handle_state_decision_event(struct clock *c) if (!cid_eq(&best_id, &c->best_id) || best != c->best) { clock_freq_est_reset(c); + if (c->sanity_check) + clockcheck_reset(c->sanity_check); tsproc_reset(c->tsproc, 1); if (!tmv_is_zero(c->initial_delay)) tsproc_set_delay(c->tsproc, c->initial_delay); diff --git a/clockcheck.c b/clockcheck.c index d48a578..d0b4714 100644 --- a/clockcheck.c +++ b/clockcheck.c @@ -47,9 +47,16 @@ struct clockcheck *clockcheck_create(int freq_limit) if (!cc) return NULL; cc->freq_limit = freq_limit; + clockcheck_reset(cc); + return cc; +} + +void clockcheck_reset(struct clockcheck *cc) +{ + cc->freq_known = 0; cc->max_freq = -CHECK_MAX_FREQ; cc->min_freq = CHECK_MAX_FREQ; - return cc; + cc->last_ts = 0; } int clockcheck_sample(struct clockcheck *cc, uint64_t ts) diff --git a/clockcheck.h b/clockcheck.h index 78aca48..1ff86eb 100644 --- a/clockcheck.h +++ b/clockcheck.h @@ -33,6 +33,12 @@ struct clockcheck; */ struct clockcheck *clockcheck_create(int freq_limit); +/** + * Reset a clock check. + * @param cc Pointer to a clock check obtained via @ref clockcheck_create(). + */ +void clockcheck_reset(struct clockcheck *cc); + /** * Perform the sanity check on a time stamp. * @param cc Pointer to a clock check obtained via @ref clockcheck_create(). -- 2.26.3 |
From: Keller, J. E <jac...@in...> - 2021-05-26 08:53:20
|
> -----Original Message----- > From: Miroslav Lichvar <mli...@re...> > Sent: Wednesday, May 26, 2021 1:24 AM > To: lin...@li... > Subject: [Linuxptp-devel] [PATCH v2 2/3] clock: Reset clock check on best > clock/port change. > > Reset the clock check when the best clock or port changes, together with > the other state like current estimated delay and frequency. This avoids > false positives if the clock is controlled by an external process when > not synchronized by PTP (e.g. phc2sys -rr). > Thanks! The intent is more clear now! Reviewed-by: Jacob Keller <jac...@in...> > Signed-off-by: Miroslav Lichvar <mli...@re...> > --- > clock.c | 2 ++ > clockcheck.c | 9 ++++++++- > clockcheck.h | 6 ++++++ > 3 files changed, 16 insertions(+), 1 deletion(-) > > diff --git a/clock.c b/clock.c > index a073575..2dd7ef9 100644 > --- a/clock.c > +++ b/clock.c > @@ -1949,6 +1949,8 @@ static void handle_state_decision_event(struct clock > *c) > > if (!cid_eq(&best_id, &c->best_id) || best != c->best) { > clock_freq_est_reset(c); > + if (c->sanity_check) > + clockcheck_reset(c->sanity_check); > tsproc_reset(c->tsproc, 1); > if (!tmv_is_zero(c->initial_delay)) > tsproc_set_delay(c->tsproc, c->initial_delay); > diff --git a/clockcheck.c b/clockcheck.c > index d48a578..d0b4714 100644 > --- a/clockcheck.c > +++ b/clockcheck.c > @@ -47,9 +47,16 @@ struct clockcheck *clockcheck_create(int freq_limit) > if (!cc) > return NULL; > cc->freq_limit = freq_limit; > + clockcheck_reset(cc); > + return cc; > +} > + > +void clockcheck_reset(struct clockcheck *cc) > +{ > + cc->freq_known = 0; > cc->max_freq = -CHECK_MAX_FREQ; > cc->min_freq = CHECK_MAX_FREQ; > - return cc; > + cc->last_ts = 0; > } > > int clockcheck_sample(struct clockcheck *cc, uint64_t ts) > diff --git a/clockcheck.h b/clockcheck.h > index 78aca48..1ff86eb 100644 > --- a/clockcheck.h > +++ b/clockcheck.h > @@ -33,6 +33,12 @@ struct clockcheck; > */ > struct clockcheck *clockcheck_create(int freq_limit); > > +/** > + * Reset a clock check. > + * @param cc Pointer to a clock check obtained via @ref clockcheck_create(). > + */ > +void clockcheck_reset(struct clockcheck *cc); > + > /** > * Perform the sanity check on a time stamp. > * @param cc Pointer to a clock check obtained via @ref clockcheck_create(). > -- > 2.26.3 > > > > _______________________________________________ > Linuxptp-devel mailing list > Lin...@li... > https://lists.sourceforge.net/lists/listinfo/linuxptp-devel |
From: Amar S. <asu...@al...> - 2021-05-26 14:05:35
|
Hi Miroslav, I am afraid, the patches that was intended to fix the reported issue doesn't do so. The issue that we reported was not due to timestamps from non-slave port being compared in jbod mode. But due to mono_interval (CLOCK_MONOTIC interval b/w two successive calls to clock_check_sample()) becoming high due to the execution BMCA algorithm between two successive calls to clock_check_sample() function. We believe that this is due to execution time being higher, and solution is to increase the default value of sanity_freq_limit accordingly. Please find the logs we collected with your patches applied. Here we can see that mono_interval (132256969) increases when compared to the interval between hardware timestamp of successive sync packets (102726564). Please note that port2(sriov1) is the current active port, and clock check is never done on port1. Log: ptp4l[13295.936]: selected best master clock 001122.fffe.33f801 ptp4l[13295.936]: updating UTC offset to 37 ptp4l[13295.936]: port :: port 2 (sriov1) ptp4l[13295.936]: Mono Interval :: 132256969 ptp4l[13295.936]: Interval :: 102726564 ptp4l[13295.936]: clockcheck: clock jumped backward or running slower than expected! ptp4l[13295.936]: port 2 (sriov1): SLAVE to UNCALIBRATED on SYNCHRONIZATION_FAULT ptp4l[13296.202]: selected best master clock 001122.fffe.33f801 ptp4l[13296.202]: updating UTC offset to 37 ptp4l[13296.319]: selected best master clock 001122.fffe.33f801 ptp4l[13296.319]: updating UTC offset to 37 ptp4l[13296.515]: selected best master clock 001122.fffe.33f801 ptp4l[13296.515]: updating UTC offset to 37 ptp4l[13296.725]: selected best master clock 001122.fffe.33f801 ptp4l[13296.725]: updating UTC offset to 37 ptp4l[13296.873]: port 2 (sriov1): UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED ptp4l[13296.963]: port :: port 2 (sriov1) ptp4l[13296.964]: Mono Interval :: 1027848806 ptp4l[13296.964]: Interval :: 1057330889 ptp4l[13297.013]: selected best master clock 001122.fffe.33f801 ptp4l[13297.013]: updating UTC offset to 37 ptp4l[13297.059]: port :: port 2 (sriov1) ptp4l[13297.062]: rms 8 max 14 freq -2600 +/- 9 delay 729 +/- 0 Commands used : ptp4l -f ~/config_ptp_8275_1.conf -H -m -s --boundary_clock_jbod=1 -i sriov0 -i sriov1 phc2sys -a -r -m -R 16 -n 24 Thanks, Amar B S > -----Original Message----- > From: Miroslav Lichvar [mailto:mli...@re...] > Sent: 26 May 2021 13:54 > To: lin...@li... > Cc: Amar Subramanyam <asu...@al...> > Subject: [PATCH v2 3/3] port: Don't check timestamps from non-slave ports. > > CAUTION: This email originated from outside of Altiostar. Do not click on links > or open attachments unless you recognize the sender and you are sure the > content is safe. You will never be asked to reset your Altiostar password via > email. > > > Don't perform the sanity check on receive timestamps from ports in non-slave > states to avoid false positives in the jbod mode, where the timestamps can be > generated by different clocks. > > Reported-by: Amar Subramanyam <asu...@al...> > Signed-off-by: Miroslav Lichvar <mli...@re...> > --- > port.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/port.c b/port.c > index 10bb9e1..fb420fb 100644 > --- a/port.c > +++ b/port.c > @@ -2744,7 +2744,10 @@ static enum fsm_event bc_event(struct port *p, int > fd_index) > } > if (msg_sots_valid(msg)) { > ts_add(&msg->hwts.ts, -p->rx_timestamp_offset); > - clock_check_ts(p->clock, tmv_to_nanoseconds(msg->hwts.ts)); > + if (p->state == PS_SLAVE) { > + clock_check_ts(p->clock, > + tmv_to_nanoseconds(msg->hwts.ts)); > + } > } > > switch (msg_type(msg)) { > -- > 2.26.3 |
From: Miroslav L. <mli...@re...> - 2021-05-26 14:32:37
|
On Wed, May 26, 2021 at 02:05:01PM +0000, Amar Subramanyam wrote: > Hi Miroslav, > > I am afraid, the patches that was intended to fix the reported issue doesn't do so. The issue that we reported was not due to timestamps from non-slave port being compared in jbod mode. But due to mono_interval (CLOCK_MONOTIC interval b/w two successive calls to clock_check_sample()) becoming high due to the execution BMCA algorithm between two successive calls to clock_check_sample() function. We believe that this is due to execution time being higher, and solution is to increase the default value of sanity_freq_limit accordingly. The execution of the BMCA took 30 milliseconds? In that case the fix would be to make the minimum interval longer. Can you post a strace -tt log containing the last second leading to the fault? I suspect it's rather the PHC interval that's wrong. There might be an unexpected interaction with phc2sys. Can you reproduce the issue with no phc2sys running? -- Miroslav Lichvar |
From: Miroslav L. <mli...@re...> - 2021-05-27 09:44:35
|
On Wed, May 26, 2021 at 05:45:29PM +0000, Amar Subramanyam wrote: > Hi Miroslav, > > We were able to reproduce the issue even without running phc2sys. > Please find the attached strace and ptp4l logs when this issue is seen. Ok, thanks. That's very helpful. > From the ptp4l log, we can see that BMCA took 148 msec to run(Mono Interval :: 148445967). > The same can be observed from strace logs. In the attached strace log, BMCA is executed between the timestamps 00:07:27.047830 (line number 53) to 00:07:27.196275(line number 102). I don't see that in the log. > 00:07:27.142942 close(4) = 0 > 00:07:27.153733 close(15) = 0 > 00:07:27.167783 socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 > 00:07:27.167829 ioctl(4, SIOCGIFHWADDR, {ifr_name="sriov0", ifr_hwaddr={sa_family=ARPHRD_ETHER, sa_data=64:4c:36:12:55:e0}}) = 0 > 00:07:27.167878 close(4) = 0 > 00:07:27.167964 socket(AF_PACKET, SOCK_RAW, 768) = 4 > 00:07:27.168023 ioctl(4, SIOCGIFINDEX, {ifr_name="sriov0", }) = 0 > 00:07:27.168095 bind(4, {sa_family=AF_PACKET, sll_protocol=htons(ETH_P_ALL), sll_ifindex=if_nametoindex("sriov0"), sll_hatype=ARPHRD_NETROM, sll_pkttype=PACKET_HOST, sll_halen=0}, 20) = 0 > 00:07:27.178781 setsockopt(4, SOL_SOCKET, SO_BINDTODEVICE, "sriov0", 6) = 0 > 00:07:27.178830 setsockopt(4, SOL_SOCKET, SO_ATTACH_FILTER, {len=12, filter=0x635ae0}, 16) = 0 > 00:07:27.178875 setsockopt(4, SOL_PACKET, PACKET_ADD_MEMBERSHIP, {mr_ifindex=if_nametoindex("sriov0"), mr_type=PACKET_MR_MULTICAST, mr_alen=6, mr_address=011b19000000}, 16) = 0 > 00:07:27.179151 setsockopt(4, SOL_PACKET, PACKET_ADD_MEMBERSHIP, {mr_ifindex=if_nametoindex("sriov0"), mr_type=PACKET_MR_MULTICAST, mr_alen=6, mr_address=0180c200000e}, 16) = 0 > 00:07:27.179415 socket(AF_PACKET, SOCK_RAW, 768) = 15 > 00:07:27.179482 ioctl(15, SIOCGIFINDEX, {ifr_name="sriov0", }) = 0 > 00:07:27.179518 bind(15, {sa_family=AF_PACKET, sll_protocol=htons(ETH_P_ALL), sll_ifindex=if_nametoindex("sriov0"), sll_hatype=ARPHRD_NETROM, sll_pkttype=PACKET_HOST, sll_halen=0}, 20) = 0 > 00:07:27.193807 setsockopt(15, SOL_SOCKET, SO_BINDTODEVICE, "sriov0", 6) = 0 What I see is that it's the closing and binding of the raw sockets that is slowing down ptp4l so much that a received message waits for up to ~40 milliseconds before the clock check gets its timestamp. ptp4l is renewing the transport on announce/sync timeout, which according to the git log is needed in the client-only mode to not get multicast sockets stuck when the link goes down. I think the fix here is to avoid renewing the raw transport. It shouldn't be necessary if I understand it correctly. Can you please verify that the following change fixes the problem for you? --- a/port.c +++ b/port.c @@ -1806,6 +1806,12 @@ static int port_renew_transport(struct port *p) if (!port_is_enabled(p)) { return 0; } + + /* Closing and binding of raw sockets is too slow and unnecessary */ + if (transport_type(p->trp) == TRANS_IEEE_802_3) { + return 0; + } + transport_close(p->trp, &p->fda); port_clear_fda(p, FD_FIRST_TIMER); res = transport_open(p->trp, p->iface, &p->fda, p->timestamping); -- Miroslav Lichvar |
From: Amar S. <asu...@al...> - 2021-05-28 09:26:36
|
Hi Miroslav, Please find our response inline. Thanks, Amar B S > -----Original Message----- > From: Miroslav Lichvar [mailto:mli...@re...] > Sent: 27 May 2021 15:14 > To: Amar Subramanyam <asu...@al...> > Cc: lin...@li... > Subject: Re: [PATCH v2 3/3] port: Don't check timestamps from non-slave ports. > > CAUTION: This email originated from outside of Altiostar. Do not click on links > or open attachments unless you recognize the sender and you are sure the > content is safe. You will never be asked to reset your Altiostar password via > email. > > > On Wed, May 26, 2021 at 05:45:29PM +0000, Amar Subramanyam wrote: > > Hi Miroslav, > > > > We were able to reproduce the issue even without running phc2sys. > > Please find the attached strace and ptp4l logs when this issue is seen. > > Ok, thanks. That's very helpful. > > > From the ptp4l log, we can see that BMCA took 148 msec to run(Mono Interval > :: 148445967). > > The same can be observed from strace logs. In the attached strace log, BMCA > is executed between the timestamps 00:07:27.047830 (line number 53) to > 00:07:27.196275(line number 102). > > I don't see that in the log. > > > 00:07:27.142942 close(4) = 0 > > 00:07:27.153733 close(15) = 0 > > 00:07:27.167783 socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4 > > 00:07:27.167829 ioctl(4, SIOCGIFHWADDR, {ifr_name="sriov0", > ifr_hwaddr={sa_family=ARPHRD_ETHER, sa_data=64:4c:36:12:55:e0}}) = 0 > > 00:07:27.167878 close(4) = 0 > > 00:07:27.167964 socket(AF_PACKET, SOCK_RAW, 768) = 4 > > 00:07:27.168023 ioctl(4, SIOCGIFINDEX, {ifr_name="sriov0", }) = 0 > > 00:07:27.168095 bind(4, {sa_family=AF_PACKET, > > sll_protocol=htons(ETH_P_ALL), sll_ifindex=if_nametoindex("sriov0"), > > sll_hatype=ARPHRD_NETROM, sll_pkttype=PACKET_HOST, sll_halen=0}, 20) = > > 0 > > 00:07:27.178781 setsockopt(4, SOL_SOCKET, SO_BINDTODEVICE, "sriov0", > > 6) = 0 > > 00:07:27.178830 setsockopt(4, SOL_SOCKET, SO_ATTACH_FILTER, {len=12, > > filter=0x635ae0}, 16) = 0 > > 00:07:27.178875 setsockopt(4, SOL_PACKET, PACKET_ADD_MEMBERSHIP, > > {mr_ifindex=if_nametoindex("sriov0"), mr_type=PACKET_MR_MULTICAST, > > mr_alen=6, mr_address=011b19000000}, 16) = 0 > > 00:07:27.179151 setsockopt(4, SOL_PACKET, PACKET_ADD_MEMBERSHIP, > > {mr_ifindex=if_nametoindex("sriov0"), mr_type=PACKET_MR_MULTICAST, > > mr_alen=6, mr_address=0180c200000e}, 16) = 0 > > 00:07:27.179415 socket(AF_PACKET, SOCK_RAW, 768) = 15 > > 00:07:27.179482 ioctl(15, SIOCGIFINDEX, {ifr_name="sriov0", }) = 0 > > 00:07:27.179518 bind(15, {sa_family=AF_PACKET, > > sll_protocol=htons(ETH_P_ALL), sll_ifindex=if_nametoindex("sriov0"), > > sll_hatype=ARPHRD_NETROM, sll_pkttype=PACKET_HOST, sll_halen=0}, 20) = > > 0 > > 00:07:27.193807 setsockopt(15, SOL_SOCKET, SO_BINDTODEVICE, "sriov0", > > 6) = 0 > > What I see is that it's the closing and binding of the raw sockets that is slowing > down ptp4l so much that a received message waits for up to ~40 milliseconds > before the clock check gets its timestamp. > > ptp4l is renewing the transport on announce/sync timeout, which according to > the git log is needed in the client-only mode to not get multicast sockets stuck > when the link goes down. I think the fix here is to avoid renewing the raw > transport. It shouldn't be necessary if I understand it correctly. > > Can you please verify that the following change fixes the problem for you? > > --- a/port.c > +++ b/port.c > @@ -1806,6 +1806,12 @@ static int port_renew_transport(struct port *p) > if (!port_is_enabled(p)) { > return 0; > } > + > + /* Closing and binding of raw sockets is too slow and unnecessary */ > + if (transport_type(p->trp) == TRANS_IEEE_802_3) { > + return 0; > + } > + > transport_close(p->trp, &p->fda); > port_clear_fda(p, FD_FIRST_TIMER); > res = transport_open(p->trp, p->iface, &p->fda, p->timestamping); > Yes, With this patch applied, the problem seems to be fixed. We even did a cable removal test and link status was correctly detected. We could see that ptp4l already has RTNL functionality and hence this port_renew_transport() seems to be unnecessary redundancy. Should bypassing the function entirely not just for 802_3 but also for UDPv4 and UDPv6 transports be the right solution? Let us know your comments. > -- > Miroslav Lichvar |
From: Richard C. <ric...@gm...> - 2021-05-28 16:51:05
|
On Fri, May 28, 2021 at 09:26:07AM +0000, Amar Subramanyam via Linuxptp-devel wrote: > We could see that ptp4l already has RTNL functionality and hence this port_renew_transport() seems to be unnecessary redundancy. The link notifications from the kernel are best effort and are not guaranteed. So renewing the transport is the safe thing to do. Thanks, Richard |
From: Miroslav L. <mli...@re...> - 2021-05-31 09:08:17
|
v3 - added patch to avoid slow renewal of raw sockets - added patch to increase the minimum check interval v2 - improved commit message - added missing NULL check These patches make the clock check more reliable in several different cases. The first patch is not strictly related to the clock check, e.g. it also fixes the path delay after switching the slave port. Miroslav Lichvar (5): clock: Reset state when switching port with same best clock. clock: Reset clock check on best clock/port change. port: Don't check timestamps from non-slave ports. port: Don't renew raw transport. clockcheck: Increase minimum interval. clock.c | 4 +++- clockcheck.c | 11 +++++++++-- clockcheck.h | 6 ++++++ port.c | 11 ++++++++++- 4 files changed, 28 insertions(+), 4 deletions(-) -- 2.26.3 |
From: Miroslav L. <mli...@re...> - 2021-05-31 09:08:17
|
Renewing of the transport on announce/sync timeout is needed in the client-only mode to avoid getting stuck with a broken multicast socket when the link goes down. This shouldn't be necessary with the raw transport. Closing and binding of raw sockets can apparently be so slow that it triggers a false positive in the clock check. Reported-by: Amar Subramanyam <asu...@al...> Signed-off-by: Miroslav Lichvar <mli...@re...> --- port.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/port.c b/port.c index fb420fb..6bf0684 100644 --- a/port.c +++ b/port.c @@ -1806,6 +1806,12 @@ static int port_renew_transport(struct port *p) if (!port_is_enabled(p)) { return 0; } + + /* Closing and binding of raw sockets is too slow and unnecessary */ + if (transport_type(p->trp) == TRANS_IEEE_802_3) { + return 0; + } + transport_close(p->trp, &p->fda); port_clear_fda(p, FD_FIRST_TIMER); res = transport_open(p->trp, p->iface, &p->fda, p->timestamping); -- 2.26.3 |
From: Keller, J. E <jac...@in...> - 2021-06-02 01:07:45
|
> -----Original Message----- > From: Miroslav Lichvar <mli...@re...> > Sent: Monday, May 31, 2021 2:08 AM > To: lin...@li... > Subject: [Linuxptp-devel] [PATCH v3 4/5] port: Don't renew raw transport. > > Renewing of the transport on announce/sync timeout is needed in the > client-only mode to avoid getting stuck with a broken multicast socket > when the link goes down. > > This shouldn't be necessary with the raw transport. Closing and binding > of raw sockets can apparently be so slow that it triggers a false > positive in the clock check. > > Reported-by: Amar Subramanyam <asu...@al...> > Signed-off-by: Miroslav Lichvar <mli...@re...> > --- Makes sense. Reviewed-by: Jacob Keller <jac...@in...> > port.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/port.c b/port.c > index fb420fb..6bf0684 100644 > --- a/port.c > +++ b/port.c > @@ -1806,6 +1806,12 @@ static int port_renew_transport(struct port *p) > if (!port_is_enabled(p)) { > return 0; > } > + > + /* Closing and binding of raw sockets is too slow and unnecessary */ > + if (transport_type(p->trp) == TRANS_IEEE_802_3) { > + return 0; > + } > + > transport_close(p->trp, &p->fda); > port_clear_fda(p, FD_FIRST_TIMER); > res = transport_open(p->trp, p->iface, &p->fda, p->timestamping); > -- > 2.26.3 > > > > _______________________________________________ > Linuxptp-devel mailing list > Lin...@li... > https://lists.sourceforge.net/lists/listinfo/linuxptp-devel |
From: Miroslav L. <mli...@re...> - 2021-05-31 09:08:15
|
Don't perform the sanity check on receive timestamps from ports in non-slave states to avoid false positives in the jbod mode, where the timestamps can be generated by different clocks. Reviewed-by: Jacob Keller <jac...@in...> Signed-off-by: Miroslav Lichvar <mli...@re...> --- port.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/port.c b/port.c index 10bb9e1..fb420fb 100644 --- a/port.c +++ b/port.c @@ -2744,7 +2744,10 @@ static enum fsm_event bc_event(struct port *p, int fd_index) } if (msg_sots_valid(msg)) { ts_add(&msg->hwts.ts, -p->rx_timestamp_offset); - clock_check_ts(p->clock, tmv_to_nanoseconds(msg->hwts.ts)); + if (p->state == PS_SLAVE) { + clock_check_ts(p->clock, + tmv_to_nanoseconds(msg->hwts.ts)); + } } switch (msg_type(msg)) { -- 2.26.3 |
From: Miroslav L. <mli...@re...> - 2021-05-31 09:08:15
|
Reset the clock check when the best clock or port changes, together with the other state like current estimated delay and frequency. This avoids false positives if the clock is controlled by an external process when not synchronized by PTP (e.g. phc2sys -rr). Reviewed-by: Jacob Keller <jac...@in...> Signed-off-by: Miroslav Lichvar <mli...@re...> --- clock.c | 2 ++ clockcheck.c | 9 ++++++++- clockcheck.h | 6 ++++++ 3 files changed, 16 insertions(+), 1 deletion(-) diff --git a/clock.c b/clock.c index a073575..2dd7ef9 100644 --- a/clock.c +++ b/clock.c @@ -1949,6 +1949,8 @@ static void handle_state_decision_event(struct clock *c) if (!cid_eq(&best_id, &c->best_id) || best != c->best) { clock_freq_est_reset(c); + if (c->sanity_check) + clockcheck_reset(c->sanity_check); tsproc_reset(c->tsproc, 1); if (!tmv_is_zero(c->initial_delay)) tsproc_set_delay(c->tsproc, c->initial_delay); diff --git a/clockcheck.c b/clockcheck.c index d48a578..d0b4714 100644 --- a/clockcheck.c +++ b/clockcheck.c @@ -47,9 +47,16 @@ struct clockcheck *clockcheck_create(int freq_limit) if (!cc) return NULL; cc->freq_limit = freq_limit; + clockcheck_reset(cc); + return cc; +} + +void clockcheck_reset(struct clockcheck *cc) +{ + cc->freq_known = 0; cc->max_freq = -CHECK_MAX_FREQ; cc->min_freq = CHECK_MAX_FREQ; - return cc; + cc->last_ts = 0; } int clockcheck_sample(struct clockcheck *cc, uint64_t ts) diff --git a/clockcheck.h b/clockcheck.h index 78aca48..1ff86eb 100644 --- a/clockcheck.h +++ b/clockcheck.h @@ -33,6 +33,12 @@ struct clockcheck; */ struct clockcheck *clockcheck_create(int freq_limit); +/** + * Reset a clock check. + * @param cc Pointer to a clock check obtained via @ref clockcheck_create(). + */ +void clockcheck_reset(struct clockcheck *cc); + /** * Perform the sanity check on a time stamp. * @param cc Pointer to a clock check obtained via @ref clockcheck_create(). -- 2.26.3 |
From: Miroslav L. <mli...@re...> - 2021-05-31 09:08:17
|
When the best port is changed, but the ID of the best clock doesn't change (e.g. a passive port is activated on link failure), reset the current delay and other master/link-specific state to avoid the switch throwing the clock off. Reviewed-by: Jacob Keller <jac...@in...> Signed-off-by: Miroslav Lichvar <mli...@re...> --- clock.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/clock.c b/clock.c index e545a9b..a073575 100644 --- a/clock.c +++ b/clock.c @@ -1947,7 +1947,7 @@ static void handle_state_decision_event(struct clock *c) cid2str(&best_id)); } - if (!cid_eq(&best_id, &c->best_id)) { + if (!cid_eq(&best_id, &c->best_id) || best != c->best) { clock_freq_est_reset(c); tsproc_reset(c->tsproc, 1); if (!tmv_is_zero(c->initial_delay)) -- 2.26.3 |
From: Miroslav L. <mli...@re...> - 2021-05-31 09:08:23
|
Increase the minimum check interval to 1 second to measure the frequency offset more accurately and with default configuration make false positives less likely due to a heavily overloaded system. Signed-off-by: Miroslav Lichvar <mli...@re...> --- clockcheck.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/clockcheck.c b/clockcheck.c index d0b4714..f0141be 100644 --- a/clockcheck.c +++ b/clockcheck.c @@ -23,7 +23,7 @@ #include "clockcheck.h" #include "print.h" -#define CHECK_MIN_INTERVAL 100000000 +#define CHECK_MIN_INTERVAL 1000000000 #define CHECK_MAX_FREQ 900000000 struct clockcheck { -- 2.26.3 |