We are testing the mstpd daemon and everything is working pretty fine until now. However we have found an issue when changing the system hour.
I took a look to the code and I found out that the reference time in the main loop is an increment of the hour at the beginning of the execution. That leads to an unexpected behaviour of the mstp state machine if the user change the system hour.
Hi Jorge!
After some consideration I came to the conclusion that [r44] isn't a good solution. It has a problem - interval between calls to the bridge_one_second() is greater than one second, and sometimes considerably greater, because it includes time spent in all event handlers and in the bridge_one_second() itself.
Please, test/review the following patch (applied on top of the revision [r44]) which (I hope) solves the issue, and tell me if it works for you:
Index: epoll_loop.c===================================================================--- epoll_loop.c (revision 44)+++ epoll_loop.c (working copy)@@ -94,7 +94,6 @@
static inline void run_timeouts(void)
{
bridge_one_second();
- gettimeofday(&nexttimeout, NULL);
++(nexttimeout.tv_sec);
}
@@ -116,6 +115,20 @@
if(timeout < 0 || timeout > 1000)
{
run_timeouts();
+ /* Check if system time has changed.+ * NOTE: we can not differentiate reliably if system+ * time has changed or we have spent too much time+ * inside event handlers and run_timeouts().+ * Fix: use clock_gettime(CLOCK_MONOTONIC, ) instead of+ * gettimeofday, if it is available.+ * If it is not available on given system -+ * the following is the best we can do.+ */+ timeout = time_diff(&nexttimeout, &tv);+ if(timeout < -3000 || timeout > 1000)+ { /* Most probably, system time has changed */+ nexttimeout.tv_sec = tv.tv_sec + 1;+ }
timeout = 0;
}
I took my some time to understand the difference between version 44 and this new patch. Finally I realized that interval between the beginning of two consecutive executions of function bridge_one_second() must be one second, but in version 44 this one second interval correspond to the difference between the end of one execution to the start of the next one, isn't it?
I was testing version 44 over a very simple topology, so the execution time of bridge_one_second had not influence over final result. I guess that in a complex topology it would become critical.
This is my version of the patch. When updating nexttimeout, I take in count tv_usec as well as tv_sec (I am not completly sure about if it's necesary)
@@ -116,6 +115,27 @@
if(timeout < 0 || timeout > 1000)
{
run_timeouts();
+ /
+ * Check if system time has changed.
+ * NOTE: we can not differentiate reliably if system
+ * time has changed or we have spent too much time
+ * inside event handlers and run_timeouts().
+ * Fix: use clock_gettime(CLOCK_MONOTONIC, ) instead of
+ * gettimeofday, if it is available.
+ * If it is not available on given system -
+ * the following is the best we can do.
+ /
+
+ /
+ * Recompute of timeout is not required. tv struct does not change and
+ * nextimeout struct is known
+ /
+ if(timeout < -4000 || timeout > 1000)
+ {
+ / Most probably, system time has changed /
+ nexttimeout.tv_usec = tv.tv_usec;
+ nexttimeout.tv_sec = tv.tv_sec + 1;
+ }
timeout = 0;
}
~~~~~~~
Regards
Jorge
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I took my some time to understand the difference between version 44 and this new patch. Finally I realized that interval between the beginning of two consecutive executions of function bridge_one_second() must be one second, but in version 44 this one second interval correspond to the difference between the end of one execution to the start of the next one, isn't it?
Exactly!
I was testing version 44 over a very simple topology, so the execution time of bridge_one_second had not influence over final result. I guess that in a complex topology it would become critical.
Yes. Slow CPU + many ports in bridge = seconds begin to count wrongly. Not sure how critical is it, but anyway we can eliminate that accumulative error in counting seconds, so let's do it.
This is my version of the patch. When updating nexttimeout, I take in count tv_usec as well as tv_sec (I am not completly sure about if it's necesary)
Thanks! Applied as revision [r45] with some minor formatting/whitespace issues. Also removed the "timeout" comment as it is discussing patch decision (why you removed the line "timeout = time_diff(&nexttimeout, &tv)"), not adding much to the resulting code. I hope, it is not a problem :)
Hello,
We are testing the mstpd daemon and everything is working pretty fine until now. However we have found an issue when changing the system hour.
I took a look to the code and I found out that the reference time in the main loop is an increment of the hour at the beginning of the execution. That leads to an unexpected behaviour of the mstp state machine if the user change the system hour.
I think the next patch can resolve the problem:
I have not a deep knowledge about the whole daemon code, so maybe this is not the right place to make the change.
Hope you find this helpful.
Regards
Jorge
Applied as revision [r44] with the ommision of unused variable cmd.
Thanks, Gastón!
BTW, it's funny, the other person (Rodolfo Giometti) had reported the same bug at the same time :) Do you know each other by any chance?
Related
Commit: [r44]
Hi,
I don't have the pleasure of knowing Rodolfo Giometti. Anyway, I am glad to hear that more people are working with mstpd.
Sorry about the cmd variable. I used it for debugging purposes and I forgot to remove it.
Thanks for your fast answer.
Best regards
Jorge
Hi Jorge!
After some consideration I came to the conclusion that [r44] isn't a good solution. It has a problem - interval between calls to the bridge_one_second() is greater than one second, and sometimes considerably greater, because it includes time spent in all event handlers and in the bridge_one_second() itself.
Please, test/review the following patch (applied on top of the revision [r44]) which (I hope) solves the issue, and tell me if it works for you:
Related
Commit: [r44]
Hi Vitalii!
I tested this new patch and it works fine.
I took my some time to understand the difference between version 44 and this new patch. Finally I realized that interval between the beginning of two consecutive executions of function bridge_one_second() must be one second, but in version 44 this one second interval correspond to the difference between the end of one execution to the start of the next one, isn't it?
I was testing version 44 over a very simple topology, so the execution time of bridge_one_second had not influence over final result. I guess that in a complex topology it would become critical.
This is my version of the patch. When updating nexttimeout, I take in count tv_usec as well as tv_sec (I am not completly sure about if it's necesary)
~~~~~~
--- mstpd-44.orig/epoll_loop.c
+++ mstpd-43.rev13/epoll_loop.c
@@ -94,7 +94,6 @@
static inline void run_timeouts(void)
{
bridge_one_second();
- gettimeofday(&nexttimeout, NULL);
++(nexttimeout.tv_sec);
}
@@ -116,6 +115,27 @@
if(timeout < 0 || timeout > 1000)
{
run_timeouts();
+ /
+ * Check if system time has changed.
+ * NOTE: we can not differentiate reliably if system
+ * time has changed or we have spent too much time
+ * inside event handlers and run_timeouts().
+ * Fix: use clock_gettime(CLOCK_MONOTONIC, ) instead of
+ * gettimeofday, if it is available.
+ * If it is not available on given system -
+ * the following is the best we can do.
+ /
+
+ /
+ * Recompute of timeout is not required. tv struct does not change and
+ * nextimeout struct is known
+ /
+ if(timeout < -4000 || timeout > 1000)
+ {
+ / Most probably, system time has changed /
+ nexttimeout.tv_usec = tv.tv_usec;
+ nexttimeout.tv_sec = tv.tv_sec + 1;
+ }
timeout = 0;
}
~~~~~~~
Regards
Jorge
Hi Jorge!
Great, thanks!
Exactly!
Yes. Slow CPU + many ports in bridge = seconds begin to count wrongly. Not sure how critical is it, but anyway we can eliminate that accumulative error in counting seconds, so let's do it.
Thanks! Applied as revision [r45] with some minor formatting/whitespace issues. Also removed the "timeout" comment as it is discussing patch decision (why you removed the line "timeout = time_diff(&nexttimeout, &tv)"), not adding much to the resulting code. I hope, it is not a problem :)
Related
Commit: [r45]