Thread: Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of (Page 2)
PTP IEEE 1588 stack for Linux
Brought to you by:
rcochran
From: Vladimir O. <ol...@gm...> - 2020-06-18 20:18:17
|
On Thu, 18 Jun 2020 at 23:02, Jacob Keller <jac...@in...> wrote: > > > > On 6/18/2020 12:56 PM, Vladimir Oltean wrote: > >> Could this use the pr facility, so that it honors the printing options > >> to log to syslog and manages the level? > >> > >> I'm not sure what I would mark these as: debug because they no longer > >> impact the syncing process, or true error because they are unexpected? > >> > >> Thanks, > >> Jake > >> > > > > Thanks a lot for looking at this. > > Good point about redirecting to syslog, I did not think about that. > > The reason why I did not use pr_err was due to the automatic line > > ending, which would have unnecessarily complicated this function by > > having me print each line of the packet to a temporary buffer first. > > But now that I see there's no way around it, I guess I'll have to do > > that. > What about implementing this as part of pr.c? > > Thanks, > Jake Not really sure what you mean. What would the prototype of the function be (i.e. what would it do)? -Vladimir |
From: Jacob K. <jac...@in...> - 2020-06-18 20:43:34
|
On 6/18/2020 1:17 PM, Vladimir Oltean wrote: > On Thu, 18 Jun 2020 at 23:02, Jacob Keller <jac...@in...> wrote: >> >> >> >> On 6/18/2020 12:56 PM, Vladimir Oltean wrote: >>>> Could this use the pr facility, so that it honors the printing options >>>> to log to syslog and manages the level? >>>> >>>> I'm not sure what I would mark these as: debug because they no longer >>>> impact the syncing process, or true error because they are unexpected? >>>> >>>> Thanks, >>>> Jake >>>> >>> >>> Thanks a lot for looking at this. >>> Good point about redirecting to syslog, I did not think about that. >>> The reason why I did not use pr_err was due to the automatic line >>> ending, which would have unnecessarily complicated this function by >>> having me print each line of the packet to a temporary buffer first. >>> But now that I see there's no way around it, I guess I'll have to do >>> that. >> What about implementing this as part of pr.c? >> >> Thanks, >> Jake > > Not really sure what you mean. What would the prototype of the > function be (i.e. what would it do)? > > -Vladimir > I was thinking like something that would do a hexdump to a given print level. Perhaps that's overkill since not many places would need it... Thanks, Jake |
From: Richard C. <ric...@gm...> - 2020-06-19 11:44:07
|
On Thu, Jun 18, 2020 at 12:42:55PM -0700, Jacob Keller wrote: > > So it is a bug if poll() wakes up the process for data availability on a > > socket's receive queue, and then recvmsg(), called immediately > > afterwards, goes to sleep trying to retrieve it. Yes. > Makes sense. Using MSG_DONTWAIT seems reasonable since we already know > there should be data available. I disagree on this point. There are (at least) three ways to avoid blocking on receive. 1. poll() first. The guarantees data are available, and for SDGAM sockets those data comprise at least on message. 2. use the MSG_DONTWAIT flag 3. set the SO_RCVTIMEO socket option This patch would mix methods 1 and 2. I understand that it still works with the DONTWAIT, but it sets a bad example. I want this program to be an example of correct programming patterns. Our industry has plenty enough guess work and sloppy copy and paste as it is. Thanks, Richard |
From: Vladimir O. <ol...@gm...> - 2020-06-19 13:02:16
|
On Fri, 19 Jun 2020 at 15:47, Richard Cochran <ric...@gm...> wrote: > > On Fri, Jun 19, 2020 at 03:41:09PM +0300, Vladimir Oltean wrote: > > Ok, and this patch does not have a place in linuxptp because? > > It is a matter of style, as I already said. > > Thanks, Again, I would have accepted any technical argument you could bring. Like, for example, the fact that this sleep on recv() is becoming impossible after patch "[PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets" (which _is_ fixing a ptp4l application bug), even without MSG_DONTWAIT in place. But you prefer to bring the style argument. So, as a matter of personal style, I think the only right thing for me to do is to retreat now. Feel free to use any portion of the patches I submitted so far in any combination you see desirable. Thanks, -Vladimir |
From: Richard C. <ric...@gm...> - 2020-06-19 14:22:05
|
On Fri, Jun 19, 2020 at 04:01:57PM +0300, Vladimir Oltean wrote: > Again, I would have accepted any technical argument you could bring. Thinking more about this, there is in fact a technical reason to leave the code as is. For DGRAM sockets, the behavior of recvmsg() in sk_receive() is the same with and without DONTWAIT. But this is only by chance. Consider a STREAM socket with fixed message sizes. In that case, poll indicates only that the socket is readable, and not that an entire message is available. When recvmsg(flags=DONTWAIT) returns, the length may be less than a full message, and the code in sk_receive() would have to read the socket again (and again) until a whole message arrives. If sk_receive(flags=DONTWAIT) is allowed, then that method would have to cover the case of reading a partial message. But sk_receive() does not implement that programming pattern. Instead the caller knows a message is available, and then sk_receive() "blocks" in recvmsg() for that one message. That is the design pattern. > So, as a matter of personal style, I think the only right thing for me > to do is to retreat now. Feel free to use any portion of the patches I > submitted so far in any combination you see desirable. I will likely apply patch 3. For patch 2, could you please comment on the diff I posted as an alternate solution? Thanks, Richard |
From: Vladimir O. <ol...@gm...> - 2020-06-19 14:50:10
|
On Fri, 19 Jun 2020 at 17:21, Richard Cochran <ric...@gm...> wrote: > > On Fri, Jun 19, 2020 at 04:01:57PM +0300, Vladimir Oltean wrote: > > Again, I would have accepted any technical argument you could bring. > > Thinking more about this, there is in fact a technical reason to leave > the code as is. For DGRAM sockets, the behavior of recvmsg() in > sk_receive() is the same with and without DONTWAIT. But this is only > by chance. > > Consider a STREAM socket with fixed message sizes. In that case, poll > indicates only that the socket is readable, and not that an entire > message is available. When recvmsg(flags=DONTWAIT) returns, the > length may be less than a full message, and the code in sk_receive() > would have to read the socket again (and again) until a whole message > arrives. If sk_receive(flags=DONTWAIT) is allowed, then that method > would have to cover the case of reading a partial message. > A technical argument which implies that IEEE 1588 synchronization may work over connection-oriented sockets is a stylistic argument, really. I don't buy it. > But sk_receive() does not implement that programming pattern. Instead > the caller knows a message is available, and then sk_receive() > "blocks" in recvmsg() for that one message. That is the design > pattern. > No, sk_receive() does not "block" in recvmsg() for that one message. Any kernel tracing program (I recommend kernelshark, it has a nice GUI) will tell you that. > > So, as a matter of personal style, I think the only right thing for me > > to do is to retreat now. Feel free to use any portion of the patches I > > submitted so far in any combination you see desirable. > > I will likely apply patch 3. For patch 2, could you please comment on > the diff I posted as an alternate solution? > Yes, I will provide my feedback on that. > Thanks, > Richard Thanks, -Vladimir |
From: Vladimir O. <ol...@gm...> - 2020-06-19 11:52:15
|
Hi Richard, On Fri, 19 Jun 2020 at 14:44, Richard Cochran <ric...@gm...> wrote: > > On Thu, Jun 18, 2020 at 12:42:55PM -0700, Jacob Keller wrote: > > > So it is a bug if poll() wakes up the process for data availability on a > > > socket's receive queue, and then recvmsg(), called immediately > > > afterwards, goes to sleep trying to retrieve it. > > Yes. > > > Makes sense. Using MSG_DONTWAIT seems reasonable since we already know > > there should be data available. > > I disagree on this point. There are (at least) three ways to avoid > blocking on receive. > > 1. poll() first. The guarantees data are available, and for SDGAM > sockets those data comprise at least on message. > > 2. use the MSG_DONTWAIT flag > > 3. set the SO_RCVTIMEO socket option > > This patch would mix methods 1 and 2. I understand that it still > works with the DONTWAIT, but it sets a bad example. I want this > program to be an example of correct programming patterns. Our > industry has plenty enough guess work and sloppy copy and paste as it > is. > > Thanks, > Richard Could you please give a concrete example of why it isn't desirable for MSG_DONTWAIT to be specified on a socket where data availability has been previously confirmed through poll()? I'm not sure what the patch has to do with the industry having "plenty enough guess work and sloppy copy and paste as it is". Thanks, -Vladimir |
From: Richard C. <ric...@gm...> - 2020-06-19 11:52:21
|
On Mon, Jun 15, 2020 at 06:23:20PM +0300, Vladimir Oltean wrote: > Printing them to the user is optional (and helpful), but reading them is > not. With this patch, even with extraneous data delivered by a buggy > kernel (which the application now loudly complains about), the > synchronization keeps chugging along. Otherwise the application starts > reordering packets in recvmsg() due to misinterpreting which socket > queue has data available. Again, I see this stack as production code and not a debugging tool. Hexdumping frames is more for wireshark or tools like linux/tools/testing/selftests/net/[rx\|tx]timestamp in the kernel source. In fact, I think that tool already dumps frames. Thanks, Richard |
From: Vladimir O. <ol...@gm...> - 2020-06-19 11:55:35
|
On Fri, 19 Jun 2020 at 14:52, Richard Cochran <ric...@gm...> wrote: > > On Mon, Jun 15, 2020 at 06:23:20PM +0300, Vladimir Oltean wrote: > > Printing them to the user is optional (and helpful), but reading them is > > not. With this patch, even with extraneous data delivered by a buggy > > kernel (which the application now loudly complains about), the > > synchronization keeps chugging along. Otherwise the application starts > > reordering packets in recvmsg() due to misinterpreting which socket > > queue has data available. > > Again, I see this stack as production code and not a debugging tool. > Hexdumping frames is more for wireshark or tools like > > linux/tools/testing/selftests/net/[rx\|tx]timestamp > > in the kernel source. In fact, I think that tool already dumps > frames. > > Thanks, > Richard Sure. In fact, printing the data on the error queue is totally optional and I could just as well remove it, no problem with that. But reading it isn't. I don't think you've addressed that point. -Vladimir |
From: Vladimir O. <ol...@gm...> - 2020-06-24 16:37:06
|
Hi Richard, On Wed, 24 Jun 2020 at 19:27, Richard Cochran <ric...@gm...> wrote: > > On Fri, Jun 19, 2020 at 10:45:23PM +0300, Vladimir Oltean wrote: > > Yes, much better. > > May I add your tag to the commit message, like this? I think the commit description is slightly lossy and a bit misleading. > > Catch unexpected socket polling errors. This part is ok. > > The poll(2) system call may set POLLERR in the returned events. Normally > no errors are returned unless specifically requested by setting an > appropriate socket option. Socket option which _is_ set, that's how we get our TX timestamps, not sure why you see the need to mention this. > Nevertheless, the poll(2) API is quite generic, > and there is no guarantee that the kernel networking stack might push an > error event one day. This patch adds defensive code in order to catch any > unexpected error condition. > I am reading this as: "let's be defensive even in the case where the kernel decides to go nuts and push us a packet on the error queue even if we didn't enable SO_SELECT_ERR_QUEUE". But that isn't at all what's going on. As stated, we opted in this SO_SELECT_ERR_QUEUE game because we need TX timestamps. So we need to be correct, and play by the API's rules, which means treat the POLLERR returned event. It is a correctness issue, not a defense issue. > Suggested-by: Vladimir Oltean <ol...@gm...> > Signed-off-by: Richard Cochran <ric...@gm...> > > Thoughts? -Vladimir |
From: Richard C. <ric...@gm...> - 2020-06-24 16:58:35
|
On Wed, Jun 24, 2020 at 07:36:47PM +0300, Vladimir Oltean wrote: > I am reading this as: "let's be defensive even in the case where the > kernel decides to go nuts and push us a packet on the error queue even > if we didn't enable SO_SELECT_ERR_QUEUE". But that isn't at all what's > going on. As stated, we opted in this SO_SELECT_ERR_QUEUE game because > we need TX timestamps. But, at this point in the program, we know that no tx time stamp is outstanding. We always send one Tx event, then immediately fetch the time stamp. This is carefully synchronized by the program. It is important to do this because many drivers support exactly one Tx time stamp at a time. The kernel must not fabricate transmit time stamps! That would be breaking the contract. > So we need to be correct, and play by the API's > rules, which means treat the POLLERR returned event. It is a > correctness issue, not a defense issue. I think you are splitting hairs here, but I disagree that the program was incorrect. There is no reason _today_ for poll to return a POLLERR event from this call, but, in general, I don't believe this is guaranteed by anything. Would you prefer me leaving your name off the commit message? Thanks, Richard |
From: Richard C. <ric...@gm...> - 2020-06-19 11:58:56
|
On Fri, Jun 19, 2020 at 02:51:56PM +0300, Vladimir Oltean wrote: > Could you please give a concrete example of why it isn't desirable for > MSG_DONTWAIT to be specified on a socket where data availability has > been previously confirmed through poll()? I'm not sure what the patch > has to do with the industry having "plenty enough guess work and > sloppy copy and paste as it is". This is how the sockets API works. Either you use poll or DONTWAIT, not both. Setting both shows that the author does not understand the API, and that is not a signal that I am willing to send out into the world. So this isn't desirable simply as a matter of style, but style does matter. I appreciate that you (and I) spent long hours searching for a subtle kernel bug, and that was very frustrating. But it was a kernel bug, and a singular one at that. Production code does not and should not need special instrumentation to catch rare kernel bugs. Thanks, Richard |
From: Vladimir O. <ol...@gm...> - 2020-06-19 12:13:22
|
On Fri, 19 Jun 2020 at 14:58, Richard Cochran <ric...@gm...> wrote: > > On Fri, Jun 19, 2020 at 02:51:56PM +0300, Vladimir Oltean wrote: > > Could you please give a concrete example of why it isn't desirable for > > MSG_DONTWAIT to be specified on a socket where data availability has > > been previously confirmed through poll()? I'm not sure what the patch > > has to do with the industry having "plenty enough guess work and > > sloppy copy and paste as it is". > > This is how the sockets API works. Either you use poll or DONTWAIT, > not both. Setting both shows that the author does not understand the > API, and that is not a signal that I am willing to send out into the > world. > > So this isn't desirable simply as a matter of style, but style does > matter. > Let's keep it technical. Being rather inexperienced compared to others who've contributed to the project and to Linux networking in general, I do respect style, but I would also like to learn. The "man recv" page says: MSG_DONTWAIT (since Linux 2.2) Enables nonblocking operation; if the operation would block, the call fails with the error EAGAIN or EWOULDBLOCK. And this is exactly what I want: a fail-fast system. So I need a technical explanation about what's wrong with the patch. Sure this is a production-quality application stack, but before it goes into production it goes into tons and tons of testing first, testing which is done by people, so if anything, I would see it as an improvement that ptp4l detects runtime errors and doesn't just try to remain oblivious to them. And a bug is a bug, doesn't matter if it's in the kernel or in the application stack as far as I'm concerned. Whoever could detect it should report it. The kernel does its part and the application stack should do its part too. > I appreciate that you (and I) spent long hours searching for a subtle > kernel bug, and that was very frustrating. But it was a kernel bug, > and a singular one at that. Production code does not and should not > need special instrumentation to catch rare kernel bugs. > > Thanks, > Richard See, this is not instrumentation to catch rare kernel bugs, you're reading the patches in totally the wrong key. This is simply extra sanity checking. Perhaps the most frustrating thing to me, after debugging that system issue, is that ptp4l had all means necessary to detect a system issue, but it preferred to remain oblivious to it. Thanks, -Vladimir |
From: Richard C. <ric...@gm...> - 2020-06-19 12:14:00
|
On Fri, Jun 19, 2020 at 02:55:17PM +0300, Vladimir Oltean wrote: > Sure. In fact, printing the data on the error queue is totally > optional and I could just as well remove it, no problem with that. But > reading it isn't. I don't think you've addressed that point. Yes, here, for correctness sake, the program should be robust to POLLERR being set unexpectedly. I don't know of any situation that would set POLLERR on the kind of sockets in use here, but even so, this is a generic flag that might appear one day. But the patch does not handle the situation correctly. This code would dump the frame and carry on as if nothing had happened. The correct way would be to treat this as a fault on the port in question. Thanks, Richard |
From: Richard C. <ric...@gm...> - 2020-06-19 12:16:05
|
On Fri, Jun 19, 2020 at 03:12:57PM +0300, Vladimir Oltean wrote: > Perhaps the most frustrating thing to me, after debugging that system > issue, is that ptp4l had all means necessary to detect a system issue, > but it preferred to remain oblivious to it. This is a production PTP stack and not an operating system validation tool. Thanks, Richard |
From: Vladimir O. <ol...@gm...> - 2020-06-19 12:27:07
|
On Fri, 19 Jun 2020 at 15:15, Richard Cochran <ric...@gm...> wrote: > > On Fri, Jun 19, 2020 at 03:12:57PM +0300, Vladimir Oltean wrote: > > Perhaps the most frustrating thing to me, after debugging that system > > issue, is that ptp4l had all means necessary to detect a system issue, > > but it preferred to remain oblivious to it. > > This is a production PTP stack and not an operating system validation > tool. > > Thanks, > Richard So that's how you want to play it. Why does it contain this code then? if (flags == MSG_ERRQUEUE) { struct pollfd pfd = { fd, sk_events, 0 }; res = poll(&pfd, 1, sk_tx_timeout); if (res < 1) { pr_err(res ? "poll for tx timestamp failed: %m" : "timed out while polling for tx timestamp"); pr_err("increasing tx_timestamp_timeout may correct " "this issue, but it is likely caused by a driver bug"); return -errno; What business has ptp4l to say that TX timestamps not being delivered within 1 ms to the application are indicative of a driver bug? You and I know full well that there's a lot of hardware which simply can't deliver timestamps that fast, and that there's nothing wrong with it. No need to shame here, really. Or this check, right below: } else if (!(pfd.revents & sk_revents)) { pr_err("poll for tx timestamp woke up on non ERR event"); return -1; } } I mean, shouldn't happen, and if it does, it's not ptp4l's fault, right? And what is with this hack? check_fup_sync Because of packet reordering that can occur in the network, in the hardware, or in the networking stack, a follow up message can appear to arrive in the application before the matching sync message. As this is a normal occurrence, and the sequenceID message field ensures proper matching, the ptp4l program accepts out of order packets. This option adds an additional check using the software time stamps from the networking stack to verify that the sync message did arrive first. This option is only useful if you do not trust the sequence IDs generated by the master. The default is 0 (disabled). Why would the network stack need additional validation via software timestamps? I mean, it's not ptp4l's fault that timestamps were not delivered in the right order. I thought the linuxptp suite is trying to be helpful instead of pedantic, I'm starting to change my opinion. Thanks, -Vladimir |
From: Richard C. <ric...@gm...> - 2020-06-19 12:37:02
|
On Fri, Jun 19, 2020 at 03:26:48PM +0300, Vladimir Oltean wrote: > if (flags == MSG_ERRQUEUE) { > struct pollfd pfd = { fd, sk_events, 0 }; > res = poll(&pfd, 1, sk_tx_timeout); > if (res < 1) { > pr_err(res ? "poll for tx timestamp failed: %m" : > "timed out while polling for tx timestamp"); > pr_err("increasing tx_timestamp_timeout may correct " > "this issue, but it is likely caused by a driver bug"); > return -errno; > > What business has ptp4l to say that TX timestamps not being delivered > within 1 ms to the application are indicative of a driver bug? You and > I know full well that there's a lot of hardware which simply can't > deliver timestamps that fast, and that there's nothing wrong with it. Correct, and the timeout is configurable beyond the 1 ms default. > No need to shame here, really. This is printing a helpful message to let the user set an appropriate timeout value for their system. > Why would the network stack need additional validation via software > timestamps? I mean, it's not ptp4l's fault that timestamps were not > delivered in the right order. I thought the linuxptp suite is trying > to be helpful instead of pedantic, I'm starting to change my opinion. You are certainly entitled to your opinions! Thanks, Richard |
From: Richard C. <ric...@gm...> - 2020-06-19 12:51:53
|
On Fri, Jun 19, 2020 at 05:13:44AM -0700, Richard Cochran wrote: > But the patch does not handle the situation correctly. This code > would dump the frame and carry on as if nothing had happened. The > correct way would be to treat this as a fault on the port in question. Something like this... diff --git a/clock.c b/clock.c index f43cc2a..b081dc2 100644 --- a/clock.c +++ b/clock.c @@ -1559,6 +1559,13 @@ int clock_poll(struct clock *c) LIST_FOREACH(p, &c->ports, list) { /* Let the ports handle their events. */ for (i = 0; i < N_POLLFD; i++) { + if (cur[i].revents & POLLERR) { + pr_err("port %d: unexpected socket error", + port_number(p)); + event = EV_FAULT_DETECTED; + port_dispatch(p, event, 0); + continue; + } if (cur[i].revents & (POLLIN|POLLPRI)) { event = port_event(p, i); if (EV_STATE_DECISION_EVENT == event) { |
From: Vladimir O. <ol...@gm...> - 2020-06-24 17:21:42
|
On Wed, 24 Jun 2020 at 19:58, Richard Cochran <ric...@gm...> wrote: > > On Wed, Jun 24, 2020 at 07:36:47PM +0300, Vladimir Oltean wrote: > > I am reading this as: "let's be defensive even in the case where the > > kernel decides to go nuts and push us a packet on the error queue even > > if we didn't enable SO_SELECT_ERR_QUEUE". But that isn't at all what's > > going on. As stated, we opted in this SO_SELECT_ERR_QUEUE game because > > we need TX timestamps. > > But, at this point in the program, we know that no tx time stamp is > outstanding. We always send one Tx event, then immediately fetch the > time stamp. This is carefully synchronized by the program. It is > important to do this because many drivers support exactly one Tx time > stamp at a time. > > The kernel must not fabricate transmit time stamps! That would be > breaking the contract. > Luckily the author of that socket option, and 1 of the 2 people on CC on that patch, are present in this email thread: https://patchwork.ozlabs.org/project/netdev/patch/201...@je.../ So, I think you or Jacob would be in a good position to answer: What contract, who said that this control channel is _only_ for TX timestamps, and for how many TX timestamps it is? If a future kernel decided to send more data to programs who opted in to the error queue, and that kernel decision were to break ptp4l, could you honestly say it's the kernel's fault? I mean, those packets are already associated with a cmsg, you're supposed to only filter the one you're interested in, and ignore everything else. But ptp4l is clearly not doing that. > > So we need to be correct, and play by the API's > > rules, which means treat the POLLERR returned event. It is a > > correctness issue, not a defense issue. > > I think you are splitting hairs here, but I disagree that the program > was incorrect. There is no reason _today_ for poll to return a > POLLERR event from this call, but, in general, I don't believe this is > guaranteed by anything. > I am splitting the hairs I have left, really. Life would have been _so_much_simpler_ if ptp4l just had this POLLERR check. Who am I to guess what's going on if ptp4l gets a POLLERR event and treats it like a POLLIN? > Would you prefer me leaving your name off the commit message? > If we can't reconcile the fact that this is fixing an API misuse, then you can take my name off of it, sure. I care more about seeing the code modification go in, than I do about seeing my name on it. > Thanks, > Richard > > > Thanks, -Vladimir |
From: Keller, J. E <jac...@in...> - 2020-06-24 17:53:03
|
> -----Original Message----- > From: Richard Cochran <ric...@gm...> > Sent: Wednesday, June 24, 2020 9:58 AM > To: Vladimir Oltean <ol...@gm...> > Cc: lin...@li...; Keller, Jacob E > <jac...@in...> > Subject: Re: [PATCH v2 2/3] clock: dump unexpected packets received on the > error queues of sockets > > On Wed, Jun 24, 2020 at 07:36:47PM +0300, Vladimir Oltean wrote: > > I am reading this as: "let's be defensive even in the case where the > > kernel decides to go nuts and push us a packet on the error queue even > > if we didn't enable SO_SELECT_ERR_QUEUE". But that isn't at all what's > > going on. As stated, we opted in this SO_SELECT_ERR_QUEUE game because > > we need TX timestamps. > > But, at this point in the program, we know that no tx time stamp is > outstanding. We always send one Tx event, then immediately fetch the > time stamp. This is carefully synchronized by the program. It is > important to do this because many drivers support exactly one Tx time > stamp at a time. > > The kernel must not fabricate transmit time stamps! That would be > breaking the contract. Sure, but the POLLERR could (in theory, not in current practice) return other events/messages besides timestamps. It was invented/created prior to the Tx timestamps, wasn't it? > > > So we need to be correct, and play by the API's > > rules, which means treat the POLLERR returned event. It is a > > correctness issue, not a defense issue. > > I think you are splitting hairs here, but I disagree that the program > was incorrect. There is no reason _today_ for poll to return a > POLLERR event from this call, but, in general, I don't believe this is > guaranteed by anything. > Right. Today there's no other messages that I am aware of. > Would you prefer me leaving your name off the commit message? > > Thanks, > Richard > > |
From: Richard C. <ric...@gm...> - 2020-06-24 17:58:09
|
On Wed, Jun 24, 2020 at 05:52:39PM +0000, Keller, Jacob E wrote: > Sure, but the POLLERR could (in theory, not in current practice) return other events/messages besides timestamps. > > It was invented/created prior to the Tx timestamps, wasn't it? Yes, and this is the point! Thanks, Richard |
From: Richard C. <ric...@gm...> - 2020-06-24 17:56:08
|
On Wed, Jun 24, 2020 at 08:21:22PM +0300, Vladimir Oltean wrote: > What contract, who said that this control channel is _only_ for TX > timestamps, and for how many TX timestamps it is? Um, there can't be more time stamps than transmitted frames. > If a future kernel decided to send more data to programs who opted in > to the error queue, and that kernel decision were to break ptp4l, > could you honestly say it's the kernel's fault? You are missing the point entirely. This is about poll(2) and not about the socket error queue. POLLERR might possibly one day acquire some kind of push semantics (i.e. unwanted by user space) on the kind of DGRAM sockets we use. In addition, if ever a pipe-like transport appears (and I think it not unlikely), then POLLERR will definitely appear, and this patch prepares for that day. If a given driver is spewing out extra time stamps, then it is most definitely the kernel's fault! Thanks, Richard |
From: Vladimir O. <ol...@gm...> - 2020-06-19 14:58:36
|
On Fri, 19 Jun 2020 at 15:51, Richard Cochran <ric...@gm...> wrote: > > On Fri, Jun 19, 2020 at 05:13:44AM -0700, Richard Cochran wrote: > > But the patch does not handle the situation correctly. This code > > would dump the frame and carry on as if nothing had happened. The > > correct way would be to treat this as a fault on the port in question. > > Something like this... > > diff --git a/clock.c b/clock.c > index f43cc2a..b081dc2 100644 > --- a/clock.c > +++ b/clock.c > @@ -1559,6 +1559,13 @@ int clock_poll(struct clock *c) > LIST_FOREACH(p, &c->ports, list) { > /* Let the ports handle their events. */ > for (i = 0; i < N_POLLFD; i++) { > + if (cur[i].revents & POLLERR) { > + pr_err("port %d: unexpected socket error", > + port_number(p)); > + event = EV_FAULT_DETECTED; > + port_dispatch(p, event, 0); You need to recv() the data. Otherwise, it will remain forever there, ptp4l will be toast, since this will keep dispatching the fault ad infinitum. > + continue; > + } > if (cur[i].revents & (POLLIN|POLLPRI)) { > event = port_event(p, i); > if (EV_STATE_DECISION_EVENT == event) { Thanks, -Vladimir |
From: Richard C. <ric...@gm...> - 2020-06-19 15:47:36
|
On Fri, Jun 19, 2020 at 05:58:17PM +0300, Vladimir Oltean wrote: > You need to recv() the data. Otherwise, it will remain forever there, > ptp4l will be toast, since this will keep dispatching the fault ad > infinitum. The fault handling path closes all sockets and opens new ones. Thanks, Richard |
From: Vladimir O. <ol...@gm...> - 2020-06-24 18:01:51
|
On Wed, 24 Jun 2020 at 20:56, Richard Cochran <ric...@gm...> wrote: > > On Wed, Jun 24, 2020 at 08:21:22PM +0300, Vladimir Oltean wrote: > > > What contract, who said that this control channel is _only_ for TX > > timestamps, and for how many TX timestamps it is? > > Um, there can't be more time stamps than transmitted frames. > > > If a future kernel decided to send more data to programs who opted in > > to the error queue, and that kernel decision were to break ptp4l, > > could you honestly say it's the kernel's fault? > > You are missing the point entirely. This is about poll(2) and not > about the socket error queue. POLLERR might possibly one day acquire > some kind of push semantics (i.e. unwanted by user space) on the kind > of DGRAM sockets we use. In addition, if ever a pipe-like transport > appears (and I think it not unlikely), then POLLERR will definitely > appear, and this patch prepares for that day. > > If a given driver is spewing out extra time stamps, then it is most > definitely the kernel's fault! > > Thanks, > Richard > > I think it boils down to something very simple. The kernel is waking up a user process with revents = POLLIN | POLLERR. That has a certain meaning. Then the process says "Oh, yay, I have revents = POLLIN. Let me process that!". That has an entirely different meaning. Timestamps are just excuses here. Thanks, -Vladimir |