Thread: Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of (Page 2)

PTP IEEE 1588 stack for Linux

Brought to you by: rcochran

linuxptp-devel

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Vladimir O. <ol...@gm...> - 2020-06-18 20:18:17

On Thu, 18 Jun 2020 at 23:02, Jacob Keller <jac...@in...> wrote:
>
>
>
> On 6/18/2020 12:56 PM, Vladimir Oltean wrote:
> >> Could this use the pr facility, so that it honors the printing options
> >> to log to syslog and manages the level?
> >>
> >> I'm not sure what I would mark these as: debug because they no longer
> >> impact the syncing process, or true error because they are unexpected?
> >>
> >> Thanks,
> >> Jake
> >>
> >
> > Thanks a lot for looking at this.
> > Good point about redirecting to syslog, I did not think about that.
> > The reason why I did not use pr_err was due to the automatic line
> > ending, which would have unnecessarily complicated this function by
> > having me print each line of the packet to a temporary buffer first.
> > But now that I see there's no way around it, I guess I'll have to do
> > that.
> What about implementing this as part of pr.c?
>
> Thanks,
> Jake

Not really sure what you mean. What would the prototype of the
function be (i.e. what would it do)?

-Vladimir

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Jacob K. <jac...@in...> - 2020-06-18 20:43:34


On 6/18/2020 1:17 PM, Vladimir Oltean wrote:
> On Thu, 18 Jun 2020 at 23:02, Jacob Keller <jac...@in...> wrote:
>>
>>
>>
>> On 6/18/2020 12:56 PM, Vladimir Oltean wrote:
>>>> Could this use the pr facility, so that it honors the printing options
>>>> to log to syslog and manages the level?
>>>>
>>>> I'm not sure what I would mark these as: debug because they no longer
>>>> impact the syncing process, or true error because they are unexpected?
>>>>
>>>> Thanks,
>>>> Jake
>>>>
>>>
>>> Thanks a lot for looking at this.
>>> Good point about redirecting to syslog, I did not think about that.
>>> The reason why I did not use pr_err was due to the automatic line
>>> ending, which would have unnecessarily complicated this function by
>>> having me print each line of the packet to a temporary buffer first.
>>> But now that I see there's no way around it, I guess I'll have to do
>>> that.
>> What about implementing this as part of pr.c?
>>
>> Thanks,
>> Jake
> 
> Not really sure what you mean. What would the prototype of the
> function be (i.e. what would it do)?
> 
> -Vladimir
> 

I was thinking like something that would do a hexdump to a given print
level.

Perhaps that's overkill since not many places would need it...

Thanks,
Jake

Re: [Linuxptp-devel] [PATCH v2 1/3] ptp4l: call recvmsg() with the MSG_DONTWAIT flag

From: Richard C. <ric...@gm...> - 2020-06-19 11:44:07

On Thu, Jun 18, 2020 at 12:42:55PM -0700, Jacob Keller wrote:
> > So it is a bug if poll() wakes up the process for data availability on a
> > socket's receive queue, and then recvmsg(), called immediately
> > afterwards, goes to sleep trying to retrieve it.

Yes.

> Makes sense. Using MSG_DONTWAIT seems reasonable since we already know
> there should be data available.

I disagree on this point.  There are (at least) three ways to avoid
blocking on receive.

1. poll() first.  The guarantees data are available, and for SDGAM
   sockets those data comprise at least on message.

2. use the MSG_DONTWAIT flag

3. set the SO_RCVTIMEO socket option

This patch would mix methods 1 and 2.  I understand that it still
works with the DONTWAIT, but it sets a bad example.  I want this
program to be an example of correct programming patterns.  Our
industry has plenty enough guess work and sloppy copy and paste as it
is.

Thanks,
Richard

Re: [Linuxptp-devel] [PATCH v2 1/3] ptp4l: call recvmsg() with the MSG_DONTWAIT flag

From: Vladimir O. <ol...@gm...> - 2020-06-19 13:02:16

On Fri, 19 Jun 2020 at 15:47, Richard Cochran <ric...@gm...> wrote:
>
> On Fri, Jun 19, 2020 at 03:41:09PM +0300, Vladimir Oltean wrote:
> > Ok, and this patch does not have a place in linuxptp because?
>
> It is a matter of style, as I already said.
>
> Thanks,

Again, I would have accepted any technical argument you could bring.
Like, for example, the fact that this sleep on recv() is becoming
impossible after patch "[PATCH v2 2/3] clock: dump unexpected packets
received on the error queues of sockets" (which _is_ fixing a ptp4l
application bug), even without MSG_DONTWAIT in place. But you prefer
to bring the style argument.
So, as a matter of personal style, I think the only right thing for me
to do is to retreat now. Feel free to use any portion of the patches I
submitted so far in any combination you see desirable.

Thanks,
-Vladimir

Re: [Linuxptp-devel] [PATCH v2 1/3] ptp4l: call recvmsg() with the MSG_DONTWAIT flag

From: Richard C. <ric...@gm...> - 2020-06-19 14:22:05

On Fri, Jun 19, 2020 at 04:01:57PM +0300, Vladimir Oltean wrote:
> Again, I would have accepted any technical argument you could bring.

Thinking more about this, there is in fact a technical reason to leave
the code as is.  For DGRAM sockets, the behavior of recvmsg() in
sk_receive() is the same with and without DONTWAIT.  But this is only
by chance.

Consider a STREAM socket with fixed message sizes.  In that case, poll
indicates only that the socket is readable, and not that an entire
message is available.  When recvmsg(flags=DONTWAIT) returns, the
length may be less than a full message, and the code in sk_receive()
would have to read the socket again (and again) until a whole message
arrives.  If sk_receive(flags=DONTWAIT) is allowed, then that method
would have to cover the case of reading a partial message.

But sk_receive() does not implement that programming pattern. Instead
the caller knows a message is available, and then sk_receive()
"blocks" in recvmsg() for that one message.  That is the design
pattern.

> So, as a matter of personal style, I think the only right thing for me
> to do is to retreat now. Feel free to use any portion of the patches I
> submitted so far in any combination you see desirable.

I will likely apply patch 3.  For patch 2, could you please comment on
the diff I posted as an alternate solution?

Thanks,
Richard

Re: [Linuxptp-devel] [PATCH v2 1/3] ptp4l: call recvmsg() with the MSG_DONTWAIT flag

From: Vladimir O. <ol...@gm...> - 2020-06-19 14:50:10

On Fri, 19 Jun 2020 at 17:21, Richard Cochran <ric...@gm...> wrote:
>
> On Fri, Jun 19, 2020 at 04:01:57PM +0300, Vladimir Oltean wrote:
> > Again, I would have accepted any technical argument you could bring.
>
> Thinking more about this, there is in fact a technical reason to leave
> the code as is.  For DGRAM sockets, the behavior of recvmsg() in
> sk_receive() is the same with and without DONTWAIT.  But this is only
> by chance.
>
> Consider a STREAM socket with fixed message sizes.  In that case, poll
> indicates only that the socket is readable, and not that an entire
> message is available.  When recvmsg(flags=DONTWAIT) returns, the
> length may be less than a full message, and the code in sk_receive()
> would have to read the socket again (and again) until a whole message
> arrives.  If sk_receive(flags=DONTWAIT) is allowed, then that method
> would have to cover the case of reading a partial message.
>

A technical argument which implies that IEEE 1588 synchronization may
work over connection-oriented sockets is a stylistic argument, really.
I don't buy it.

> But sk_receive() does not implement that programming pattern. Instead
> the caller knows a message is available, and then sk_receive()
> "blocks" in recvmsg() for that one message.  That is the design
> pattern.
>

No, sk_receive() does not "block" in recvmsg() for that one message.
Any kernel tracing program (I recommend kernelshark, it has a nice
GUI) will tell you that.

> > So, as a matter of personal style, I think the only right thing for me
> > to do is to retreat now. Feel free to use any portion of the patches I
> > submitted so far in any combination you see desirable.
>
> I will likely apply patch 3.  For patch 2, could you please comment on
> the diff I posted as an alternate solution?
>

Yes, I will provide my feedback on that.

> Thanks,
> Richard

Thanks,
-Vladimir

Re: [Linuxptp-devel] [PATCH v2 1/3] ptp4l: call recvmsg() with the MSG_DONTWAIT flag

From: Vladimir O. <ol...@gm...> - 2020-06-19 11:52:15

Hi Richard,

On Fri, 19 Jun 2020 at 14:44, Richard Cochran <ric...@gm...> wrote:
>
> On Thu, Jun 18, 2020 at 12:42:55PM -0700, Jacob Keller wrote:
> > > So it is a bug if poll() wakes up the process for data availability on a
> > > socket's receive queue, and then recvmsg(), called immediately
> > > afterwards, goes to sleep trying to retrieve it.
>
> Yes.
>
> > Makes sense. Using MSG_DONTWAIT seems reasonable since we already know
> > there should be data available.
>
> I disagree on this point.  There are (at least) three ways to avoid
> blocking on receive.
>
> 1. poll() first.  The guarantees data are available, and for SDGAM
>    sockets those data comprise at least on message.
>
> 2. use the MSG_DONTWAIT flag
>
> 3. set the SO_RCVTIMEO socket option
>
> This patch would mix methods 1 and 2.  I understand that it still
> works with the DONTWAIT, but it sets a bad example.  I want this
> program to be an example of correct programming patterns.  Our
> industry has plenty enough guess work and sloppy copy and paste as it
> is.
>
> Thanks,
> Richard

Could you please give a concrete example of why it isn't desirable for
MSG_DONTWAIT to be specified on a socket where data availability has
been previously confirmed through poll()? I'm not sure what the patch
has to do with the industry having "plenty enough guess work and
sloppy copy and paste as it is".

Thanks,
-Vladimir

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Richard C. <ric...@gm...> - 2020-06-19 11:52:21

On Mon, Jun 15, 2020 at 06:23:20PM +0300, Vladimir Oltean wrote:
> Printing them to the user is optional (and helpful), but reading them is
> not. With this patch, even with extraneous data delivered by a buggy
> kernel (which the application now loudly complains about), the
> synchronization keeps chugging along. Otherwise the application starts
> reordering packets in recvmsg() due to misinterpreting which socket
> queue has data available.

Again, I see this stack as production code and not a debugging tool.
Hexdumping frames is more for wireshark or tools like

   linux/tools/testing/selftests/net/[rx\|tx]timestamp

in the kernel source.  In fact, I think that tool already dumps
frames.

Thanks,
Richard

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Vladimir O. <ol...@gm...> - 2020-06-19 11:55:35

On Fri, 19 Jun 2020 at 14:52, Richard Cochran <ric...@gm...> wrote:
>
> On Mon, Jun 15, 2020 at 06:23:20PM +0300, Vladimir Oltean wrote:
> > Printing them to the user is optional (and helpful), but reading them is
> > not. With this patch, even with extraneous data delivered by a buggy
> > kernel (which the application now loudly complains about), the
> > synchronization keeps chugging along. Otherwise the application starts
> > reordering packets in recvmsg() due to misinterpreting which socket
> > queue has data available.
>
> Again, I see this stack as production code and not a debugging tool.
> Hexdumping frames is more for wireshark or tools like
>
>    linux/tools/testing/selftests/net/[rx\|tx]timestamp
>
> in the kernel source.  In fact, I think that tool already dumps
> frames.
>
> Thanks,
> Richard

Sure. In fact, printing the data on the error queue is totally
optional and I could just as well remove it, no problem with that. But
reading it isn't. I don't think you've addressed that point.

-Vladimir

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Vladimir O. <ol...@gm...> - 2020-06-24 16:37:06

Hi Richard,

On Wed, 24 Jun 2020 at 19:27, Richard Cochran <ric...@gm...> wrote:
>
> On Fri, Jun 19, 2020 at 10:45:23PM +0300, Vladimir Oltean wrote:
> > Yes, much better.
>
> May I add your tag to the commit message, like this?

I think the commit description is slightly lossy and a bit misleading.

>
>     Catch unexpected socket polling errors.

This part is ok.

>
>     The poll(2) system call may set POLLERR in the returned events.  Normally
>     no errors are returned unless specifically requested by setting an
>     appropriate socket option.

Socket option which _is_ set, that's how we get our TX timestamps, not
sure why you see the need to mention this.

>     Nevertheless, the poll(2) API is quite generic,
>     and there is no guarantee that the kernel networking stack might push an
>     error event one day.  This patch adds defensive code in order to catch any
>     unexpected error condition.
>

I am reading this as: "let's be defensive even in the case where the
kernel decides to go nuts and push us a packet on the error queue even
if we didn't enable SO_SELECT_ERR_QUEUE". But that isn't at all what's
going on. As stated, we opted in this SO_SELECT_ERR_QUEUE game because
we need TX timestamps. So we need to be correct, and play by the API's
rules, which means treat the POLLERR returned event. It is a
correctness issue, not a defense issue.

>     Suggested-by: Vladimir Oltean <ol...@gm...>
>     Signed-off-by: Richard Cochran <ric...@gm...>
>
>

Thoughts?
-Vladimir

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Richard C. <ric...@gm...> - 2020-06-24 16:58:35

On Wed, Jun 24, 2020 at 07:36:47PM +0300, Vladimir Oltean wrote:
> I am reading this as: "let's be defensive even in the case where the
> kernel decides to go nuts and push us a packet on the error queue even
> if we didn't enable SO_SELECT_ERR_QUEUE". But that isn't at all what's
> going on. As stated, we opted in this SO_SELECT_ERR_QUEUE game because
> we need TX timestamps.

But, at this point in the program, we know that no tx time stamp is
outstanding.  We always send one Tx event, then immediately fetch the
time stamp.  This is carefully synchronized by the program.  It is
important to do this because many drivers support exactly one Tx time
stamp at a time.

The kernel must not fabricate transmit time stamps!  That would be
breaking the contract.

> So we need to be correct, and play by the API's
> rules, which means treat the POLLERR returned event. It is a
> correctness issue, not a defense issue.

I think you are splitting hairs here, but I disagree that the program
was incorrect.  There is no reason _today_ for poll to return a
POLLERR event from this call, but, in general, I don't believe this is
guaranteed by anything.

Would you prefer me leaving your name off the commit message?

Thanks,
Richard

Re: [Linuxptp-devel] [PATCH v2 1/3] ptp4l: call recvmsg() with the MSG_DONTWAIT flag

From: Richard C. <ric...@gm...> - 2020-06-19 11:58:56

On Fri, Jun 19, 2020 at 02:51:56PM +0300, Vladimir Oltean wrote:
> Could you please give a concrete example of why it isn't desirable for
> MSG_DONTWAIT to be specified on a socket where data availability has
> been previously confirmed through poll()? I'm not sure what the patch
> has to do with the industry having "plenty enough guess work and
> sloppy copy and paste as it is".

This is how the sockets API works.  Either you use poll or DONTWAIT,
not both.  Setting both shows that the author does not understand the
API, and that is not a signal that I am willing to send out into the
world.

So this isn't desirable simply as a matter of style, but style does
matter.

I appreciate that you (and I) spent long hours searching for a subtle
kernel bug, and that was very frustrating.  But it was a kernel bug,
and a singular one at that.  Production code does not and should not
need special instrumentation to catch rare kernel bugs.

Thanks,
Richard

Re: [Linuxptp-devel] [PATCH v2 1/3] ptp4l: call recvmsg() with the MSG_DONTWAIT flag

From: Vladimir O. <ol...@gm...> - 2020-06-19 12:13:22

On Fri, 19 Jun 2020 at 14:58, Richard Cochran <ric...@gm...> wrote:
>
> On Fri, Jun 19, 2020 at 02:51:56PM +0300, Vladimir Oltean wrote:
> > Could you please give a concrete example of why it isn't desirable for
> > MSG_DONTWAIT to be specified on a socket where data availability has
> > been previously confirmed through poll()? I'm not sure what the patch
> > has to do with the industry having "plenty enough guess work and
> > sloppy copy and paste as it is".
>
> This is how the sockets API works.  Either you use poll or DONTWAIT,
> not both.  Setting both shows that the author does not understand the
> API, and that is not a signal that I am willing to send out into the
> world.
>
> So this isn't desirable simply as a matter of style, but style does
> matter.
>

Let's keep it technical. Being rather inexperienced compared to others
who've contributed to the project and to Linux networking in general,
I do respect style, but I would also like to learn. The "man recv"
page says:

       MSG_DONTWAIT (since Linux 2.2)
              Enables nonblocking operation; if the operation would
block, the call fails with the error EAGAIN or EWOULDBLOCK.

And this is exactly what I want: a fail-fast system. So I need a
technical explanation about what's wrong with the patch. Sure this is
a production-quality application stack, but before it goes into
production it goes into tons and tons of testing first, testing which
is done by people, so if anything, I would see it as an improvement
that ptp4l detects runtime errors and doesn't just try to remain
oblivious to them. And a bug is a bug, doesn't matter if it's in the
kernel or in the application stack as far as I'm concerned. Whoever
could detect it should report it. The kernel does its part and the
application stack should do its part too.

> I appreciate that you (and I) spent long hours searching for a subtle
> kernel bug, and that was very frustrating.  But it was a kernel bug,
> and a singular one at that.  Production code does not and should not
> need special instrumentation to catch rare kernel bugs.
>
> Thanks,
> Richard

See, this is not instrumentation to catch rare kernel bugs, you're
reading the patches in totally the wrong key. This is simply extra
sanity checking.
Perhaps the most frustrating thing to me, after debugging that system
issue, is that ptp4l had all means necessary to detect a system issue,
but it preferred to remain oblivious to it.

Thanks,
-Vladimir

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Richard C. <ric...@gm...> - 2020-06-19 12:14:00

On Fri, Jun 19, 2020 at 02:55:17PM +0300, Vladimir Oltean wrote:
> Sure. In fact, printing the data on the error queue is totally
> optional and I could just as well remove it, no problem with that. But
> reading it isn't. I don't think you've addressed that point.

Yes, here, for correctness sake, the program should be robust to
POLLERR being set unexpectedly.  I don't know of any situation that
would set POLLERR on the kind of sockets in use here, but even so,
this is a generic flag that might appear one day.

But the patch does not handle the situation correctly.  This code
would dump the frame and carry on as if nothing had happened.  The
correct way would be to treat this as a fault on the port in question.

Thanks,
Richard

Re: [Linuxptp-devel] [PATCH v2 1/3] ptp4l: call recvmsg() with the MSG_DONTWAIT flag

From: Richard C. <ric...@gm...> - 2020-06-19 12:16:05

On Fri, Jun 19, 2020 at 03:12:57PM +0300, Vladimir Oltean wrote:
> Perhaps the most frustrating thing to me, after debugging that system
> issue, is that ptp4l had all means necessary to detect a system issue,
> but it preferred to remain oblivious to it.

This is a production PTP stack and not an operating system validation
tool.

Thanks,
Richard

Re: [Linuxptp-devel] [PATCH v2 1/3] ptp4l: call recvmsg() with the MSG_DONTWAIT flag

From: Vladimir O. <ol...@gm...> - 2020-06-19 12:27:07

On Fri, 19 Jun 2020 at 15:15, Richard Cochran <ric...@gm...> wrote:
>
> On Fri, Jun 19, 2020 at 03:12:57PM +0300, Vladimir Oltean wrote:
> > Perhaps the most frustrating thing to me, after debugging that system
> > issue, is that ptp4l had all means necessary to detect a system issue,
> > but it preferred to remain oblivious to it.
>
> This is a production PTP stack and not an operating system validation
> tool.
>
> Thanks,
> Richard

So that's how you want to play it.
Why does it contain this code then?

    if (flags == MSG_ERRQUEUE) {
        struct pollfd pfd = { fd, sk_events, 0 };
        res = poll(&pfd, 1, sk_tx_timeout);
        if (res < 1) {
            pr_err(res ? "poll for tx timestamp failed: %m" :
                         "timed out while polling for tx timestamp");
            pr_err("increasing tx_timestamp_timeout may correct "
                   "this issue, but it is likely caused by a driver bug");
            return -errno;

What business has ptp4l to say that TX timestamps not being delivered
within 1 ms to the application are indicative of a driver bug? You and
I know full well that there's a lot of hardware which simply can't
deliver timestamps that fast, and that there's nothing wrong with it.
No need to shame here, really.

Or this check, right below:

        } else if (!(pfd.revents & sk_revents)) {
            pr_err("poll for tx timestamp woke up on non ERR event");
            return -1;
        }
    }

I mean, shouldn't happen, and if it does, it's not ptp4l's fault, right?

And what is with this hack?

check_fup_sync
              Because  of packet reordering that can occur in the network, in
              the hardware, or in the networking stack, a follow up
message can appear to
              arrive in the application before the matching sync message. As
              this is a normal occurrence, and the sequenceID message field
              ensures proper matching, the ptp4l program accepts out of order
              packets. This option  adds  an  additional check  using  the
              software time stamps from the networking stack to verify that the
              sync message did arrive first. This option is only useful if you
              do not trust the sequence IDs generated by the master.
               The default is 0 (disabled).

Why would the network stack need additional validation via software
timestamps? I mean, it's not ptp4l's fault that timestamps were not
delivered in the right order. I thought the linuxptp suite is trying
to be helpful instead of pedantic, I'm starting to change my opinion.

Thanks,
-Vladimir

Re: [Linuxptp-devel] [PATCH v2 1/3] ptp4l: call recvmsg() with the MSG_DONTWAIT flag

From: Richard C. <ric...@gm...> - 2020-06-19 12:37:02

On Fri, Jun 19, 2020 at 03:26:48PM +0300, Vladimir Oltean wrote:
>     if (flags == MSG_ERRQUEUE) {
>         struct pollfd pfd = { fd, sk_events, 0 };
>         res = poll(&pfd, 1, sk_tx_timeout);
>         if (res < 1) {
>             pr_err(res ? "poll for tx timestamp failed: %m" :
>                          "timed out while polling for tx timestamp");
>             pr_err("increasing tx_timestamp_timeout may correct "
>                    "this issue, but it is likely caused by a driver bug");
>             return -errno;
> 
> What business has ptp4l to say that TX timestamps not being delivered
> within 1 ms to the application are indicative of a driver bug? You and
> I know full well that there's a lot of hardware which simply can't
> deliver timestamps that fast, and that there's nothing wrong with it.

Correct, and the timeout is configurable beyond the 1 ms default.

> No need to shame here, really.

This is printing a helpful message to let the user set an appropriate
timeout value for their system.

> Why would the network stack need additional validation via software
> timestamps? I mean, it's not ptp4l's fault that timestamps were not
> delivered in the right order. I thought the linuxptp suite is trying
> to be helpful instead of pedantic, I'm starting to change my opinion.

You are certainly entitled to your opinions!

Thanks,
Richard

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Richard C. <ric...@gm...> - 2020-06-19 12:51:53

On Fri, Jun 19, 2020 at 05:13:44AM -0700, Richard Cochran wrote:
> But the patch does not handle the situation correctly.  This code
> would dump the frame and carry on as if nothing had happened.  The
> correct way would be to treat this as a fault on the port in question.

Something like this...

diff --git a/clock.c b/clock.c
index f43cc2a..b081dc2 100644
--- a/clock.c
+++ b/clock.c
@@ -1559,6 +1559,13 @@ int clock_poll(struct clock *c)
 	LIST_FOREACH(p, &c->ports, list) {
 		/* Let the ports handle their events. */
 		for (i = 0; i < N_POLLFD; i++) {
+			if (cur[i].revents & POLLERR) {
+				pr_err("port %d: unexpected socket error",
+				       port_number(p));
+				event = EV_FAULT_DETECTED;
+				port_dispatch(p, event, 0);
+				continue;
+			}
 			if (cur[i].revents & (POLLIN|POLLPRI)) {
 				event = port_event(p, i);
 				if (EV_STATE_DECISION_EVENT == event) {

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Vladimir O. <ol...@gm...> - 2020-06-24 17:21:42

On Wed, 24 Jun 2020 at 19:58, Richard Cochran <ric...@gm...> wrote:
>
> On Wed, Jun 24, 2020 at 07:36:47PM +0300, Vladimir Oltean wrote:
> > I am reading this as: "let's be defensive even in the case where the
> > kernel decides to go nuts and push us a packet on the error queue even
> > if we didn't enable SO_SELECT_ERR_QUEUE". But that isn't at all what's
> > going on. As stated, we opted in this SO_SELECT_ERR_QUEUE game because
> > we need TX timestamps.
>
> But, at this point in the program, we know that no tx time stamp is
> outstanding.  We always send one Tx event, then immediately fetch the
> time stamp.  This is carefully synchronized by the program.  It is
> important to do this because many drivers support exactly one Tx time
> stamp at a time.
>
> The kernel must not fabricate transmit time stamps!  That would be
> breaking the contract.
>

Luckily the author of that socket option, and 1 of the 2 people on CC
on that patch, are present in this email thread:

https://patchwork.ozlabs.org/project/netdev/patch/201...@je.../

So, I think you or Jacob would be in a good position to answer:

What contract, who said that this control channel is _only_ for TX
timestamps, and for how many TX timestamps it is?
If a future kernel decided to send more data to programs who opted in
to the error queue, and that kernel decision were to break ptp4l,
could you honestly say it's the kernel's fault?
I mean, those packets are already associated with a cmsg, you're
supposed to only filter the one you're interested in, and ignore
everything else. But ptp4l is clearly not doing that.

> > So we need to be correct, and play by the API's
> > rules, which means treat the POLLERR returned event. It is a
> > correctness issue, not a defense issue.
>
> I think you are splitting hairs here, but I disagree that the program
> was incorrect.  There is no reason _today_ for poll to return a
> POLLERR event from this call, but, in general, I don't believe this is
> guaranteed by anything.
>

I am splitting the hairs I have left, really. Life would have been
_so_much_simpler_ if ptp4l just had this POLLERR check. Who am I to
guess what's going on if ptp4l gets a POLLERR event and treats it like
a POLLIN?

> Would you prefer me leaving your name off the commit message?
>

If we can't reconcile the fact that this is fixing an API misuse, then
you can take my name off of it, sure. I care more about seeing the
code modification go in, than I do about seeing my name on it.

> Thanks,
> Richard
>
>
>

Thanks,
-Vladimir

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Keller, J. E <jac...@in...> - 2020-06-24 17:53:03


> -----Original Message-----
> From: Richard Cochran <ric...@gm...>
> Sent: Wednesday, June 24, 2020 9:58 AM
> To: Vladimir Oltean <ol...@gm...>
> Cc: lin...@li...; Keller, Jacob E
> <jac...@in...>
> Subject: Re: [PATCH v2 2/3] clock: dump unexpected packets received on the
> error queues of sockets
> 
> On Wed, Jun 24, 2020 at 07:36:47PM +0300, Vladimir Oltean wrote:
> > I am reading this as: "let's be defensive even in the case where the
> > kernel decides to go nuts and push us a packet on the error queue even
> > if we didn't enable SO_SELECT_ERR_QUEUE". But that isn't at all what's
> > going on. As stated, we opted in this SO_SELECT_ERR_QUEUE game because
> > we need TX timestamps.
> 
> But, at this point in the program, we know that no tx time stamp is
> outstanding.  We always send one Tx event, then immediately fetch the
> time stamp.  This is carefully synchronized by the program.  It is
> important to do this because many drivers support exactly one Tx time
> stamp at a time.
> 
> The kernel must not fabricate transmit time stamps!  That would be
> breaking the contract.

Sure, but the POLLERR could (in theory, not in current practice) return other events/messages besides timestamps.

It was invented/created prior to the Tx timestamps, wasn't it?

> 
> > So we need to be correct, and play by the API's
> > rules, which means treat the POLLERR returned event. It is a
> > correctness issue, not a defense issue.
> 
> I think you are splitting hairs here, but I disagree that the program
> was incorrect.  There is no reason _today_ for poll to return a
> POLLERR event from this call, but, in general, I don't believe this is
> guaranteed by anything.
> 

Right. Today there's no other messages that I am aware of.

> Would you prefer me leaving your name off the commit message?
> 
> Thanks,
> Richard
> 
>

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Richard C. <ric...@gm...> - 2020-06-24 17:58:09

On Wed, Jun 24, 2020 at 05:52:39PM +0000, Keller, Jacob E wrote:
> Sure, but the POLLERR could (in theory, not in current practice) return other events/messages besides timestamps.
> 
> It was invented/created prior to the Tx timestamps, wasn't it?

Yes, and this is the point!

Thanks,
Richard

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Richard C. <ric...@gm...> - 2020-06-24 17:56:08

On Wed, Jun 24, 2020 at 08:21:22PM +0300, Vladimir Oltean wrote:

> What contract, who said that this control channel is _only_ for TX
> timestamps, and for how many TX timestamps it is?

Um, there can't be more time stamps than transmitted frames.

> If a future kernel decided to send more data to programs who opted in
> to the error queue, and that kernel decision were to break ptp4l,
> could you honestly say it's the kernel's fault?

You are missing the point entirely.  This is about poll(2) and not
about the socket error queue.  POLLERR might possibly one day acquire
some kind of push semantics (i.e. unwanted by user space) on the kind
of DGRAM sockets we use.  In addition, if ever a pipe-like transport
appears (and I think it not unlikely), then POLLERR will definitely
appear, and this patch prepares for that day.

If a given driver is spewing out extra time stamps, then it is most
definitely the kernel's fault!

Thanks,
Richard

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Vladimir O. <ol...@gm...> - 2020-06-19 14:58:36

On Fri, 19 Jun 2020 at 15:51, Richard Cochran <ric...@gm...> wrote:
>
> On Fri, Jun 19, 2020 at 05:13:44AM -0700, Richard Cochran wrote:
> > But the patch does not handle the situation correctly.  This code
> > would dump the frame and carry on as if nothing had happened.  The
> > correct way would be to treat this as a fault on the port in question.
>
> Something like this...
>
> diff --git a/clock.c b/clock.c
> index f43cc2a..b081dc2 100644
> --- a/clock.c
> +++ b/clock.c
> @@ -1559,6 +1559,13 @@ int clock_poll(struct clock *c)
>         LIST_FOREACH(p, &c->ports, list) {
>                 /* Let the ports handle their events. */
>                 for (i = 0; i < N_POLLFD; i++) {
> +                       if (cur[i].revents & POLLERR) {
> +                               pr_err("port %d: unexpected socket error",
> +                                      port_number(p));
> +                               event = EV_FAULT_DETECTED;
> +                               port_dispatch(p, event, 0);

You need to recv() the data. Otherwise, it will remain forever there,
ptp4l will be toast, since this will keep dispatching the fault ad
infinitum.

> +                               continue;
> +                       }
>                         if (cur[i].revents & (POLLIN|POLLPRI)) {
>                                 event = port_event(p, i);
>                                 if (EV_STATE_DECISION_EVENT == event) {

Thanks,
-Vladimir

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Richard C. <ric...@gm...> - 2020-06-19 15:47:36

On Fri, Jun 19, 2020 at 05:58:17PM +0300, Vladimir Oltean wrote:
> You need to recv() the data. Otherwise, it will remain forever there,
> ptp4l will be toast, since this will keep dispatching the fault ad
> infinitum.

The fault handling path closes all sockets and opens new ones.

Thanks,
Richard

Re: [Linuxptp-devel] [PATCH v2 2/3] clock: dump unexpected packets received on the error queues of sockets

From: Vladimir O. <ol...@gm...> - 2020-06-24 18:01:51

On Wed, 24 Jun 2020 at 20:56, Richard Cochran <ric...@gm...> wrote:
>
> On Wed, Jun 24, 2020 at 08:21:22PM +0300, Vladimir Oltean wrote:
>
> > What contract, who said that this control channel is _only_ for TX
> > timestamps, and for how many TX timestamps it is?
>
> Um, there can't be more time stamps than transmitted frames.
>
> > If a future kernel decided to send more data to programs who opted in
> > to the error queue, and that kernel decision were to break ptp4l,
> > could you honestly say it's the kernel's fault?
>
> You are missing the point entirely.  This is about poll(2) and not
> about the socket error queue.  POLLERR might possibly one day acquire
> some kind of push semantics (i.e. unwanted by user space) on the kind
> of DGRAM sockets we use.  In addition, if ever a pipe-like transport
> appears (and I think it not unlikely), then POLLERR will definitely
> appear, and this patch prepares for that day.
>
> If a given driver is spewing out extra time stamps, then it is most
> definitely the kernel's fault!
>
> Thanks,
> Richard
>
>

I think it boils down to something very simple.
The kernel is waking up a user process with revents = POLLIN |
POLLERR. That has a certain meaning.
Then the process says "Oh, yay, I have revents = POLLIN. Let me
process that!". That has an entirely different meaning.
Timestamps are just excuses here.

Thanks,
-Vladimir

<< < 1 2 3 > >> (Page 2 of 3)