Thread: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

PTP IEEE 1588 stack for Linux

Brought to you by: rcochran

linuxptp-users

[Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Mario M. <mar...@we...> - 2012-10-30 15:47:21

Hallo Richard,

I observed following error message with a new GM Clock search during my flood ping test:
 
 ptp4l[23768.970]: recvmsg tx timestamp failed: Resource temporarily unavailable
 ptp4l[23768.975]: port 1: send delay request failed
 ptp4l[23768.975]: port 1: SLAVE to FAULTY on FAULT_DETECTED
 ptp4l[23784.059]: port 1: FAULTY to LISTENING on FAULT_CLEARED
 ptp4l[23784.414]: port 1: new foreign master 0050c2.fffe.c2dfc3-1
 ptp4l[23788.424]: selected best master clock 0050c2.fffe.c2dfc3
 ptp4l[23788.425]: port 1: LISTENING to UNCALIBRATED on RS_SLAVE
 ptp4l[23790.092]: port 1: minimum delay request interval 2^3
 ptp4l[23790.688]: master offset       -329 s2 adj  -13761 path delay       1626
 ptp4l[23790.708]: port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
 
 flood ping test script:
 while true; do sudo ping -f -c 1000 -s $RANDOM <IP of PTP Module> ; done.
 
I have instrument the ptp4l code and I could see that a part of problem was a not correct error handling in the function sk_receive(). The recvmsg() returns sometime a EAGAIN and try-again variable was not increment.
I have changed this and now disappears this error message with GM Clock search during my flood ping test and it works all very well. 
 
 My code changes:
 --- a/sk.c	
 +++ b/sk.c	
 
  		}
  		if (errno == EINTR) {
  			try_again++;
 -		} else if (errno == EAGAIN) {
 +		} else if ((errno == EAGAIN ) || (errno == EWOULDBLOCK)) {
  			usleep(1);
 +			try_again++;
  		} else {
  			break;
  		}  
 
 
Do you have an idea why these EAGAIN errors occur? I cloud not find a reason for non-blocking.
 
 Best regards,
 Mario

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Jacob K. <jac...@in...> - 2012-10-30 16:54:44

On 10/30/2012 08:47 AM, Mario Molitor wrote:
>
> Hallo Richard,
>
> I observed following error message with a new GM Clock search during my flood ping test:
>
>   ptp4l[23768.970]: recvmsg tx timestamp failed: Resource temporarily unavailable
>   ptp4l[23768.975]: port 1: send delay request failed
>   ptp4l[23768.975]: port 1: SLAVE to FAULTY on FAULT_DETECTED
>   ptp4l[23784.059]: port 1: FAULTY to LISTENING on FAULT_CLEARED
>   ptp4l[23784.414]: port 1: new foreign master 0050c2.fffe.c2dfc3-1
>   ptp4l[23788.424]: selected best master clock 0050c2.fffe.c2dfc3
>   ptp4l[23788.425]: port 1: LISTENING to UNCALIBRATED on RS_SLAVE
>   ptp4l[23790.092]: port 1: minimum delay request interval 2^3
>   ptp4l[23790.688]: master offset       -329 s2 adj  -13761 path delay       1626
>   ptp4l[23790.708]: port 1: UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
>
>   flood ping test script:
>   while true; do sudo ping -f -c 1000 -s $RANDOM <IP of PTP Module> ; done.
>
> I have instrument the ptp4l code and I could see that a part of problem was a not correct error handling in the function sk_receive(). The recvmsg() returns sometime a EAGAIN and try-again variable was not increment.
> I have changed this and now disappears this error message with GM Clock search during my flood ping test and it works all very well.
>
>   My code changes:
>   --- a/sk.c	
>   +++ b/sk.c	
>
>    		}
>    		if (errno == EINTR) {
>    			try_again++;
>   -		} else if (errno == EAGAIN) {
>   +		} else if ((errno == EAGAIN ) || (errno == EWOULDBLOCK)) {
>    			usleep(1);
>   +			try_again++;
>    		} else {
>    			break;
>    		}
>
>
> Do you have an idea why these EAGAIN errors occur? I cloud not find a reason for non-blocking.
>
>   Best regards,
>   Mario
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_sfd2d_oct
> _______________________________________________
> Linuxptp-users mailing list
> Lin...@li...
> https://lists.sourceforge.net/lists/listinfo/linuxptp-users
>

That loop is there due to the way hardware timestamps are returned from 
the network stack to PTP4l. They are looped back on the socket error 
queue, and then picked up by PTP4l. Current design doesn't want to wait 
indefinitely due to possible missed timestamps. The try_again isn't 
incremented on purpose, as that would cause it to loop an infinite 
number of times.

You could try to increment the tx_timestamp_retry value in the config 
file and see if this fixes the issue. I believe we should have a higher 
default in this field, because the drivers I've tested all have trouble 
returning the timestamp within the very short time (2 nanoseconds).

If it's a problem due to the regular receive that might be an entirely 
different issue.

I believe the true correct answer is to completely re-architect the 
tx_hwtstamp to be asynchronous, so that it just waits until it receives 
the timestamp for a complete sequence of events. That design is 
significantly more difficult to write though.

- Jake

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Richard C. <ric...@gm...> - 2012-10-30 17:05:59

On Tue, Oct 30, 2012 at 04:47:09PM +0100, Mario Molitor wrote:
>  
> I have instrument the ptp4l code and I could see that a part of problem was a not correct error handling in the function sk_receive(). The recvmsg() returns sometime a EAGAIN and try-again variable was not increment.
> I have changed this and now disappears this error message with GM Clock search during my flood ping test and it works all very well. 
>  
>  My code changes:
>  --- a/sk.c	
>  +++ b/sk.c	
>  
>   		}
>   		if (errno == EINTR) {
>   			try_again++;
>  -		} else if (errno == EAGAIN) {
>  +		} else if ((errno == EAGAIN ) || (errno == EWOULDBLOCK)) {

This does not accomplish anything since:

--- /usr/include/asm-generic/errno.h ---

#define	  EWOULDBLOCK	 EAGAIN	  /* Operation would block */


>   			usleep(1);
>  +			try_again++;

This is the wrong solution. The right way is to set the
tx_timestamp_retries configuration variable to a higher number, like
200 or 2000 instead of the default of 2.

HTH,
Richard

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Miroslav L. <mli...@re...> - 2012-10-30 17:17:31

On Tue, Oct 30, 2012 at 06:05:38PM +0100, Richard Cochran wrote:
> This does not accomplish anything since:
> 
> --- /usr/include/asm-generic/errno.h ---
> 
> #define	  EWOULDBLOCK	 EAGAIN	  /* Operation would block */
> 
> 
> >   			usleep(1);
> >  +			try_again++;
> 
> This is the wrong solution. The right way is to set the
> tx_timestamp_retries configuration variable to a higher number, like
> 200 or 2000 instead of the default of 2.

Would it make sense to specify a timeout instead of number of retries
and use select()?

-- 
Miroslav Lichvar

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Keller, J. E <jac...@in...> - 2012-10-31 00:02:37

> -----Original Message-----
> From: Miroslav Lichvar [mailto:mli...@re...]
> Sent: Tuesday, October 30, 2012 10:17 AM
> To: lin...@li...
> Subject: Re: [Linuxptp-users] Fw: “Resource temporarily unavailable”
> errors during flood ping test
> 
> On Tue, Oct 30, 2012 at 06:05:38PM +0100, Richard Cochran wrote:
> > This does not accomplish anything since:
> >
> > --- /usr/include/asm-generic/errno.h ---
> >
> > #define	  EWOULDBLOCK	 EAGAIN	  /* Operation would block */
> >
> >
> > >   			usleep(1);
> > >  +			try_again++;
> >
> > This is the wrong solution. The right way is to set the
> > tx_timestamp_retries configuration variable to a higher number, like
> > 200 or 2000 instead of the default of 2.
> 
> Would it make sense to specify a timeout instead of number of retries
> and use select()?
> 
> --
> Miroslav Lichvar

I did some digging in the kernel to figure out why exceptfs parameter didn't get wokenup when a message appeared on the error queue. Turns out that is because in fs/socket.c the POLLEX_SET flags only includes POLL_PRI and not POLL_ERR. I tried a modified kernel which defined POLLEX_SET (POLL_PRI | POLL_ERR) and it enabled the usecase we want. Sadly this would add a further dependency.. and probably can't really be changed..

I don't know why exceptfs which is documented as "wake a socket on error" doesn't check POLL_ERR flag.... it seems really silly.

- Jake

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Richard C. <ric...@gm...> - 2012-10-30 17:24:13

On Tue, Oct 30, 2012 at 09:54:30AM -0700, Jacob Keller wrote:
> 
> I believe the true correct answer is to completely re-architect the 
> tx_hwtstamp to be asynchronous, so that it just waits until it receives 
> the timestamp for a complete sequence of events. That design is 
> significantly more difficult to write though.

But even if we did that way, it would not really be a better
solution. Think about your own Intel cards. They would end up missing
Tx time stamps and possibly mixing them up due to the hardware
limitation of having a Tx time stamp FIFO of depth one.

And it is not just Intel cards that have this issue. I think the
majority of the current hardware offerings all have this same
limitation.  So we really must wait for the Tx time stamp after
sending an event message before going on with the protocol, simply
to function on most of the hardware out there.

Thanks,
Richard

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Richard C. <ric...@gm...> - 2012-10-30 17:26:02

On Tue, Oct 30, 2012 at 06:17:18PM +0100, Miroslav Lichvar wrote:
> On Tue, Oct 30, 2012 at 06:05:38PM +0100, Richard Cochran wrote:
> > This does not accomplish anything since:
> > 
> > --- /usr/include/asm-generic/errno.h ---
> > 
> > #define	  EWOULDBLOCK	 EAGAIN	  /* Operation would block */
> > 
> > 
> > >   			usleep(1);
> > >  +			try_again++;
> > 
> > This is the wrong solution. The right way is to set the
> > tx_timestamp_retries configuration variable to a higher number, like
> > 200 or 2000 instead of the default of 2.
> 
> Would it make sense to specify a timeout instead of number of retries
> and use select()?

That doesn't work because you can't restrict the select to the error
queue only.

Thanks,
Richard

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Keller, J. E <jac...@in...> - 2012-10-30 17:29:00

> -----Original Message-----
> From: Richard Cochran [mailto:ric...@gm...]
> Sent: Tuesday, October 30, 2012 10:24 AM
> To: Keller, Jacob E
> Cc: lin...@li...
> Subject: Re: [Linuxptp-users] Fw: “Resource temporarily unavailable”
> errors during flood ping test
> 
> On Tue, Oct 30, 2012 at 09:54:30AM -0700, Jacob Keller wrote:
> >
> > I believe the true correct answer is to completely re-architect the
> > tx_hwtstamp to be asynchronous, so that it just waits until it receives
> > the timestamp for a complete sequence of events. That design is
> > significantly more difficult to write though.
> 
> But even if we did that way, it would not really be a better
> solution. Think about your own Intel cards. They would end up missing
> Tx time stamps and possibly mixing them up due to the hardware
> limitation of having a Tx time stamp FIFO of depth one.
> 
> And it is not just Intel cards that have this issue. I think the
> majority of the current hardware offerings all have this same
> limitation.  So we really must wait for the Tx time stamp after
> sending an event message before going on with the protocol, simply
> to function on most of the hardware out there.
> 
> Thanks,
> Richard

I think we can get PTP4l to work right even under those scenarios, but right now it is horribly annoying that practically everyone has to change the value if they ever have stress when ptp is running.

I like the idea of architecting it to use select and a delay value, so I'll try to see how difficult that would be.

- Jake

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Keller, J. E <jac...@in...> - 2012-10-30 17:34:23

> -----Original Message-----
> From: Richard Cochran [mailto:ric...@gm...]
> Sent: Tuesday, October 30, 2012 10:26 AM
> To: Miroslav Lichvar
> Cc: lin...@li...
> Subject: Re: [Linuxptp-users] Fw: “Resource temporarily unavailable”
> errors during flood ping test
> > >
> > > This is the wrong solution. The right way is to set the
> > > tx_timestamp_retries configuration variable to a higher number, like
> > > 200 or 2000 instead of the default of 2.
> >
> > Would it make sense to specify a timeout instead of number of retries
> > and use select()?
> 
> That doesn't work because you can't restrict the select to the error
> queue only.
> 

But select returns the time it waited, and I believe we can do something like:

while (nothing_in_errorqueue or timeleft) {

  timeleft -= select(timeleft)

}


> Thanks,
> Richard

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Richard C. <ric...@gm...> - 2012-10-30 17:45:22

On Tue, Oct 30, 2012 at 05:28:46PM +0000, Keller, Jacob E wrote:

> I think we can get PTP4l to work right even under those scenarios,
> but right now it is horribly annoying that practically everyone has
> to change the value if they ever have stress when ptp is running.

[ I don't have this problem on the hardware that I use most often.
  We can increase the default if you want. ]

We can get ptp4l to transmit asynchronously, but we can't fix the
hardware.  For this reason I remain convinced that we must block and
wait for the time stamp.  If you accept this premise, then you also
must accept the idea of a timeout or retry count, since time stamps
can get lost. There is a trade off between wanting to wait long enough
(hardware specific) and wanting to give up ASAP when a time stamp goes
missing.

We could change the counter to a time value to be compared against
CLOCK_MONOTONIC, but that is only a cosmetic change.

> I like the idea of architecting it to use select and a delay value,
> so I'll try to see how difficult that would be.

Be my guest ;)

I tried once to fixup ptpd like this, and I didn't get very far.

Thanks,
Richard

Re: [Linuxptp-users] ?Fw: “Resource temporarily unavailable” errors during flood ping test

From: Richard C. <ric...@gm...> - 2012-10-30 17:46:40

On Tue, Oct 30, 2012 at 05:33:45PM +0000, Keller, Jacob E wrote:
> > That doesn't work because you can't restrict the select to the error
> > queue only.
> > 
> 
> But select returns the time it waited, and I believe we can do something like:
> 
> while (nothing_in_errorqueue or timeleft) {

/* handle new, incoming packets here? */
 
>   timeleft -= select(timeleft)
> 
> }

Re: [Linuxptp-users] ?Fw: “Resource temporarily unavailable” errors during flood ping test

From: Keller, J. E <jac...@in...> - 2012-10-30 17:53:03

> -----Original Message-----
> From: Richard Cochran [mailto:ric...@gm...]
> Sent: Tuesday, October 30, 2012 10:46 AM
> To: Keller, Jacob E
> Cc: Miroslav Lichvar; lin...@li...
> Subject: Re: [Linuxptp-users]?Fw: “Resource temporarily unavailable”
> errors during flood ping test
> 
> On Tue, Oct 30, 2012 at 05:33:45PM +0000, Keller, Jacob E wrote:
> > > That doesn't work because you can't restrict the select to the error
> > > queue only.
> > >
> >
> > But select returns the time it waited, and I believe we can do something
> like:
> >
> > while (nothing_in_errorqueue or timeleft) {
> 
> /* handle new, incoming packets here? */
> 

Basically that's the idea.

- Jake

> >   timeleft -= select(timeleft)
> >
> > }

Re: [Linuxptp-users] ?Fw: “Resource temporarily unavailable” errors during flood ping test

From: Richard C. <ric...@gm...> - 2012-10-30 17:57:01

On Tue, Oct 30, 2012 at 05:52:51PM +0000, Keller, Jacob E wrote:
> > >
> > > while (nothing_in_errorqueue or timeleft) {
> > 
> > /* handle new, incoming packets here? */
> > 
> 
> Basically that's the idea.

But don't forget about the other ports!

Thanks,
Richard

> > >   timeleft -= select(timeleft)
> > >
> > > }

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Stephan G. <ste...@gm...> - 2012-10-30 19:54:46

>>   --- a/sk.c	
>>   +++ b/sk.c	
>>
>>    		}
>>    		if (errno == EINTR) {
>>    			try_again++;
>>   -		} else if (errno == EAGAIN) {
>>   +		} else if ((errno == EAGAIN ) || (errno == EWOULDBLOCK)) {
>
> This does not accomplish anything since:
>
> --- /usr/include/asm-generic/errno.h ---

The man page for recvmsg suggest to check both just for portability. But 
maybe the whole stuff is so much Linux dependent that it probably makes 
no sense to distinguish both.

Regards,

Stephan

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Keller, J. E <jac...@in...> - 2012-10-30 20:00:44

> -----Original Message-----
> From: Stephan Gatzka [mailto:ste...@gm...]
> Sent: Tuesday, October 30, 2012 12:35 PM
> To: Richard Cochran
> Cc: lin...@li...
> Subject: Re: [Linuxptp-users] Fw: “Resource temporarily unavailable”
> errors during flood ping test
> 
> 
> >>   --- a/sk.c
> >>   +++ b/sk.c
> >>
> >>    		}
> >>    		if (errno == EINTR) {
> >>    			try_again++;
> >>   -		} else if (errno == EAGAIN) {
> >>   +		} else if ((errno == EAGAIN ) || (errno == EWOULDBLOCK)) {
> >
> > This does not accomplish anything since:
> >
> > --- /usr/include/asm-generic/errno.h ---
> 
> The man page for recvmsg suggest to check both just for portability. But
> maybe the whole stuff is so much Linux dependent that it probably makes
> no sense to distinguish both.
> 

It is completely 100% Linux (of very recent kernels!) dependant. It makes no sense to attempt to be more portable because the interfaces for ptp and hwtimestamps are completely non-portable....

- Jake

> Regards,
> 
> Stephan

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Richard C. <ric...@gm...> - 2012-10-30 20:48:41

On Tue, Oct 30, 2012 at 06:23:53PM +0100, Richard Cochran wrote:
> On Tue, Oct 30, 2012 at 09:54:30AM -0700, Jacob Keller wrote:
> > 
> > I believe the true correct answer is to completely re-architect the 
> > tx_hwtstamp to be asynchronous, so that it just waits until it receives 
> > the timestamp for a complete sequence of events. That design is 
> > significantly more difficult to write though.
> 
> But even if we did that way, it would not really be a better
> solution. Think about your own Intel cards. They would end up missing
> Tx time stamps and possibly mixing them up due to the hardware
> limitation of having a Tx time stamp FIFO of depth one.

This may be the wrong list, but this reminds me of an issue with the
Intel hardware that I have been meaning to ask you about.  The igb
driver has always had the following comment regarding transmit time
stamps:

 * If we were asked to do hardware stamping and such a time stamp is
 * available, then it must have been for this skb here because we only
 * allow only one such packet into the queue.

This statement wasn't actually true up until recently, when Matthew
Vick added some code that enforced the one packet limit.

If I am not mistaken, the ixgb also would need some kind of guard
against the case when a user program sends two or more event packets
in a row, would it not?

Thanks,
Richard

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Keller, J. E <jac...@in...> - 2012-10-30 21:23:28

> -----Original Message-----
> From: Richard Cochran [mailto:ric...@gm...]
> Sent: Tuesday, October 30, 2012 1:48 PM
> To: Keller, Jacob E
> Cc: lin...@li...
> Subject: Re: [Linuxptp-users] Fw: “Resource temporarily unavailable”
> errors during flood ping test
> 
> On Tue, Oct 30, 2012 at 06:23:53PM +0100, Richard Cochran wrote:
> > On Tue, Oct 30, 2012 at 09:54:30AM -0700, Jacob Keller wrote:
> > >
> > > I believe the true correct answer is to completely re-architect the
> > > tx_hwtstamp to be asynchronous, so that it just waits until it
> receives
> > > the timestamp for a complete sequence of events. That design is
> > > significantly more difficult to write though.
> >
> > But even if we did that way, it would not really be a better
> > solution. Think about your own Intel cards. They would end up missing
> > Tx time stamps and possibly mixing them up due to the hardware
> > limitation of having a Tx time stamp FIFO of depth one.
> 
> This may be the wrong list, but this reminds me of an issue with the
> Intel hardware that I have been meaning to ask you about.  The igb
> driver has always had the following comment regarding transmit time
> stamps:
> 
>  * If we were asked to do hardware stamping and such a time stamp is
>  * available, then it must have been for this skb here because we only
>  * allow only one such packet into the queue.
> 
> This statement wasn't actually true up until recently, when Matthew
> Vick added some code that enforced the one packet limit.
> 
> If I am not mistaken, the ixgb also would need some kind of guard
> against the case when a user program sends two or more event packets
> in a row, would it not?

Short answer: that limit is enforced by the hardware (it disables time stamping as long as the RXTSTMP register is locked), except in the mode that puts time stamp directly into packet buffer.

Long answer:

That comment actually refers to hardware design for the 82576 device. Basically, a packet is time stamped and the register stores RXHWTSTMP and sets the bit in the descriptor plus the bit in TSYNCRXCTL.

No more than one packet will have the bit set in the descriptor, because time stamping is disabled when there is a valid stamp in the RXHWTSTAMP registers, so that packet must match the timestamp in the registers.

There was some queuing code but this actually turns out to be bogus and did nothing of value, and I've petitioned to have it removed.

for 10Gbe, I added the ptp_match function to prevent the case where a time stamped packet is dropped.

The one-per-queue basically occurs because hardware design timestamps the packet, puts timestamp in registers, and indicates which packet got time stamped. There's no need for more correlation because the descriptor indicates which packet got time stamped, and as long as you don't read the RXTSTMP registers they remain locked and hardware won't timestamp another packet until you unlock the RXTSTMP registers. The ptp_match is necessary in the very rare case that a time stamped ptp packet never reaches the driver. (it will find the next ptp packet that should have been time stamped according to the timestamp mode, and then clear timestamps so that the error case causing timestamps to stop forever is avoided)

For the 82580 part timestamps are stored in the packet buffer avoiding the issue entirely.

So, the only guard necessary is the ptp_match function to prevent that condition. If there is a timestamp in the registers, hardware doesn't timestamp again until the user reads the timestamp value out. Two rapid event packets in succession will cause the first arrived to be time stamped and the second to not be time stamped.

> 
> Thanks,
> Richard

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Keller, J. E <jac...@in...> - 2012-10-30 23:03:12

> -----Original Message-----
> From: Richard Cochran [mailto:ric...@gm...]
> Sent: Tuesday, October 30, 2012 10:45 AM
> To: Keller, Jacob E
> Cc: lin...@li...
> Subject: Re: [Linuxptp-users] Fw: “Resource temporarily unavailable”
> errors during flood ping test
> 
> > I like the idea of architecting it to use select and a delay value,
> > so I'll try to see how difficult that would be.
> 
> Be my guest ;)
> 
> I tried once to fixup ptpd like this, and I didn't get very far.
> 

You're right. I looked at implementing it and to make it work we really need to
have select only check on the errqueue which it doesn't seem to be able to do...

It would take an insane amount of work to move to a model that allows receive handling inside sk.c, and I believe it isn't worth the effort. I would however like to increase the default tx_timestamp_retries value, as 2 tries rarely works for anything I've tested, as it generates false positive errors a lot. What hardware are you using that doesn't have issues at 2 retries? And have you attempted testing this under moderate stress?

I also like the idea of using CLOCK_MONOTONIC as a "timeout" seems to make more user-sense than a tx_retries as the user will more likely understand what is intended vs the tx_timestamp_retries which might be confusing.

Thanks

- Jake

> Thanks,
> Richard

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Jonatan W. <jw...@ne...> - 2012-10-31 07:49:26

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/30/2012 10:23 PM, Keller, Jacob E wrote:
> 
> 
>> -----Original Message----- From: Richard Cochran
>> [mailto:ric...@gm...] Sent: Tuesday, October 30, 2012
>> 1:48 PM To: Keller, Jacob E Cc:
>> lin...@li... Subject: Re:
>> [Linuxptp-users] Fw: “Resource temporarily unavailable” errors
>> during flood ping test
>> 
>> On Tue, Oct 30, 2012 at 06:23:53PM +0100, Richard Cochran wrote:
>>> On Tue, Oct 30, 2012 at 09:54:30AM -0700, Jacob Keller wrote:
>>>> 
>>>> I believe the true correct answer is to completely
>>>> re-architect the tx_hwtstamp to be asynchronous, so that it
>>>> just waits until it
>> receives
>>>> the timestamp for a complete sequence of events. That design
>>>> is significantly more difficult to write though.
>>> 
>>> But even if we did that way, it would not really be a better 
>>> solution. Think about your own Intel cards. They would end up
>>> missing Tx time stamps and possibly mixing them up due to the
>>> hardware limitation of having a Tx time stamp FIFO of depth
>>> one.
>> 
>> This may be the wrong list, but this reminds me of an issue with
>> the Intel hardware that I have been meaning to ask you about.
>> The igb driver has always had the following comment regarding
>> transmit time stamps:
>> 
>> * If we were asked to do hardware stamping and such a time stamp
>> is * available, then it must have been for this skb here because
>> we only * allow only one such packet into the queue.
>> 
>> This statement wasn't actually true up until recently, when
>> Matthew Vick added some code that enforced the one packet limit.
>> 
>> If I am not mistaken, the ixgb also would need some kind of
>> guard against the case when a user program sends two or more
>> event packets in a row, would it not?
> 
> 
> Short answer: that limit is enforced by the hardware (it disables
> time stamping as long as the RXTSTMP register is locked), except in
> the mode that puts time stamp directly into packet buffer.
> 
> 
> Long answer:
> 
> That comment actually refers to hardware design for the 82576
> device. Basically, a packet is time stamped and the register stores
> RXHWTSTMP and sets the bit in the descriptor plus the bit in
> TSYNCRXCTL.
> 
> No more than one packet will have the bit set in the descriptor,
> because time stamping is disabled when there is a valid stamp in
> the RXHWTSTAMP registers, so that packet must match the timestamp
> in the registers.
> 
> There was some queuing code but this actually turns out to be bogus
> and did nothing of value, and I've petitioned to have it removed.
> 
> for 10Gbe, I added the ptp_match function to prevent the case where
> a time stamped packet is dropped.
> 
> The one-per-queue basically occurs because hardware design
> timestamps the packet, puts timestamp in registers, and indicates
> which packet got time stamped. There's no need for more correlation
> because the descriptor indicates which packet got time stamped, and
> as long as you don't read the RXTSTMP registers they remain locked
> and hardware won't timestamp another packet until you unlock the
> RXTSTMP registers. The ptp_match is necessary in the very rare case
> that a time stamped ptp packet never reaches the driver. (it will
> find the next ptp packet that should have been time stamped
> according to the timestamp mode, and then clear timestamps so that
> the error case causing timestamps to stop forever is avoided)
> 
> For the 82580 part timestamps are stored in the packet buffer
> avoiding the issue entirely.
> 
> So, the only guard necessary is the ptp_match function to prevent
> that condition. If there is a timestamp in the registers, hardware
> doesn't timestamp again until the user reads the timestamp value
> out. Two rapid event packets in succession will cause the first
> arrived to be time stamped and the second to not be time stamped.
> 
>> 
>> Thanks, Richard
> 

Now that we are on the subject, have I understood correctly that
82599(ES) suffers from the same hardware design as the 82576? That is,
timestamps in a seperate register rather than the possibility to be
strapped on to each packet in the queue?

Are there any 10GE cards with packet buffer timestamping?

// jwalck
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iQIcBAEBAgAGBQJQkNNcAAoJEFwg9i9GDX+nKFwP/2T9dGhz+kZXYwX6PRVU5cKn
ZvD8wwmBi4i8xtEF36Ulc/HNfzY7JuSQtobfheYIu60FpLy1DF1nemWM62Sm6c2v
VdZ5Dx2JgPLRaYBZKfwR2/MNzKHfm7Sw0OSDTqvNe59ZCYFIAPmYsk0+6TLUSeqY
BhKe1TH8yRgCgFkBvsQ2Fsh9jcTwprjENXoFPhIP2ww3+Iq3t9IV4ZtbXoQMVHb/
ppKOkYmZ45OgYSrbNgGJeEYf8KvIAKy92Fd26635PImhxjMm5hIfPwBs95xjlCqU
8JA/RuW9Vtit+n5dAv/+OHVLoge+RS8MxDuJ2c79+nHFoqya45TbmuTnVAqSZXs3
0+Ou2YqSoHRDfOg0b6gSsQOczQjY9i/9k75/2VM+fuZr6TIwcHupe3QUy9TGDRSS
53uKt0AGYKkq/xQILKIdEGkFmAB4C1mL2UoJSngZDFnHRv5k13gK/Oq5IBOjejAV
OJvQdy8twsnBmH8pxd+jIB2j/T72lG7+kEu8nWjolbX7QX7QokK9XklmUCoxtfbm
6umZUwEZ8tAh3LWeRFhvddjFM53wh5TAsy2qxqc9zGAucOYrjDYPeFzGRS9xjEkQ
VLAJR9OPAnrszEkav237qJ+HdZ+9LFDFeoeh3OYwrtRrwEO6bU/N5EtXKM7EPt+S
lxKKnp1oLnQ6n72XjMe1
=lIKO
-----END PGP SIGNATURE-----

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Keller, J. E <jac...@in...> - 2012-10-31 19:14:04


> -----Original Message-----
> From: Jonatan Walck [mailto:jw...@ne...]
> Sent: Wednesday, October 31, 2012 12:30 AM
> To: lin...@li...
> Subject: Re: [Linuxptp-users] Fw: “Resource temporarily unavailable”
> errors during flood ping test
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 10/30/2012 10:23 PM, Keller, Jacob E wrote:
> >
> >
> >> -----Original Message----- From: Richard Cochran
> >> [mailto:ric...@gm...] Sent: Tuesday, October 30, 2012
> >> 1:48 PM To: Keller, Jacob E Cc:
> >> lin...@li... Subject: Re:
> >> [Linuxptp-users] Fw: “Resource temporarily unavailable” errors
> >> during flood ping test
> >>
> >> On Tue, Oct 30, 2012 at 06:23:53PM +0100, Richard Cochran wrote:
> >>> On Tue, Oct 30, 2012 at 09:54:30AM -0700, Jacob Keller wrote:
> >>>>
> >>>> I believe the true correct answer is to completely
> >>>> re-architect the tx_hwtstamp to be asynchronous, so that it
> >>>> just waits until it
> >> receives
> >>>> the timestamp for a complete sequence of events. That design
> >>>> is significantly more difficult to write though.
> >>>
> >>> But even if we did that way, it would not really be a better
> >>> solution. Think about your own Intel cards. They would end up
> >>> missing Tx time stamps and possibly mixing them up due to the
> >>> hardware limitation of having a Tx time stamp FIFO of depth
> >>> one.
> >>
> >> This may be the wrong list, but this reminds me of an issue with
> >> the Intel hardware that I have been meaning to ask you about.
> >> The igb driver has always had the following comment regarding
> >> transmit time stamps:
> >>
> >> * If we were asked to do hardware stamping and such a time stamp
> >> is * available, then it must have been for this skb here because
> >> we only * allow only one such packet into the queue.
> >>
> >> This statement wasn't actually true up until recently, when
> >> Matthew Vick added some code that enforced the one packet limit.
> >>
> >> If I am not mistaken, the ixgb also would need some kind of
> >> guard against the case when a user program sends two or more
> >> event packets in a row, would it not?
> >
> >
> > Short answer: that limit is enforced by the hardware (it disables
> > time stamping as long as the RXTSTMP register is locked), except in
> > the mode that puts time stamp directly into packet buffer.
> >
> >
> > Long answer:
> >
> > That comment actually refers to hardware design for the 82576
> > device. Basically, a packet is time stamped and the register stores
> > RXHWTSTMP and sets the bit in the descriptor plus the bit in
> > TSYNCRXCTL.
> >
> > No more than one packet will have the bit set in the descriptor,
> > because time stamping is disabled when there is a valid stamp in
> > the RXHWTSTAMP registers, so that packet must match the timestamp
> > in the registers.
> >
> > There was some queuing code but this actually turns out to be bogus
> > and did nothing of value, and I've petitioned to have it removed.
> >
> > for 10Gbe, I added the ptp_match function to prevent the case where
> > a time stamped packet is dropped.
> >
> > The one-per-queue basically occurs because hardware design
> > timestamps the packet, puts timestamp in registers, and indicates
> > which packet got time stamped. There's no need for more correlation
> > because the descriptor indicates which packet got time stamped, and
> > as long as you don't read the RXTSTMP registers they remain locked
> > and hardware won't timestamp another packet until you unlock the
> > RXTSTMP registers. The ptp_match is necessary in the very rare case
> > that a time stamped ptp packet never reaches the driver. (it will
> > find the next ptp packet that should have been time stamped
> > according to the timestamp mode, and then clear timestamps so that
> > the error case causing timestamps to stop forever is avoided)
> >
> > For the 82580 part timestamps are stored in the packet buffer
> > avoiding the issue entirely.
> >
> > So, the only guard necessary is the ptp_match function to prevent
> > that condition. If there is a timestamp in the registers, hardware
> > doesn't timestamp again until the user reads the timestamp value
> > out. Two rapid event packets in succession will cause the first
> > arrived to be time stamped and the second to not be time stamped.
> >
> >>
> >> Thanks, Richard
> >
> 
> Now that we are on the subject, have I understood correctly that
> 82599(ES) suffers from the same hardware design as the 82576? That is,
> timestamps in a seperate register rather than the possibility to be
> strapped on to each packet in the queue?
> 
> Are there any 10GE cards with packet buffer timestamping?


At this time, there are no 10Gb cards with per-packet timestamping in the buffer.

- Jake

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Richard C. <ric...@gm...> - 2012-11-01 08:53:18

On Tue, Oct 30, 2012 at 11:02:21PM +0000, Keller, Jacob E wrote:
> 
> It would take an insane amount of work to move to a model that
> allows receive handling inside sk.c, and I believe it isn't worth
> the effort. I would however like to increase the default
> tx_timestamp_retries value, as 2 tries rarely works for anything
> I've tested, as it generates false positive errors a lot.

Okay, what value is a safe default, in your experience?

> What hardware are you using that doesn't have issues at 2 retries?

The PHYTER almost never loses a Tx time stamp, and the TI CPTS seems
to be working perfectly, too. I don't have the Freescale eTSEC (gianfar)
for testing, but I remember that it always worked, since the Tx time
stamp is delivered into packet buffer's padding.

With the IGB, sometimes it seemed that 2 retries is okay, but
sometimes I needed to ramp this up.

> And have you attempted testing this under moderate stress?

I just retested the PHYTER under a ping flood, and there were no
hiccups. I will test the CPTS again when I get a chance.

Thanks,
Richard

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Jacob K. <jac...@in...> - 2012-11-01 16:54:28

On 11/01/2012 01:53 AM, Richard Cochran wrote:
> On Tue, Oct 30, 2012 at 11:02:21PM +0000, Keller, Jacob E wrote:
>>
>> It would take an insane amount of work to move to a model that
>> allows receive handling inside sk.c, and I believe it isn't worth
>> the effort. I would however like to increase the default
>> tx_timestamp_retries value, as 2 tries rarely works for anything
>> I've tested, as it generates false positive errors a lot.
>
> Okay, what value is a safe default, in your experience?
>
>> What hardware are you using that doesn't have issues at 2 retries?
>
> The PHYTER almost never loses a Tx time stamp, and the TI CPTS seems
> to be working perfectly, too. I don't have the Freescale eTSEC (gianfar)
> for testing, but I remember that it always worked, since the Tx time
> stamp is delivered into packet buffer's padding.
>
> With the IGB, sometimes it seemed that 2 retries is okay, but
> sometimes I needed to ramp this up.
>
>> And have you attempted testing this under moderate stress?
>
> I just retested the PHYTER under a ping flood, and there were no
> hiccups. I will test the CPTS again when I get a chance.
>
> Thanks,
> Richard
>

At 10Gbe link, I've only needed to go up to about 20 or 25 to remove 
most of the contention. Our validation team increased this to 200 as 
they really didn't want to see false positives and it seemed better to 
wait longer.

the 1Gb ones I haven't got a figure personally but I will ask Matthew.

I am ok with needing to ramp up the value sometimes, but I would much 
prefer a default which meant fewer users had to change something as the 
value seemed cryptic to the people I had to explain it too.

An interesting thought I just had was how difficult would it be for the 
software to automatically increase the timing if it misses a bunch of tx 
timestamps in a row? So if would increase a counter as it missed tx 
timestamps, so that it would start low but would increase to the value 
the hardware needs to respond in time. I'm thinking it would only 
trigger if you missed a few in a row as a one time mistake isn't really 
a big deal.

Thanks

- Jake

Re: [Linuxptp-users] “Resource temporarily unavailable” errors during flood ping test

From: Delio B. <dbr...@au...> - 2012-11-01 17:32:39

Hello Richard,

On Nov 1, 2012, at 9:53 AM, Richard Cochran <ric...@gm...> wrote:

> On Tue, Oct 30, 2012 at 11:02:21PM +0000, Keller, Jacob E wrote:
>> 
>> It would take an insane amount of work to move to a model that
>> allows receive handling inside sk.c, and I believe it isn't worth
>> the effort. I would however like to increase the default
>> tx_timestamp_retries value, as 2 tries rarely works for anything
>> I've tested, as it generates false positive errors a lot.
> 
> Okay, what value is a safe default, in your experience?
> 
>> What hardware are you using that doesn't have issues at 2 retries?
> 
> The PHYTER almost never loses a Tx time stamp, and the TI CPTS seems
> to be working perfectly, too. I don't have the Freescale eTSEC (gianfar)

Unfortunately I can trigger this issue (TX timestamp loss) by transferring a large file via ssh on my DM8148 based board (CPSW+CPTS) using the default value for sk_tx_retries. I can enable extra debug flags and investigate if the timestamp is lost at the CPTS level if it's any help.

Regards
--
Delio Brignoli
Audioscience Inc

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Stephan G. <ste...@gm...> - 2012-11-03 09:45:26

>
> I just retested the PHYTER under a ping flood, and there were no
> hiccups. I will test the CPTS again when I get a chance.
>

We also have this National Phyter but connected to an 400MHz MPC5200b. 
We see these hiccups with the default value.

Regards,

Stephan

Re: [Linuxptp-users] Fw: “Resource temporarily unavailable” errors during flood ping test

From: Keller, J. E <jac...@in...> - 2012-11-05 17:44:41

I tested igb and ixgbe using a print out which would tell me how many retries it was taking.

We actually saw most of them taking 2-5 retries with a few large anomalies.

I want to change the default to maybe 20 or 25, as this should reduce false positives without taking significantly more time in the longer cases.

- Jake

> -----Original Message-----
> From: Stephan Gatzka [mailto:ste...@gm...]
> Sent: Saturday, November 03, 2012 2:45 AM
> To: Richard Cochran
> Cc: Keller, Jacob E; lin...@li...
> Subject: Re: [Linuxptp-users] Fw: “Resource temporarily unavailable”
> errors during flood ping test
> 
> >
> > I just retested the PHYTER under a ping flood, and there were no
> > hiccups. I will test the CPTS again when I get a chance.
> >
> 
> We also have this National Phyter but connected to an 400MHz MPC5200b.
> We see these hiccups with the default value.
> 
> Regards,
> 
> Stephan