Thread: [Linuxptp-users] Polling ptp clock not allowed?
PTP IEEE 1588 stack for Linux
Brought to you by:
rcochran
From: frank <sou...@gm...> - 2016-02-12 11:09:34
|
Hi, i have an application which polls ptp clock and tries to execute some activitites at specific times. During printf debugging I have seen large jumps in the timespecs returned by clock_gettime(). Using a Intel i5 with a "Intel Corporation Ethernet Connection I217-V (rev 05)" network card around 100-200 calls to clock_gettime() in a loop are enough. I have stripped my problem down to the following code, which fails if between two inner loops the ptp time changes by more than a second: Is this expected? kind regards Frank Using linuxptp-1.6 and kernel 3.16 $ phc2sys -v 1.6-00005-g8a79480 $ uname -v #1 SMP Debian 3.16.7-ckt20-1+deb8u3 (2016-01-17) #include <time.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <assert.h> int main(int argc, char *argv[]) { static const int CLOCKFD = 3; int fd = open("/dev/ptp0", O_RDONLY); clockid_t clkid = ((~(clockid_t)(fd) << 3) | CLOCKFD); struct timespec oldts; struct timespec newts; clock_gettime(clkid, &oldts); for(;;) { clock_gettime(clkid, &newts); time_t delta = newts.tv_sec - oldts.tv_sec; oldts = newts; assert(delta >= 0); assert(delta <= 1); } } |
From: Richard C. <ric...@gm...> - 2016-02-12 14:03:31
|
On Fri, Feb 12, 2016 at 12:09:25PM +0100, frank wrote: > During printf debugging I have seen large jumps in the timespecs > returned by clock_gettime(). > > Using a Intel i5 with a "Intel Corporation Ethernet Connection I217-V > (rev 05)" > network card around 100-200 calls to clock_gettime() in a loop are enough. This uses the e1000e driver, right? That driver has a work around for a HW bug when reading the time, but it is restricted to 82574/82583 devices: drivers/net/ethernet/intel/e1000e/netdev.c:e1000e_cyclecounter_read(): /* errata for 82574/82583 possible bad bits read from SYSTIMH/L * check to see that the time is incrementing at a reasonable * rate and is a multiple of incvalue */ Maybe the i217 is also affected? You could try enabling this work around for i217 as well. > struct timespec oldts; > struct timespec newts; > clock_gettime(clkid, &oldts); > > for(;;) > { > clock_gettime(clkid, &newts); > time_t delta = newts.tv_sec - oldts.tv_sec; > oldts = newts; > assert(delta >= 0); > assert(delta <= 1); Here it would help to see the values of oldts and newts when the assertion fails. > } > } This sounds like the same old hw bug. I would ask intel and/or on the e1000 list. Thanks, Richard |
From: Keller, J. E <jac...@in...> - 2016-02-16 19:53:52
|
On Fri, 2016-02-12 at 15:03 +0100, Richard Cochran wrote: > On Fri, Feb 12, 2016 at 12:09:25PM +0100, frank wrote: > > During printf debugging I have seen large jumps in the timespecs > > returned by clock_gettime(). > > > > Using a Intel i5 with a "Intel Corporation Ethernet Connection > > I217-V > > (rev 05)" > > network card around 100-200 calls to clock_gettime() in a loop are > > enough. > > This uses the e1000e driver, right? That driver has a work around > for > a HW bug when reading the time, but it is restricted to 82574/82583 > devices: > > drivers/net/ethernet/intel/e1000e/netdev.c:e1000e_cyclecounter_read() > : > /* errata for 82574/82583 possible bad bits read from > SYSTIMH/L > * check to see that the time is incrementing at a > reasonable > * rate and is a multiple of incvalue > */ > > Maybe the i217 is also affected? You could try enabling this work > around for i217 as well. > > > struct timespec oldts; > > struct timespec newts; > > clock_gettime(clkid, &oldts); > > > > for(;;) > > { > > clock_gettime(clkid, &newts); > > time_t delta = newts.tv_sec - oldts.tv_sec; > > oldts = newts; > > assert(delta >= 0); > > assert(delta <= 1); > > Here it would help to see the values of oldts and newts when the > assertion fails. > > > } > > } > > This sounds like the same old hw bug. I would ask intel and/or on > the > e1000 list. > > Thanks, > Richard It is possible this is the same bug. Let me know your results, I can forward it to the team responsible. Regards, Jake |
From: frank <sou...@gm...> - 2016-02-24 14:48:19
|
Hi, some updates > Here it would help to see the values of oldts and newts when the > assertion fails. The counter values when the assert happens are: newts.tv_sec = 56c923a3 oldts.tv_sec= 56c810c2 delta=000112e1 The delta at error time is either 000112e1 or 000112e0 The check code looks now: #include <time.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> int main(int argc, char* argv[]) { static const int CLOCKFD = 3; int fd = open("/dev/ptp0", O_RDONLY); if(fd == -1) { printf("open failed\n"); exit(1); } clockid_t clkid = ((~(clockid_t)(fd) << 3) | CLOCKFD); struct timespec oldts; struct timespec newts; int r = clock_gettime(clkid, &oldts); if(r == -1) { printf("clock_gettime failed\n"); exit(1); } for (;;) { int r = clock_gettime(clkid, &newts); if(r == -1) { printf("clock_gettime failed\n"); exit(1); } time_t delta = newts.tv_sec - oldts.tv_sec; if(delta < 0) { printf("newts.tv_sec = %08x\n oldts.tv_sec= %08x\n delta=%08x\n", newts.tv_sec , oldts.tv_sec, delta); exit(1); } if(delta > 1) { printf("newts.tv_sec = %08x\n oldts.tv_sec= %08x\n delta=%08x\n", newts.tv_sec , oldts.tv_sec, delta); exit(1); } oldts = newts; } } wrt: > drivers/net/ethernet/intel/e1000e/netdev.c:e1000e_cyclecounter_read(): The workaround implemented there does not seem to work correctly. Using a card where the workaround jumps in, the problem appears too. There are various versions of the workaround in linux and we backported the linux 4.4 workaround to 3.19, but still the same error happens. Finally a collegue used the patch below and the problem vanished(above test code did run a long time), however it is unclear if the timestamps produced are still 'correct' and useful for ptp. The fix still looks surprising.. kind regards Frank patch against 3.19 kernel. --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -4150,11 +4150,29 @@ static cycle_t e1000e_cyclecounter_read( struct e1000_adapter *adapter = container_of(cc, struct e1000_adapter, cc); struct e1000_hw *hw = &adapter->hw; + u32 systimel_1, systimel_2, systimeh; cycle_t systim, systim_next; - - /* latch SYSTIMH on read of SYSTIML */ - systim = (cycle_t)er32(SYSTIML); - systim |= (cycle_t)er32(SYSTIMH) << 32; + /* SYSTIMH latching upon SYSTIML read does not work well. + * This means that if SYSTIML overflows after we read it but before + * we read SYSTIMH, the value of SYSTIMH has been incremented and we + * will experience a huge non linear increment in the systime value + * to fix that we test for overflow and if true, we re-read systime. + */ + systimel_1 = er32(SYSTIML); + systimeh = er32(SYSTIMH); + systimel_2 = er32(SYSTIML); + /* Check for overflow. If there was no overflow, use the values */ + if (systimel_1 < systimel_2) { + systim = (cycle_t)systimel_1; + systim |= (cycle_t)systimeh << 32; + } else { + /* There was an overflow, read again SYSTIMH, and use + * systimel_2 + */ + systimeh = er32(SYSTIMH); + systim = (cycle_t)systimel_2; + systim |= (cycle_t)systimeh << 32; + } if ((hw->mac.type == e1000_82574) || (hw->mac.type == e1000_82583)) { u64 incvalue, time_delta, rem, temp; @@ -4167,8 +4185,21 @@ static cycle_t e1000e_cyclecounter_read( incvalue = er32(TIMINCA) & E1000_TIMINCA_INCVALUE_MASK; for (i = 0; i < E1000_MAX_82574_SYSTIM_REREADS; i++) { /* latch SYSTIMH on read of SYSTIML */ - systim_next = (cycle_t)er32(SYSTIML); - systim_next |= (cycle_t)er32(SYSTIMH) << 32; + systimel_1 = er32(SYSTIML); + systimeh = er32(SYSTIMH); + systimel_2 = er32(SYSTIML); + /* Check for overflow. If there was no overflow, use the values */ + if (systimel_1 < systimel_2) { + systim_next = (cycle_t)systimel_1; + systim_next |= (cycle_t)systimeh << 32; + } else { + /* There was an overflow, read again SYSTIMH, and use + * systimel_2 + */ + systimeh = er32(SYSTIMH); + systim_next = (cycle_t)systimel_2; + systim_next |= (cycle_t)systimeh << 32; + } time_delta = systim_next - systim; temp = time_delta; |
From: Keller, J. E <jac...@in...> - 2016-02-24 17:23:55
|
Hi, > -----Original Message----- > From: frank [mailto:sou...@gm...] > Sent: Wednesday, February 24, 2016 6:48 AM > To: lin...@li... > Subject: Re: [Linuxptp-users] Polling ptp clock not allowed? > > Hi, > > some updates <snip> > Finally a collegue used the patch below and the problem vanished(above > test code did run a long time), > however it is unclear if the timestamps produced are still 'correct' and > useful for ptp. > The fix still looks surprising.. > > kind regards > Frank > This is the work-around I was talking about below. I am surprised it is not in the 4.4 kernel? This is a known hardware errata on some of the parts (I am not sure which ones) and it should have had the workaround pushed upstream. I am guessing your collegue got this from the out-of-tree driver on SourceForge? Regards, Jake |
From: frank <sou...@gm...> - 2016-02-25 08:02:00
|
On 02/24/2016 06:23 PM, Keller, Jacob E wrote: >> -----Original Message----- >> Finally a collegue used the patch below and the problem vanished(above >> test code did run a long time), >> however it is unclear if the timestamps produced are still 'correct' and >> useful for ptp. >> The fix still looks surprising.. >> >> kind regards >> Frank >> > This is the work-around I was talking about below. I am surprised it is not in the 4.4 kernel? > > This is a known hardware errata on some of the parts (I am not sure which ones) and it should have had the workaround pushed upstream. I am guessing your collegue got this from the out-of-tree driver on SourceForge Hi, no, we looked at the 4.4 Code and backported it to 3.19, but our testcase still showed the problem. Thought "this workaround cannot work if..." and created the patch below ourselfes. Need to check sourceforge.. We were surprised that it helped, basically we did not understand what kind of hardware bug might cause more than one overflow in such a small amount of time. It is not clear to me if the "time" produced by the intel chips using our patch is "valid" at all, or if it would be better just to remove the PTP clock from the affected Intel drivers completely. Backgrund: What I see in the setup currently are spikes in the PTP time distributed by linuxptp. We investigated the components and found a problem in the intel driver and a testcase to reproduce it, developed a patch for the testcase, but the spikes are still there... (We have a beagleboard in the system too and suspect problems which might be related to parallel NFS traffic , but found no simple testcase for this.) What our testcase confirms is that the results from clock_gettime() jumped wildly around before our patch and did not do this after applying the patch. A constant counter or somethings with a jitter having an amplitude below 1s would pass our testcase too... Any opinion very much appreciated. kind regards Frank |
From: frank <sou...@gm...> - 2016-02-25 14:26:33
|
On 02/25/2016 09:01 AM, frank wrote: > On 02/24/2016 06:23 PM, Keller, Jacob E wrote: >>> -----Original Message----- >>> Finally a collegue used the patch below and the problem vanished(above >>> test code did run a long time), >>> however it is unclear if the timestamps produced are still 'correct' and >>> useful for ptp. >>> The fix still looks surprising.. >>> >>> kind regards >>> Frank >>> >> This is the work-around I was talking about below. I am surprised it is not in the 4.4 kernel? >> >> This is a known hardware errata on some of the parts (I am not sure which ones) and it should have had the workaround pushed upstream. I am guessing your collegue got this from the out-of-tree driver on SourceForge > Hi, > > no, we looked at the 4.4 Code and backported it to 3.19, but our > testcase still showed the problem. > Thought "this workaround cannot work if..." and created the patch below > ourselfes. > Need to check sourceforge.. > Hm, the code in e1000e-3.3.3.tar.gz looks like the one in the 4.4 kernel, so I suspect it will fail using the testcase too. regards Frank |
From: Keller, J. E <jac...@in...> - 2016-02-25 17:11:44
|
On Thu, 2016-02-25 at 15:26 +0100, frank wrote: > > On 02/25/2016 09:01 AM, frank wrote: > > On 02/24/2016 06:23 PM, Keller, Jacob E wrote: > > > > -----Original Message----- > > > > Finally a collegue used the patch below and the problem > > > > vanished(above > > > > test code did run a long time), > > > > however it is unclear if the timestamps produced are still > > > > 'correct' and > > > > useful for ptp. > > > > The fix still looks surprising.. > > > > > > > > kind regards > > > > Frank > > > > > > > This is the work-around I was talking about below. I am surprised > > > it is not in the 4.4 kernel? > > > > > > This is a known hardware errata on some of the parts (I am not > > > sure which ones) and it should have had the workaround pushed > > > upstream. I am guessing your collegue got this from the out-of- > > > tree driver on SourceForge > > Hi, > > > > no, we looked at the 4.4 Code and backported it to 3.19, but our > > testcase still showed the problem. > > Thought "this workaround cannot work if..." and created the patch > > below > > ourselfes. > > Need to check sourceforge.. > > > Hm, the code in e1000e-3.3.3.tar.gz looks like the one in the 4.4 > kernel, so I suspect it will fail > using the testcase too. > > regards > Frank How fast are you trying to read the clock in your test case? I suspect that polling too rapidly may be what causes part of the issue? I'm not certain, though. I don't know that particular hardware very well. Thanks, Jake |
From: Keller, J. E <jac...@in...> - 2016-02-25 17:15:53
|
On Thu, 2016-02-25 at 15:26 +0100, frank wrote: > > On 02/25/2016 09:01 AM, frank wrote: > > On 02/24/2016 06:23 PM, Keller, Jacob E wrote: > > > > -----Original Message----- > > > > Finally a collegue used the patch below and the problem > > > > vanished(above > > > > test code did run a long time), > > > > however it is unclear if the timestamps produced are still > > > > 'correct' and > > > > useful for ptp. > > > > The fix still looks surprising.. > > > > > > > > kind regards > > > > Frank > > > > > > > This is the work-around I was talking about below. I am surprised > > > it is not in the 4.4 kernel? > > > > > > This is a known hardware errata on some of the parts (I am not > > > sure which ones) and it should have had the workaround pushed > > > upstream. I am guessing your collegue got this from the out-of- > > > tree driver on SourceForge > > Hi, > > > > no, we looked at the 4.4 Code and backported it to 3.19, but our > > testcase still showed the problem. > > Thought "this workaround cannot work if..." and created the patch > > below > > ourselfes. > > Need to check sourceforge.. > > > Hm, the code in e1000e-3.3.3.tar.gz looks like the one in the 4.4 > kernel, so I suspect it will fail > using the testcase too. > > regards > Frank You can use testptp from Documentation/ptp to perform some sanity checks. Set the clock, wait for a predetermined time and read it. If you modify the program you can even do this within the program to remove human error from the commands. It isn't 100% perfect but it's a good start to make sure things are behaving as expected. Regards, Jake |