Thread: [Linuxptp-users] ptp4l and network connectivity interruption
PTP IEEE 1588 stack for Linux
Brought to you by:
rcochran
From: Brian W. <br...@wa...> - 2015-12-08 16:52:51
|
Sorry if this has been asked before. The archives are unreachable on sourceforge. I keep getting an "Error 403 Read access required" when trying to view the list archives. I am having an issue with the ptp4l client and network connectivity. The client works just fine and syncs the hardware clock on an Intel e1000 device. However, if anything interrupts that connectivity for a couple of seconds the clock seems to drop the fact that it is synced to a TAI time source with a leap second offset. It will panic that it is behind and jump forward 36 seconds (the current leap second offset). Then a few seconds later when connectivity is restored and resynced, it realizes it is now 36 seconds fast and takes 20 minutes or more to work back to the correct time. I am able to reproduce this by temporarily blocking access to 1588 udp ports 319 and 320 through iptables. Wait a few seconds and the clock will jump ahead by the leap second offest. Unblock the udp ports and then the clock begins the long process of adjusting back to the actual time. Is there a setting that I have missed or something I have over looked? The ptp4l client does not have many options. I would think that the clock should maintain the last known offset during the brief interruption. Thanks, Brian |
From: Richard C. <ric...@gm...> - 2015-12-10 09:25:44
|
On Tue, Dec 08, 2015 at 11:27:29AM -0500, Brian Walsh wrote: > Sorry if this has been asked before. The archives are unreachable on > sourceforge. I keep getting an "Error 403 Read access required" when trying > to view the list archives. Yes, SF does have issues, and I want to move away from there, eventually. In the mean time, you can use the archives on Gmane: http://news.gmane.org/gmane.comp.linux.ptp.user http://news.gmane.org/gmane.comp.linux.ptp.devel > I am having an issue with the ptp4l client and network connectivity. The > client works just fine and syncs the hardware clock on an Intel e1000 > device. Which device? > However, if anything interrupts that connectivity for a couple of > seconds the clock seems to drop the fact that it is synced to a TAI time > source with a leap second offset. It will panic that it is behind and jump > forward 36 seconds (the current leap second offset). Then a few seconds > later when connectivity is restored and resynced, it realizes it is now 36 > seconds fast and takes 20 minutes or more to work back to the correct time. IIRC, this problem is due to the fact the e1000 HW and driver requires a complete reset when the link goes down. The old time values gets lost, and the driver simply initializes the clock with the current system time. > I am able to reproduce this by temporarily blocking access to 1588 udp > ports 319 and 320 through iptables. Wait a few seconds and the clock will > jump ahead by the leap second offest. Unblock the udp ports and then the > clock begins the long process of adjusting back to the actual time. Hm, I wouldn't expect that behavior, but it does sound like the link loss symptom. > Is there a setting that I have missed or something I have over looked? The > ptp4l client does not have many options. I would think that the clock > should maintain the last known offset during the brief interruption. I think the source of the jump is not in ptp4l but rather in the driver or HW. Thanks, Richard |
From: Brian W. <br...@wa...> - 2015-12-10 17:06:46
|
On Thu, Dec 10, 2015 at 4:25 AM, Richard Cochran <ric...@gm...> wrote: >> I am having an issue with the ptp4l client and network connectivity. The >> client works just fine and syncs the hardware clock on an Intel e1000 >> device. > > Which device? It is an Intel 82574L. 8086:10d3 >> However, if anything interrupts that connectivity for a couple of >> seconds the clock seems to drop the fact that it is synced to a TAI time >> source with a leap second offset. It will panic that it is behind and jump >> forward 36 seconds (the current leap second offset). Then a few seconds >> later when connectivity is restored and resynced, it realizes it is now 36 >> seconds fast and takes 20 minutes or more to work back to the correct time. > > IIRC, this problem is due to the fact the e1000 HW and driver requires > a complete reset when the link goes down. The old time values gets > lost, and the driver simply initializes the clock with the current > system time. > >> I am able to reproduce this by temporarily blocking access to 1588 udp >> ports 319 and 320 through iptables. Wait a few seconds and the clock will >> jump ahead by the leap second offest. Unblock the udp ports and then the >> clock begins the long process of adjusting back to the actual time. > > Hm, I wouldn't expect that behavior, but it does sound like the link > loss symptom. > >> Is there a setting that I have missed or something I have over looked? The >> ptp4l client does not have many options. I would think that the clock >> should maintain the last known offset during the brief interruption. > > I think the source of the jump is not in ptp4l but rather in the > driver or HW. I am running tests using kernel version 4.1.7. I will try and trace it down some more. Looking again it appears it may be the opposite of what I thought. ptp4l is maintaining the offset value while the hardware clock has switched back to UTC time. I am not seeing anywhere that ptp4l is reseting the offset to 0 during this state. Connectivity working: root@host:~> phc_ctl eth0 cmp get phc_ctl[92833.880]: offset from CLOCK_REALTIME is -36000012151ns phc_ctl[92833.880]: clock time is 1449766596.912500774 or Thu Dec 10 16:56:36 2015 Ports blocked: root@host:~> phc_ctl eth0 cmp get phc_ctl[92834.718]: offset from CLOCK_REALTIME is 7518ns phc_ctl[92834.719]: clock time is 1449766561.750694117 or Thu Dec 10 16:56:01 2015 |
From: Richard C. <ric...@gm...> - 2015-12-11 15:30:21
|
On Thu, Dec 10, 2015 at 12:06:19PM -0500, Brian Walsh wrote: > It is an Intel 82574L. 8086:10d3 Ok, I have that card. The driver is the e1000e (and not the e1000). Can you send me your iptables script so that I can try and reproduce the problem? > Looking again it appears it may be the opposite of what I thought. > ptp4l is maintaining the > offset value while the hardware clock has switched back to UTC time. I > am not seeing > anywhere that ptp4l is reseting the offset to 0 during this state. Right, it is in the driver or HW. I remember that card resetting the clock after link loss. I complained about this, but Intel said it was as HW limitation, IIRC. However, I wouldn't expect this to happen just from the action of the firewall. That sounds more like a driver bug. Thanks, Richard |
From: Brian W. <br...@wa...> - 2015-12-11 20:40:26
|
On Fri, 11 Dec 2015, Richard Cochran wrote: > > It is an Intel 82574L. 8086:10d3 > > Ok, I have that card. The driver is the e1000e (and not the e1000). > Can you send me your iptables script so that I can try and reproduce > the problem? I am just dropping udp packets on INPUT for ports 319 and 320 iptables -A INPUT -p udp --dport 319 -j DROP iptables -A INPUT -p udp --dport 320 -j DROP After a few seconds I just delete those rules. > > Looking again it appears it may be the opposite of what I thought. > > ptp4l is maintaining the > > offset value while the hardware clock has switched back to UTC time. I > > am not seeing > > anywhere that ptp4l is reseting the offset to 0 during this state. > > Right, it is in the driver or HW. I remember that card resetting the > clock after link loss. I complained about this, but Intel said it was > as HW limitation, IIRC. > > However, I wouldn't expect this to happen just from the action of the > firewall. That sounds more like a driver bug. > I was looking at the linuxptp code to see if it could possibly detect the condition. It does detect the initial jump when the hardware starts receiving packets again. Maybe it could check the jump against the last known offset value. Have it wait for a few packets while the device settles before trusting that jump if it is close to the offset. Brian |
From: Richard C. <ric...@gm...> - 2015-12-12 17:50:55
|
On Fri, Dec 11, 2015 at 03:09:56PM -0500, Brian Walsh wrote: > I was looking at the linuxptp code to see if it could possibly detect the > condition. It does detect the initial jump when the hardware starts > receiving packets again. Maybe it could check the jump against the last > known offset value. Have it wait for a few packets while the device > settles before trusting that jump if it is close to the offset. This is definitely a driver bug. Looking at drivers/net/ethernet/intel/e1000e/netdev.c, in the function e1000e_config_hwtstamp(), the time is reset whenever time stamping is activated. That doesn't make any sense. It looks like the calls to e1000e_get_base_timinca() and timecounter_init() are misplaced. They should go into the probe function instead. Thanks, Richard |
From: Brian W. <br...@wa...> - 2015-12-12 20:48:15
|
On Sat, Dec 12, 2015 at 06:50:45PM +0100, Richard Cochran wrote: > This is definitely a driver bug. > > Looking at drivers/net/ethernet/intel/e1000e/netdev.c, in the function > e1000e_config_hwtstamp(), the time is reset whenever time stamping is > activated. That doesn't make any sense. > > It looks like the calls to e1000e_get_base_timinca() and > timecounter_init() are misplaced. They should go into the probe > function instead. I see that. Comparing that code to what happens in the ixgbe driver it looks like reseting the clock should be part of e1000e_ptp_init. Then the e1000e_ptp_init code should be called in the device open function to initialize whenever the device is made active. Pull ptp_init out of the probe function. I will see what i can put together to test based off of the ixgbe code. Brian |
From: Richard C. <ric...@gm...> - 2015-12-12 20:50:45
|
On Sat, Dec 12, 2015 at 03:18:01PM -0500, Brian Walsh wrote: > I see that. Comparing that code to what happens in the ixgbe driver it > looks like reseting the clock should be part of e1000e_ptp_init. Then > the e1000e_ptp_init code should be called in the device open function to > initialize whenever the device is made active. Pull ptp_init out of the > probe function. Sorry, I mixed up the Intel cards WRT the unfortunate HW limitation. The 82574 does not need to reset the clock at link loss, or at least it doesn't appear to need it. I wouldn't follow ixgbe, because putting the reset in the 'open' method means that the clock will become reset during ifup/ifdown. For the ixgbe this is necessary, IIRC, but I wouldn't do it unless you are absolutely by some HW quirk. I would try something like the following untested patch... Thanks, Richard diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 0a854a4..1823148 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -3732,16 +3732,6 @@ static int e1000e_config_hwtstamp(struct e1000_adapter *adapter, er32(RXSTMPH); er32(TXSTMPH); - /* Get and set the System Time Register SYSTIM base frequency */ - ret_val = e1000e_get_base_timinca(adapter, ®val); - if (ret_val) - return ret_val; - ew32(TIMINCA, regval); - - /* reset the ns time counter */ - timecounter_init(&adapter->tc, &adapter->cc, - ktime_to_ns(ktime_get_real())); - return 0; } @@ -6980,6 +6970,7 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent) u16 eeprom_data = 0; u16 eeprom_apme_mask = E1000_EEPROM_APME; s32 rval = 0; + u32 regval; if (ei->flags2 & FLAG2_DISABLE_ASPM_L0S) aspm_disable_flag = PCIE_LINK_STATE_L0S; @@ -7270,6 +7261,16 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent) /* carrier off reporting is important to ethtool even BEFORE open */ netif_carrier_off(netdev); + /* Get and set the System Time Register SYSTIM base frequency */ + err = e1000e_get_base_timinca(adapter, ®val); + if (err) + goto err_register; + ew32(TIMINCA, regval); + + /* reset the ns time counter */ + timecounter_init(&adapter->tc, &adapter->cc, + ktime_to_ns(ktime_get_real())); + /* init PTP hardware clock */ e1000e_ptp_init(adapter); |
From: Brian W. <br...@wa...> - 2015-12-12 21:00:50
|
On Sat, Dec 12, 2015 at 09:50:32PM +0100, Richard Cochran wrote: > I wouldn't follow ixgbe, because putting the reset in the 'open' > method means that the clock will become reset during ifup/ifdown. For > the ixgbe this is necessary, IIRC, but I wouldn't do it unless you are > absolutely by some HW quirk. I was not sure if having it reset during ifup makes more sense. Does the clock go away when the interface is down? I can't test that right now. It is my primary interface so it is always up on my device. > I would try something like the following untested patch... Just finished doing an initial test of quickly making the same changes you sent. Looks like it fixes the problem I was seeing. Makes sense. Stop reseting the clock and it will not reset. Brian |
From: Richard C. <ric...@gm...> - 2015-12-12 21:09:57
|
On Sat, Dec 12, 2015 at 03:58:31PM -0500, Brian Walsh wrote: > I was not sure if having it reset during ifup makes more sense. Does the > clock go away when the interface is down? I can't test that right now. > It is my primary interface so it is always up on my device. The /dev/ptpX persists from ptp_clock_register() until ptp_clock_unregister(). Ideally, the clock should appear when the device is probed and stay running until either the HW is unplugged or the driver gets unloaded. There are some HW designs out there that cause the clock to go away or become unusable when the link state changes, but I think the 82574 does not have those kinds of issues. > Makes sense. Stop reseting the clock and it will not reset. Yup. Thanks, Richard |