Thread: [Linuxptp-users] ptp4l and network connectivity interruption

PTP IEEE 1588 stack for Linux

Brought to you by: rcochran

linuxptp-users

[Linuxptp-users] ptp4l and network connectivity interruption

From: Brian W. <br...@wa...> - 2015-12-08 16:52:51

Sorry if this has been asked before. The archives are unreachable on
sourceforge. I keep getting an "Error 403 Read access required" when trying
to view the list archives.

I am having an issue with the ptp4l client and network connectivity. The
client works just fine and syncs the hardware clock on an Intel e1000
device. However, if anything interrupts that connectivity for a couple of
seconds the clock seems to drop the fact that it is synced to a TAI time
source with a leap second offset. It will panic that it is behind and jump
forward 36 seconds (the current leap second offset). Then a few seconds
later when connectivity is restored and resynced, it realizes it is now 36
seconds fast and takes 20 minutes or more to work back to the correct time.

I am able to reproduce this by temporarily blocking access to 1588 udp
ports 319 and 320 through iptables. Wait a few seconds and the clock will
jump ahead by the leap second offest. Unblock the udp ports and then the
clock begins the long process of adjusting back to the actual time.

Is there a setting that I have missed or something I have over looked? The
ptp4l client does not have many options. I would think that the clock
should maintain the last known offset during the brief interruption.

Thanks,
Brian

Re: [Linuxptp-users] ptp4l and network connectivity interruption

From: Richard C. <ric...@gm...> - 2015-12-10 09:25:44

On Tue, Dec 08, 2015 at 11:27:29AM -0500, Brian Walsh wrote:
> Sorry if this has been asked before. The archives are unreachable on
> sourceforge. I keep getting an "Error 403 Read access required" when trying
> to view the list archives.

Yes, SF does have issues, and I want to move away from there,
eventually.  In the mean time, you can use the archives on Gmane:

  http://news.gmane.org/gmane.comp.linux.ptp.user
  http://news.gmane.org/gmane.comp.linux.ptp.devel
 
> I am having an issue with the ptp4l client and network connectivity. The
> client works just fine and syncs the hardware clock on an Intel e1000
> device.

Which device?

> However, if anything interrupts that connectivity for a couple of
> seconds the clock seems to drop the fact that it is synced to a TAI time
> source with a leap second offset. It will panic that it is behind and jump
> forward 36 seconds (the current leap second offset). Then a few seconds
> later when connectivity is restored and resynced, it realizes it is now 36
> seconds fast and takes 20 minutes or more to work back to the correct time.

IIRC, this problem is due to the fact the e1000 HW and driver requires
a complete reset when the link goes down.  The old time values gets
lost, and the driver simply initializes the clock with the current
system time.
 
> I am able to reproduce this by temporarily blocking access to 1588 udp
> ports 319 and 320 through iptables. Wait a few seconds and the clock will
> jump ahead by the leap second offest. Unblock the udp ports and then the
> clock begins the long process of adjusting back to the actual time.

Hm, I wouldn't expect that behavior, but it does sound like the link
loss symptom.
 
> Is there a setting that I have missed or something I have over looked? The
> ptp4l client does not have many options. I would think that the clock
> should maintain the last known offset during the brief interruption.

I think the source of the jump is not in ptp4l but rather in the
driver or HW.

Thanks,
Richard

Re: [Linuxptp-users] ptp4l and network connectivity interruption

From: Brian W. <br...@wa...> - 2015-12-10 17:06:46

On Thu, Dec 10, 2015 at 4:25 AM, Richard Cochran
<ric...@gm...> wrote:
>> I am having an issue with the ptp4l client and network connectivity. The
>> client works just fine and syncs the hardware clock on an Intel e1000
>> device.
>
> Which device?

It is an Intel 82574L. 8086:10d3

>> However, if anything interrupts that connectivity for a couple of
>> seconds the clock seems to drop the fact that it is synced to a TAI time
>> source with a leap second offset. It will panic that it is behind and jump
>> forward 36 seconds (the current leap second offset). Then a few seconds
>> later when connectivity is restored and resynced, it realizes it is now 36
>> seconds fast and takes 20 minutes or more to work back to the correct time.
>
> IIRC, this problem is due to the fact the e1000 HW and driver requires
> a complete reset when the link goes down.  The old time values gets
> lost, and the driver simply initializes the clock with the current
> system time.
>
>> I am able to reproduce this by temporarily blocking access to 1588 udp
>> ports 319 and 320 through iptables. Wait a few seconds and the clock will
>> jump ahead by the leap second offest. Unblock the udp ports and then the
>> clock begins the long process of adjusting back to the actual time.
>
> Hm, I wouldn't expect that behavior, but it does sound like the link
> loss symptom.
>
>> Is there a setting that I have missed or something I have over looked? The
>> ptp4l client does not have many options. I would think that the clock
>> should maintain the last known offset during the brief interruption.
>
> I think the source of the jump is not in ptp4l but rather in the
> driver or HW.

I am running tests using kernel version 4.1.7. I will try and trace it
down some more.

Looking again it appears it may be the opposite of what I thought.
ptp4l is maintaining the
offset value while the hardware clock has switched back to UTC time. I
am not seeing
anywhere that ptp4l is reseting the offset to 0 during this state.

Connectivity working:
root@host:~> phc_ctl eth0 cmp get
phc_ctl[92833.880]: offset from CLOCK_REALTIME is -36000012151ns
phc_ctl[92833.880]: clock time is 1449766596.912500774 or Thu Dec 10
16:56:36 2015

Ports blocked:
root@host:~> phc_ctl eth0 cmp get
phc_ctl[92834.718]: offset from CLOCK_REALTIME is 7518ns
phc_ctl[92834.719]: clock time is 1449766561.750694117 or Thu Dec 10
16:56:01 2015

Re: [Linuxptp-users] ptp4l and network connectivity interruption

From: Richard C. <ric...@gm...> - 2015-12-11 15:30:21

On Thu, Dec 10, 2015 at 12:06:19PM -0500, Brian Walsh wrote:
> It is an Intel 82574L. 8086:10d3

Ok, I have that card.  The driver is the e1000e (and not the e1000).
Can you send me your iptables script so that I can try and reproduce
the problem?

> Looking again it appears it may be the opposite of what I thought.
> ptp4l is maintaining the
> offset value while the hardware clock has switched back to UTC time. I
> am not seeing
> anywhere that ptp4l is reseting the offset to 0 during this state.

Right, it is in the driver or HW.  I remember that card resetting the
clock after link loss.  I complained about this, but Intel said it was
as HW limitation, IIRC.

However, I wouldn't expect this to happen just from the action of the
firewall.  That sounds more like a driver bug.

Thanks,
Richard

Re: [Linuxptp-users] ptp4l and network connectivity interruption

From: Brian W. <br...@wa...> - 2015-12-11 20:40:26

On Fri, 11 Dec 2015, Richard Cochran wrote:
> > It is an Intel 82574L. 8086:10d3
> 
> Ok, I have that card.  The driver is the e1000e (and not the e1000).
> Can you send me your iptables script so that I can try and reproduce
> the problem?

I am just dropping udp packets on INPUT for ports 319 and 320

iptables -A INPUT -p udp --dport 319 -j DROP
iptables -A INPUT -p udp --dport 320 -j DROP

After a few seconds I just delete those rules.
 
> > Looking again it appears it may be the opposite of what I thought.
> > ptp4l is maintaining the
> > offset value while the hardware clock has switched back to UTC time. I
> > am not seeing
> > anywhere that ptp4l is reseting the offset to 0 during this state.
> 
> Right, it is in the driver or HW.  I remember that card resetting the
> clock after link loss.  I complained about this, but Intel said it was
> as HW limitation, IIRC.
> 
> However, I wouldn't expect this to happen just from the action of the
> firewall.  That sounds more like a driver bug.
> 

I was looking at the linuxptp code to see if it could possibly detect the
condition. It does detect the initial jump when the hardware starts
receiving packets again. Maybe it could check the jump against the last
known offset value. Have it wait for a few packets while the device
settles before trusting that jump if it is close to the offset.

Brian

Re: [Linuxptp-users] ptp4l and network connectivity interruption

From: Richard C. <ric...@gm...> - 2015-12-12 17:50:55

On Fri, Dec 11, 2015 at 03:09:56PM -0500, Brian Walsh wrote:
> I was looking at the linuxptp code to see if it could possibly detect the
> condition. It does detect the initial jump when the hardware starts
> receiving packets again. Maybe it could check the jump against the last
> known offset value. Have it wait for a few packets while the device
> settles before trusting that jump if it is close to the offset.

This is definitely a driver bug.

Looking at drivers/net/ethernet/intel/e1000e/netdev.c, in the function
e1000e_config_hwtstamp(), the time is reset whenever time stamping is
activated.  That doesn't make any sense.

It looks like the calls to e1000e_get_base_timinca() and
timecounter_init() are misplaced.  They should go into the probe
function instead.

Thanks,
Richard

Re: [Linuxptp-users] ptp4l and network connectivity interruption

From: Brian W. <br...@wa...> - 2015-12-12 20:48:15

On Sat, Dec 12, 2015 at 06:50:45PM +0100, Richard Cochran wrote:
> This is definitely a driver bug.
> 
> Looking at drivers/net/ethernet/intel/e1000e/netdev.c, in the function
> e1000e_config_hwtstamp(), the time is reset whenever time stamping is
> activated.  That doesn't make any sense.
> 
> It looks like the calls to e1000e_get_base_timinca() and
> timecounter_init() are misplaced.  They should go into the probe
> function instead.

I see that. Comparing that code to what happens in the ixgbe driver it
looks like reseting the clock should be part of e1000e_ptp_init. Then
the e1000e_ptp_init code should be called in the device open function to
initialize whenever the device is made active. Pull ptp_init out of the
probe function.

I will see what i can put together to test based off of the ixgbe code.

Brian

Re: [Linuxptp-users] ptp4l and network connectivity interruption

From: Richard C. <ric...@gm...> - 2015-12-12 20:50:45

On Sat, Dec 12, 2015 at 03:18:01PM -0500, Brian Walsh wrote:
> I see that. Comparing that code to what happens in the ixgbe driver it
> looks like reseting the clock should be part of e1000e_ptp_init. Then
> the e1000e_ptp_init code should be called in the device open function to
> initialize whenever the device is made active. Pull ptp_init out of the
> probe function.

Sorry, I mixed up the Intel cards WRT the unfortunate HW limitation.
The 82574 does not need to reset the clock at link loss, or at least
it doesn't appear to need it.

I wouldn't follow ixgbe, because putting the reset in the 'open'
method means that the clock will become reset during ifup/ifdown.  For
the ixgbe this is necessary, IIRC, but I wouldn't do it unless you are 
absolutely by some HW quirk.

I would try something like the following untested patch...

Thanks,
Richard


diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 0a854a4..1823148 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -3732,16 +3732,6 @@ static int e1000e_config_hwtstamp(struct e1000_adapter *adapter,
 	er32(RXSTMPH);
 	er32(TXSTMPH);
 
-	/* Get and set the System Time Register SYSTIM base frequency */
-	ret_val = e1000e_get_base_timinca(adapter, &regval);
-	if (ret_val)
-		return ret_val;
-	ew32(TIMINCA, regval);
-
-	/* reset the ns time counter */
-	timecounter_init(&adapter->tc, &adapter->cc,
-			 ktime_to_ns(ktime_get_real()));
-
 	return 0;
 }
 
@@ -6980,6 +6970,7 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	u16 eeprom_data = 0;
 	u16 eeprom_apme_mask = E1000_EEPROM_APME;
 	s32 rval = 0;
+	u32 regval;
 
 	if (ei->flags2 & FLAG2_DISABLE_ASPM_L0S)
 		aspm_disable_flag = PCIE_LINK_STATE_L0S;
@@ -7270,6 +7261,16 @@ static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	/* carrier off reporting is important to ethtool even BEFORE open */
 	netif_carrier_off(netdev);
 
+	/* Get and set the System Time Register SYSTIM base frequency */
+	err = e1000e_get_base_timinca(adapter, &regval);
+	if (err)
+		goto err_register;
+	ew32(TIMINCA, regval);
+
+	/* reset the ns time counter */
+	timecounter_init(&adapter->tc, &adapter->cc,
+			 ktime_to_ns(ktime_get_real()));
+
 	/* init PTP hardware clock */
 	e1000e_ptp_init(adapter);

Re: [Linuxptp-users] ptp4l and network connectivity interruption

From: Brian W. <br...@wa...> - 2015-12-12 21:00:50

On Sat, Dec 12, 2015 at 09:50:32PM +0100, Richard Cochran wrote:
> I wouldn't follow ixgbe, because putting the reset in the 'open'
> method means that the clock will become reset during ifup/ifdown.  For
> the ixgbe this is necessary, IIRC, but I wouldn't do it unless you are 
> absolutely by some HW quirk.

I was not sure if having it reset during ifup makes more sense. Does the
clock go away when the interface is down? I can't test that right now.
It is my primary interface so it is always up on my device.

> I would try something like the following untested patch...

Just finished doing an initial test of quickly making the same changes
you sent. Looks like it fixes the problem I was seeing.

Makes sense. Stop reseting the clock and it will not reset.

Brian

Re: [Linuxptp-users] ptp4l and network connectivity interruption

From: Richard C. <ric...@gm...> - 2015-12-12 21:09:57

On Sat, Dec 12, 2015 at 03:58:31PM -0500, Brian Walsh wrote:
> I was not sure if having it reset during ifup makes more sense. Does the
> clock go away when the interface is down? I can't test that right now.
> It is my primary interface so it is always up on my device.

The /dev/ptpX persists from ptp_clock_register() until
ptp_clock_unregister().  Ideally, the clock should appear when the
device is probed and stay running until either the HW is unplugged or
the driver gets unloaded.

There are some HW designs out there that cause the clock to go away or
become unusable when the link state changes, but I think the 82574
does not have those kinds of issues.

> Makes sense. Stop reseting the clock and it will not reset.

Yup.

Thanks,
Richard