#323 8255x/e100: AutoNeg reliably fails with some switches

closed
e100 (15)
in-kernel_driver
5
2014-03-06
2009-09-09
No

[e100: v3.5.17]

Hi,

customers are reporting problems with some switches and different 8255x at the time AutoNeg is enabled with ethtool (ethtool -s eth1 autoneg on).

Right after issuing this command the network LED goes off and the driver reports "no link". AutoNeg may or may not be already set when this command is issued, but this doesn't matter, as it always breaks after running the command above.

This issue only happens with some switches, notably:

Cisco 2610 XM
Cisco 1400 Series
ONE 60 router
Cisco router 2600
Load Balancer F5 BigIP

We tried to reproduce it with a couple of other switches, but without success.

This all was tested with Linux kernel 2.6.16.62 + v3.5.17 (vendor driver) and 2.6.28 (unpatched), both showing the same results. The hardware showing this behaviour is at least:

Intel Corporation 8255xER/82551IT Fast Ethernet Controller (Device ID 0x1209)
Intel Corporation 82557/8/9 [Ethernet Pro 100] (Device ID 0x1229)

I've attached 'dmesg_after', which I got right after enabling AutoNeg with debugging enabled. I also noticed that MDI status went from MDI-X to MDI after trying to enable AutoNeg. Other than that I got no interesting info from ethtool.

Please let me know if you need further information.

Discussion

  •  
    Attachments
  • Note that reloading the driver helps to get the link working again.

     
  • Jesse, I saw that you are at Plumbers Conference this year. As I'm attending it too there is the possibility of bringing one of that boxes (but it's happening on other hw too). Please drop me a note if there is interest.

     
  • I've asked davegraham to take a look at it, I think the hardest part for us is that we only have one of the switches, the F5 BIG IP LTM 8900.

    I don't think we need one of the machines, but we haven't reproduced the issue yet, due to time constraints. I hope dave will be able to drive this.

     
  • david graham
    david graham
    2009-09-18

    Hi,
    I have started looking into this today. Unfortunately I'm out all next week, but I'll be sure to post progress today before I leave.

     
  • david graham
    david graham
    2009-09-18

    Debug log for sucessful ethtool -s ethX autoneg on

     
  • david graham
    david graham
    2009-09-18

    Holger
    I have tried with a few 10/100Mb linik partners, and not seen the problem. Yes, I'll need to get one of the switches from our list. As Jesse said, we already have a F5 somewhere, and We have our problem repro guys on it. Here's
    what I tried:

        Trendnet TE100-S5
        Encore EN908H   
        Encore EN908H Uplink Port
        Netgear RP614V2         
        LinkSys BEFSR81
        NetGear MR314
    

    I attach a debug trace "dmesg_on_sucessful_config" where "ethtool -s ethX autoconfig on" works. It will take me a while, and some help, to understand what the data says, but the problem may well already be in that data, and I
    plan to go through it.

    You mention 2.6.12 in the report. Do you think that this may have always been a problem interoperating with these switches , or do you know that it worked OK some time before 2.6.12.

    Could you please provide the following

    1) ethtool -e ethX
    2) Another debug dump, but with timestamps , as typically in
    /var/log/messages (1 sec granularity is fine) , and that continues for about
    5 seconds after the ethtool command was issued.

    Thanks !

     
  • My initial posting mentions kernel 2.6.16.y and 2.6.28 having this problem. I have no information about kernel versions before 2.6.16.y which didn't have this issue. Also I currently don't know if it always has been failing with this kind of switches with other kernel versions before.

    In addition to my initial posting I can say that this issue only appears when the interface already has link, which wasn't clear when I did the initial posting. This also lowers the impact a bit, as we can rather easily avoid this issue most of the time by just checking the link status before. After all think it should be fixed in the driver.

    As I'm traveling the coming week I'll arrange for getting the data you request. Many thanks!

     
  • david graham
    david graham
    2009-09-29

    ethtool -e output as per note of 09_25

     
    Attachments
  • david graham
    david graham
    2009-09-29

    kernel debug log as per note of 09_25

     
  • david graham
    david graham
    2009-09-29

    Thanks Florian (fw_strlen). I've imported the files from your note as attachments, but have not yet looked at them. I'm just back from vacatoin. I know that our lab has still not yet repro'd the issue, though I suspect that they are not yet using one of the failing switches - I'm following up this afternoon.

     
  • david graham
    david graham
    2009-09-30

    Still no prepro here, but we're having trouble getting one of the switches. If we can't locate one today, I'd lke to discuss having one shipped here. Please let me know if you'd be OK with that.

    From the traces so far, I see that your link partner is not advertizing PAUSE capability, and on my test system it is - but that's not very likley to be relevant. It's a deep dig back through some of the older specs, so I may well turn up something else soon.

    There are a lot of 'similar' devices that operate under device IDs 1209 & 1229, and they may have subtle differences in autoneg behavior. Could yo please provide outputs from the lspci dump command below on one of the problem systems so I can be sure exactly what it has in it.

    lspci -tv
    lspci -xxx

    I've discussed the issue with our Si support team, and they'd also like to konw a few things baout the failure, as follows:

    1) Does cable length matter ? (100Mb aneg may have weakspots on some designs)
    2) Does the problem occur on one system design type only, or more ?
    3) Does the problem occur at 10Mb ? (10Mb uses different autoneg procedures). You can force a 10Mb link advertisement by
    ethtool -s eth0 advertise 0x0461 autoneg on

    Thanks !

     
  • david graham
    david graham
    2009-10-06

    lspci -tv and lspci -xxxx logs from problem config

     
    Attachments
  • Hi, are you still having autoneg problems with the e100 adapter? I also have a suspicion that maybe the MDI-X code in e100 could be causing an issue for you, however the 0x1229 devices I think don't support MDI-X. We can disable MDI-X, as if you have proper cabling and/or your remote supports MDI-X then the e100 doesn't need to be doing it.

    after having the issue does an ethtool -r ethX fix it?

     
  • Todd Fujinaka
    Todd Fujinaka
    2013-07-09

    • status: open --> closed
     
  • Todd Fujinaka
    Todd Fujinaka
    2013-07-09

    Closing due to inactivity.