Thread: [Etherboot-developers] Re: [Etherboot-users] tulip driver problems? (clone FA310TX)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On 5/25/01 10:15 PM Michael Stein ma...@uc... wrote:
>tulip driver problems? (clone FA310TX)
>I'm trying etherboot for the first time, I've been using netboot
>before this.
>I've fetched the etherboot-5.0.1 source and built it as is.  I built a
>floppy with the tulip driver.  I tried to boot this in a machine with
>a clone chip Netgear FA310TX.  Upon booting I can see that etherboot
>finds the network card.  It says (retyped, approx):
>FA310TX: Chip Lite-On 82c168 Pnic rev 32 at e400
>        vendor = 11ad device 0002
>        mii transceiver 1 config 3000 status 7829 advertising 01e1
>Searching for sever (DHCP)
><sleep>
>The <sleep> keeps repeating...
>I tried using tcpdump to see if packets were visible on the dhcp/tftp
>server and don't believe any are appearing.

This has been reported before, and I am at a loss to explain it.  I have 
been unable to reproduce this failure.  I'm going to look at the 2.4 
Kernel sources to see if there have been any changes that could possibly 
account for this behaviour.

>Switching to a netboot floppy results in a complete boot (same machine,
>same dhcp/tftp server).

Sounds like an initialization problem.  I'm looking into it, and since 
you have a clear test case, perhaps you can test it.

>Trying the etherboot floppy in a different machine which has an
>Netgear FA310TX with the original Dec tulip chip (21140) results in
>it booting. This machine uses a different dhcp/tftp server but I don't
>believe that that is related.

Yes, that would exercise a different code path than than PNIC.

>Reconnecting the failing machine with the PNIC from the existing
>Cisco-4000 100 Mbit port to an old Cisco switch with 10 Mbit ports
>results in the PNIC machine booting using etherboot too...
>Sounds like the tulip driver MII/speed code has a problem.
>All the above machines are on the same subnet.

This does point to the autonegotiation/MII code as a possible culprit.

>I've been using netboot for a few years but am thinking of switching
>to etherboot as it is built with "normal" tools and would allow the
>easy(?) addition of 32 bit C code.  I want to have a floppy based fast
>boot over the network which is secure and not restricted to the local
>subnet.  I figure that reading up to 10% of a floppy is ok, reading more
>is too slow.  I have some ideas on how this could work but am interested
>in hearing others. 

Alright, having looked at the latest kernel code for a few minutes, I 
think I have a theory about what is happening.  This will require 
testing, but I think we might be onto something.

Here is some MDIO read code from the kernel relating to the LC82168 (the 
chip you have on your FA310TX card:

        if (tp->chip_id == LC82C168) {
                int i = 1000;
                outl(0x60020000 + (phy_id<<23) + (location<<18), ioaddr + 
0xA0);
                inl(ioaddr + 0xA0);
                inl(ioaddr + 0xA0);
                while (--i > 0) {
                        barrier();
                        if ( ! ((retval = inl(ioaddr + 0xA0)) & 
0x80000000))
                                break;
                }
                return retval & 0xffff;
        }

Here is the equivalent etherboot code:

    if (tp->chip_id == LC82C168) {
        int i = 1000;
        outl(0x60020000 + (phy_id<<23) + (location<<18), ioaddr + 0xA0);
        inl(ioaddr + 0xA0);
        inl(ioaddr + 0xA0);
        while (--i > 0)
            if ( ! ((retval = inl(ioaddr + 0xA0)) & 0x80000000))
                return retval & 0xffff;
        return 0xffff;
    }

Now, ignoring the stylistic differences, I notice a call to "barrier()".
I haven't ever heard of that function before.  What could it be for?

So I grepped through the 2.4 kernel sources trying to see where it was 
used and what it was for:

here is an example:

in dgrs.c:

        for (i = jiffies + 8 * HZ; time_after(i, jiffies); )
        {
                barrier();              /* Gcc 2.95 needs this */
                if (priv0->bcomm->bc_status >= BC_RUN)
                        break;
        }

Oh my god.  GCC needs something to protect volatile memory.  Time to find 
the definition of barrier(). 

In the file linux-2.4.5/include/linux/kernel.h we find:

/* Optimization barrier */
/* The "volatile" is due to gcc bugs */
#define barrier() __asm__ __volatile__("": : :"memory")

Oh this is ugly.  But it could explain a lot.  Let's see where else in 
the tulip code this call is used:

mdc@ll:~/linux-2.4.5/drivers/net/tulip$ grep barrier *
ChangeLog:      * media.c: Add barrier() to mdio_read/write's PNIC status 
check
media.c:                        barrier();
media.c:                        barrier();

Hmmm, in exactly 2 places.  Now obviously the one above is the first.  
Let's see what the second is:

        if (tp->chip_id == LC82C168) {
                int i = 1000;
                outl(cmd, ioaddr + 0xA0);
                do {
                        barrier();
                        if ( ! (inl(ioaddr + 0xA0) & 0x80000000))
                                break;
                } while (--i > 0);
                return;
        }

Now, I'm not a gambling man.  But, one might say "a subtle pattern begins 
to emerge...".  The only two places in the tulip code that use this call 
are in code specific to the LC82C168. Hmmmmmmm.

So the first thing I'd try is putting the definition of barrier() in the 
tulip.c file after the other #includes:

/* Optimization barrier */
/* The "volatile" is due to gcc bugs */
#define barrier() __asm__ __volatile__("": : :"memory")

then add the barrier() calls to functions mdio_read:

    if (tp->chip_id == LC82C168) {
        int i = 1000;
        outl(0x60020000 + (phy_id<<23) + (location<<18), ioaddr + 0xA0);
        inl(ioaddr + 0xA0);
        inl(ioaddr + 0xA0);
        while (--i > 0) {
            barrier();
            if ( ! ((retval = inl(ioaddr + 0xA0)) & 0x80000000))
                return retval & 0xffff;
        }
        return 0xffff;
    }

and to its sibling mdio_write:

    if (tp->chip_id == LC82C168) {
        int i = 1000;
        outl(cmd, ioaddr + 0xA0);
        do {
            barrier();
            if ( ! (inl(ioaddr + 0xA0) & 0x80000000))
                break;
        } while (--i > 0);
        return;
    }

Don't forget to add the "{" and "}" inside the do and while loops.

Now it is possible that this is not the problem at all, but it the 
evidence does strongly point to some sort of subtle thing, and we have 
certainly seen compiler, um, "eccentricities" when dealing with this code 
before.

Could someone with a non-working FA310TX try these changes?  I note that 
the barrier() call  is also used (at least) in these kernel drivers:

3c527.c
7990.c
8139too.c
a2065.c
acenic.c
de4x5.c
dgrs.c
epic100.c
ioc3-eth.c
sk_g16.c
sunhme.c
sunlance.c
sunqe.c
winbond-840.c

And I would not be at all surprised if the eepro100 problems we have been 
seeing might be related.

So, there are my initial quick thoughts on the long-running FA310TX saga. 

Michael, thank you very much for your debugging information and the time 
you took in doing it. Please let us know if the changes I suggest make a 
difference or not.  If you're not comfortable making the changes, I can 
send you an edited tulip.c file to try.

I hope this helps,

Marty

---
    Try: http://rom-o-matic.net/ to make Etherboot images instantly.

   Name: Marty Connor
US Mail: Entity Cyber, Inc.; P.O. Box 391827; Cambridge, MA 02139; USA
  Voice: (617) 491-6935, Fax: (617) 491-7046 
  Email: md...@th...
    Web: http://www.thinguin.org/

Thread: [Etherboot-developers] Re: [Etherboot-users] tulip driver problems? (clone FA310TX)

etherboot-developers