I've been trying to get to the bottom of an annoying problem for some weeks now.
I have a main board which supports the Overo Waterstorm and has an SMSC LAN2513 connected to the USBH_DP / USBH_DM (not the OTG channel) with two SMSC LAN7500 gigabit ethernet transceivers downstream.
The first unusual event is that the system (building from a custom Sakoman kernel) comes up fine with both gigabit ethernet channels running. After some time (which may be a short time if I construct my test case right) I get some variation on this in dmesg:
ehci-omap.0: detected XactErr len 0/18944 retry <n>
hub 1-0:1.0: port 2 disabled by hub (EMI?), re-enabling...
usb_disable_device nuking all URBs
This all takes place in less than 100ms (from the first errors till USB shutdown) and USB is now down (along with the network interfaces).
I see the time to reproduce varying on different COM units, but I do eventually get the problem to occur. I have no way to start USB back up when this happens (currently).
I do not believe routing of the traces on the board is an issue, otherwise that would be a suspect in a case like this. I am also running the Ethernet devices as self-powered rather than bus-powered, so there should not be any overcurrent protection triggering (I've also specifically looked for that in the kernel).
Using a Tobi board with a Yocto prebuilt kernel image and an SMSC 7500 USB eval kit plugged into Tobi, I am able to reproduce the same problem (with slightly different time to failure).
I basically run
nc -l 1337 < /dev/urandom
on one end and on the Tobi, I run a program that reads from the TCP socket and measures the rate (using one interface at a time).
The TI errata sheet for the DM3730 describes erratum 2.1 as something that sounds very much like the problem I'm seeing. Basically this erratum says that if a 26mhz reference clock source is used, the 120mhz signal generated for the high-speed USB will suffer from some clock drift. There is a kernel patch which is supposed to reduce the problem (which I am now trying) but the only way to eliminate the problem (according to the erratum) is to use a clock source with a 12mhz, 19.2mhz or 38.4mhz rate.
I have two questions for the Gumstix team (or anyone who can help me with this):
1. Are you aware of this and do you have a solution? We have a large number of these Waterstorm boards and need some workable solution for the LAN7500 even if it involves running the USB bus in full-speed mode (which I haven't yet succeeded in doing and not sure the USB3326 supports). I'm not even sure that would eliminate this particular problem although it should definitely reduce it.
2. Is it possible (at least as an experiment for now) to rework the Overo board to replace the visible 26mhz crystal with one of the supported rates, or is it shared by other components? I would have to change some of the kernel platform init code to support the new rate but wanted to find out if it is feasible.
Errata sheet is available with erratum 2.1 on page 112. Search for sprz319e.pdf