Short bus reset gets doubled, causing reset loop

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

Be forewarned: this is a long message about short resets.

I found that in some cases, a reset loop can occur when adding a device 
to a FireWire bus. This happened with the following setup: a SY-PEX30016 
PCI Express OHCI card with the TI XIO2213B chip was installed in the 
computer. This was connected to a Data Transit bus analyzer. An Exsys 
EX-6682 6-port FireWire 400 hub was plugged into the other port of the 
bus analyzer. According to lsfirewirephy, this hub has a TI TSB41LV06A 
chip. A Sony DSR-25 was plugged into the hub.  Note that the reset loop 
only happens with certain topologies; it won't happen if the DSR-25 is 
plugged directly into the OHCI card.

There was a recent change to bus reset generation, 
7ed4380009e96d9e9c605e12822e987b35b05648. However, that change did not 
cause the problem. Also, that change did not solve the problem. The bus 
reset loop will happen with 7ed4380 and without 7ed4380.

I believe that two things must happen to trigger this reset loop: A node 
on the bus must turn the short bus reset into a "double" bus reset, and 
another node on the bus must interpret this double reset as two resets.

A comment in the open-source Mac OS X FireWire driver 
(IOFireWireController.cpp) mentions a situation that can cause an apparent 
double bus reset:

> // This works around a problem with certain devices when they
> // see an ISBR followed by an IBR due to ISBR promotion.

Various OHCI and PHY datasheets from Texas Instruments also mention this 
"promotion" in the documentation for ISBR. For example, from the XIO2213B 
datasheet:

> Note: Legacy IEEE Std 1394-1995-compliant PHYs are not capable of 
> performing short bus resets. Therefore, initiation of a short bus 
> reset in a network that contains such a legacy device results in a 
> long bus reset being performed.

However, this explanation must be an oversimplification. It's true that 
1394-1995 PHYs can't *initiate* short bus resets, but they can 
*recognize* them. Also, the OCHI's 1394b PHY can't be intentionally 
"promoting" the reset, because it has no way of knowing whether there 
are any 1394-1995 devices on the bus. And even if it did know, it would 
have no reason to generate a double reset; it should just generate a 
single long reset (and set the "i" bit in the self-ID packet, which it's 
not doing - see the logs, later).

Instead, I believe that some other node on the bus is misbehaving, and 
responding to the short bus reset by immediately sending a long bus 
reset.

This is a bus analyzer trace of a short bus reset, without the DSR-25 present:

> 014 BUS RESET                                                  78.96 US 
> 015 SELFID PKT0  80078465  01 ID=00 L=0 GAP=07 SP=400 C=0 I=0  3.010 US 

The rightmost column is the time delta from the previous event to the 
current one. The trace doesn't directly indicate whether a bus reset is 
short or long, but it can be inferred from the timestamps. As described 
in 1394a-2000, a short bus reset is ~1.3us and a long bus reset is ~167us. The 
self-ID packet occurs 3.01us after the bus reset, so it must have been a short 
bus reset.

This is a bus analyzer trace of a double bus reset:

> 050 BUS RESET                                                  7.508 MS 
> 051 BUS RESET                                                  2.950 US 
> 052 SELFID PKT0  80480880  01 ID=00 L=1 GAP=08 SP=100 C=1 I=0  168.6 US 

The second reset occurs 2.95us after the first one, so the first reset was 
short.  The self-ID packet occurs 168.6us after the second bus reset, so the 
second reset was long.

By itself, the double bus reset doesn't always cause problems. However, in some 
cases, some of the devices on the bus will interpret this as a single reset and 
others will interpret it as two resets. The one-reset devices will keep the gap 
count intact, while the two-reset devices will reset it, resulting in a gap 
count error.

But if we send another short bus reset to correct the gap count error, the same 
thing will happen again, leading to a reset loop.

This is a trimmed kernel log of the reset loop:

> [17589.010477] IRQ 00000010 AR_req
> [17589.012279] AR evt_bus_reset, generation 3
> [17589.013987] IRQ 00010000 selfID
> [17589.015621] AR evt_bus_reset, generation 3
> [17589.017275] 5 selfIDs, generation 4, local node ID ffc3
> [17589.018893] selfID 0: 807f0880, phy 0 [p..] S100 gc=63 +0W Lc
> [17589.020514] selfID 0: 813f8477, phy 1 [-c-] S400 gc=63 -3W i
> [17589.022151] selfID n: 81816000, phy 1 [--p.....]
> [17589.023740] selfID 0: 827f80b4, phy 2 [pc-] S400 gc=63 +0W L
> [17589.025327] selfID 0: 837fcc74, phy 3 [-c-] beta gc=63 -3W Lc

The DSR-25 (node 0) was powered on. The hub (node 1) initiated a bus 
reset because a new node appeared on one of its ports. We are node 3, 
and the analyzer is node 2. We are the root node.

> [17589.542922] AT spd 0 tl 3a, ffc3 -> ffc0, ack_pending , QR req, fffff0000400
> [17589.546098] AR spd 0 tl 3a, ffc0 -> ffc3, ack_complete, QR resp = 04042592
> [17589.549254] AT spd 0 tl 3b, ffc3 -> ffc1, evt_missing_ack, QR req, fffff0000400
> [17589.550799] AT spd 0 tl 3c, ffc3 -> ffc2, evt_missing_ack, QR req, fffff0000400
> [17589.552385] rediscovered device fw0

We start reading the config ROMs. The missing ACKs from the hub and 
analyzer are normal because those devices don't have config ROMs.

> [17589.553916] phy config: new root=ffc3, gap_count=8
> [17589.557011] AR spd 0 tl 14, ffc0 -> ffc3, ack_complete, QR resp = 31333934
> [17589.558507] AT spd 0 tl 14, ffc3 -> ffc0, pending/cancelled, QR req, fffff0000404

We send a PHY configuration packet for gap count optimization. Meanwhile 
we continue reading the DSR-25's config ROM.

> [17589.561477] reset_bus(short_reset=1)
> [17589.563013] IRQ 00000010 AR_req
> [17589.564534] AR evt_bus_reset, generation 5
> [17589.566021] IRQ 00010000 selfID
> [17589.567436] AR evt_bus_reset, generation 5
> [17589.568860] AT spd 0 tl 15, ffc3 -> ffc0, evt_no_status, QR req, fffff0000408

We initiate a short bus reset to complete the gap count optimization.  
In this log, I added output to reset_bus() so you can see where we 
initiated the bus reset. evt_bus_reset then gets called TWICE. Meanwhile 
there is a dangling config ROM read attempt.

> [17589.570280] 5 selfIDs, generation 6, local node ID ffc3
> [17589.571663] selfID 0: 80480880, phy 0 [p..] S100 gc=8 +0W Lc
> [17589.573087] selfID 0: 81088475, phy 1 [-c-] S400 gc=8 -3W 
> [17589.574497] selfID n: 81816000, phy 1 [--p.....]
> [17589.575869] selfID 0: 827f80b4, phy 2 [pc-] S400 gc=63 +0W L
> [17589.577228] selfID 0: 8348cc74, phy 3 [-c-] beta gc=8 -3W Lc

After the bus reset, most of the devices have set their gap count to 8, 
but the analyzer has set it to 63. We had initiated a bus reset, but in 
the self ID packets, we aren't taking credit for the bus reset and 
neither is anyone else.

> [17589.578599] giving up on node ffc0: reading config rom failed: bus reset
> [17590.086737] AR spd 0 tl 17, ffc0 -> ffc3, ack_complete, QR resp = 04042592
> [17590.088094] AT spd 0 tl 17, ffc3 -> ffc0, pending/cancelled, QR req, fffff0000400
> [17590.090849] AT spd 0 tl 18, ffc3 -> ffc1, evt_missing_ack, QR req, fffff0000400
> [17590.093712] AT spd 0 tl 19, ffc3 -> ffc2, evt_missing_ack, QR req, fffff0000400
> [17590.095168] rediscovered device fw0

We start reading config ROMs again.

> [17590.096576] phy config: new root=ffc3, gap_count=8
> [17590.099530] AR spd 0 tl 31, ffc0 -> ffc3, ack_complete, QR resp = 31333934
> [17590.101014] AT spd 0 tl 31, ffc3 -> ffc0, pending/cancelled, QR req, fffff0000404
> [17590.103970] reset_bus(short_reset=1)

We send another short bus reset to correct the gap count error.

> [17590.105471] IRQ 00000010 AR_req
> [17590.106934] AR evt_bus_reset, generation 7
> [17590.108439] IRQ 00010000 selfID
> [17590.109901] AR evt_bus_reset, generation 7
> [17590.111381] AT spd 0 tl 32, ffc3 -> ffc0, evt_no_status, QR req, fffff0000408
> [17590.112862] 5 selfIDs, generation 8, local node ID ffc3
> [17590.114349] selfID 0: 80480880, phy 0 [p..] S100 gc=8 +0W Lc
> [17590.115864] selfID 0: 81088475, phy 1 [-c-] S400 gc=8 -3W 
> [17590.117373] selfID n: 81816000, phy 1 [--p.....]
> [17590.118844] selfID 0: 827f80b4, phy 2 [pc-] S400 gc=63 +0W L
> [17590.120297] selfID 0: 8348cc74, phy 3 [-c-] beta gc=8 -3W Lc

The reset gets doubled again and the gap counts are inconsistent again.

This repeats forever until the DSR-25 is turned off.

IEEE 1394a section 8.4.6.2 mentions the possibility of nodes on a mixed 
1394/1394a bus responding differently to a reset, and gives the solution of 
using a long bus reset to recover:

> NOTE—Differences in the reset state machines between PHYs compliant with 
> IEEE Std 1394-1995 (but unmodified by IEEE Std 1394a-2000) and those 
> compliant with IEEE Std 1394a-2000 may result in different gap counts 
> (subsequent to a bus reset preceded by a PHY configuration packet) when 
> a mixture of devices is present. The probability of this inconsistency 
> may be reduced by asserting BUS_RESET for 166.6 us instead of the 
> arbitrated (short) bus reset specified by this standard, but this remedy 
> should not be applied unless inconsistent gap counts have been observed 
> subsequent to arbitrated (short) bus resets.  This heuristic does not 
> guarantee uniform values for gap_count at all nodes; nevertheless, 
> because of the probabilistic nature of the problem, success is likely 
> after a modest number of attempts.

Indeed, the Mac OS X driver uses 1394a's suggested approach:

> // If the gap counts were mismatched and we're optimizing the
> // gap count, use in IBR instead of an ISBR

Regarding the difference in 1394-1995 and 1394a devices, the comment in the Mac
driver goes on to say this:

> // When this occurs the device is supposed to throw out the IBR.
> // Devices that don't do this latch the gap count with the ISBR
> // and then reset it with the subsequent IBR.
> 
> // This gets us into a loop where we keep trying to optimize the
> // gap count only to find it reset on the next bus generation

The implication is that some of the PHYs are debouncing the incoming resets and 
some aren't. However, the node that was clearing the gap count seems to 
debounce resets in some cases. In the following kernel log and analyzer trace 
excerpts, the analyzer and DSR-25 are still present but the hub has been 
removed from the chain:

> [ 6927.395841] phy config: new root=ffc2, gap_count=7
> [ 6927.397589] IRQ 00000001 AT_req
> [ 6927.399340] AT ack_complete, PHY 02c70000 fd38ffff
> [ 6927.401206] IRQ 00000010 AR_req
> [ 6927.402928] AR evt_bus_reset, generation 8
> [ 6927.404586] IRQ 00010000 selfID
> [ 6927.406196] AR evt_bus_reset, generation 8
> [ 6927.407846] 3 selfIDs, generation 9, local node ID ffc2
> [ 6927.409452] selfID 0: 80070880, phy 0 [p..] S100 gc=7 +0W c
> [ 6927.411075] selfID 0: 814780b4, phy 1 [pc-] S400 gc=7 +0W L
> [ 6927.412677] selfID 0: 8247ccd4, phy 2 [c--] beta gc=7 -3W Lc

> 007 PHY_CONFIG   02C70000  01 ROOT ID=02  R=1  T=1  GAP=07     13.76 MS 
> 008 ASYNC        FD38FFFF  02                                  0.330 US 
> 009 SUBACTN GAP                                                1.690 US 
> 010 ARB GAP                                                    1.380 US 
> 011 BUS RESET                                                  3.611 MS 
> 012 SELFID PKT0  80070880  01 ID=00 L=0 GAP=07 SP=100 C=1 I=0  171.0 US 

With this topology change, the double reset still seems to be happening.  
There are two evt_bus_resets in syslog, the timing of the self-ID packet 
shows that there must have been a long bus reset, and none of the 
self-ID packets are claiming responsibility for the reset. But this 
time, the gap count is consistent, and the analyzer trace only shows one 
reset. The self-ID packet is 171.0us after the bus reset; in the "bad" 
topology, the self-ID packet is ~171.6us after the first bus reset. So 
it looks like the analyzer's PHY is considering the two resets to be a 
single reset in this case, but not in the other case. (The 600ns 
difference also suggests that the DSR-25 is the origin of the second 
reset in the double reset pair. The hub would introduce this delay 
between the OHCI's short reset as seen by the analyzer and the DSR-25's 
long reset as seen by the analyzer.)

So, I think the best explanation for the gap count inconsistency is some 
more subtle difference in bus reset detection between 1394-1995 and 
1394a devices, and not that the node is just failing to throw away a bus 
reset that it's supposed to.

I've prepared a patch that uses a long bus reset when responding to a gap count 
inconsistency. This eliminates the reset loop. I will submit the patch soon.  
Here is what happens with the "bad" topology and the patched kernel:

> [17074.390846] IRQ 00000010 AR_req
> [17074.392893] AR evt_bus_reset, generation 3
> [17074.394846] IRQ 00010000 selfID
> [17074.396775] AR evt_bus_reset, generation 3
> [17074.398655] 5 selfIDs, generation 4, local node ID ffc3
> [17074.400589] selfID 0: 807f0880, phy 0 [p..] S100 gc=63 +0W Lc
> [17074.402442] selfID 0: 813f8467, phy 1 [-p-] S400 gc=63 -3W i
> [17074.404261] selfID n: 81835000, phy 1 [c--.....]
> [17074.406075] selfID 0: 827f80b4, phy 2 [pc-] S400 gc=63 +0W L
> [17074.407857] selfID 0: 837fccd4, phy 3 [c--] beta gc=63 -3W Lc

> 000 BUS RESET                                                  0.000 USX
> 001 BUS RESET                                                  2.600 US 
> 002 SELFID PKT0  807F0880  01 ID=00 L=1 GAP=3F SP=100 C=1 I=0  168.6 US 

The DSR-25 was powered on and the hub initiated a bus reset.

> [17074.918368] AT spd 0 tl 3a, ffc3 -> ffc0, ack_pending , QR req, fffff0000400
> [17074.921964] AR spd 0 tl 3a, ffc0 -> ffc3, ack_complete, QR resp = 04042592
> [17074.925563] AR spd 0 tl 3c, ffc0 -> ffc3, ack_complete, QR resp = 31333934
> [17074.927306] AT spd 0 tl 3c, ffc3 -> ffc0, pending/cancelled, QR req, fffff0000404
> [17074.929046] AT spd 0 tl 3b, ffc3 -> ffc1, evt_missing_ack, QR req, fffff0000400
> [17074.932573] AT spd 0 tl 3d, ffc3 -> ffc2, evt_missing_ack, QR req, fffff0000400
> [17074.934335] rediscovered device fw0

We start reading the config ROMs.

> [17074.936033] phy config: new root=ffc3, gap_count=8
> [17074.937810] IRQ 00000021 AR_resp AT_req
> [17074.939486] AR spd 0 tl 15, ffc0 -> ffc3, ack_complete, QR resp = e0644000
> [17074.941196] AT spd 0 tl 15, ffc3 -> ffc0, pending/cancelled, QR req, fffff0000408
> [17074.942879] AT ack_complete, PHY 03c80000 fc37ffff
> [17074.944597] IRQ 00000011 AR_req AT_req
> [17074.946252] AR evt_bus_reset, generation 5
> [17074.947924] IRQ 00010010 selfID AR_req
> [17074.949544] AR evt_bus_reset, generation 5
> [17074.951130] AT spd 0 tl 16, ffc3 -> ffc0, ack_pending , QR req, fffff000040c
> [17074.952770] 5 selfIDs, generation 6, local node ID ffc3
> [17074.954341] selfID 0: 80480880, phy 0 [p..] S100 gc=8 +0W Lc
> [17074.955918] selfID 0: 81088465, phy 1 [-p-] S400 gc=8 -3W 
> [17074.957480] selfID n: 81835000, phy 1 [c--.....]
> [17074.959008] selfID 0: 827f80b4, phy 2 [pc-] S400 gc=63 +0W L
> [17074.960537] selfID 0: 8348ccd4, phy 3 [c--] beta gc=8 -3W Lc

> 072 ASYNC        F000040C  03 DEST OFFSET=FFFF:F000:040C       0.330 US 
> 073 ASYNC        A2508E4C  04 HEADER CRC                       0.320 US 
> 074 ACK                2D     ACK_PENDING                      1.370 US 
> 075 BUS RESET                                                  5.730 US 
> 076 BUS RESET                                                  2.950 US 
> 077 SELFID PKT0  80480880  01 ID=00 L=1 GAP=08 SP=100 C=1 I=0  168.6 US 

We send a PHY configuration packet and short bus reset for gap count 
optimization. This turns into a double bus reset, and there is a gap 
count inconsistency.

> [17075.494059] AT spd 0 tl 18, ffc3 -> ffc0, ack_pending , QR req, fffff0000400
> [17075.497078] AR spd 0 tl 18, ffc0 -> ffc3, ack_complete, QR resp = 04042592
> [17075.500071] AT spd 0 tl 19, ffc3 -> ffc1, evt_missing_ack, QR req, fffff0000400
> [17075.503125] AT spd 0 tl 1a, ffc3 -> ffc2, evt_missing_ack, QR req, fffff0000400
> [17075.504702] rediscovered device fw0

We start reading config ROMs again.

> [17075.506199] phy config: new root=ffc3, gap_count=8
> [17075.507759] IRQ 00000021 AR_resp AT_req
> [17075.509326] AR spd 0 tl 32, ffc0 -> ffc3, ack_complete, QR resp = 31333934
> [17075.510842] AT spd 0 tl 32, ffc3 -> ffc0, pending/cancelled, QR req, fffff0000404
> [17075.512351] AT ack_complete, PHY 03c80000 fc37ffff
> [17075.513960] IRQ 00000010 AR_req
> [17075.515477] AR evt_bus_reset, generation 7
> [17075.517035] IRQ 00010000 selfID
> [17075.518559] AT spd 0 tl 33, ffc3 -> ffc0, evt_no_status, QR req, fffff0000408
> [17075.520106] 5 selfIDs, generation 7, local node ID ffc3
> [17075.521660] selfID 0: 80480880, phy 0 [p..] S100 gc=8 +0W Lc
> [17075.523228] selfID 0: 81088465, phy 1 [-p-] S400 gc=8 -3W 
> [17075.524793] selfID n: 81835000, phy 1 [c--.....]
> [17075.526321] selfID 0: 824880b4, phy 2 [pc-] S400 gc=8 +0W L
> [17075.527821] selfID 0: 8348ccd6, phy 3 [c--] beta gc=8 -3W Lci

> 133 ASYNC        31333934  04 QUADLET=31333934                 0.330 US 
> 134 ASYNC        86B1088D  05 HEADER CRC                       0.320 US 
> 135 ACK                1E     ACK_COMPLETE                     1.330 US 
> 136 SUBACTN GAP                                                1.830 US 
> 137 ARB GAP                                                    1.580 US 
> 138 BUS RESET                                                  6.164 MS 
> 139 SELFID PKT0  80480880  01 ID=00 L=1 GAP=08 SP=100 C=1 I=0  169.4 US 

We send a long bus reset to correct the gap count error. This is 
successful. The bus reset does not get doubled, and the gap counts are 
consistent.

> [17078.534125] AR spd 0 tl 3f, ffc0 -> ffc3, ack_complete, QR resp = 04042592
> [17078.535629] AT spd 0 tl 3f, ffc3 -> ffc0, pending/cancelled, QR req, fffff0000400
...
> [17078.771716] AT spd 0 tl 21, ffc3 -> ffc0, ack_pending , QR req, fffff0000450
> [17078.775367] AR spd 0 tl 21, ffc0 -> ffc3, ack_complete, QR resp = 536f6e79
> [17078.777511] created device fw1: GUID 08004601025e582a, S100

There are no more bus resets. We read the config ROM and we're ready to 
rock and roll.

So, I believe that when there is a gap count inconsistency, sending a 
long bus reset is the right thing to do.

-- Adam