Short bus reset gets doubled, causing reset loop
Brought to you by:
aeb,
bencollins
From: Adam G. <ad...@po...> - 2024-02-26 07:57:58
|
Hi, Be forewarned: this is a long message about short resets. I found that in some cases, a reset loop can occur when adding a device to a FireWire bus. This happened with the following setup: a SY-PEX30016 PCI Express OHCI card with the TI XIO2213B chip was installed in the computer. This was connected to a Data Transit bus analyzer. An Exsys EX-6682 6-port FireWire 400 hub was plugged into the other port of the bus analyzer. According to lsfirewirephy, this hub has a TI TSB41LV06A chip. A Sony DSR-25 was plugged into the hub. Note that the reset loop only happens with certain topologies; it won't happen if the DSR-25 is plugged directly into the OHCI card. There was a recent change to bus reset generation, 7ed4380009e96d9e9c605e12822e987b35b05648. However, that change did not cause the problem. Also, that change did not solve the problem. The bus reset loop will happen with 7ed4380 and without 7ed4380. I believe that two things must happen to trigger this reset loop: A node on the bus must turn the short bus reset into a "double" bus reset, and another node on the bus must interpret this double reset as two resets. A comment in the open-source Mac OS X FireWire driver (IOFireWireController.cpp) mentions a situation that can cause an apparent double bus reset: > // This works around a problem with certain devices when they > // see an ISBR followed by an IBR due to ISBR promotion. Various OHCI and PHY datasheets from Texas Instruments also mention this "promotion" in the documentation for ISBR. For example, from the XIO2213B datasheet: > Note: Legacy IEEE Std 1394-1995-compliant PHYs are not capable of > performing short bus resets. Therefore, initiation of a short bus > reset in a network that contains such a legacy device results in a > long bus reset being performed. However, this explanation must be an oversimplification. It's true that 1394-1995 PHYs can't *initiate* short bus resets, but they can *recognize* them. Also, the OCHI's 1394b PHY can't be intentionally "promoting" the reset, because it has no way of knowing whether there are any 1394-1995 devices on the bus. And even if it did know, it would have no reason to generate a double reset; it should just generate a single long reset (and set the "i" bit in the self-ID packet, which it's not doing - see the logs, later). Instead, I believe that some other node on the bus is misbehaving, and responding to the short bus reset by immediately sending a long bus reset. This is a bus analyzer trace of a short bus reset, without the DSR-25 present: > 014 BUS RESET 78.96 US > 015 SELFID PKT0 80078465 01 ID=00 L=0 GAP=07 SP=400 C=0 I=0 3.010 US The rightmost column is the time delta from the previous event to the current one. The trace doesn't directly indicate whether a bus reset is short or long, but it can be inferred from the timestamps. As described in 1394a-2000, a short bus reset is ~1.3us and a long bus reset is ~167us. The self-ID packet occurs 3.01us after the bus reset, so it must have been a short bus reset. This is a bus analyzer trace of a double bus reset: > 050 BUS RESET 7.508 MS > 051 BUS RESET 2.950 US > 052 SELFID PKT0 80480880 01 ID=00 L=1 GAP=08 SP=100 C=1 I=0 168.6 US The second reset occurs 2.95us after the first one, so the first reset was short. The self-ID packet occurs 168.6us after the second bus reset, so the second reset was long. By itself, the double bus reset doesn't always cause problems. However, in some cases, some of the devices on the bus will interpret this as a single reset and others will interpret it as two resets. The one-reset devices will keep the gap count intact, while the two-reset devices will reset it, resulting in a gap count error. But if we send another short bus reset to correct the gap count error, the same thing will happen again, leading to a reset loop. This is a trimmed kernel log of the reset loop: > [17589.010477] IRQ 00000010 AR_req > [17589.012279] AR evt_bus_reset, generation 3 > [17589.013987] IRQ 00010000 selfID > [17589.015621] AR evt_bus_reset, generation 3 > [17589.017275] 5 selfIDs, generation 4, local node ID ffc3 > [17589.018893] selfID 0: 807f0880, phy 0 [p..] S100 gc=63 +0W Lc > [17589.020514] selfID 0: 813f8477, phy 1 [-c-] S400 gc=63 -3W i > [17589.022151] selfID n: 81816000, phy 1 [--p.....] > [17589.023740] selfID 0: 827f80b4, phy 2 [pc-] S400 gc=63 +0W L > [17589.025327] selfID 0: 837fcc74, phy 3 [-c-] beta gc=63 -3W Lc The DSR-25 (node 0) was powered on. The hub (node 1) initiated a bus reset because a new node appeared on one of its ports. We are node 3, and the analyzer is node 2. We are the root node. > [17589.542922] AT spd 0 tl 3a, ffc3 -> ffc0, ack_pending , QR req, fffff0000400 > [17589.546098] AR spd 0 tl 3a, ffc0 -> ffc3, ack_complete, QR resp = 04042592 > [17589.549254] AT spd 0 tl 3b, ffc3 -> ffc1, evt_missing_ack, QR req, fffff0000400 > [17589.550799] AT spd 0 tl 3c, ffc3 -> ffc2, evt_missing_ack, QR req, fffff0000400 > [17589.552385] rediscovered device fw0 We start reading the config ROMs. The missing ACKs from the hub and analyzer are normal because those devices don't have config ROMs. > [17589.553916] phy config: new root=ffc3, gap_count=8 > [17589.557011] AR spd 0 tl 14, ffc0 -> ffc3, ack_complete, QR resp = 31333934 > [17589.558507] AT spd 0 tl 14, ffc3 -> ffc0, pending/cancelled, QR req, fffff0000404 We send a PHY configuration packet for gap count optimization. Meanwhile we continue reading the DSR-25's config ROM. > [17589.561477] reset_bus(short_reset=1) > [17589.563013] IRQ 00000010 AR_req > [17589.564534] AR evt_bus_reset, generation 5 > [17589.566021] IRQ 00010000 selfID > [17589.567436] AR evt_bus_reset, generation 5 > [17589.568860] AT spd 0 tl 15, ffc3 -> ffc0, evt_no_status, QR req, fffff0000408 We initiate a short bus reset to complete the gap count optimization. In this log, I added output to reset_bus() so you can see where we initiated the bus reset. evt_bus_reset then gets called TWICE. Meanwhile there is a dangling config ROM read attempt. > [17589.570280] 5 selfIDs, generation 6, local node ID ffc3 > [17589.571663] selfID 0: 80480880, phy 0 [p..] S100 gc=8 +0W Lc > [17589.573087] selfID 0: 81088475, phy 1 [-c-] S400 gc=8 -3W > [17589.574497] selfID n: 81816000, phy 1 [--p.....] > [17589.575869] selfID 0: 827f80b4, phy 2 [pc-] S400 gc=63 +0W L > [17589.577228] selfID 0: 8348cc74, phy 3 [-c-] beta gc=8 -3W Lc After the bus reset, most of the devices have set their gap count to 8, but the analyzer has set it to 63. We had initiated a bus reset, but in the self ID packets, we aren't taking credit for the bus reset and neither is anyone else. > [17589.578599] giving up on node ffc0: reading config rom failed: bus reset > [17590.086737] AR spd 0 tl 17, ffc0 -> ffc3, ack_complete, QR resp = 04042592 > [17590.088094] AT spd 0 tl 17, ffc3 -> ffc0, pending/cancelled, QR req, fffff0000400 > [17590.090849] AT spd 0 tl 18, ffc3 -> ffc1, evt_missing_ack, QR req, fffff0000400 > [17590.093712] AT spd 0 tl 19, ffc3 -> ffc2, evt_missing_ack, QR req, fffff0000400 > [17590.095168] rediscovered device fw0 We start reading config ROMs again. > [17590.096576] phy config: new root=ffc3, gap_count=8 > [17590.099530] AR spd 0 tl 31, ffc0 -> ffc3, ack_complete, QR resp = 31333934 > [17590.101014] AT spd 0 tl 31, ffc3 -> ffc0, pending/cancelled, QR req, fffff0000404 > [17590.103970] reset_bus(short_reset=1) We send another short bus reset to correct the gap count error. > [17590.105471] IRQ 00000010 AR_req > [17590.106934] AR evt_bus_reset, generation 7 > [17590.108439] IRQ 00010000 selfID > [17590.109901] AR evt_bus_reset, generation 7 > [17590.111381] AT spd 0 tl 32, ffc3 -> ffc0, evt_no_status, QR req, fffff0000408 > [17590.112862] 5 selfIDs, generation 8, local node ID ffc3 > [17590.114349] selfID 0: 80480880, phy 0 [p..] S100 gc=8 +0W Lc > [17590.115864] selfID 0: 81088475, phy 1 [-c-] S400 gc=8 -3W > [17590.117373] selfID n: 81816000, phy 1 [--p.....] > [17590.118844] selfID 0: 827f80b4, phy 2 [pc-] S400 gc=63 +0W L > [17590.120297] selfID 0: 8348cc74, phy 3 [-c-] beta gc=8 -3W Lc The reset gets doubled again and the gap counts are inconsistent again. This repeats forever until the DSR-25 is turned off. IEEE 1394a section 8.4.6.2 mentions the possibility of nodes on a mixed 1394/1394a bus responding differently to a reset, and gives the solution of using a long bus reset to recover: > NOTE—Differences in the reset state machines between PHYs compliant with > IEEE Std 1394-1995 (but unmodified by IEEE Std 1394a-2000) and those > compliant with IEEE Std 1394a-2000 may result in different gap counts > (subsequent to a bus reset preceded by a PHY configuration packet) when > a mixture of devices is present. The probability of this inconsistency > may be reduced by asserting BUS_RESET for 166.6 us instead of the > arbitrated (short) bus reset specified by this standard, but this remedy > should not be applied unless inconsistent gap counts have been observed > subsequent to arbitrated (short) bus resets. This heuristic does not > guarantee uniform values for gap_count at all nodes; nevertheless, > because of the probabilistic nature of the problem, success is likely > after a modest number of attempts. Indeed, the Mac OS X driver uses 1394a's suggested approach: > // If the gap counts were mismatched and we're optimizing the > // gap count, use in IBR instead of an ISBR Regarding the difference in 1394-1995 and 1394a devices, the comment in the Mac driver goes on to say this: > // When this occurs the device is supposed to throw out the IBR. > // Devices that don't do this latch the gap count with the ISBR > // and then reset it with the subsequent IBR. > > // This gets us into a loop where we keep trying to optimize the > // gap count only to find it reset on the next bus generation The implication is that some of the PHYs are debouncing the incoming resets and some aren't. However, the node that was clearing the gap count seems to debounce resets in some cases. In the following kernel log and analyzer trace excerpts, the analyzer and DSR-25 are still present but the hub has been removed from the chain: > [ 6927.395841] phy config: new root=ffc2, gap_count=7 > [ 6927.397589] IRQ 00000001 AT_req > [ 6927.399340] AT ack_complete, PHY 02c70000 fd38ffff > [ 6927.401206] IRQ 00000010 AR_req > [ 6927.402928] AR evt_bus_reset, generation 8 > [ 6927.404586] IRQ 00010000 selfID > [ 6927.406196] AR evt_bus_reset, generation 8 > [ 6927.407846] 3 selfIDs, generation 9, local node ID ffc2 > [ 6927.409452] selfID 0: 80070880, phy 0 [p..] S100 gc=7 +0W c > [ 6927.411075] selfID 0: 814780b4, phy 1 [pc-] S400 gc=7 +0W L > [ 6927.412677] selfID 0: 8247ccd4, phy 2 [c--] beta gc=7 -3W Lc > 007 PHY_CONFIG 02C70000 01 ROOT ID=02 R=1 T=1 GAP=07 13.76 MS > 008 ASYNC FD38FFFF 02 0.330 US > 009 SUBACTN GAP 1.690 US > 010 ARB GAP 1.380 US > 011 BUS RESET 3.611 MS > 012 SELFID PKT0 80070880 01 ID=00 L=0 GAP=07 SP=100 C=1 I=0 171.0 US With this topology change, the double reset still seems to be happening. There are two evt_bus_resets in syslog, the timing of the self-ID packet shows that there must have been a long bus reset, and none of the self-ID packets are claiming responsibility for the reset. But this time, the gap count is consistent, and the analyzer trace only shows one reset. The self-ID packet is 171.0us after the bus reset; in the "bad" topology, the self-ID packet is ~171.6us after the first bus reset. So it looks like the analyzer's PHY is considering the two resets to be a single reset in this case, but not in the other case. (The 600ns difference also suggests that the DSR-25 is the origin of the second reset in the double reset pair. The hub would introduce this delay between the OHCI's short reset as seen by the analyzer and the DSR-25's long reset as seen by the analyzer.) So, I think the best explanation for the gap count inconsistency is some more subtle difference in bus reset detection between 1394-1995 and 1394a devices, and not that the node is just failing to throw away a bus reset that it's supposed to. I've prepared a patch that uses a long bus reset when responding to a gap count inconsistency. This eliminates the reset loop. I will submit the patch soon. Here is what happens with the "bad" topology and the patched kernel: > [17074.390846] IRQ 00000010 AR_req > [17074.392893] AR evt_bus_reset, generation 3 > [17074.394846] IRQ 00010000 selfID > [17074.396775] AR evt_bus_reset, generation 3 > [17074.398655] 5 selfIDs, generation 4, local node ID ffc3 > [17074.400589] selfID 0: 807f0880, phy 0 [p..] S100 gc=63 +0W Lc > [17074.402442] selfID 0: 813f8467, phy 1 [-p-] S400 gc=63 -3W i > [17074.404261] selfID n: 81835000, phy 1 [c--.....] > [17074.406075] selfID 0: 827f80b4, phy 2 [pc-] S400 gc=63 +0W L > [17074.407857] selfID 0: 837fccd4, phy 3 [c--] beta gc=63 -3W Lc > 000 BUS RESET 0.000 USX > 001 BUS RESET 2.600 US > 002 SELFID PKT0 807F0880 01 ID=00 L=1 GAP=3F SP=100 C=1 I=0 168.6 US The DSR-25 was powered on and the hub initiated a bus reset. > [17074.918368] AT spd 0 tl 3a, ffc3 -> ffc0, ack_pending , QR req, fffff0000400 > [17074.921964] AR spd 0 tl 3a, ffc0 -> ffc3, ack_complete, QR resp = 04042592 > [17074.925563] AR spd 0 tl 3c, ffc0 -> ffc3, ack_complete, QR resp = 31333934 > [17074.927306] AT spd 0 tl 3c, ffc3 -> ffc0, pending/cancelled, QR req, fffff0000404 > [17074.929046] AT spd 0 tl 3b, ffc3 -> ffc1, evt_missing_ack, QR req, fffff0000400 > [17074.932573] AT spd 0 tl 3d, ffc3 -> ffc2, evt_missing_ack, QR req, fffff0000400 > [17074.934335] rediscovered device fw0 We start reading the config ROMs. > [17074.936033] phy config: new root=ffc3, gap_count=8 > [17074.937810] IRQ 00000021 AR_resp AT_req > [17074.939486] AR spd 0 tl 15, ffc0 -> ffc3, ack_complete, QR resp = e0644000 > [17074.941196] AT spd 0 tl 15, ffc3 -> ffc0, pending/cancelled, QR req, fffff0000408 > [17074.942879] AT ack_complete, PHY 03c80000 fc37ffff > [17074.944597] IRQ 00000011 AR_req AT_req > [17074.946252] AR evt_bus_reset, generation 5 > [17074.947924] IRQ 00010010 selfID AR_req > [17074.949544] AR evt_bus_reset, generation 5 > [17074.951130] AT spd 0 tl 16, ffc3 -> ffc0, ack_pending , QR req, fffff000040c > [17074.952770] 5 selfIDs, generation 6, local node ID ffc3 > [17074.954341] selfID 0: 80480880, phy 0 [p..] S100 gc=8 +0W Lc > [17074.955918] selfID 0: 81088465, phy 1 [-p-] S400 gc=8 -3W > [17074.957480] selfID n: 81835000, phy 1 [c--.....] > [17074.959008] selfID 0: 827f80b4, phy 2 [pc-] S400 gc=63 +0W L > [17074.960537] selfID 0: 8348ccd4, phy 3 [c--] beta gc=8 -3W Lc > 072 ASYNC F000040C 03 DEST OFFSET=FFFF:F000:040C 0.330 US > 073 ASYNC A2508E4C 04 HEADER CRC 0.320 US > 074 ACK 2D ACK_PENDING 1.370 US > 075 BUS RESET 5.730 US > 076 BUS RESET 2.950 US > 077 SELFID PKT0 80480880 01 ID=00 L=1 GAP=08 SP=100 C=1 I=0 168.6 US We send a PHY configuration packet and short bus reset for gap count optimization. This turns into a double bus reset, and there is a gap count inconsistency. > [17075.494059] AT spd 0 tl 18, ffc3 -> ffc0, ack_pending , QR req, fffff0000400 > [17075.497078] AR spd 0 tl 18, ffc0 -> ffc3, ack_complete, QR resp = 04042592 > [17075.500071] AT spd 0 tl 19, ffc3 -> ffc1, evt_missing_ack, QR req, fffff0000400 > [17075.503125] AT spd 0 tl 1a, ffc3 -> ffc2, evt_missing_ack, QR req, fffff0000400 > [17075.504702] rediscovered device fw0 We start reading config ROMs again. > [17075.506199] phy config: new root=ffc3, gap_count=8 > [17075.507759] IRQ 00000021 AR_resp AT_req > [17075.509326] AR spd 0 tl 32, ffc0 -> ffc3, ack_complete, QR resp = 31333934 > [17075.510842] AT spd 0 tl 32, ffc3 -> ffc0, pending/cancelled, QR req, fffff0000404 > [17075.512351] AT ack_complete, PHY 03c80000 fc37ffff > [17075.513960] IRQ 00000010 AR_req > [17075.515477] AR evt_bus_reset, generation 7 > [17075.517035] IRQ 00010000 selfID > [17075.518559] AT spd 0 tl 33, ffc3 -> ffc0, evt_no_status, QR req, fffff0000408 > [17075.520106] 5 selfIDs, generation 7, local node ID ffc3 > [17075.521660] selfID 0: 80480880, phy 0 [p..] S100 gc=8 +0W Lc > [17075.523228] selfID 0: 81088465, phy 1 [-p-] S400 gc=8 -3W > [17075.524793] selfID n: 81835000, phy 1 [c--.....] > [17075.526321] selfID 0: 824880b4, phy 2 [pc-] S400 gc=8 +0W L > [17075.527821] selfID 0: 8348ccd6, phy 3 [c--] beta gc=8 -3W Lci > 133 ASYNC 31333934 04 QUADLET=31333934 0.330 US > 134 ASYNC 86B1088D 05 HEADER CRC 0.320 US > 135 ACK 1E ACK_COMPLETE 1.330 US > 136 SUBACTN GAP 1.830 US > 137 ARB GAP 1.580 US > 138 BUS RESET 6.164 MS > 139 SELFID PKT0 80480880 01 ID=00 L=1 GAP=08 SP=100 C=1 I=0 169.4 US We send a long bus reset to correct the gap count error. This is successful. The bus reset does not get doubled, and the gap counts are consistent. > [17078.534125] AR spd 0 tl 3f, ffc0 -> ffc3, ack_complete, QR resp = 04042592 > [17078.535629] AT spd 0 tl 3f, ffc3 -> ffc0, pending/cancelled, QR req, fffff0000400 ... > [17078.771716] AT spd 0 tl 21, ffc3 -> ffc0, ack_pending , QR req, fffff0000450 > [17078.775367] AR spd 0 tl 21, ffc0 -> ffc3, ack_complete, QR resp = 536f6e79 > [17078.777511] created device fw1: GUID 08004601025e582a, S100 There are no more bus resets. We read the config ROM and we're ready to rock and roll. So, I believe that when there is a gap count inconsistency, sending a long bus reset is the right thing to do. -- Adam |