Thread: [PATCH] firewire: core: send bus reset promptly on gap count error
Brought to you by:
aeb,
bencollins
From: Adam G. <ad...@po...> - 2024-01-23 08:11:55
|
If we are bus manager and the bus has inconsistent gap counts, send a bus reset immediately instead of trying to read the root node's config ROM first. Otherwise, we could spend a lot of time trying to read the config ROM but never succeeding. This eliminates a 50+ second delay before the FireWire bus is usable after a newly connected device is powered on in certain circumstances. Signed-off-by: Adam Goldman <ad...@po...> Link: https://sourceforge.net/p/linux1394/mailman/message/58727806/ --- --- linux-source-6.1.orig/drivers/firewire/core-card.c 2023-09-23 02:11:13.000000000 -0700 +++ linux-source-6.1/drivers/firewire/core-card.c 2024-01-22 04:23:06.000000000 -0800 @@ -435,6 +435,16 @@ * config rom. In either case, pick another root. */ new_root_id = local_id; + } else if (card->gap_count == 0) { + /* + * If self IDs have inconsistent gap counts, do a + * bus reset ASAP. The config rom read might never + * complete, so don't wait for it. However, still + * send a PHY configuration packet prior to the bus + * reset, as permitted by IEEE 1394-2008 8.4.5.2. + */ + new_root_id = local_id; + card->bm_retries = 0; } else if (!root_device_is_running) { /* * If we haven't probed this device yet, bail out now |
From: Takashi S. <o-t...@sa...> - 2024-01-27 08:37:52
|
Hi, Thanks for the patch. I would like to check some points about the change. On Tue, Jan 23, 2024 at 12:11:40AM -0800, Adam Goldman wrote: > If we are bus manager and the bus has inconsistent gap counts, send a > bus reset immediately instead of trying to read the root node's config > ROM first. Otherwise, we could spend a lot of time trying to read the > config ROM but never succeeding. > > This eliminates a 50+ second delay before the FireWire bus is usable > after a newly connected device is powered on in certain circumstances. At first, would I request you to explain about the certain circumstances in the patch comment? It is really helpful to understand the change itself. > Signed-off-by: Adam Goldman <ad...@po...> > Link: https://sourceforge.net/p/linux1394/mailman/message/58727806/ > --- > > --- linux-source-6.1.orig/drivers/firewire/core-card.c 2023-09-23 02:11:13.000000000 -0700 > +++ linux-source-6.1/drivers/firewire/core-card.c 2024-01-22 04:23:06.000000000 -0800 > @@ -435,6 +435,16 @@ > * config rom. In either case, pick another root. > */ > new_root_id = local_id; > + } else if (card->gap_count == 0) { > + /* > + * If self IDs have inconsistent gap counts, do a > + * bus reset ASAP. The config rom read might never > + * complete, so don't wait for it. However, still > + * send a PHY configuration packet prior to the bus > + * reset, as permitted by IEEE 1394-2008 8.4.5.2. > + */ > + new_root_id = local_id; > + card->bm_retries = 0; > } else if (!root_device_is_running) { > /* > * If we haven't probed this device yet, bail out now Next, after the condition branches, we can see below lines: ``` /* * Finally, figure out if we should do a reset or not. If we have * done less than 5 resets with the same physical topology and we * have either a new root or a new gap count setting, let's do it. */ if (card->bm_retries++ < 5 && (card->gap_count != gap_count || new_root_id != root_id)) do_reset = true; ``` When the value of "card->gap_count" is zero, it would hit the condition of "card->gap_count != gap_count". I think the transmission of phy config packet and scheduling of short bus reset would be done, regardless of the change. Would I ask the main intention to the additional branch? Thanks Takashi Sakamoto |
From: Adam G. <ad...@po...> - 2024-01-27 11:14:07
|
Hi, On Sat, Jan 27, 2024 at 05:37:30PM +0900, Takashi Sakamoto wrote: > On Tue, Jan 23, 2024 at 12:11:40AM -0800, Adam Goldman wrote: > > If we are bus manager and the bus has inconsistent gap counts, send a > > bus reset immediately instead of trying to read the root node's config > > ROM first. Otherwise, we could spend a lot of time trying to read the > > config ROM but never succeeding. > > > > This eliminates a 50+ second delay before the FireWire bus is usable > > after a newly connected device is powered on in certain circumstances. > > At first, would I request you to explain about the certain > circumstances in the patch comment? It is really helpful to understand > the change itself. The delay occurs if a gap count inconsistency occurs, we are not the root node, and we become bus manager. One scenario that causes this is with a TI XIO2213B OHCI, the first time a Sony DSR-25 is powered on after being connected to the FireWire cable. > > Link: https://sourceforge.net/p/linux1394/mailman/message/58727806/ This link has a longer description and kernel logs. > > --- linux-source-6.1.orig/drivers/firewire/core-card.c 2023-09-23 02:11:13.000000000 -0700 > > +++ linux-source-6.1/drivers/firewire/core-card.c 2024-01-22 04:23:06.000000000 -0800 > > @@ -435,6 +435,16 @@ > > * config rom. In either case, pick another root. > > */ > > new_root_id = local_id; > > + } else if (card->gap_count == 0) { > > + /* > > + * If self IDs have inconsistent gap counts, do a > > + * bus reset ASAP. The config rom read might never > > + * complete, so don't wait for it. However, still > > + * send a PHY configuration packet prior to the bus > > + * reset, as permitted by IEEE 1394-2008 8.4.5.2. > > + */ > > + new_root_id = local_id; > > + card->bm_retries = 0; > > } else if (!root_device_is_running) { > > /* > > * If we haven't probed this device yet, bail out now > > Next, after the condition branches, we can see below lines: > > ``` > /* > * Finally, figure out if we should do a reset or not. If we have > * done less than 5 resets with the same physical topology and we > * have either a new root or a new gap count setting, let's do it. > */ > > if (card->bm_retries++ < 5 && > (card->gap_count != gap_count || new_root_id != root_id)) > do_reset = true; > ``` > > When the value of "card->gap_count" is zero, it would hit the condition of > "card->gap_count != gap_count". I think the transmission of phy config > packet and scheduling of short bus reset would be done, regardless of the > change. Would I ask the main intention to the additional branch? Without the additional branch, the !root_device_is_running branch will be taken (because the root node's config ROM hasn't been read yet), and bm_work will go to sleep. Eventually we will give up trying to read the config ROM, the root_device==NULL branch will be taken, and the bus reset will be done. The additional branch eliminates waiting for the config ROM read when gap_count is zero. Here is the full sequence of events: 1. Bus reset occurs due to newly active device. 2. Self identification process completes. We are not root node. Gap counts are inconsistent. 3. build_tree() notices the gap count error and sets gap_count to 0: > /* > * If PHYs report different gap counts, set an invalid count > * which will force a gap count reconfiguration and a reset. > */ > if (SELF_ID_GAP_COUNT(q) != gap_count) > gap_count = 0; 4. bm_work() starts and makes us bus manager: > rcode = fw_run_transaction(card, TCODE_LOCK_COMPARE_SWAP, > irm_id, generation, SCODE_100, > CSR_REGISTER_BASE + CSR_BUS_MANAGER_ID, > transaction_data, 8); 5. read_config_rom() starts reading the root node's config ROM. 6. bm_work() wants to know if the root node is cycle master capable. If the root node is cycle master capable, we will leave it as the root node. Otherwise, we will make ourself the root node. To determine if the root node is cycle master capable, we must wait until its config ROM has been read: > } else if (!root_device_is_running) { > /* > * If we haven't probed this device yet, bail out now > * and let's try again once that's done. > */ > spin_unlock_irq(&card->lock); > goto out; 7a. Without the patch: read_config_rom() reads the first few quadlets from the config ROM. Due to the gap count inconsistency, eventually one of the reads times out. When read_config_rom() fails, fw_device_init() calls it again until MAX_RETRIES is reached. This takes 50+ seconds. 8a. bm_work() sees that we have given up trying to read the config ROM. It makes us the root node and does a bus reset: > if (root_device == NULL) { > /* > * Either link_on is false, or we failed to read the > * config rom. In either case, pick another root. > */ > new_root_id = local_id; > ... > if (card->bm_retries++ < 5 && > (card->gap_count != gap_count || new_root_id != root_id)) > do_reset = true; 7b. With the patch: Because of the gap count inconsistency, bm_work() does not wait for the config ROM to be read. It makes us the root node and does a bus reset immediately: > } else if (card->gap_count == 0) { > /* > * If self IDs have inconsistent gap counts, do a > * bus reset ASAP. The config rom read might never > * complete, so don't wait for it. However, still > * send a PHY configuration packet prior to the > * bus reset, as permitted by 1394-2008 8.4.5.2. > */ > new_root_id = local_id; > card->bm_retries = 0; > ... > if (card->bm_retries++ < 5 && > (card->gap_count != gap_count || new_root_id != root_id)) > do_reset = true; -- Adam |
From: Takashi S. <o-t...@sa...> - 2024-01-31 14:27:32
|
Hi, I'm sorry to be late for reply. On Sat, Jan 27, 2024 at 03:13:45AM -0800, Adam Goldman wrote: > Hi, > > On Sat, Jan 27, 2024 at 05:37:30PM +0900, Takashi Sakamoto wrote: > > On Tue, Jan 23, 2024 at 12:11:40AM -0800, Adam Goldman wrote: > > > If we are bus manager and the bus has inconsistent gap counts, send a > > > bus reset immediately instead of trying to read the root node's config > > > ROM first. Otherwise, we could spend a lot of time trying to read the > > > config ROM but never succeeding. > > > > > > This eliminates a 50+ second delay before the FireWire bus is usable > > > after a newly connected device is powered on in certain circumstances. > > > > At first, would I request you to explain about the certain > > circumstances in the patch comment? It is really helpful to understand > > the change itself. > > The delay occurs if a gap count inconsistency occurs, we are not the > root node, and we become bus manager. One scenario that causes this is > with a TI XIO2213B OHCI, the first time a Sony DSR-25 is powered on > after being connected to the FireWire cable. > > > > Link: https://sourceforge.net/p/linux1394/mailman/message/58727806/ > > This link has a longer description and kernel logs. Of course, I've already read your post in the last week. Yes, you did enough investigation. However, in the review process, you need to post patches with sufficient description. > > > --- linux-source-6.1.orig/drivers/firewire/core-card.c 2023-09-23 02:11:13.000000000 -0700 > > > +++ linux-source-6.1/drivers/firewire/core-card.c 2024-01-22 04:23:06.000000000 -0800 > > > @@ -435,6 +435,16 @@ > > > * config rom. In either case, pick another root. > > > */ > > > new_root_id = local_id; > > > + } else if (card->gap_count == 0) { > > > + /* > > > + * If self IDs have inconsistent gap counts, do a > > > + * bus reset ASAP. The config rom read might never > > > + * complete, so don't wait for it. However, still > > > + * send a PHY configuration packet prior to the bus > > > + * reset, as permitted by IEEE 1394-2008 8.4.5.2. > > > + */ > > > + new_root_id = local_id; > > > + card->bm_retries = 0; > > > } else if (!root_device_is_running) { > > > /* > > > * If we haven't probed this device yet, bail out now > > > > Next, after the condition branches, we can see below lines: > > > > ``` > > /* > > * Finally, figure out if we should do a reset or not. If we have > > * done less than 5 resets with the same physical topology and we > > * have either a new root or a new gap count setting, let's do it. > > */ > > > > if (card->bm_retries++ < 5 && > > (card->gap_count != gap_count || new_root_id != root_id)) > > do_reset = true; > > ``` > > > > When the value of "card->gap_count" is zero, it would hit the condition of > > "card->gap_count != gap_count". I think the transmission of phy config > > packet and scheduling of short bus reset would be done, regardless of the > > change. Would I ask the main intention to the additional branch? > > Without the additional branch, the !root_device_is_running branch will > be taken (because the root node's config ROM hasn't been read yet), and > bm_work will go to sleep. Eventually we will give up trying to read the > config ROM, the root_device==NULL branch will be taken, and the bus > reset will be done. The additional branch eliminates waiting for the > config ROM read when gap_count is zero. > > Here is the full sequence of events: > > 1. Bus reset occurs due to newly active device. > > 2. Self identification process completes. We are not root node. Gap > counts are inconsistent. > > 3. build_tree() notices the gap count error and sets gap_count to 0: > > > /* > > * If PHYs report different gap counts, set an invalid count > > * which will force a gap count reconfiguration and a reset. > > */ > > if (SELF_ID_GAP_COUNT(q) != gap_count) > > gap_count = 0; > > 4. bm_work() starts and makes us bus manager: > > > rcode = fw_run_transaction(card, TCODE_LOCK_COMPARE_SWAP, > > irm_id, generation, SCODE_100, > > CSR_REGISTER_BASE + CSR_BUS_MANAGER_ID, > > transaction_data, 8); > > 5. read_config_rom() starts reading the root node's config ROM. > > 6. bm_work() wants to know if the root node is cycle master capable. If > the root node is cycle master capable, we will leave it as the root > node. Otherwise, we will make ourself the root node. To determine if the > root node is cycle master capable, we must wait until its config ROM has > been read: > > > } else if (!root_device_is_running) { > > /* > > * If we haven't probed this device yet, bail out now > > * and let's try again once that's done. > > */ > > spin_unlock_irq(&card->lock); > > goto out; > > 7a. Without the patch: read_config_rom() reads the first few quadlets > from the config ROM. Due to the gap count inconsistency, eventually one > of the reads times out. When read_config_rom() fails, fw_device_init() > calls it again until MAX_RETRIES is reached. This takes 50+ seconds. > > 8a. bm_work() sees that we have given up trying to read the config ROM. > It makes us the root node and does a bus reset: > > > if (root_device == NULL) { > > /* > > * Either link_on is false, or we failed to read the > > * config rom. In either case, pick another root. > > */ > > new_root_id = local_id; > > ... > > if (card->bm_retries++ < 5 && > > (card->gap_count != gap_count || new_root_id != root_id)) > > do_reset = true; > > 7b. With the patch: Because of the gap count inconsistency, bm_work() > does not wait for the config ROM to be read. It makes us the root node > and does a bus reset immediately: > > > } else if (card->gap_count == 0) { > > /* > > * If self IDs have inconsistent gap counts, do a > > * bus reset ASAP. The config rom read might never > > * complete, so don't wait for it. However, still > > * send a PHY configuration packet prior to the > > * bus reset, as permitted by 1394-2008 8.4.5.2. > > */ > > new_root_id = local_id; > > card->bm_retries = 0; > > ... > > if (card->bm_retries++ < 5 && > > (card->gap_count != gap_count || new_root_id != root_id)) > > do_reset = true; Thanks for the detail. The mismatch of gap count can be detected in the firstworkqueue context (bus_reset_work()) in 'driver/firewire/ohci.c', so I investigated more quick way to reset the bus, however your change looks better than it. I'd like to apply your patch for v6.9 kernel, while the patch description is not suffice, as I address. I'm sorry to ask you more work, but would I ask you to repost your patch with any detail description? The description should include enough information for developers inner/outer this subsystem to understand its background and intention. Thanks Takashi Sakamoto |
From: Adam G. <ad...@po...> - 2024-02-04 06:22:48
|
Hi, On Wed, Jan 31, 2024 at 11:27:14PM +0900, Takashi Sakamoto wrote: > I'd like to apply your patch for v6.9 kernel, while the patch > description is not suffice, as I address. I'm sorry to ask you more > work, but would I ask you to repost your patch with any detail > description? The description should include enough information for > developers inner/outer this subsystem to understand its background > and intention. Is the following description acceptable? If we are bus manager and the bus has inconsistent gap counts, send a bus reset immediately instead of trying to read the root node's config ROM first. Otherwise, we could spend a lot of time trying to read the config ROM but never succeeding. This eliminates a 50+ second delay before the FireWire bus is usable after a newly connected device is powered on in certain circumstances. The delay occurs if a gap count inconsistency occurs, we are not the root node, and we become bus manager. One scenario that causes this is with a TI XIO2213B OHCI, the first time a Sony DSR-25 is powered on after being connected to the FireWire cable. In this configuration, the Linux box will not receive the initial PHY configuration packet sent by the DSR-25 as IRM, resulting in the DSR-25 having a gap count of 44 while the Linux box has a gap count of 63. FireWire devices have a gap count parameter, which is set to 63 on power-up and can be changed with a PHY configuration packet. This determines the duration of the subaction and arbitration gaps. For reliable communication, all nodes on a FireWire bus must have the same gap count. A node may have one or more of the following roles: root node, bus manager (BM), isochronous resource manager (IRM), and cycle master. Unless a root node was forced with a PHY configuration packet, any node might become root node after a bus reset. Only the root node can become cycle master. If the root node is not cycle master capable, the BM or IRM should force a change of root node. After a bus reset, each node sends a self-ID packet, which contains its current gap count. A single bus reset does not change the gap count, but two bus resets in a row will set the gap count to 63. Because a consistent gap count is required for reliable communication, IEEE 1394a-2000 requires that the bus manager generate a bus reset if it detects that the gap count is inconsistent. When the gap count is inconsistent, build_tree() will notice this after the self identification process. It will set card->gap_count to the invalid value 0. If we become bus master, this will force bm_work() to send a bus reset when it performs gap count optimization. After a bus reset, there is no bus manager. We will almost always try to become bus manager. Once we become bus manager, we will first determine whether the root node is cycle master capable. Then, we will determine if the gap count should be changed. If either the root node or the gap count should be changed, we will generate a bus reset. To determine if the root node is cycle master capable, we read its configuration ROM. bm_work() will wait until we have finished trying to read the configuration ROM. However, an inconsistent gap count can make this take a long time. read_config_rom() will read the first few quadlets from the config ROM. Due to the gap count inconsistency, eventually one of the reads will time out. When read_config_rom() fails, fw_device_init() calls it again until MAX_RETRIES is reached. This takes 50+ seconds. Once we give up trying to read the configuration ROM, bm_work() will wake up, assume that the root node is not cycle master capable, and do a bus reset. Hopefully, this will resolve the gap count inconsistency. This change makes bm_work() check for an inconsistent gap count before waiting for the root node's configuration ROM. If the gap count is inconsistent, bm_work() will immediately do a bus reset. This eliminates the 50+ second delay and rapidly brings the bus to a working state. I considered that if the gap count is inconsistent, a PHY configuration packet might not be successful, so it could be desirable to skip the PHY configuration packet before the bus reset in this case. However, IEEE 1394a-2000 and IEEE 1394-2008 say that the bus manager may transmit a PHY configuration packet before a bus reset when correcting a gap count error. Since the standard endorses this, I decided it's safe to retain the PHY configuration packet transmission. Normally, after a topology change, we will reset the bus a maximum of 5 times to change the root node and perform gap count optimization. However, if there is a gap count inconsistency, we must always generate a bus reset. Otherwise the gap count inconsistency will persist and communication will be unreliable. For that reason, if there is a gap count inconstency, we generate a bus reset even if we already reached the 5 reset limit. -- Adam |
From: Takashi S. <o-t...@sa...> - 2024-02-04 09:20:47
|
Hi, On Sun, Feb 4, 2024, at 15:22, Adam Goldman wrote: > Is the following description acceptable? > > If we are bus manager and the bus has inconsistent gap counts, send a bus > reset immediately instead of trying to read the root node's config ROM > first. Otherwise, we could spend a lot of time trying to read the config > ROM but never succeeding. > > This eliminates a 50+ second delay before the FireWire bus is usable after > a newly connected device is powered on in certain circumstances. > > The delay occurs if a gap count inconsistency occurs, we are not the root > node, and we become bus manager. One scenario that causes this is with a TI > XIO2213B OHCI, the first time a Sony DSR-25 is powered on after being > connected to the FireWire cable. In this configuration, the Linux box will > not receive the initial PHY configuration packet sent by the DSR-25 as IRM, > resulting in the DSR-25 having a gap count of 44 while the Linux box has a > gap count of 63. > > FireWire devices have a gap count parameter, which is set to 63 on power-up > and can be changed with a PHY configuration packet. This determines the > duration of the subaction and arbitration gaps. For reliable communication, > all nodes on a FireWire bus must have the same gap count. > > A node may have one or more of the following roles: root node, bus manager > (BM), isochronous resource manager (IRM), and cycle master. Unless a root > node was forced with a PHY configuration packet, any node might become root > node after a bus reset. Only the root node can become cycle master. If the > root node is not cycle master capable, the BM or IRM should force a change > of root node. > > After a bus reset, each node sends a self-ID packet, which contains its > current gap count. A single bus reset does not change the gap count, but > two bus resets in a row will set the gap count to 63. Because a consistent > gap count is required for reliable communication, IEEE 1394a-2000 requires > that the bus manager generate a bus reset if it detects that the gap count > is inconsistent. > > When the gap count is inconsistent, build_tree() will notice this after the > self identification process. It will set card->gap_count to the invalid > value 0. If we become bus master, this will force bm_work() to send a bus > reset when it performs gap count optimization. > > After a bus reset, there is no bus manager. We will almost always try to > become bus manager. Once we become bus manager, we will first determine > whether the root node is cycle master capable. Then, we will determine if > the gap count should be changed. If either the root node or the gap count > should be changed, we will generate a bus reset. > > To determine if the root node is cycle master capable, we read its > configuration ROM. bm_work() will wait until we have finished trying to > read the configuration ROM. > > However, an inconsistent gap count can make this take a long time. > read_config_rom() will read the first few quadlets from the config ROM. Due > to the gap count inconsistency, eventually one of the reads will time out. > When read_config_rom() fails, fw_device_init() calls it again until > MAX_RETRIES is reached. This takes 50+ seconds. > > Once we give up trying to read the configuration ROM, bm_work() will wake > up, assume that the root node is not cycle master capable, and do a bus > reset. Hopefully, this will resolve the gap count inconsistency. > > This change makes bm_work() check for an inconsistent gap count before > waiting for the root node's configuration ROM. If the gap count is > inconsistent, bm_work() will immediately do a bus reset. This eliminates > the 50+ second delay and rapidly brings the bus to a working state. > > I considered that if the gap count is inconsistent, a PHY configuration > packet might not be successful, so it could be desirable to skip the PHY > configuration packet before the bus reset in this case. However, IEEE > 1394a-2000 and IEEE 1394-2008 say that the bus manager may transmit a PHY > configuration packet before a bus reset when correcting a gap count error. > Since the standard endorses this, I decided it's safe to retain the PHY > configuration packet transmission. > > Normally, after a topology change, we will reset the bus a maximum of 5 > times to change the root node and perform gap count optimization. However, > if there is a gap count inconsistency, we must always generate a bus reset. > Otherwise the gap count inconsistency will persist and communication will > be unreliable. For that reason, if there is a gap count inconstency, we > generate a bus reset even if we already reached the 5 reset limit. > > -- Adam It looks preferable in this case, I think. We are going to change the part of fundamentals in this software stack, thus the longer description is inevitable. I think it preferable as well to add code comment about your judge to retain sending PHY configuration packet in the case. Additionally, it is helpful to add code comment about "bm_retries=0". Thanks. Takashi Sakamoto |
From: Adam G. <ad...@po...> - 2024-02-05 05:21:31
|
If we are bus manager and the bus has inconsistent gap counts, send a bus reset immediately instead of trying to read the root node's config ROM first. Otherwise, we could spend a lot of time trying to read the config ROM but never succeeding. This eliminates a 50+ second delay before the FireWire bus is usable after a newly connected device is powered on in certain circumstances. The delay occurs if a gap count inconsistency occurs, we are not the root node, and we become bus manager. One scenario that causes this is with a TI XIO2213B OHCI, the first time a Sony DSR-25 is powered on after being connected to the FireWire cable. In this configuration, the Linux box will not receive the initial PHY configuration packet sent by the DSR-25 as IRM, resulting in the DSR-25 having a gap count of 44 while the Linux box has a gap count of 63. FireWire devices have a gap count parameter, which is set to 63 on power-up and can be changed with a PHY configuration packet. This determines the duration of the subaction and arbitration gaps. For reliable communication, all nodes on a FireWire bus must have the same gap count. A node may have zero or more of the following roles: root node, bus manager (BM), isochronous resource manager (IRM), and cycle master. Unless a root node was forced with a PHY configuration packet, any node might become root node after a bus reset. Only the root node can become cycle master. If the root node is not cycle master capable, the BM or IRM should force a change of root node. After a bus reset, each node sends a self-ID packet, which contains its current gap count. A single bus reset does not change the gap count, but two bus resets in a row will set the gap count to 63. Because a consistent gap count is required for reliable communication, IEEE 1394a-2000 requires that the bus manager generate a bus reset if it detects that the gap count is inconsistent. When the gap count is inconsistent, build_tree() will notice this after the self identification process. It will set card->gap_count to the invalid value 0. If we become bus master, this will force bm_work() to send a bus reset when it performs gap count optimization. After a bus reset, there is no bus manager. We will almost always try to become bus manager. Once we become bus manager, we will first determine whether the root node is cycle master capable. Then, we will determine if the gap count should be changed. If either the root node or the gap count should be changed, we will generate a bus reset. To determine if the root node is cycle master capable, we read its configuration ROM. bm_work() will wait until we have finished trying to read the configuration ROM. However, an inconsistent gap count can make this take a long time. read_config_rom() will read the first few quadlets from the config ROM. Due to the gap count inconsistency, eventually one of the reads will time out. When read_config_rom() fails, fw_device_init() calls it again until MAX_RETRIES is reached. This takes 50+ seconds. Once we give up trying to read the configuration ROM, bm_work() will wake up, assume that the root node is not cycle master capable, and do a bus reset. Hopefully, this will resolve the gap count inconsistency. This change makes bm_work() check for an inconsistent gap count before waiting for the root node's configuration ROM. If the gap count is inconsistent, bm_work() will immediately do a bus reset. This eliminates the 50+ second delay and rapidly brings the bus to a working state. I considered that if the gap count is inconsistent, a PHY configuration packet might not be successful, so it could be desirable to skip the PHY configuration packet before the bus reset in this case. However, IEEE 1394a-2000 and IEEE 1394-2008 say that the bus manager may transmit a PHY configuration packet before a bus reset when correcting a gap count error. Since the standard endorses this, I decided it's safe to retain the PHY configuration packet transmission. Normally, after a topology change, we will reset the bus a maximum of 5 times to change the root node and perform gap count optimization. However, if there is a gap count inconsistency, we must always generate a bus reset. Otherwise the gap count inconsistency will persist and communication will be unreliable. For that reason, if there is a gap count inconstency, we generate a bus reset even if we already reached the 5 reset limit. Signed-off-by: Adam Goldman <ad...@po...> Link: https://sourceforge.net/p/linux1394/mailman/message/58727806/ --- --- linux-source-6.1.orig/drivers/firewire/core-card.c 2023-09-23 02:11:13.000000000 -0700 +++ linux-source-6.1/drivers/firewire/core-card.c 2024-02-04 20:24:53.000000000 -0800 @@ -429,7 +429,23 @@ */ card->bm_generation = generation; - if (root_device == NULL) { + if (card->gap_count == 0) { + /* + * If self IDs have inconsistent gap counts, do a + * bus reset ASAP. The config rom read might never + * complete, so don't wait for it. However, still + * send a PHY configuration packet prior to the + * bus reset. The PHY configuration packet might + * fail, but 1394-2008 8.4.5.2 explicitly permits + * it in this case, so it should be safe to try. + */ + new_root_id = local_id; + /* + * We must always send a bus reset if the gap count + * is inconsistent, so bypass the 5-reset limit. + */ + card->bm_retries = 0; + } else if (root_device == NULL) { /* * Either link_on is false, or we failed to read the * config rom. In either case, pick another root. |
From: Takashi S. <o-t...@sa...> - 2024-02-06 23:38:55
|
Hi, Thanks for the patch. I applied it to for-linus branch and will send it for v6.8-rc4 in this week. Thanks for your long patience in the review process. On Sun, Feb 04, 2024 at 09:21:14PM -0800, Adam Goldman wrote: > If we are bus manager and the bus has inconsistent gap counts, send a > bus reset immediately instead of trying to read the root node's config > ROM first. Otherwise, we could spend a lot of time trying to read the > config ROM but never succeeding. > > This eliminates a 50+ second delay before the FireWire bus is usable after > a newly connected device is powered on in certain circumstances. > > The delay occurs if a gap count inconsistency occurs, we are not the root > node, and we become bus manager. One scenario that causes this is with a TI > XIO2213B OHCI, the first time a Sony DSR-25 is powered on after being > connected to the FireWire cable. In this configuration, the Linux box will > not receive the initial PHY configuration packet sent by the DSR-25 as IRM, > resulting in the DSR-25 having a gap count of 44 while the Linux box has a > gap count of 63. > > FireWire devices have a gap count parameter, which is set to 63 on power-up > and can be changed with a PHY configuration packet. This determines the > duration of the subaction and arbitration gaps. For reliable communication, > all nodes on a FireWire bus must have the same gap count. > > A node may have zero or more of the following roles: root node, bus manager > (BM), isochronous resource manager (IRM), and cycle master. Unless a root > node was forced with a PHY configuration packet, any node might become root > node after a bus reset. Only the root node can become cycle master. If the > root node is not cycle master capable, the BM or IRM should force a change > of root node. > > After a bus reset, each node sends a self-ID packet, which contains its > current gap count. A single bus reset does not change the gap count, but > two bus resets in a row will set the gap count to 63. Because a consistent > gap count is required for reliable communication, IEEE 1394a-2000 requires > that the bus manager generate a bus reset if it detects that the gap count > is inconsistent. > > When the gap count is inconsistent, build_tree() will notice this after the > self identification process. It will set card->gap_count to the invalid > value 0. If we become bus master, this will force bm_work() to send a bus > reset when it performs gap count optimization. > > After a bus reset, there is no bus manager. We will almost always try to > become bus manager. Once we become bus manager, we will first determine > whether the root node is cycle master capable. Then, we will determine if > the gap count should be changed. If either the root node or the gap count > should be changed, we will generate a bus reset. > > To determine if the root node is cycle master capable, we read its > configuration ROM. bm_work() will wait until we have finished trying to > read the configuration ROM. > > However, an inconsistent gap count can make this take a long time. > read_config_rom() will read the first few quadlets from the config ROM. Due > to the gap count inconsistency, eventually one of the reads will time out. > When read_config_rom() fails, fw_device_init() calls it again until > MAX_RETRIES is reached. This takes 50+ seconds. > > Once we give up trying to read the configuration ROM, bm_work() will wake > up, assume that the root node is not cycle master capable, and do a bus > reset. Hopefully, this will resolve the gap count inconsistency. > > This change makes bm_work() check for an inconsistent gap count before > waiting for the root node's configuration ROM. If the gap count is > inconsistent, bm_work() will immediately do a bus reset. This eliminates > the 50+ second delay and rapidly brings the bus to a working state. > > I considered that if the gap count is inconsistent, a PHY configuration > packet might not be successful, so it could be desirable to skip the PHY > configuration packet before the bus reset in this case. However, IEEE > 1394a-2000 and IEEE 1394-2008 say that the bus manager may transmit a PHY > configuration packet before a bus reset when correcting a gap count error. > Since the standard endorses this, I decided it's safe to retain the PHY > configuration packet transmission. > > Normally, after a topology change, we will reset the bus a maximum of 5 > times to change the root node and perform gap count optimization. However, > if there is a gap count inconsistency, we must always generate a bus reset. > Otherwise the gap count inconsistency will persist and communication will > be unreliable. For that reason, if there is a gap count inconstency, we > generate a bus reset even if we already reached the 5 reset limit. > > Signed-off-by: Adam Goldman <ad...@po...> > Link: https://sourceforge.net/p/linux1394/mailman/message/58727806/ > --- Takashi Sakamoto |