Re: [Madwifi-devel] [Madwifi-cvs] revision 3502 committed (NETDEV WATCHDOG issue)
Status: Beta
Brought to you by:
otaku
From: Benoit P. <ben...@fr...> - 2008-04-19 08:51:17
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Benoit PAPILLAULT a écrit : | Benoit PAPILLAULT a écrit : | | Michael Taylor wrote: | |> Project : madwifi | |> Revision : 3502 | |> Author : mtaylor (Michael Taylor) | |> Date : 2008-04-10 01:41:07 +0200 (Thu, 10 Apr 2008) | |> | |> Log Message : | |> The RX tasklet must process the CABQ when it is setup and non-empty, | even if | |> the HAL says it is not "active". | |> | |> HAL will say the CABQ is not active if the trigger has fired and the | CABQ has | |> been serviced. Therefore, the conditions under which we were | skipping CABQ | |> are EXACTLY when we need to check it. | |> | |> This was leading to large numbers of rx buffers consumed in the CABQ | |> (multicast, etc) and thus the RX queue would deteriorate over time | until RX | |> overruns would eventually start to be a big problem. | |> | |> I also include better diagnostics. | | | | Hi Mike, | | | | I don't really understand what you are saying in this commit in fact. | | You are probably talking about TX queues instead of RX queues since CABq | | is used for transmitting traffic. | | | | What I don't understand is the "active" and "trigger" word. Moreover, is | | it specific to the CABq ? If not, why other queue are not checked in the | | same way? | | | | I'm currently trying to debug a case where I got "NETDEV WATCHDOG" | | message filling up my log (1 per minute with trunk@3545 and 1 per hour | | with madwifi-dfs@3545). The setup is basically 2 or 3 nodes in IBSS node | | and the culprit node is doing broadcast ping + UDP stream. Apparently, | | what happens is that the TX buffers are all eat up at some points, in | | few seconds and never freed. So, 5s later, the kernel NETDEV WATCHDOG | | will kick in! During this time, no traffic is sent :-( | | | | I was wondering if the culprit was that we no longer checked TX queues... | | | | Help appreciated. | | | | Regards, | | Benoit | | I have done more experiment. In fact, HAL_INT_TX correctly show up. | However, the first packet in the queue always return HAL_EINPROGRES when | it's status is checked, so the queue is no longer processed. | | How can HAL_INT_TX interrupts still occurs if the first TX descriptor | says "HW operation is still in progress"? For information, i'm currently | testing on an SMP laptop with a 2.6.20 Ubuntu kernel. Could there be | some race condition? HW bugs? workaround? | | Regards, | Benoit I hope I'm not talking to myself :-). I can now confirmed that it's a race condition between ath_tx_txqaddbuf() which adds a TX buffer into the SW and HW TX queues AND ath_tx_processq() which removes TX buffers from the SW queue. In details, here is the case I debug: 1. A first TX buffer is submitted to HW in ath_tx_txqaddbuf(). Since this is the first time, axq_link is NULL and ath_hal_puttxbuf() is used to set the HW queue head. axq_link is then set to this TX buffer address. 2. When this first TX buffer is successfully transmitted, HAL_INT_TX is triggered. At this point, ath_tx_processq() is called. The first TX buffer in the SW queue is checked. It is found that the TX buffer has been properly sent by HW and is removed from the head of the SW queue. 3. At this time, probably since txq LOCK is released, a second TX buffer is submited to HW in ath_tx_txqaddbuf. Since axq_link is NOT NULL, the HW queue head is not updated. This second TX buffer is added to the SW queue and the HW queue is restarted (but the HW queue head still point to the first TX buffer!). 4. Another HAL_INT_TX interrupt is received and ath_tx_processq() is called. The first TX buffer in the SW queue matches the second TX buffer when submitted. It has not been sent and is not removed from the SW queue. We are now in a dead situation since the HW queue point to a TX buffer which is no longer in the SW queue. 5. The SW queue quickly fills up and NETDEV WATCHDOG kicks in after 5s. At this time, it is clear that the HW queue head does not match any TX buffers in the SW queue! So solve this issue, I tried to reset axq_link to NULL in ath_tx_processq() whenever a TX buffer with ds_link = NULL is processed. It has some effects, but as soon as I removed all the debug code I've added, I still got NETDEV WATCHDOG messages. Moreover, I tried to set the HW queue head on every call to ath_tx_txqaddbuf() and ... it works! It's a bit too agressive however. I tried to set the HW queue head only when axq_depth is 1, but it does not work. What should be the way to handle this kind of race condition? Regards, Benoit -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFICbJ/OR6EySwP7oIRAmPRAKCriIwOECYeL1kQgfV88ckObmRUEgCg9P+W 5vNPvzJFxfoodfJcvm49s2k= =t8G8 -----END PGP SIGNATURE----- |