From: Jonathan W. <jw...@ju...> - 2013-08-27 13:08:03
|
I have some more information. For those not interested in the detailed analysis, the last few paragraphs contain some interesting observations. FFADO revisions 2375 through 2378 do NOT cause a system lockup on my development machine. Revision 2379, if left to its own devices, seems to lock the machine hard almost every time it's run even when the FF800 is in master clock mode. This backs up observations made by Ilia about r2379 appearing to trigger the lockup more frequently, although he was also seeing the lockups on earlier revisions. Looking at the changes in r2379 it's hard to see how they could cause such a dramatic problem: * Chunk 1 is in getSamplingFrequency() and comprises comments only. * Chunk 2 is in getSamplingFrequency(). It adds a get_hardware_state() call. A new return path is relevant only if the device is in autosync clock mode. * Chunk 3 is in setSamplingFrequency(). Comments were added and a debug warning was changed to an error. Other functional changes (a new return path and a setting of software_freq) is only executed if the device is in autosync clock mode. * Chunk 4 is in getSupportedSamplingFrequencies(). Comments were added. If in autosync clock mode a new push_back() to the frequency list has been added. With my device currently in master clock mode, the only functional change introduced by this patch is the call to get_hardware_state() in getSamplingFrequency(). Commenting this out seems to fix the crash for me: I no longer see the "CTR discrepancy" messages and the kernel does not lock up a few seconds after starting jackd/ffado. So what's going on? getSamplingFrequency() is called in only a few places within the RME driver: * Device::prepare() - called once during program startup and device initialisation. * Device::getFramesPerPacket() - various call sites * Device::addDirPorts() - called only during program startup getFramesPerPacket() is significant because it shows up in the streaming driver. The call in RmeReceiveStreamProcessor is from getMaxPacketSize() and this is another function only called during startup. On the transmit side, there is a similar call from RmeTransmitStreamProcessor's getMaxPacketSize() method. However, there is one other call site: in getNominalFramesPerPacket(). This function is called 1-2 times in RmeTransmitStreamProcessor::generatePacketHeader(), and this method is called once per iso cycle. It also shows up in twice in generatePacketData(), once in generateSilentPacketHeader() and once in fillDataPacketHeader(). All up, on average this function is probably called 2-3 times per iso cycle. Prior to r2379 this wasn't a big deal since getSamplingFrequency() more or less just returned a fixed value. Revision 2379 introduced the get_hardware_state() call which involves a two-quadlet block async read transaction on the bus - a detail which was overlooked at the time r2379 was being prepared. I can't remember the bus specifications regarding async transactions and how often they can be done. In any case, it seems that the kernel driver was choking on the async load generated by this oversight: 2-3 two-quadlet async block reads per iso cycle appears to be beyond the kernel driver's capability. With this in mind I will rework the driver to avoid this problem. However, I am not in a position to judge whether this FFADO error has exposed a real problem in the firewire driver, or if it can be written off on the basis that the driver simply can't cope with this traffic pattern. I would be interested to see what others think. As an aside, if the kernel driver does have trouble with this traffic load, it seems that this might be a possible DOS vector against the kernel. If one has any firewire device connected, it seems the kernel can be crashed by sending a handful of async block read requests to that device every iso cycle (whether some iso traffic is also needed is not something I've tested). Regards jonathan |