Re: [FFADO-devel] jackd+ffado causes kernel lock ups

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I have some more information.  For those not interested in the detailed
analysis, the last few paragraphs contain some interesting observations.

FFADO revisions 2375 through 2378 do NOT cause a system lockup on my
development machine.  Revision 2379, if left to its own devices, seems to
lock the machine hard almost every time it's run even when the FF800 is in
master clock mode.  This backs up observations made by Ilia about r2379
appearing to trigger the lockup more frequently, although he was also seeing
the lockups on earlier revisions.

Looking at the changes in r2379 it's hard to see how they could cause such a
dramatic problem:

 * Chunk 1 is in getSamplingFrequency() and comprises comments only.

 * Chunk 2 is in getSamplingFrequency().  It adds a get_hardware_state()
   call.  A new return path is relevant only if the device is in autosync
   clock mode.

 * Chunk 3 is in setSamplingFrequency().  Comments were added and a debug   
   warning was changed to an error.  Other functional changes (a new return 
   path and a setting of software_freq) is only executed if the device is in 
   autosync clock mode.

 * Chunk 4 is in getSupportedSamplingFrequencies().  Comments were added.
   If in autosync clock mode a new push_back() to the frequency list has
   been added.

With my device currently in master clock mode, the only functional change   
introduced by this patch is the call to get_hardware_state() in
getSamplingFrequency().  Commenting this out seems to fix the crash for me:
I no longer see the "CTR discrepancy" messages and the kernel does not lock
up a few seconds after starting jackd/ffado.

So what's going on?  getSamplingFrequency() is called in only a few places
within the RME driver:

 * Device::prepare() - called once during program startup and device
   initialisation.

 * Device::getFramesPerPacket() - various call sites

 * Device::addDirPorts() - called only during program startup

getFramesPerPacket() is significant because it shows up in the streaming
driver.  The call in RmeReceiveStreamProcessor is from getMaxPacketSize()
and this is another function only called during startup.  On the transmit
side, there is a similar call from RmeTransmitStreamProcessor's
getMaxPacketSize() method.  However, there is one other call site: in
getNominalFramesPerPacket(). This function is called 1-2 times in
RmeTransmitStreamProcessor::generatePacketHeader(), and this method is
called once per iso cycle.  It also shows up in twice in generatePacketData(), 
once in generateSilentPacketHeader() and once in fillDataPacketHeader(). 
All up, on average this function is probably called 2-3 times per iso cycle.

Prior to r2379 this wasn't a big deal since getSamplingFrequency() more or
less just returned a fixed value.  Revision 2379 introduced the
get_hardware_state() call which involves a two-quadlet block async read
transaction on the bus - a detail which was overlooked at the time r2379 was
being prepared.

I can't remember the bus specifications regarding async transactions and how
often they can be done.  In any case, it seems that the kernel driver was
choking on the async load generated by this oversight: 2-3 two-quadlet async
block reads per iso cycle appears to be beyond the kernel driver's
capability.

With this in mind I will rework the driver to avoid this problem.  However,
I am not in a position to judge whether this FFADO error has exposed a real
problem in the firewire driver, or if it can be written off on the basis 
that the driver simply can't cope with this traffic pattern.  I would be
interested to see what others think.

As an aside, if the kernel driver does have trouble with this traffic load,
it seems that this might be a possible DOS vector against the kernel.  If
one has any firewire device connected, it seems the kernel can be crashed by
sending a handful of async block read requests to that device every iso
cycle (whether some iso traffic is also needed is not something I've
tested).

Regards
  jonathan