Re: [FFADO-user] Problem with RME FF800. Can not start jackd,

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Andreas

On Thu, Aug 29, 2013 at 11:28:24PM +0200, Andreas Hehn wrote:
> On 08/29/2013 09:40 PM, Stefan Richter wrote:
> > Ubuntu packages precompiled upstream kernels:
> > https://wiki.ubuntu.com/Kernel/MainlineBuilds
> > 
> Just a few seconds too late. I compiled kernel 3.8.13 myself. Was
> actually sort of fun. Last time I compiled a kernel was about 10 years
> ago. :)

:-) Awesome.  Thanks for doing this.

> Anyways, so I ran a couple of tests using r2385 and I managed to get it
> running! ... I just don't like the statistics. It was one out of five runs.

r2385 is no different to anything earlier with regard to your problem so I
wouldn't expect it to behave any better.  In fact due to the large amount of
output produced by this version I expect it to be worse when "-v 6" is
specified.  Your results bear this out.

> So, seriously, here are the results:
> :

For the moment I've just looked at the final test log (using kernel 3.8.13).
A more detailed analysis will follow perhaps over the weekend.

The first observation is that problems started in the same manner as your
earlier tests.  This is good since it means the additional debug output
production hasn't changed the general system behaviour.

Tx packets are prepared in blocks, which is why the "now_cy" in the
getPacket messages remains unchanged for a large number of calls.  The "cy"
is reporting the cycle number delivered to the iso callback from libraw1394.
The other values are used within the function for various purposes, with
pkt_ctr being the number reported to the driver-specific streaming code.

Looking a little earlier in the log, we see that within each tx processing
block the time between successive getPacket calls within a block is very
short (of the order of 10-12 us normally).  Between each processing block it
is predictably larger.  For example between now_cy=2366 and now_cy=2430 we
see 7334 us elapsed.  Between now_cy=2302 and now_cy=2366 the number is 
7528 us.  Other time gaps seen are 7523, 7420, 6584 (between now_cy=2110 and
now_cy=2174).

The xrun/backwards cycle problem is first noted at 1377809186908658.  This
is the first packet to be processed in a new processing block (now_cy=2553),
and fired at 1377809186908641.  The previous block (now_cy=2494) ended
normally at 1377809186901939.  The time elapsed in this case is 6702 us,
which is within the range seen earlier in the log when no problems arose. 
What this does prove is that the system is not delaying the packet request
for sufficient time for the cycle number to wrap.

The log also clearly shows that the cycle number reported via getPacket()'s 
"cycle" parameter really has gone backward compared to the previous call. 
The previous call at 1377809186901939 requested cycle 2623 while the one
immediately following it 6702 us later requested cycle 2619.  I note that
cycle 2619 was requested as part of the earlier tx processing block at
1377809186901901, 5 cycles before the end of that block.

The call stack leading to IsoHandler::getPacket() is as follows. 
raw1394_iso_xmit_init() sets up the tx iso callback in IsoHandler::enable().
The direct callback is IsoHandler::iso_transmit_handler which is a wrapper
around the getPacket() call.  The cycle number received by the C callback
IsoHandler::iso_transmit_handler()  is passed through to getPacket() with no
modification.  From this we can tell that the cycle numbers reported by
getPacket() are exactly those being requested by libraw1394 through the iso
tx callback.

So now we have to work out why libraw1394 is doing this.  According to
Stefan's earlier emails, the cycle number requested by libraw1394 is
effectively passed straight through from the kernel request that it
receives.  I wil verify this myself when I have more time, but assuming it
to be the case it means that the kernel itself is requesting a cycle which
it has previously requested.  Drawing from Stefan's emails again, this would
ordinarily mean that the hardware itself has requested this cycle, unless
there was something else odd going on behind the scenes.

Based on the diff_cy value it would appear that libraw1394 and the kernel
are running approximately 128 cycles ahead when it comes to the tx packets,
with the final request before the errant one being 129 cycles ahead.  There
were many other earlier cases where the queue grew to this size though, so
there's nothing especially wrong with that number of packets.

However, if for some reason there was memory pressure, perhaps the packets
requested at 1377809186901901 through 1377809186901939 were never sent to
the kernel or queued by the kernel.  This would provide a mechanism for
the kernel to request them again at some later stage.  Identifying a reason
for this could be tricky - I will look at the relevant code and see if
anything jumps out at me.  There doesn't seem to be an obvious avenue for
this to occur.

The freeze at the end is interesting too.  The log indicates that jackd
determined that things could not start and proceeded to shut down.  The logs
indicate that the rx IsoHandler is waiting for the disable lock, which is
strange because nothing else should hold that lock at this time given that
the call to disable happened through ~IsoHandler().  On closer inspection
though it seems that there was a small logic error in the code introduced in
by me in r1985.  The problem was that when nothing else was holding the
disable lock, pthread_mutex_trylock() would of course acquire it.  With no
followup release, disable() could then be called with the lock held - and
thus disable() would block since the lock would never be released.  Most
people in normal use would never hit this, since the timing at shutdown
means that pthread_mutex_trylock() usually fails with EBUSY.  The resulting
code path did release the lock before disable() was potentially called.

I've fixed this issue in r2386.  Please test this under a vanilla kernel (to
keep things consistent); although I can't guarantee that it will resolve the
freeze on shutdown, it *might* have a beneficial effect on this aspect of the
problem.  Note that nothing else changes in this revision, so the
fundamental cause of the trouble (the cycle number oddities) remain.  It
also doesn't address the reason why ffado is able to crash the kernel in the
face of its misbehaviour.

> Detailed logs are here:
> http://www.noiseload.de/ffado/ffado_r2385_on_linux_3_8/
> As Jonathan mentioned they are rather large.

Indeed.  This is mostly due to the per-packet getPacket() log entry that's
temporarily in place.

> When I read Jonathan's mail again just now, I realized he/you used
> mainstream kernel 3.8.0 while I used 3.8.13. I might also try a run with
> 3.8.0, but I guess it won't make much difference.

It might be worth trying for interest and for consistency, but I agree that
it's unlikely to make a difference to the behaviour.

Regards
  jonathan