From: Jonathan W. <jw...@ju...> - 2013-08-30 02:53:38
|
Hi Andreas On Thu, Aug 29, 2013 at 11:28:24PM +0200, Andreas Hehn wrote: > On 08/29/2013 09:40 PM, Stefan Richter wrote: > > Ubuntu packages precompiled upstream kernels: > > https://wiki.ubuntu.com/Kernel/MainlineBuilds > > > Just a few seconds too late. I compiled kernel 3.8.13 myself. Was > actually sort of fun. Last time I compiled a kernel was about 10 years > ago. :) :-) Awesome. Thanks for doing this. > Anyways, so I ran a couple of tests using r2385 and I managed to get it > running! ... I just don't like the statistics. It was one out of five runs. r2385 is no different to anything earlier with regard to your problem so I wouldn't expect it to behave any better. In fact due to the large amount of output produced by this version I expect it to be worse when "-v 6" is specified. Your results bear this out. > So, seriously, here are the results: > : For the moment I've just looked at the final test log (using kernel 3.8.13). A more detailed analysis will follow perhaps over the weekend. The first observation is that problems started in the same manner as your earlier tests. This is good since it means the additional debug output production hasn't changed the general system behaviour. Tx packets are prepared in blocks, which is why the "now_cy" in the getPacket messages remains unchanged for a large number of calls. The "cy" is reporting the cycle number delivered to the iso callback from libraw1394. The other values are used within the function for various purposes, with pkt_ctr being the number reported to the driver-specific streaming code. Looking a little earlier in the log, we see that within each tx processing block the time between successive getPacket calls within a block is very short (of the order of 10-12 us normally). Between each processing block it is predictably larger. For example between now_cy=2366 and now_cy=2430 we see 7334 us elapsed. Between now_cy=2302 and now_cy=2366 the number is 7528 us. Other time gaps seen are 7523, 7420, 6584 (between now_cy=2110 and now_cy=2174). The xrun/backwards cycle problem is first noted at 1377809186908658. This is the first packet to be processed in a new processing block (now_cy=2553), and fired at 1377809186908641. The previous block (now_cy=2494) ended normally at 1377809186901939. The time elapsed in this case is 6702 us, which is within the range seen earlier in the log when no problems arose. What this does prove is that the system is not delaying the packet request for sufficient time for the cycle number to wrap. The log also clearly shows that the cycle number reported via getPacket()'s "cycle" parameter really has gone backward compared to the previous call. The previous call at 1377809186901939 requested cycle 2623 while the one immediately following it 6702 us later requested cycle 2619. I note that cycle 2619 was requested as part of the earlier tx processing block at 1377809186901901, 5 cycles before the end of that block. The call stack leading to IsoHandler::getPacket() is as follows. raw1394_iso_xmit_init() sets up the tx iso callback in IsoHandler::enable(). The direct callback is IsoHandler::iso_transmit_handler which is a wrapper around the getPacket() call. The cycle number received by the C callback IsoHandler::iso_transmit_handler() is passed through to getPacket() with no modification. From this we can tell that the cycle numbers reported by getPacket() are exactly those being requested by libraw1394 through the iso tx callback. So now we have to work out why libraw1394 is doing this. According to Stefan's earlier emails, the cycle number requested by libraw1394 is effectively passed straight through from the kernel request that it receives. I wil verify this myself when I have more time, but assuming it to be the case it means that the kernel itself is requesting a cycle which it has previously requested. Drawing from Stefan's emails again, this would ordinarily mean that the hardware itself has requested this cycle, unless there was something else odd going on behind the scenes. Based on the diff_cy value it would appear that libraw1394 and the kernel are running approximately 128 cycles ahead when it comes to the tx packets, with the final request before the errant one being 129 cycles ahead. There were many other earlier cases where the queue grew to this size though, so there's nothing especially wrong with that number of packets. However, if for some reason there was memory pressure, perhaps the packets requested at 1377809186901901 through 1377809186901939 were never sent to the kernel or queued by the kernel. This would provide a mechanism for the kernel to request them again at some later stage. Identifying a reason for this could be tricky - I will look at the relevant code and see if anything jumps out at me. There doesn't seem to be an obvious avenue for this to occur. The freeze at the end is interesting too. The log indicates that jackd determined that things could not start and proceeded to shut down. The logs indicate that the rx IsoHandler is waiting for the disable lock, which is strange because nothing else should hold that lock at this time given that the call to disable happened through ~IsoHandler(). On closer inspection though it seems that there was a small logic error in the code introduced in by me in r1985. The problem was that when nothing else was holding the disable lock, pthread_mutex_trylock() would of course acquire it. With no followup release, disable() could then be called with the lock held - and thus disable() would block since the lock would never be released. Most people in normal use would never hit this, since the timing at shutdown means that pthread_mutex_trylock() usually fails with EBUSY. The resulting code path did release the lock before disable() was potentially called. I've fixed this issue in r2386. Please test this under a vanilla kernel (to keep things consistent); although I can't guarantee that it will resolve the freeze on shutdown, it *might* have a beneficial effect on this aspect of the problem. Note that nothing else changes in this revision, so the fundamental cause of the trouble (the cycle number oddities) remain. It also doesn't address the reason why ffado is able to crash the kernel in the face of its misbehaviour. > Detailed logs are here: > http://www.noiseload.de/ffado/ffado_r2385_on_linux_3_8/ > As Jonathan mentioned they are rather large. Indeed. This is mostly due to the per-packet getPacket() log entry that's temporarily in place. > When I read Jonathan's mail again just now, I realized he/you used > mainstream kernel 3.8.0 while I used 3.8.13. I might also try a run with > 3.8.0, but I guess it won't make much difference. It might be worth trying for interest and for consistency, but I agree that it's unlikely to make a difference to the behaviour. Regards jonathan |