From: Илья <il...@ma...> - 2013-08-21 21:03:10
|
FFADO diagnostic utility 2.1.9999-2380 ============================ (C) 2008 Pieter Palmers 2009-2010 Arnold Krille === CHECK === Base system... kernel version............ 3.8.0-27-lowlatency Preempt (low latency)... True RT patched.............. False old 1394 stack present.... False old 1394 stack loaded..... False old 1394 stack active..... False new 1394 stack present.... True new 1394 stack loaded..... True new 1394 stack active..... True /dev/raw1394 node present. False /dev/fw* permissions: crw------- 1 root root 251, 0 авг. 22 00:48 /dev/fw0 crw-rw----+ 1 root audio 251, 1 авг. 22 00:48 /dev/fw1 User IDs: uid=1000(teddy) gid=1000(teddy) groups=1000(teddy),4(adm),7(lp),24(cdrom),27(sudo),29(audio),30(dip),46(plugdev),107(lpadmin),123(sambashare),1001(fingerprint) Prerequisites (dynamic at run-time)... gcc ............... gcc (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3 g++ ............... g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3 PyQt4 (by pyuic4) . Python User Interface Compiler 4.10 for Qt version 4.8.4 jackd ............. no message buffer overruns path ............ /usr/bin/jackd flags ........... Package jack was not found in the pkg-config search path. Perhaps you should add the directory containing `jack.pc' to the PKG_CONFIG_PATH environment variable No package 'jack' found libraw1394 ........ 2.0.9 flags ........... -lraw1394 libavc1394 ........ 0.5.4 flags ........... -lavc1394 -lrom1394 -lraw1394 libiec61883 ....... 1.2.0 flags ........... -liec61883 -lraw1394 libxml++-2.6 ...... 2.34.2 flags ........... -I/usr/include/libxml2 -I/usr/include/glibmm-2.4 -I/usr/lib/x86_64-linux-gnu/glibmm-2.4/include -I/usr/include/sigc++-2.0 -I/usr/lib/x86_64-linux-gnu/sigc++-2.0/include -I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -I/usr/include/libxml++-2.6 -I/usr/lib/libxml++-2.6/include -lxml++-2.6 -lxml2 -lglibmm-2.4 -lgobject-2.0 -lsigc-2.0 -lglib-2.0 dbus-1 ............ 1.6.8 flags ........... -I/usr/include/dbus-1.0 -I/usr/lib/x86_64-linux-gnu/dbus-1.0/include -ldbus-1 Prerequisites (static at compile-time)... gcc ............... gcc (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3 g++ ............... g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3 PyQt4 (by pyuic4) . Python User Interface Compiler 4.10 for Qt version 4.8.4 jackd ............. no message buffer overruns path ............ /usr/bin/jackd flags ........... Package jack was not found in the pkg-config search path. libraw1394 ........ 2.0.9 flags ........... -lraw1394 libavc1394 ........ 0.5.4 flags ........... -lavc1394 -lrom1394 -lraw1394 libiec61883 ....... 1.2.0 flags ........... -liec61883 -lraw1394 libxml++-2.6 ...... 2.34.2 flags ........... -I/usr/include/libxml2 -I/usr/include/glibmm-2.4 -I/usr/lib/x86_64-linux-gnu/glibmm-2.4/include -I/usr/include/sigc++-2.0 -I/usr/lib/x86_64-linux-gnu/sigc++-2.0/include -I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -I/usr/include/libxml++-2.6 -I/usr/lib/libxml++-2.6/include -lxml++-2.6 -lxml2 -lglibmm-2.4 -lgobject-2.0 -lsigc-2.0 -lglib-2.0 dbus-1 ............ 1.6.8 flags ........... -I/usr/include/dbus-1.0 -I/usr/lib/x86_64-linux-gnu/dbus-1.0/include -ldbus-1 uname -a... Linux solar-host 3.8.0-27-lowlatency #19-Ubuntu SMP PREEMPT Wed Jul 17 10:40:35 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Hardware... Host controllers: 05:00.0 FireWire (IEEE 1394) [0c00]: Texas Instruments XIO2200A IEEE-1394a-2000 Controller (PHY/Link) [104c:8235] (rev 01) (prog-if 10 [OHCI]) Subsystem: Device [5678:1234] Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32 (500ns min, 1000ns max), Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 18 Region 0: Memory at f1404000 (32-bit, non-prefetchable) [size=2K] Region 1: Memory at f1400000 (32-bit, non-prefetchable) [size=16K] Capabilities: <access denied> Kernel driver in use: firewire_ohci CPU info: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 58 Stepping: 9 CPU MHz: 1200.000 BogoMIPS: 5787.39 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 4096K NUMA node0 CPU(s): 0-3 Configuration... IRQ information Hardware Interrupts: -------------------- IRQ 0: PID: None, count: [55, 0, 0, 0], Sched None (priority None), drivers: ['timer'] IRQ 1: PID: None, count: [83, 2735, 80, 229], Sched None (priority None), drivers: ['i8042'] IRQ 8: PID: None, count: [0, 0, 0, 1], Sched None (priority None), drivers: ['rtc0'] IRQ 9: PID: None, count: [42, 117, 25, 10], Sched None (priority None), drivers: ['acpi'] IRQ 12: PID: None, count: [1654, 765, 149, 292], Sched None (priority None), drivers: ['i8042'] IRQ 16: PID: None, count: [997, 5845, 25, 24], Sched None (priority None), drivers: ['ehci_hcd:usb1', 'mmc0'] IRQ 18: PID: None, count: [2, 5, 10, 3], Sched None (priority None), drivers: ['firewire_ohci'] IRQ 23: PID: None, count: [10, 17, 8, 2], Sched None (priority None), drivers: ['ehci_hcd:usb2'] IRQ 40: PID: None, count: [0, 0, 0, 0], Sched None (priority None), drivers: ['dmar0'] IRQ 41: PID: None, count: [0, 0, 0, 0], Sched None (priority None), drivers: ['dmar1'] IRQ 42: PID: None, count: [0, 0, 0, 0], Sched None (priority None), drivers: ['xhci_hcd'] IRQ 43: PID: None, count: [10, 303, 0, 2], Sched None (priority None), drivers: ['eth0'] IRQ 44: PID: None, count: [4817, 19051, 1313, 2854], Sched None (priority None), drivers: ['ahci'] IRQ 45: PID: None, count: [21, 0, 0, 1], Sched None (priority None), drivers: ['mei'] IRQ 46: PID: None, count: [163, 887, 27012, 62], Sched None (priority None), drivers: ['iwlwifi'] IRQ 47: PID: None, count: [42, 103, 25, 218], Sched None (priority None), drivers: ['i915'] IRQ 48: PID: None, count: [294, 0, 6, 13], Sched None (priority None), drivers: ['snd_hda_intel'] Software Interrupts: -------------------- === REPORT === FireWire kernel drivers: The new FireWire kernel stack is loaded. If running a kernel earlier than 2.6.37 and problems are experienced, either try with the old Firewire kernel stack or upgrade to a newer kernel (preferrably 2.6.37 or later). |
From: Jonathan W. <jw...@ju...> - 2013-08-24 06:22:12
|
Hi all On Thu, Aug 22, 2013 at 01:02:57AM +0400, ???????? wrote: > I found a steps to reproduce a kernel software locks up by ffado. This is now getting plain strange. After successfully running ffado/jackd many times over this last week, I've suddenly started to get machine lockups. There is no obvious reason for this since the kernel has remained unchanged for many months, and I was successfully running jackd/ffado this past week when debugging the sample rate issues on the FF800. I experienced trouble with ffado r2379, even though this version had run successfully earlier in the week. Upgrading to r2383 didn't change things. The kernel is a kernel.org 3.8.0 SMP PREEMPT kernel which I compiled on June 28 2013. These are the first lockups I have experienced since using this kernel. CPU is an i7 at 2.67 GHz. The kernel is compiled for 32-bit and userspace is 32-bit (there are reasons why I have yet to move this machine to a 64-bit environment). The kernel's .config can be provided on request. There is 6 GB of RAM fitted. Even after starting X (with a very lightweight window manager), free reports: total used free shared buffers cached Mem: 6227572 318824 5908748 0 1968 216168 -/+ buffers/cache: 100688 6126884 Swap: 12594924 0 12594924 The loaded modules are included at the end of this email. The firewire host controller is: 09:06.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller (PHY/Link) /proc/interrupts report: 18: ... IO-APIC-fasteoi pata_jmicron, i801_smbus, firewire_ohci, ehci_hcd:usb1, uhci_hcd:usb5, uhci_hcd:usb8 Obviously it's not a dedicated IRQ but that's never been an issue for the past 3+ years that I've had the machine. The jackd command line was: jackd -P70 -R -dfirewire -p512 -n3 -v 6 If running under X the machine simply froze up. There was nothing shown on the screen to indicate what the problem might be. I *thought* I had a panic handler configured, but perhaps I'm imagining it. The machine in question boots to text mode so it was trivial to test this from a console without ever running X. The symptoms were the same: the machine simply froze up without displaying any reason for doing so. Hitting Ctrl-C a few times did display ^C on the console, but after half a dozen presses even this stopped. I found that if I hit the magic-sysreq button combination to elicit an emergency sync before there was a complete freeze, I was able to get some output from the kernel. Whether this is a result of the sysreq handler or is something related to the underlying problem remains to be seen. In this condition, the kernel displayed the following line at a rate of maybe 50 to 100 per second. firewire_core 0000:09:06.0: out of memory when allocating event Eventually the scrolling stopped with the following messages displayed: kmem_getpages: 4667629 callbacks suppressed SLAB: Unable to allocate memory on node 0 (gfp=0x20) cache: size-64, object size: 64, order: 0 node 0: slabs: 204425/204425, objs: 12061075/12061075, free: 0 firewire_core 0000:09:06.0: out of memory when allocating event SLAB: Unable to allocate memory on node 0 (gfp=0x20) cache: size-64, object size: 64, order: 0 node 0: slabs: 204425/204425, objs: 12061075/12061075, free: 0 firewire_core 0000:09:06.0: out of memory when allocating event SLAB: Unable to allocate memory on node 0 (gfp=0x20) cache: size-64, object size: 64, order: 0 node 0: slabs: 204425/204425, objs: 12061075/12061075, free: 0 firewire_core 0000:09:06.0: out of memory when allocating event SLAB: Unable to allocate memory on node 0 (gfp=0x20) cache: size-64, object size: 64, order: 0 The machine froze at this point. During the rapid scrolling of the firewire_core message I also noticed that every so often (maybe once every 5 to 10 seconds) there would be another block of text appearing, but it scrolled off the screen far too fast to allow comprehension. Scroll lock was ineffective. However, I was able to capture at least some it using a camera with a 60 Hz shutter speed and a reasonably fast reaction time. :-) Even so, due to the LCD's finite response time there's a reasonable amount of image untangling to do. Here is what I can gleen. Things marked with ? except the single "?" before some function names in the call trace are those I couldn't make out from the shots. [ Output which is too garbled to distinguish ] lowmem_reserve[]: 0 0 0 0 DMA: 8*4kB 1*8kB (R) 0*16kB 1*32kB (R) 0*64kB 1*128kB (R) 1*256kB (R) 0*512kB 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3496kB Normal: 11*4kB (??) 4*8kB (??) 0*16kB 0*32kB 0*64kB 8*128kB 1*256KB (R) 0*512kB 1*1024kB (R) 8*2048kB 0*4096kB = 1356kB Highmem: 1127*4kB (UM) 665*8kB (UM) 416*16kB (UM) 234*32kB (UM) 133*64kB (UM) 74*128kB (UM) 42*256kB (UM) 27*512kB (UM) 18*1024kB (M) 7*2048kB (M) 1244*4096kB (M?) = 5194724kB 17294 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 12594924kB 1769456 pages RAM 1541122 pages HighMem 212563 pages reserved 270346 pages shared 251286 pages non-shared firewire_core 0000:09:06.0: out of memory when allocating event swapper/7: page allocation failure: order:0, mode: 0x200020 Pid: 0, comm: swapper/7 Not tainted 3.8.0 #2 Call Trace: [<c1088df4>] __alloc_pages_nodemask+0x3c4/0x5c0 [<c102ee80>] ? log_store+0x190/0x240 [<c16ae32c>] cache_alloc_refill+0x2dc/0x580 [<c10ae67a>] __kmalloc+0xaa/0xf0 [<f86ea533>] iso_callback+0x23/0xb0 [firewire_core] [<f8704662>] handle_it_packet+0x132/0x190 [firewire_ohci] [<f870511c>] context_tasklet+0xac/0x150 [firewire_ohci] [<c1034a80>] tasklet_action+0x50/0xa0 [<c1035080>] __do_softirq+0x90/0x140 [<c1034ff0>] ? remote_softirq_receive+0x20/0x20 <IRQ> [<c1035246>] ? irq_exit+0x76/0xa0 [<c1004506>] ? do_IRQ+0x46/0xb0 [<c13e9fec>] ? common_interrupt+0x2c/0x31 [<c105067b>] ? set_groups+0xbb/0x170 [<c132c396>] ? cpuidle_wrap_enter+0x36/0x80 [<c132bc52>] ? cpuidle_enter_tk+0x12/0x20 [<c132bc20>] ? read_measured_perf_ctrs+0x50/0x50 [<c132c13c>] ? cpuidle_enter_state+0xc/0x40 [<c132a1c6>] ? cpu_idle+0x56/0xb0 [<c13def56>] ? start_secondary+?????? Mem-Info: DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 used: 0 [End of output] Putting all this together, it seems like the kernel is exhausing physical memory rather rapidly and (I'm guessing) it can no longer alloocate memory to handle the IT packet. In other words, it seems that we have a rather difficult to reproduce kernel bug (as evidenced by the fact that the exact same software has worked perfectly for me on exactly the same hardware earlier in the week). The trouble with the above is that while it documents where we're running out of memory it really says nothing about why we're running out of memory. Clearly this is happening with a different firewire host controller compared to Ilia, so I think we can rule out a particular card as the cause of these troubles. I'll keep think of tests I can do to try to narrow down the problem but I'm definitely open to suggestions. Needless to say that there was no information about the freeze in the system logs on reboot. I did notice that ffado appeared to have some trouble starting the streaming interface of the FF800 in the early stages of the run. I don't know why this would suddenly be an issue (it was fine 2 days ago) and it's something else I need to look into. However, even if this were the case it should not be capable of crashing the kernel like this. One thing I will try to test is another firewire interface. This will tell whether the bug is triggered by FFADO generally or requires the Fireface for some yet to be identified reason. I will report back about this once I have a chance to test it. My suspicion at present is that there is a bug in ffado's Fireface driver relating to sample rates which is causing sync problems at start up. However, I wouldn't have thought that this should be sufficient to trip the kernel up and freeze the machine solid. Regards jonathan List of modules loaded at the time of the freeze Module Size Used by usb_storage 33435 0 snd_seq_dummy 1023 0 snd_seq_oss 23324 0 snd_seq_midi_event 4312 1 snd_seq_oss snd_seq 39481 5 snd_seq_midi_event,snd_seq_oss,snd_seq_dummy snd_pcm_oss 32710 0 snd_mixer_oss 12120 1 snd_pcm_oss usbhid 22815 0 radeon 826022 2 uhci_hcd 19608 0 snd_ens1371 15095 0 snd_ac97_codec 86564 1 snd_ens1371 ac97_bus 770 1 snd_ac97_codec snd_rawmidi 14438 1 snd_ens1371 cfbfillrect 3118 1 radeon cfbimgblt 1891 1 radeon cfbcopyarea 2978 1 radeon i2c_algo_bit 4415 1 radeon drm_kms_helper 24090 1 radeon ttm 45230 1 radeon drm 174235 4 ttm,drm_kms_helper,radeon ehci_pci 2884 0 ehci_hcd 33242 1 ehci_pci snd_seq_device 4020 4 snd_seq,snd_rawmidi,snd_seq_oss,snd_seq_dummy snd_pcm 62121 3 snd_pcm_oss,snd_ac97_codec,snd_ens1371 usbcore 118905 5 uhci_hcd,usb_storage,ehci_hcd,ehci_pci,usbhid snd_page_alloc 5934 1 snd_pcm snd_timer 14414 2 snd_pcm,snd_seq snd 41759 10 snd_pcm_oss,snd_ac97_codec,snd_timer,snd_pcm,snd_seq,snd_rawmidi,snd_seq_oss,snd_ens1371,snd_seq_device,snd_mixer_oss agpgart 21191 2 drm,ttm firewire_ohci 28422 0 firewire_core 44036 1 firewire_ohci nosy 5919 0 soundcore 4186 1 snd r8169 48283 0 usb_common 582 1 usbcore loop 14395 0 |
From: Stefan R. <st...@s5...> - 2013-08-24 09:07:53
|
Adding Cc: linux1394-devel. I am quoting the parent message in full and add some comments ad the end of this post. The long string of previous messages (which dealt with multiple other, FFADO-specific issues besides this kernel related one) can be found in ffado-user's archive under the subject "Problem with RME FF800. Can not start jackd" (started 2013-08-07) and "Problem with RME FF800. Can not start jackd," (multiple discontinuous threads) and a much shorter thread in ffado-devel's archive under "jackd+ffado causes kernel lock ups" (started 2013-08-21). http://sourceforge.net/mailarchive/forum.php?viewmonth=201308&forum_name=ffado-user&style=threaded&max_rows=100 http://sourceforge.net/mailarchive/forum.php?thread_name=20130824062151.GA20471%40marvin.atrad.com.au&forum_name=ffado-devel (Note to ffado list subscribers: There is no need to subscribe to linux1394-devel, reply-to-all just works.) On Aug 24 Jonathan Woithe wrote: > Hi all > > On Thu, Aug 22, 2013 at 01:02:57AM +0400, ???????? wrote: > > I found a steps to reproduce a kernel software locks up by ffado. > > This is now getting plain strange. After successfully running ffado/jackd > many times over this last week, I've suddenly started to get machine > lockups. There is no obvious reason for this since the kernel has remained > unchanged for many months, and I was successfully running jackd/ffado this > past week when debugging the sample rate issues on the FF800. I experienced > trouble with ffado r2379, even though this version had run successfully > earlier in the week. Upgrading to r2383 didn't change things. > > The kernel is a kernel.org 3.8.0 SMP PREEMPT kernel which I compiled on June > 28 2013. These are the first lockups I have experienced since using this > kernel. CPU is an i7 at 2.67 GHz. The kernel is compiled for 32-bit and > userspace is 32-bit (there are reasons why I have yet to move this machine > to a 64-bit environment). The kernel's .config can be provided on request. > There is 6 GB of RAM fitted. Even after starting X (with a very lightweight > window manager), free reports: > > total used free shared buffers cached > Mem: 6227572 318824 5908748 0 1968 216168 > -/+ buffers/cache: 100688 6126884 > Swap: 12594924 0 12594924 > > The loaded modules are included at the end of this email. > > The firewire host controller is: > 09:06.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000 > Controller (PHY/Link) > > /proc/interrupts report: > 18: ... IO-APIC-fasteoi pata_jmicron, i801_smbus, firewire_ohci, > ehci_hcd:usb1, uhci_hcd:usb5, uhci_hcd:usb8 > > Obviously it's not a dedicated IRQ but that's never been an issue for the > past 3+ years that I've had the machine. > > The jackd command line was: > > jackd -P70 -R -dfirewire -p512 -n3 -v 6 > > If running under X the machine simply froze up. There was nothing shown on > the screen to indicate what the problem might be. I *thought* I had a panic > handler configured, but perhaps I'm imagining it. > > The machine in question boots to text mode so it was trivial to test this > from a console without ever running X. The symptoms were the same: the > machine simply froze up without displaying any reason for doing so. Hitting > Ctrl-C a few times did display ^C on the console, but after half a dozen > presses even this stopped. > > I found that if I hit the magic-sysreq button combination to elicit an > emergency sync before there was a complete freeze, I was able to get some > output from the kernel. Whether this is a result of the sysreq handler or > is something related to the underlying problem remains to be seen. In this > condition, the kernel displayed the following line at a rate of maybe 50 to > 100 per second. > > firewire_core 0000:09:06.0: out of memory when allocating event > > Eventually the scrolling stopped with the following messages displayed: > > kmem_getpages: 4667629 callbacks suppressed > SLAB: Unable to allocate memory on node 0 (gfp=0x20) > cache: size-64, object size: 64, order: 0 > node 0: slabs: 204425/204425, objs: 12061075/12061075, free: 0 > firewire_core 0000:09:06.0: out of memory when allocating event > SLAB: Unable to allocate memory on node 0 (gfp=0x20) > cache: size-64, object size: 64, order: 0 > node 0: slabs: 204425/204425, objs: 12061075/12061075, free: 0 > firewire_core 0000:09:06.0: out of memory when allocating event > SLAB: Unable to allocate memory on node 0 (gfp=0x20) > cache: size-64, object size: 64, order: 0 > node 0: slabs: 204425/204425, objs: 12061075/12061075, free: 0 > firewire_core 0000:09:06.0: out of memory when allocating event > SLAB: Unable to allocate memory on node 0 (gfp=0x20) > cache: size-64, object size: 64, order: 0 > > The machine froze at this point. > > During the rapid scrolling of the firewire_core message I also noticed that > every so often (maybe once every 5 to 10 seconds) there would be another > block of text appearing, but it scrolled off the screen far too fast to > allow comprehension. Scroll lock was ineffective. However, I was able to > capture at least some it using a camera with a 60 Hz shutter speed and a > reasonably fast reaction time. :-) Even so, due to the LCD's finite > response time there's a reasonable amount of image untangling to do. Here > is what I can gleen. Things marked with ? except the single "?" before > some function names in the call trace are those I couldn't make out from the > shots. > > [ Output which is too garbled to distinguish ] > > lowmem_reserve[]: 0 0 0 0 > DMA: 8*4kB 1*8kB (R) 0*16kB 1*32kB (R) 0*64kB 1*128kB (R) 1*256kB (R) > 0*512kB 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3496kB > Normal: 11*4kB (??) 4*8kB (??) 0*16kB 0*32kB 0*64kB 8*128kB 1*256KB (R) > 0*512kB 1*1024kB (R) 8*2048kB 0*4096kB = 1356kB > Highmem: 1127*4kB (UM) 665*8kB (UM) 416*16kB (UM) 234*32kB (UM) > 133*64kB (UM) 74*128kB (UM) 42*256kB (UM) 27*512kB (UM) 18*1024kB (M) > 7*2048kB (M) 1244*4096kB (M?) = 5194724kB > 17294 total pagecache pages > 0 pages in swap cache > Swap cache stats: add 0, delete 0, find 0/0 > Free swap = 12594924kB > 1769456 pages RAM > 1541122 pages HighMem > 212563 pages reserved > 270346 pages shared > 251286 pages non-shared > firewire_core 0000:09:06.0: out of memory when allocating event > swapper/7: page allocation failure: order:0, mode: 0x200020 > Pid: 0, comm: swapper/7 Not tainted 3.8.0 #2 > Call Trace: > [<c1088df4>] __alloc_pages_nodemask+0x3c4/0x5c0 > [<c102ee80>] ? log_store+0x190/0x240 > [<c16ae32c>] cache_alloc_refill+0x2dc/0x580 > [<c10ae67a>] __kmalloc+0xaa/0xf0 > [<f86ea533>] iso_callback+0x23/0xb0 [firewire_core] > [<f8704662>] handle_it_packet+0x132/0x190 [firewire_ohci] > [<f870511c>] context_tasklet+0xac/0x150 [firewire_ohci] > [<c1034a80>] tasklet_action+0x50/0xa0 > [<c1035080>] __do_softirq+0x90/0x140 > [<c1034ff0>] ? remote_softirq_receive+0x20/0x20 > <IRQ> [<c1035246>] ? irq_exit+0x76/0xa0 > [<c1004506>] ? do_IRQ+0x46/0xb0 > [<c13e9fec>] ? common_interrupt+0x2c/0x31 > [<c105067b>] ? set_groups+0xbb/0x170 > [<c132c396>] ? cpuidle_wrap_enter+0x36/0x80 > [<c132bc52>] ? cpuidle_enter_tk+0x12/0x20 > [<c132bc20>] ? read_measured_perf_ctrs+0x50/0x50 > [<c132c13c>] ? cpuidle_enter_state+0xc/0x40 > [<c132a1c6>] ? cpu_idle+0x56/0xb0 > [<c13def56>] ? start_secondary+?????? > Mem-Info: > DMA per-cpu: > CPU 0: hi: 0, btch: 1 usd: 0 > CPU 1: hi: 0, btch: 1 used: 0 > > [End of output] > > Putting all this together, it seems like the kernel is exhausing physical > memory rather rapidly and (I'm guessing) it can no longer alloocate memory > to handle the IT packet. In other words, it seems that we have a rather > difficult to reproduce kernel bug (as evidenced by the fact that the exact > same software has worked perfectly for me on exactly the same hardware > earlier in the week). The trouble with the above is that while it documents > where we're running out of memory it really says nothing about why we're > running out of memory. > > Clearly this is happening with a different firewire host controller compared > to Ilia, so I think we can rule out a particular card as the cause of these > troubles. I'll keep think of tests I can do to try to narrow down the > problem but I'm definitely open to suggestions. > > Needless to say that there was no information about the freeze in the system > logs on reboot. > > I did notice that ffado appeared to have some trouble starting the streaming > interface of the FF800 in the early stages of the run. I don't know why > this would suddenly be an issue (it was fine 2 days ago) and it's something > else I need to look into. However, even if this were the case it should not > be capable of crashing the kernel like this. > > One thing I will try to test is another firewire interface. This will tell > whether the bug is triggered by FFADO generally or requires the Fireface for > some yet to be identified reason. I will report back about this once I have > a chance to test it. > > My suspicion at present is that there is a bug in ffado's Fireface driver > relating to sample rates which is causing sync problems at start up. > However, I wouldn't have thought that this should be sufficient to trip the > kernel up and freeze the machine solid. > > Regards > jonathan > > > List of modules loaded at the time of the freeze > > Module Size Used by > usb_storage 33435 0 > snd_seq_dummy 1023 0 > snd_seq_oss 23324 0 > snd_seq_midi_event 4312 1 snd_seq_oss > snd_seq 39481 5 snd_seq_midi_event,snd_seq_oss,snd_seq_dummy > snd_pcm_oss 32710 0 > snd_mixer_oss 12120 1 snd_pcm_oss > usbhid 22815 0 > radeon 826022 2 > uhci_hcd 19608 0 > snd_ens1371 15095 0 > snd_ac97_codec 86564 1 snd_ens1371 > ac97_bus 770 1 snd_ac97_codec > snd_rawmidi 14438 1 snd_ens1371 > cfbfillrect 3118 1 radeon > cfbimgblt 1891 1 radeon > cfbcopyarea 2978 1 radeon > i2c_algo_bit 4415 1 radeon > drm_kms_helper 24090 1 radeon > ttm 45230 1 radeon > drm 174235 4 ttm,drm_kms_helper,radeon > ehci_pci 2884 0 > ehci_hcd 33242 1 ehci_pci > snd_seq_device 4020 4 > snd_seq,snd_rawmidi,snd_seq_oss,snd_seq_dummy > snd_pcm 62121 3 snd_pcm_oss,snd_ac97_codec,snd_ens1371 > usbcore 118905 5 uhci_hcd,usb_storage,ehci_hcd,ehci_pci,usbhid > snd_page_alloc 5934 1 snd_pcm > snd_timer 14414 2 snd_pcm,snd_seq > snd 41759 10 snd_pcm_oss,snd_ac97_codec,snd_timer,snd_pcm,snd_seq,snd_rawmidi,snd_seq_oss,snd_ens1371,snd_seq_device,snd_mixer_oss > agpgart 21191 2 drm,ttm > firewire_ohci 28422 0 > firewire_core 44036 1 firewire_ohci > nosy 5919 0 > soundcore 4186 1 snd > r8169 48283 0 > usb_common 582 1 usbcore > loop 14395 0 Ilia, Andreas, and you all use kernel 3.8. Ilia and Andreas use Ubuntu's kernel, you use kernel.org's kernel. I just wanted to ask Ilia and Andreas to test a mainline kernel (Ubuntu has got a package repo for these), but now it is confirmed that mainline is affected (unless you implanted a third-party driver into it). The three of you are all using RME FF800 and recent FFADO. Ilia had some rather serious issues already before you supplied him with the newest FFADO updates, but I am not sure how far back this particular kernel problem has been happening on Ilia's setup. Ilia is using an x86-64 kernel, you an x86-32 kernel. All of you probably use generic but different PC hardware; e.g. Ilia with i915 graphics driver, Andreas with nouveau, you with radeon (and with X11 shut down). If I understood correctly, all of you get the lock-up very soon after jackd+ffado was started. Earlier I responded to Ilia that I primarily suspected a corruption of kernel memory outside of the firewire subsystem's control. However this has appeared less likely with subsequent reports from Ilia and is now pretty much ruled out in light of Andreas' and your reports. Instead, I now think we are seeing a denial-of-service hole of the firewire-core driver at work here: Whenever the kernel needs to report a - bus reset, - outbound asynchronous transaction completion (if the client has been sending those), - inbound asynchronous request (if reception of them was requested by the client i.e. userland program), - received physical packet (if reception of them was requested by the client), - isochronous reception (if the client has been queuing IR buffers), - isochronous transmission completion (if the client has been queuing IT buffers), it allocates kernel memory, generally in atomic context and from the general emergency memory pool. This memory is freed when (or let's say: if) the client is read()ing from the char dev file descriptor. As long as the client is not doing so, allocation follows on allocation until eventual exhaustion of the memory pool. Obviously, if FFADO was behaving in the intended way, it would consume IR and IT completion at the same rate as it would queue new IR and IT buffers, and with rather short distance between queuing and processing of the completion events. So, while Jonathan is surely going to fix up whatever disturbance is going on in FFADO, I need to look into getting the described class of DoS holes in the kernel driver fixed (which is an independent problem of whatever issue FFADO currently appears to face with RME FF800). -- Stefan Richter -=====-===-= =--- ==--- http://arcgraph.de/sr/ |
From: Jonathan W. <jw...@ju...> - 2013-08-24 08:27:27
|
Hi again On Sat, Aug 24, 2013 at 03:51:51PM +0930, Jonathan Woithe wrote: > I did notice that ffado appeared to have some trouble starting the streaming > interface of the FF800 in the early stages of the run. I don't know why > this would suddenly be an issue (it was fine 2 days ago) and it's something > else I need to look into. However, even if this were the case it should not > be capable of crashing the kernel like this. > : > My suspicion at present is that there is a bug in ffado's Fireface driver > relating to sample rates which is causing sync problems at start up. > However, I wouldn't have thought that this should be sufficient to trip the > kernel up and freeze the machine solid. I was able to capture enough of the verbose log to conclude that there is no sample rate mismatch occuring in my case. The log looks completely normal during the early stages of streaming startup. Eventually we see this: StreamProcessor 0x8bc3958, Transmit: Port, Channel : 0, 0 Packets, Dropped, Skipped : 2085, 0, 0 Now : 01628100932 (066s 1980c 2372t) Xrun? : False State : ePS_Running Buffer : 0x8bc4a20 Framerate : Nominal: 48000, Sync: 47987.729349, Buffer 48000.000000 TimestampedBuffer (0x8bc4a20): 1641 frames, 1641 events Timestamps : head: 1628482387.000, Tail: 1629322579.000, Next tail: 1629584723.000 Head - Tail : -840192.000 (-1641.000000 frames) DLL Rate : 262144.000000 (512.000000) DLL Bandwidth : 4.069010e-08 1/ticks (1.000000 Hz) 11415293453: Debug (RmeTransmitStreamProcessor.cpp)[ 328] generatePacketData: ticks per frame: 512 7 112 (len=784) 11415308740: Debug (StreamProcessor.cpp)[ 430] putPacket: Should update state to Running 11415308746: Debug (StreamProcessor.cpp)[1705] updateState: Do state transition: ePS_WaitingForStreamEnable => ePS_Running 11415308755: Debug (StreamProcessor.cpp)[1631] doRunning: Enter from state: ePS_WaitingForStreamEnable 11415308767: Debug (StreamProcessor.cpp)[1636] doRunning: StreamProcessor 0x8bc2990 started running 11415308770: Debug (TimestampedBuffer.cpp)[ 102] setBandwidth: bandwidth 2.034505e-07 => 1.017253e-08 11415308777: Debug (StreamProcessor.cpp)[1652] doRunning: State switch complete, dumping SP info... StreamProcessor 0x8bc2990, Receive: Port, Channel : 0, 1 Packets, Dropped, Skipped : 1796, 0, 0 Now : 01628520308 (066s 2117c 0884t) Xrun? : False State : ePS_Running Buffer : 0x8bc2b40 Framerate : Nominal: 48000, Sync: 48027.246674, Buffer 48027.246674 TimestampedBuffer (0x8bc2b40): 0000 frames, 0000 events Timestamps : head: 1628467200.000, Tail: 1628467200.000, Next tail: 1628470781.967 Head - Tail : 0.000 (0.000000 frames) DLL Rate : 3581.966736 (511.709534) DLL Bandwidth : 1.017253e-08 1/ticks (0.250000 Hz) All this is normal and indicates that we have acquired sync with the device. The sync rates are believable and there's no reason to believe anything is going wrong. Then I get this: 11415308938: Debug (StreamProcessorManager.cpp)[ 842] syncStartAll: initial time of transfer 1628729206, rate 511.734222... 11415308948: Debug (StreamProcessorManager.cpp)[ 861] syncStartAll: preset transmit tail TS 1629306953, rate 511.734222... 11415308953: Debug (StreamProcessorManager.cpp)[ 872] syncStartAll: => transmit head TS 1628646304, fc=1291... 11415308958: Debug (StreamProcessorManager.cpp)[ 888] alignReceivedStreams: Aligning received streams... 11415309025: Debug (StreamProcessorManager.cpp)[ 905] alignReceivedStreams: averaging over 37 periods... 11415319682: Warning (IsoHandlerManager.cpp)[1656] getPacket: reconstructed CTR counter discrepancy 11415319688: Warning (IsoHandlerManager.cpp)[1662] getPacket: ingredients: 89B, 8689B000, 8489BA89, 8489C5B9, 84880A89, 67, 66, 66, 1628786313 11415319693: Warning (IsoHandlerManager.cpp)[1663] getPacket: diffcy = 27 11415319922: Warning (IsoHandlerManager.cpp)[1656] getPacket: reconstructed CTR counter discrepancy 11415319926: Warning (IsoHandlerManager.cpp)[1662] getPacket: ingredients: 89C, 8689C000, 8489CA89, 8489E4AB, 84880A89, 67, 66, 66, 1628789385 11415319931: Warning (IsoHandlerManager.cpp)[1663] getPacket: diffcy = 28 : 11415324396: Warning (IsoHandlerManager.cpp)[1656] getPacket: reconstructed CTR counter discrepancy 11415324400: Warning (IsoHandlerManager.cpp)[1662] getPacket: ingredients: 8B2, 868B2000, 848B2A89, 848C2228, 84880A89, 67, 66, 66, 1628856969 11415324404: Warning (IsoHandlerManager.cpp)[1663] getPacket: diffcy = 50 11415333361: Debug (StreamProcessorManager.cpp)[1294] waitForPeriod: wait extended since period not ready... 11415333616: Debug (StreamProcessorManager.cpp)[1294] waitForPeriod: wait extended since period not ready... The last line repeats until the end of the log: 11415893961: Debug (StreamProcessorManager.cpp)[1294] waitForPeriod: wait extended since period not ready... 11415894210: Debug (StreamProcessorManager.cpp)[1294] waitForPeriod: wait extended since period not ready... Based on this I can't say that FFADO is misbehaving here in a way which would be expected to crash the kernel. I will look into the cause of these "reconstructed CTR counter discrepancy" messages and what exactly they mean since this could add to our understanding of the problem. However, as can be seen, we receive a over 1800 packets successfully before this problem is noted. Plus there's still this mystery as to why this problem appeared out of the blue today when everything was fine a couple of days ago. Regards jonathan |
From: Stefan R. <st...@s5...> - 2013-08-24 09:09:33
|
Quoting for linux1394-devel without further comment. On Aug 24 Jonathan Woithe wrote: > Hi again > > On Sat, Aug 24, 2013 at 03:51:51PM +0930, Jonathan Woithe wrote: > > I did notice that ffado appeared to have some trouble starting the streaming > > interface of the FF800 in the early stages of the run. I don't know why > > this would suddenly be an issue (it was fine 2 days ago) and it's something > > else I need to look into. However, even if this were the case it should not > > be capable of crashing the kernel like this. > > : > > My suspicion at present is that there is a bug in ffado's Fireface driver > > relating to sample rates which is causing sync problems at start up. > > However, I wouldn't have thought that this should be sufficient to trip the > > kernel up and freeze the machine solid. > > I was able to capture enough of the verbose log to conclude that there is no > sample rate mismatch occuring in my case. The log looks completely normal > during the early stages of streaming startup. Eventually we see this: > > StreamProcessor 0x8bc3958, Transmit: > Port, Channel : 0, 0 > Packets, Dropped, Skipped : 2085, 0, 0 > Now : 01628100932 (066s 1980c 2372t) > Xrun? : False > State : ePS_Running > Buffer : 0x8bc4a20 > Framerate : Nominal: 48000, Sync: 47987.729349, Buffer 48000.000000 > TimestampedBuffer (0x8bc4a20): 1641 frames, 1641 events > Timestamps : head: 1628482387.000, Tail: 1629322579.000, Next tail: 1629584723.000 > Head - Tail : -840192.000 (-1641.000000 frames) > DLL Rate : 262144.000000 (512.000000) > DLL Bandwidth : 4.069010e-08 1/ticks (1.000000 Hz) > 11415293453: Debug (RmeTransmitStreamProcessor.cpp)[ 328] generatePacketData: ticks per frame: 512 7 112 (len=784) > 11415308740: Debug (StreamProcessor.cpp)[ 430] putPacket: Should update state to Running > 11415308746: Debug (StreamProcessor.cpp)[1705] updateState: Do state transition: ePS_WaitingForStreamEnable => ePS_Running > 11415308755: Debug (StreamProcessor.cpp)[1631] doRunning: Enter from state: ePS_WaitingForStreamEnable > 11415308767: Debug (StreamProcessor.cpp)[1636] doRunning: StreamProcessor 0x8bc2990 started running > 11415308770: Debug (TimestampedBuffer.cpp)[ 102] setBandwidth: bandwidth 2.034505e-07 => 1.017253e-08 > 11415308777: Debug (StreamProcessor.cpp)[1652] doRunning: State switch complete, dumping SP info... > StreamProcessor 0x8bc2990, Receive: > Port, Channel : 0, 1 > Packets, Dropped, Skipped : 1796, 0, 0 > Now : 01628520308 (066s 2117c 0884t) > Xrun? : False > State : ePS_Running > Buffer : 0x8bc2b40 > Framerate : Nominal: 48000, Sync: 48027.246674, Buffer 48027.246674 > TimestampedBuffer (0x8bc2b40): 0000 frames, 0000 events > Timestamps : head: 1628467200.000, Tail: 1628467200.000, Next tail: 1628470781.967 > Head - Tail : 0.000 (0.000000 frames) > DLL Rate : 3581.966736 (511.709534) > DLL Bandwidth : 1.017253e-08 1/ticks (0.250000 Hz) > > All this is normal and indicates that we have acquired sync with the device. > The sync rates are believable and there's no reason to believe anything is > going wrong. Then I get this: > > 11415308938: Debug (StreamProcessorManager.cpp)[ 842] syncStartAll: initial time of transfer 1628729206, rate 511.734222... > 11415308948: Debug (StreamProcessorManager.cpp)[ 861] syncStartAll: preset transmit tail TS 1629306953, rate 511.734222... > 11415308953: Debug (StreamProcessorManager.cpp)[ 872] syncStartAll: => transmit head TS 1628646304, fc=1291... > 11415308958: Debug (StreamProcessorManager.cpp)[ 888] alignReceivedStreams: Aligning received streams... > 11415309025: Debug (StreamProcessorManager.cpp)[ 905] alignReceivedStreams: averaging over 37 periods... > 11415319682: Warning (IsoHandlerManager.cpp)[1656] getPacket: reconstructed CTR counter discrepancy > 11415319688: Warning (IsoHandlerManager.cpp)[1662] getPacket: ingredients: 89B, 8689B000, 8489BA89, 8489C5B9, 84880A89, 67, 66, 66, 1628786313 > 11415319693: Warning (IsoHandlerManager.cpp)[1663] getPacket: diffcy = 27 > 11415319922: Warning (IsoHandlerManager.cpp)[1656] getPacket: reconstructed CTR counter discrepancy > 11415319926: Warning (IsoHandlerManager.cpp)[1662] getPacket: ingredients: 89C, 8689C000, 8489CA89, 8489E4AB, 84880A89, 67, 66, 66, 1628789385 > 11415319931: Warning (IsoHandlerManager.cpp)[1663] getPacket: diffcy = 28 > : > 11415324396: Warning (IsoHandlerManager.cpp)[1656] getPacket: reconstructed CTR counter discrepancy > 11415324400: Warning (IsoHandlerManager.cpp)[1662] getPacket: ingredients: 8B2, 868B2000, 848B2A89, 848C2228, 84880A89, 67, 66, 66, 1628856969 > 11415324404: Warning (IsoHandlerManager.cpp)[1663] getPacket: diffcy = 50 > 11415333361: Debug (StreamProcessorManager.cpp)[1294] waitForPeriod: wait extended since period not ready... > 11415333616: Debug (StreamProcessorManager.cpp)[1294] waitForPeriod: wait extended since period not ready... > > The last line repeats until the end of the log: > > 11415893961: Debug (StreamProcessorManager.cpp)[1294] waitForPeriod: wait extended since period not ready... > 11415894210: Debug (StreamProcessorManager.cpp)[1294] waitForPeriod: wait extended since period not ready... > > Based on this I can't say that FFADO is misbehaving here in a way which > would be expected to crash the kernel. I will look into the cause of these > "reconstructed CTR counter discrepancy" messages and what exactly they mean > since this could add to our understanding of the problem. However, as can > be seen, we receive a over 1800 packets successfully before this problem is > noted. > > > Plus there's still this mystery as to why this problem appeared out of the > blue today when everything was fine a couple of days ago. > > Regards > jonathan -- Stefan Richter -=====-===-= =--- ==--- http://arcgraph.de/sr/ |
From: Jonathan W. <jw...@ju...> - 2013-08-24 09:03:26
|
I have some more information. I tested ffado as before but with my MOTU Traveler. It worked fine, which suggests to me that whatever the problem it's triggered with the Fireface driver. I was able to run jackd/ffado with the FF800 and, by hitting Ctrl-C early, exit before the kernel crashed. After doing so memory usage did seem a little high: total used free shared buffers cached Mem: 6227572 213764 6013808 0 1648 154780 -/+ buffers/cache: 57336 6170236 Swap: 12594924 0 12594924 Compared to the total of VSZ reported by "ps auxw" (less than 150000 kB) the value of 213000 kB is larger than I expected. But perhaps we're not comparing like with like. Nothing in /proc/slabinfo looked suspicious so it's hard to say that there's a clear memory leak at work. The resulting log was consistent with what I reported earlier. Audio streaming started up just fine, with the CTR discrepancies coming in some time afterwards. I have to leave it there for now (possibly until another day). I plan to do an audit on how much data the RME driver transfers into the packet buffer in the iso callback. If it overran the buffer passed to it perhaps that could cause troubles elsewhere (in libraw1394 or in other places). Such a bug may also explain the seemingly random nature of it. For reference, so far today (with the exception of the Ctrl-C run reported above) I've had the kernel freeze every time I've started the FF800. Two days ago it started fine every time. Regards jonathan |
From: Stefan R. <st...@s5...> - 2013-08-24 09:12:51
|
Quoting for linux1394-devel without further comment. On Aug 24 Jonathan Woithe wrote: > I have some more information. > > I tested ffado as before but with my MOTU Traveler. It worked fine, which > suggests to me that whatever the problem it's triggered with the Fireface > driver. > > I was able to run jackd/ffado with the FF800 and, by hitting Ctrl-C early, > exit before the kernel crashed. After doing so memory usage did seem a > little high: > > total used free shared buffers cached > Mem: 6227572 213764 6013808 0 1648 154780 > -/+ buffers/cache: 57336 6170236 > Swap: 12594924 0 12594924 > > Compared to the total of VSZ reported by "ps auxw" (less than 150000 kB) > the value of 213000 kB is larger than I expected. But perhaps we're not > comparing like with like. Nothing in /proc/slabinfo looked suspicious so > it's hard to say that there's a clear memory leak at work. > > The resulting log was consistent with what I reported earlier. Audio > streaming started up just fine, with the CTR discrepancies coming in some > time afterwards. > > I have to leave it there for now (possibly until another day). I plan to do > an audit on how much data the RME driver transfers into the packet buffer in > the iso callback. If it overran the buffer passed to it perhaps that could > cause troubles elsewhere (in libraw1394 or in other places). Such a bug may > also explain the seemingly random nature of it. > > For reference, so far today (with the exception of the Ctrl-C run reported > above) I've had the kernel freeze every time I've started the FF800. Two > days ago it started fine every time. > > Regards > jonathan -- Stefan Richter -=====-===-= =--- ==--- http://arcgraph.de/sr/ |
From: Илья <il...@ma...> - 2013-08-24 14:58:53
|
Суббота, 24 августа 2013, 15:51 +09:30 от Jonathan Woithe <jw...@ju...>: Hello, now I have a vocation in Europe and can't test smth. on real hardware. But I can confirm, that since r2379 a kernel become to locks up more frequently. >My suspicion at present is that there is a bug in ffado's Fireface driver >relating to sample rates which is causing sync problems at start up. >However, I wouldn't have thought that this should be sufficient to trip the >kernel up and freeze the machine solid. All changes in the latest releases linked with sample rate functions and of course I think that you are right. My PC have a 16Gb RAM and 32 GB swap for it, and i7-3520M CPU, I think this is a reason for waiting a lock up appx. for 30 sec and saves all messages to syslog. Also I can inform that you can escape kernel lock up if you to be in time for disconnect a firewire cable when a "waitForPeriod: wait extended since period not ready..." message will appear. Best regards, Ilia. |
From: Jonathan W. <jw...@ju...> - 2013-08-25 03:36:29
|
On Sat, Aug 24, 2013 at 06:58:41PM +0400, ???????? wrote: > Hello, now I have a vocation in Europe and can't test smth. on real hardware. Fair enough. Thanks for the information and please do keep making comments. The more information we have the greater the chance that we'll find the culprit. > But I can confirm, that since r2379 a kernel become to locks up more frequently. Noted. I'm not sure what this observation means because I was able to start r2379 many times without error 2-3 days ago. One big puzzle for me at present is why suddenly I'm seeing problems when no software changes have occured on the system over the time concerned. The change in r2379 is, for all intents and purposes, independent of the audio streaming code so it's hard to see what it's done which might make the problem more likely. This is why I'm thinking that there might be a subtle overflow bug somewhere. But it's early days. > >My suspicion at present is that there is a bug in ffado's Fireface driver > >relating to sample rates which is causing sync problems at start up. > >However, I wouldn't have thought that this should be sufficient to trip the > >kernel up and freeze the machine solid. > > All changes in the latest releases linked with sample rate functions and > of course I think that you are right. FYI, as per a follow up message I have pretty much ruled out a direct link between these changes and the bug. > My PC have a 16Gb RAM and 32 GB swap for it, and i7-3520M CPU, I think > this is a reason for waiting a lock up appx. for 30 sec and saves all > messages to syslog. Quite possibly. It would take more time for your system to exhaust its memory and therefore there's every likelihood that some messages will be flushed to the system log files prior to the freeze. > Also I can inform that you can escape kernel lock up if you to be in time > for disconnect a firewire cable when a "waitForPeriod: wait extended since > period not ready..." message will appear. Basically it's good to exit at any point before memory is completely exhausted. I'm not sure the "waitForPeriod" message is particularly significant - I see that at times during normal operation and it's a consequence of the course timing information we receive indirectly from the fireface. Regards jonathan |
From: Jonathan W. <jw...@ju...> - 2013-08-25 13:05:45
|
> On Aug 24 Jonathan Woithe wrote: > I have to leave it there for now (possibly until another day). I plan to do > an audit on how much data the RME driver transfers into the packet buffer in > the iso callback. If it overran the buffer passed to it perhaps that could > cause troubles elsewhere (in libraw1394 or in other places). Such a bug may > also explain the seemingly random nature of it. I have had a chance to look into this. As far as I can tell the transmit packets are not being over run. The expected packet size is requested when iso transmit is being set up, and data is not written beyond the expected end point when filling the buffer in the callback. I'm now wondering whether there is something about the data pattern when a Fireface is in use which is triggering this problem. Next up (possibly tomorrow) I'll try to understand what precisely triggers the CTR discrepancy message to see if that sheds any light on things. Regarding the issue of the kernel running out of memory (which is what ultimately causes the freeze), what sort of things would FFADO have to do in order to create this situation? If for example the iso tx callback never returned (due to a bug in userspace - FFADO in this case), could this cause the firewire kernel driver to continuously allocate memory to store incoming data, eventually causing memory exhaustion? Are any other scenarios known or suspected? jonathan |
From: Илья <il...@ma...> - 2013-08-25 13:28:25
|
Воскресенье, 25 августа 2013, 22:35 +09:30 от Jonathan Woithe <jw...@ju...>: >Regarding the issue of the kernel running out of memory (which is what >ultimately causes the freeze), what sort of things would FFADO have to do in >order to create this situation? If for example the iso tx callback never >returned (due to a bug in userspace - FFADO in this case), could this cause >the firewire kernel driver to continuously allocate memory to store incoming >data, eventually causing memory exhaustion? Are any other scenarios known >or suspected? > >jonathan If I remember correctly The kernel lock up will occur when I try to start jackd at the same samplerate (without busreset) at a second time after it fails to start at first time. Your assumption looks very truly. Best regards, Ilia. |
From: Jonathan W. <jw...@ju...> - 2013-08-26 12:43:44
|
Another minor data point. As far as I can tell, ffado r2375 does not trigger the kernel problem. This revision predates the sample rate control changes, so it seems that something in there is inexplicably causing the trouble. I will try to do a bisect tomorrow to narrow this down - looking at commits after r2375 there is nothing which has an obvious potential to mess the streaming system up so dramatically. Whatever the userspace problem, I still have no leads as to how this leads to the kernel meltdown. Regards jonathan |
From: Clemens L. <cl...@la...> - 2013-08-26 14:20:30
|
Jonathan Woithe wrote: > I have had a chance to look into this. As far as I can tell the transmit > packets are not being over run. The expected packet size is requested when > iso transmit is being set up, and data is not written beyond the expected > end point when filling the buffer in the callback. In any case, the kernel checks that packet buffers never go outside the iso buffer, and that buffer is independent of any other buffers, so if FFADO were to do anything weird with its packets, the only possible corruption would be in the packets' contents. > I'm now wondering whether there is something about the data pattern when a > Fireface is in use which is triggering this problem. The pattern in the kernel's memory allocations is quite simple: some memory (for an entry in the DMA descriptor queue) gets allocated when a packet is submitted. When that packet is actually transmitted on the bus, the DMA entry is freed, and a little bit of memory for the completion event is allocated. That event memory gets freed when userspace reads the event. To exhaust kernel memory, FFADO could either submit too many packets, or continue to submit packets even though the DMA context is stopped, or it could never read the completion events from the device. > Next up (possibly tomorrow) I'll try to understand what precisely triggers > the CTR discrepancy message to see if that sheds any light on things. This happened before the lockups, and I suspect it's an independent problem. Regards, Clemens |
From: Stefan R. <st...@s5...> - 2013-08-26 22:23:58
|
On Aug 26 Clemens Ladisch wrote: > Jonathan Woithe wrote: > > I have had a chance to look into this. As far as I can tell the transmit > > packets are not being over run. The expected packet size is requested when > > iso transmit is being set up, and data is not written beyond the expected > > end point when filling the buffer in the callback. > > In any case, the kernel checks that packet buffers never go outside the > iso buffer, and that buffer is independent of any other buffers, so if > FFADO were to do anything weird with its packets, the only possible > corruption would be in the packets' contents. > > > I'm now wondering whether there is something about the data pattern when a > > Fireface is in use which is triggering this problem. > > The pattern in the kernel's memory allocations is quite simple: some > memory (for an entry in the DMA descriptor queue) gets allocated when > a packet is submitted. When that packet is actually transmitted on the > bus, the DMA entry is freed, and a little bit of memory for the > completion event is allocated. That event memory gets freed when > userspace reads the event. > > To exhaust kernel memory, FFADO could either submit too many packets, or > continue to submit packets even though the DMA context is stopped, or it > could never read the completion events from the device. Indeed; I too expect such an issue to be caused by an imbalance between the application's queuing of new buffers to transmit (or to receive into) versus the application's reading of buffer completion events. But FFADO uses libraw1394, and libraw1394 transparently bundles a read() to consume FW_CDEV_EVENT_ISO_INTERRUPT and an ioctl() to produce FW_CDEV_IOC_QUEUE_ISO, in lockstep. Could the 3.4...3.10 kernel habit to split up large FW_CDEV_EVENT_ISO_INTERRUPT arrays (rather than to drop the overhang of large FW_CDEV_EVENT_ISO_INTERRUPT arrays) have an effect here? I somehow doubt that FFADO's IRQ intervals are that large /and/ that libraw1394's read--ioctl pair ever queues more than what was completed. However, to be sure, Jonathan could update his mainline kernel to 3.10.9, and Ilia and Andreas could watch for a Ubuntu kernel update which incorporates 3.8.13.7 and update to that one (3.8.0-30.44 doesn't contain 3.8.13.7 yet; I don't know which one will), or install Ubuntu's currently available 3.8.13.7-raring or 3.10.9-saucy package from the Ubuntu mainline kernel PPA. > > Next up (possibly tomorrow) I'll try to understand what precisely triggers > > the CTR discrepancy message to see if that sheds any light on things. > > This happened before the lockups, and I suspect it's an independent > problem. > > > Regards, > Clemens -- Stefan Richter -=====-===-= =--- ==-== http://arcgraph.de/sr/ |
From: Jonathan W. <jw...@ju...> - 2013-08-26 23:45:50
|
On Mon, Aug 26, 2013 at 04:20:16PM +0200, Clemens Ladisch wrote: > Jonathan Woithe wrote: > > I have had a chance to look into this. As far as I can tell the transmit > > packets are not being over run. The expected packet size is requested when > > iso transmit is being set up, and data is not written beyond the expected > > end point when filling the buffer in the callback. > > In any case, the kernel checks that packet buffers never go outside the > iso buffer, and that buffer is independent of any other buffers, so if > FFADO were to do anything weird with its packets, the only possible > corruption would be in the packets' contents. If FFADO were to overrun the end of the buffers though I expect it could stomp on other memory used either by itself or libraw1394. It shouldn't affect the kernel directly, but if libraw1394 goes off the rails because of this the kernel could be told incorrect things. Having said that, my latest checks indicate that we're not writing beyond the end of the buffer, so that's unlikely to be an issue. > > I'm now wondering whether there is something about the data pattern when a > > Fireface is in use which is triggering this problem. > > The pattern in the kernel's memory allocations is quite simple: some > memory (for an entry in the DMA descriptor queue) gets allocated when > a packet is submitted. When that packet is actually transmitted on the > bus, the DMA entry is freed, and a little bit of memory for the > completion event is allocated. That event memory gets freed when > userspace reads the event. > > To exhaust kernel memory, FFADO could either submit too many packets, or > continue to submit packets even though the DMA context is stopped, or it > could never read the completion events from the device. We only ever submit one packet per iso cycle so I don't think we could end up submitting too many (unless libraw1394 is calling us too often for some other reason). What sort of things would cause the DMA context to stop within the kernel driver without such a request from libraw1394/ffado? Is it possible that the DMA context could stop but libraw1394 continue to request tx packets via callback? When libraw1394 is involved, what is responsible for reading the completion events? I think it's libraw1394 itself. If ffado gets hung up and somehow fails to give libraw1394 a chance to read these I guess there's a potential mechanism there, but it seems unlikely. > > Next up (possibly tomorrow) I'll try to understand what precisely triggers > > the CTR discrepancy message to see if that sheds any light on things. > > This happened before the lockups, and I suspect it's an independent > problem. Yes, could be. However, I note that when the CTR discrepancy error is not seen then everything works just fine. For example, FFADO r2375 last night worked perfectly for me and there was no sign of the CTR error. Switching back to r2383 and the CTR error - and freeze - returned. There is clearly something in the last few FFADO commits which is not right and is causing the kernel some distress. Whether the CTR messages are illustrative of the fundamental problem or are just collateral damage remains to be seen. Regards jonathan |
From: Clemens L. <cl...@la...> - 2013-08-27 08:59:53
|
Jonathan Woithe wrote: > What sort of things would cause the DMA context to stop within the kernel > driver without such a request from libraw1394/ffado? Hardware errors, or bugs in the kernel driver. > Is it possible that the DMA context could stop but libraw1394 continue to > request tx packets via callback? A stopped context would not complete packets. Does libraw1394 request more packets when it doesn't get completion events? Regards, Clemens |
From: Jonathan W. <jw...@ju...> - 2013-08-27 13:08:03
|
I have some more information. For those not interested in the detailed analysis, the last few paragraphs contain some interesting observations. FFADO revisions 2375 through 2378 do NOT cause a system lockup on my development machine. Revision 2379, if left to its own devices, seems to lock the machine hard almost every time it's run even when the FF800 is in master clock mode. This backs up observations made by Ilia about r2379 appearing to trigger the lockup more frequently, although he was also seeing the lockups on earlier revisions. Looking at the changes in r2379 it's hard to see how they could cause such a dramatic problem: * Chunk 1 is in getSamplingFrequency() and comprises comments only. * Chunk 2 is in getSamplingFrequency(). It adds a get_hardware_state() call. A new return path is relevant only if the device is in autosync clock mode. * Chunk 3 is in setSamplingFrequency(). Comments were added and a debug warning was changed to an error. Other functional changes (a new return path and a setting of software_freq) is only executed if the device is in autosync clock mode. * Chunk 4 is in getSupportedSamplingFrequencies(). Comments were added. If in autosync clock mode a new push_back() to the frequency list has been added. With my device currently in master clock mode, the only functional change introduced by this patch is the call to get_hardware_state() in getSamplingFrequency(). Commenting this out seems to fix the crash for me: I no longer see the "CTR discrepancy" messages and the kernel does not lock up a few seconds after starting jackd/ffado. So what's going on? getSamplingFrequency() is called in only a few places within the RME driver: * Device::prepare() - called once during program startup and device initialisation. * Device::getFramesPerPacket() - various call sites * Device::addDirPorts() - called only during program startup getFramesPerPacket() is significant because it shows up in the streaming driver. The call in RmeReceiveStreamProcessor is from getMaxPacketSize() and this is another function only called during startup. On the transmit side, there is a similar call from RmeTransmitStreamProcessor's getMaxPacketSize() method. However, there is one other call site: in getNominalFramesPerPacket(). This function is called 1-2 times in RmeTransmitStreamProcessor::generatePacketHeader(), and this method is called once per iso cycle. It also shows up in twice in generatePacketData(), once in generateSilentPacketHeader() and once in fillDataPacketHeader(). All up, on average this function is probably called 2-3 times per iso cycle. Prior to r2379 this wasn't a big deal since getSamplingFrequency() more or less just returned a fixed value. Revision 2379 introduced the get_hardware_state() call which involves a two-quadlet block async read transaction on the bus - a detail which was overlooked at the time r2379 was being prepared. I can't remember the bus specifications regarding async transactions and how often they can be done. In any case, it seems that the kernel driver was choking on the async load generated by this oversight: 2-3 two-quadlet async block reads per iso cycle appears to be beyond the kernel driver's capability. With this in mind I will rework the driver to avoid this problem. However, I am not in a position to judge whether this FFADO error has exposed a real problem in the firewire driver, or if it can be written off on the basis that the driver simply can't cope with this traffic pattern. I would be interested to see what others think. As an aside, if the kernel driver does have trouble with this traffic load, it seems that this might be a possible DOS vector against the kernel. If one has any firewire device connected, it seems the kernel can be crashed by sending a handful of async block read requests to that device every iso cycle (whether some iso traffic is also needed is not something I've tested). Regards jonathan |
From: Stefan R. <st...@s5...> - 2013-08-27 21:44:04
|
On Aug 27 Jonathan Woithe wrote: [...] > Revision 2379 introduced the > get_hardware_state() call which involves a two-quadlet block async read > transaction on the bus - a detail which was overlooked at the time r2379 was > being prepared. > > I can't remember the bus specifications regarding async transactions and how > often they can be done. In any case, it seems that the kernel driver was > choking on the async load generated by this oversight: 2-3 two-quadlet async > block reads per iso cycle appears to be beyond the kernel driver's > capability. > > With this in mind I will rework the driver to avoid this problem. However, > I am not in a position to judge whether this FFADO error has exposed a real > problem in the firewire driver, or if it can be written off on the basis > that the driver simply can't cope with this traffic pattern. I would be > interested to see what others think. If I am not mistaken, this part of the RME driver uses FFADO's ieee1394service's read() which in turn uses libraw1394's blocking raw1394_read(). This libraw1394 function has one thing in common with the previously discussed IT/IR queuing in libraw1394: The ioctl() which produces the request is internally bundled with a read() which consumes the corresponding completion event. Therefore the lifetime of the kernel allocation for the event object is strictly ending before raw1394_read() returns. Hence, no memory exhaustion possible no matter how quickly you repeat raw1394_read(). What's more, IT/IR completion events and some other types of events are allocated from the kernel's rather limited "emergency memory pool". (This is because the kernel context which performs this allocation is unable to got to sleep and wait for memory be reclaimed into the free pool.) In contrast, outbound asynchronous transaction completion events only use memory which is preallocated during the ioctl() call from the general kernel memory pool. (It won't allocate from the emergency pool.) If memory is tight, the ioctl will block until the kernel as able to reclaim memory. The few kernel back traces that were posted previously in this thread clearly showed exhaustion of the emergency memory pool. In other words: I still am clueless how FFADO managed to bring the kernel down. > As an aside, if the kernel driver does have trouble with this traffic load, > it seems that this might be a possible DOS vector against the kernel. If > one has any firewire device connected, it seems the kernel can be crashed by > sending a handful of async block read requests to that device every iso > cycle (whether some iso traffic is also needed is not something I've > tested). I suspect that there is indeed a DoS vector by causing firewire-core to allocate bus reset events (needs to trigger many bus resets in order to do so, and probably needs to do it fast, which is not easily possible for userspace software), or inbound asynchronous request events (needs an external requester... or do loopback requests, I am not sure), or physical packet reception events (needs external transmitter of phy packets), or IR completion events (needs to queue IR buffers continuously an needs an external streamer), or IT completion events (needs to queue IT buffers continuously). IT seems easiest from these all. But again, libraw1394 does not appear to be usable for such an exploit, since libraw1394's isochronous facility bundles ioctl() (the indirect producer of these events) with read() (the consumer of these events). -- Stefan Richter -=====-===-= =--- ==-== http://arcgraph.de/sr/ |
From: Clemens L. <cl...@la...> - 2013-08-27 13:45:18
|
Jonathan Woithe wrote: > Revision 2379 introduced the get_hardware_state() call which involves a two- > quadlet block async read transaction on the bus - a detail which was > overlooked at the time r2379 was being prepared. > > I can't remember the bus specifications regarding async transactions and how > often they can be done. Isochronous packets have priority; asynchronous packets get all the remaining time in the frame. > In any case, it seems that the kernel driver was choking on the async > load generated by this oversight: 2-3 two-quadlet async block reads > per iso cycle appears to be beyond the kernel driver's capability. The kernel driver is perfectly capable of saturating the bus with lots of async packets. > I am not in a position to judge whether this FFADO error has exposed a real > problem in the firewire driver In the case of the XIO2200A, it's a hardware bug. (Is there any kernel message in the system log?) I'm wondering about that other controller. Regards, Clemens |
From: Jonathan W. <jw...@ju...> - 2013-08-27 23:52:55
|
On Tue, Aug 27, 2013 at 03:45:06PM +0200, Clemens Ladisch wrote: > Isochronous packets have priority; asynchronous packets get all the > remaining time in the frame. Ok, in that case we definitely would not be saturating the bus. In the modes tested no more than half of the bus bandwidth is consumed by the iso stream, so even half a dozen two-quadlet async reads per iso cycle should not have a problem getting through. I wonder if the barrage of quadlet reads was putting the Fireface itself into some strange mode whereby it did strange things on the bus. Even so, it still shouldn't have been able to bring the kernel down. > > In any case, it seems that the kernel driver was choking on the async > > load generated by this oversight: 2-3 two-quadlet async block reads > > per iso cycle appears to be beyond the kernel driver's capability. > > The kernel driver is perfectly capable of saturating the bus with lots > of async packets. Fair enough. In any case, the kernel was clearly freezing when FFADO presented it with this specific traffic pattern. > > I am not in a position to judge whether this FFADO error has exposed a real > > problem in the firewire driver > > In the case of the XIO2200A, it's a hardware bug. > (Is there any kernel message in the system log?) I haven't seen any, but the lockup happens so quickly on my box that any messages which might be sent to syslog never get flushed. The only kernel-related messages I've seen in connection with the lockup are those which made it to the console; they were detailed in my initial post to this thread. > I'm wondering about that other controller. The kernel freeze has been verified to occur on systems with the following controllers when using FFADO revisions r2379 to r2383 inclusive: * Ilia: Texas Instruments XIO2200A [104c:8235] (rev 01) * Mine: Texas Instruments TSB43AB23 * Andreas: Via VT6306/7/8 [Fire II(M)] - Many copies of the following kernel messages were reported by Andreas: usb 2-1.3:ath9k_htc RX memory allocation error firewire_core: out of memory when allocate event Ilia also has a VIA vt6315 controller, but I'm not sure that the kernel freezing FFADO revisions were tested with that. Knowing that the XIO2200A had hardware bugs I can understand how it might have choked on the traffic pattern presented by these FFADO versions. However, the XIO2200A was not the only host controller with which the kernel lockup was observed which to me suggests the lockup was caused by something other than a hardware bug specific to the XIO2200A. Is it likely that all three chipsets have bugs which are triggered with this traffic pattern? This seems improbable to me. Regards jonathan |