From: Jonathan W. <jw...@ju...> - 2013-08-24 06:22:12
|
Hi all On Thu, Aug 22, 2013 at 01:02:57AM +0400, ???????? wrote: > I found a steps to reproduce a kernel software locks up by ffado. This is now getting plain strange. After successfully running ffado/jackd many times over this last week, I've suddenly started to get machine lockups. There is no obvious reason for this since the kernel has remained unchanged for many months, and I was successfully running jackd/ffado this past week when debugging the sample rate issues on the FF800. I experienced trouble with ffado r2379, even though this version had run successfully earlier in the week. Upgrading to r2383 didn't change things. The kernel is a kernel.org 3.8.0 SMP PREEMPT kernel which I compiled on June 28 2013. These are the first lockups I have experienced since using this kernel. CPU is an i7 at 2.67 GHz. The kernel is compiled for 32-bit and userspace is 32-bit (there are reasons why I have yet to move this machine to a 64-bit environment). The kernel's .config can be provided on request. There is 6 GB of RAM fitted. Even after starting X (with a very lightweight window manager), free reports: total used free shared buffers cached Mem: 6227572 318824 5908748 0 1968 216168 -/+ buffers/cache: 100688 6126884 Swap: 12594924 0 12594924 The loaded modules are included at the end of this email. The firewire host controller is: 09:06.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller (PHY/Link) /proc/interrupts report: 18: ... IO-APIC-fasteoi pata_jmicron, i801_smbus, firewire_ohci, ehci_hcd:usb1, uhci_hcd:usb5, uhci_hcd:usb8 Obviously it's not a dedicated IRQ but that's never been an issue for the past 3+ years that I've had the machine. The jackd command line was: jackd -P70 -R -dfirewire -p512 -n3 -v 6 If running under X the machine simply froze up. There was nothing shown on the screen to indicate what the problem might be. I *thought* I had a panic handler configured, but perhaps I'm imagining it. The machine in question boots to text mode so it was trivial to test this from a console without ever running X. The symptoms were the same: the machine simply froze up without displaying any reason for doing so. Hitting Ctrl-C a few times did display ^C on the console, but after half a dozen presses even this stopped. I found that if I hit the magic-sysreq button combination to elicit an emergency sync before there was a complete freeze, I was able to get some output from the kernel. Whether this is a result of the sysreq handler or is something related to the underlying problem remains to be seen. In this condition, the kernel displayed the following line at a rate of maybe 50 to 100 per second. firewire_core 0000:09:06.0: out of memory when allocating event Eventually the scrolling stopped with the following messages displayed: kmem_getpages: 4667629 callbacks suppressed SLAB: Unable to allocate memory on node 0 (gfp=0x20) cache: size-64, object size: 64, order: 0 node 0: slabs: 204425/204425, objs: 12061075/12061075, free: 0 firewire_core 0000:09:06.0: out of memory when allocating event SLAB: Unable to allocate memory on node 0 (gfp=0x20) cache: size-64, object size: 64, order: 0 node 0: slabs: 204425/204425, objs: 12061075/12061075, free: 0 firewire_core 0000:09:06.0: out of memory when allocating event SLAB: Unable to allocate memory on node 0 (gfp=0x20) cache: size-64, object size: 64, order: 0 node 0: slabs: 204425/204425, objs: 12061075/12061075, free: 0 firewire_core 0000:09:06.0: out of memory when allocating event SLAB: Unable to allocate memory on node 0 (gfp=0x20) cache: size-64, object size: 64, order: 0 The machine froze at this point. During the rapid scrolling of the firewire_core message I also noticed that every so often (maybe once every 5 to 10 seconds) there would be another block of text appearing, but it scrolled off the screen far too fast to allow comprehension. Scroll lock was ineffective. However, I was able to capture at least some it using a camera with a 60 Hz shutter speed and a reasonably fast reaction time. :-) Even so, due to the LCD's finite response time there's a reasonable amount of image untangling to do. Here is what I can gleen. Things marked with ? except the single "?" before some function names in the call trace are those I couldn't make out from the shots. [ Output which is too garbled to distinguish ] lowmem_reserve[]: 0 0 0 0 DMA: 8*4kB 1*8kB (R) 0*16kB 1*32kB (R) 0*64kB 1*128kB (R) 1*256kB (R) 0*512kB 1*1024kB (R) 1*2048kB (R) 0*4096kB = 3496kB Normal: 11*4kB (??) 4*8kB (??) 0*16kB 0*32kB 0*64kB 8*128kB 1*256KB (R) 0*512kB 1*1024kB (R) 8*2048kB 0*4096kB = 1356kB Highmem: 1127*4kB (UM) 665*8kB (UM) 416*16kB (UM) 234*32kB (UM) 133*64kB (UM) 74*128kB (UM) 42*256kB (UM) 27*512kB (UM) 18*1024kB (M) 7*2048kB (M) 1244*4096kB (M?) = 5194724kB 17294 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 12594924kB 1769456 pages RAM 1541122 pages HighMem 212563 pages reserved 270346 pages shared 251286 pages non-shared firewire_core 0000:09:06.0: out of memory when allocating event swapper/7: page allocation failure: order:0, mode: 0x200020 Pid: 0, comm: swapper/7 Not tainted 3.8.0 #2 Call Trace: [<c1088df4>] __alloc_pages_nodemask+0x3c4/0x5c0 [<c102ee80>] ? log_store+0x190/0x240 [<c16ae32c>] cache_alloc_refill+0x2dc/0x580 [<c10ae67a>] __kmalloc+0xaa/0xf0 [<f86ea533>] iso_callback+0x23/0xb0 [firewire_core] [<f8704662>] handle_it_packet+0x132/0x190 [firewire_ohci] [<f870511c>] context_tasklet+0xac/0x150 [firewire_ohci] [<c1034a80>] tasklet_action+0x50/0xa0 [<c1035080>] __do_softirq+0x90/0x140 [<c1034ff0>] ? remote_softirq_receive+0x20/0x20 <IRQ> [<c1035246>] ? irq_exit+0x76/0xa0 [<c1004506>] ? do_IRQ+0x46/0xb0 [<c13e9fec>] ? common_interrupt+0x2c/0x31 [<c105067b>] ? set_groups+0xbb/0x170 [<c132c396>] ? cpuidle_wrap_enter+0x36/0x80 [<c132bc52>] ? cpuidle_enter_tk+0x12/0x20 [<c132bc20>] ? read_measured_perf_ctrs+0x50/0x50 [<c132c13c>] ? cpuidle_enter_state+0xc/0x40 [<c132a1c6>] ? cpu_idle+0x56/0xb0 [<c13def56>] ? start_secondary+?????? Mem-Info: DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 used: 0 [End of output] Putting all this together, it seems like the kernel is exhausing physical memory rather rapidly and (I'm guessing) it can no longer alloocate memory to handle the IT packet. In other words, it seems that we have a rather difficult to reproduce kernel bug (as evidenced by the fact that the exact same software has worked perfectly for me on exactly the same hardware earlier in the week). The trouble with the above is that while it documents where we're running out of memory it really says nothing about why we're running out of memory. Clearly this is happening with a different firewire host controller compared to Ilia, so I think we can rule out a particular card as the cause of these troubles. I'll keep think of tests I can do to try to narrow down the problem but I'm definitely open to suggestions. Needless to say that there was no information about the freeze in the system logs on reboot. I did notice that ffado appeared to have some trouble starting the streaming interface of the FF800 in the early stages of the run. I don't know why this would suddenly be an issue (it was fine 2 days ago) and it's something else I need to look into. However, even if this were the case it should not be capable of crashing the kernel like this. One thing I will try to test is another firewire interface. This will tell whether the bug is triggered by FFADO generally or requires the Fireface for some yet to be identified reason. I will report back about this once I have a chance to test it. My suspicion at present is that there is a bug in ffado's Fireface driver relating to sample rates which is causing sync problems at start up. However, I wouldn't have thought that this should be sufficient to trip the kernel up and freeze the machine solid. Regards jonathan List of modules loaded at the time of the freeze Module Size Used by usb_storage 33435 0 snd_seq_dummy 1023 0 snd_seq_oss 23324 0 snd_seq_midi_event 4312 1 snd_seq_oss snd_seq 39481 5 snd_seq_midi_event,snd_seq_oss,snd_seq_dummy snd_pcm_oss 32710 0 snd_mixer_oss 12120 1 snd_pcm_oss usbhid 22815 0 radeon 826022 2 uhci_hcd 19608 0 snd_ens1371 15095 0 snd_ac97_codec 86564 1 snd_ens1371 ac97_bus 770 1 snd_ac97_codec snd_rawmidi 14438 1 snd_ens1371 cfbfillrect 3118 1 radeon cfbimgblt 1891 1 radeon cfbcopyarea 2978 1 radeon i2c_algo_bit 4415 1 radeon drm_kms_helper 24090 1 radeon ttm 45230 1 radeon drm 174235 4 ttm,drm_kms_helper,radeon ehci_pci 2884 0 ehci_hcd 33242 1 ehci_pci snd_seq_device 4020 4 snd_seq,snd_rawmidi,snd_seq_oss,snd_seq_dummy snd_pcm 62121 3 snd_pcm_oss,snd_ac97_codec,snd_ens1371 usbcore 118905 5 uhci_hcd,usb_storage,ehci_hcd,ehci_pci,usbhid snd_page_alloc 5934 1 snd_pcm snd_timer 14414 2 snd_pcm,snd_seq snd 41759 10 snd_pcm_oss,snd_ac97_codec,snd_timer,snd_pcm,snd_seq,snd_rawmidi,snd_seq_oss,snd_ens1371,snd_seq_device,snd_mixer_oss agpgart 21191 2 drm,ttm firewire_ohci 28422 0 firewire_core 44036 1 firewire_ohci nosy 5919 0 soundcore 4186 1 snd r8169 48283 0 usb_common 582 1 usbcore loop 14395 0 |