From: Thomas S H. <tha...@gm...> - 2010-12-30 20:46:55
|
We are getting these kernel errors: Dec 30 20:17:34 localhost kernel: [1818169.391998] mfschunkserver: page allocation failure. order:2, mode:0x4020 Dec 30 20:17:34 localhost kernel: [1818169.392008] Pid: 8290, comm: mfschunkserver Not tainted 2.6.32-25-server #45-Ubuntu Dec 30 20:17:34 localhost kernel: [1818169.392014] Call Trace: Dec 30 20:17:34 localhost kernel: [1818169.392019] <IRQ> [<ffffffff810f9a2e>] __alloc_pages_slowpath+0x56e/0x580 Dec 30 20:17:34 localhost kernel: [1818169.392046] [<ffffffff810f9bb1>] __alloc_pages_nodemask+0x171/0x180 Dec 30 20:17:34 localhost kernel: [1818169.392055] [<ffffffff81131df2>] kmalloc_large_node+0x62/0xb0 Dec 30 20:17:34 localhost kernel: [1818169.392063] [<ffffffff81136409>] __kmalloc_node_track_caller+0x109/0x160 Dec 30 20:17:34 localhost kernel: [1818169.392073] [<ffffffff81468d8d>] ? dev_alloc_skb+0x1d/0x40 Dec 30 20:17:34 localhost kernel: [1818169.392078] [<ffffffff81468aa0>] __alloc_skb+0x80/0x190 Dec 30 20:17:34 localhost kernel: [1818169.392084] [<ffffffff81468d8d>] dev_alloc_skb+0x1d/0x40 Dec 30 20:17:34 localhost kernel: [1818169.392109] [<ffffffffa001af27>] nv_alloc_rx_optimized+0x197/0x270 [forcedeth] Dec 30 20:17:34 localhost kernel: [1818169.392120] [<ffffffffa001a369>] ? T.936+0x269/0x2a0 [forcedeth] Dec 30 20:17:34 localhost kernel: [1818169.392130] [<ffffffffa001c09c>] nv_nic_irq_optimized+0xdc/0x330 [forcedeth] Dec 30 20:17:34 localhost kernel: [1818169.392138] [<ffffffff810c4030>] handle_IRQ_event+0x60/0x170 Dec 30 20:17:34 localhost kernel: [1818169.392145] [<ffffffff810c6472>] handle_edge_irq+0xd2/0x170 Dec 30 20:17:34 localhost kernel: [1818169.392152] [<ffffffff81014d12>] handle_irq+0x22/0x30 Dec 30 20:17:34 localhost kernel: [1818169.392161] [<ffffffff8155f29c>] do_IRQ+0x6c/0xf0 Dec 30 20:17:34 localhost kernel: [1818169.392166] [<ffffffff81012b13>] ret_from_intr+0x0/0x11 Dec 30 20:17:34 localhost kernel: [1818169.392174] [<ffffffff8106d494>] ? __do_softirq+0xd4/0x1e0 Dec 30 20:17:34 localhost kernel: [1818169.392180] [<ffffffff810c4030>] ? handle_IRQ_event+0x60/0x170 Dec 30 20:17:34 localhost kernel: [1818169.392187] [<ffffffff810132ec>] ? call_softirq+0x1c/0x30 Dec 30 20:17:34 localhost kernel: [1818169.392192] [<ffffffff81014cb5>] ? do_softirq+0x65/0xa0 Dec 30 20:17:34 localhost kernel: [1818169.392197] [<ffffffff8106d315>] ? irq_exit+0x85/0x90 Dec 30 20:17:34 localhost kernel: [1818169.392203] [<ffffffff8155f2a5>] ? do_IRQ+0x75/0xf0 Dec 30 20:17:34 localhost kernel: [1818169.392208] [<ffffffff81012b13>] ? ret_from_intr+0x0/0x11 Dec 30 20:17:34 localhost kernel: [1818169.392211] <EOI> [<ffffffff812bc2ad>] ? copy_user_generic_string+0x2d/0x40 Dec 30 20:17:34 localhost kernel: [1818169.392227] [<ffffffff814af860>] ? tcp_sendmsg+0x860/0xa20 Dec 30 20:17:34 localhost kernel: [1818169.392236] [<ffffffff814630cc>] ? sock_aio_write+0x13c/0x150 Dec 30 20:17:34 localhost kernel: [1818169.392245] [<ffffffff8114378a>] ? do_sync_write+0xfa/0x140 Dec 30 20:17:34 localhost kernel: [1818169.392253] [<ffffffff8105a254>] ? try_to_wake_up+0x284/0x380 Dec 30 20:17:34 localhost kernel: [1818169.392261] [<ffffffff81084240>] ? autoremove_wake_function+0x0/0x40 Dec 30 20:17:34 localhost kernel: [1818169.392269] [<ffffffff8101078c>] ? __switch_to+0x1ac/0x320 Dec 30 20:17:34 localhost kernel: [1818169.392276] [<ffffffff81057850>] ? finish_task_switch+0x50/0xe0 Dec 30 20:17:34 localhost kernel: [1818169.392285] [<ffffffff81251fd6>] ? security_file_permission+0x16/0x20 Dec 30 20:17:34 localhost kernel: [1818169.392292] [<ffffffff81143b54>] ? vfs_write+0x184/0x1a0 Dec 30 20:17:34 localhost kernel: [1818169.392298] [<ffffffff811442f1>] ? sys_write+0x51/0x80 Dec 30 20:17:34 localhost kernel: [1818169.392305] [<ffffffff810121b2>] ? system_call_fastpath+0x16/0x1b For mfschunkservers, master and metaloggers, when we see them for the mfsmaster then the mfsmaster starts to go crazy and all of the chunkservers disconnect, when I stop the mfsmaster the metadata file is not created and I have to use metalogger data or the metadata.back.tmp file. We noticed that if we echo 3 > /proc/sys/vm/drop_caches then the kernel errors go away, but we can't afford to have these crashes, this is the second time this week. This is a catastrophic problem, please let me know what you think, right now our guess is that it is related to a kernel bug. We are running 1.6.19 on Ubuntu 10.04 kernel 2.6.32. Thanks -Tom Hatch |