[Moosefs-users] Crashing mfsmaster

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

We are getting these kernel errors:
Dec 30 20:17:34 localhost kernel: [1818169.391998] mfschunkserver: page
allocation failure. order:2, mode:0x4020
Dec 30 20:17:34 localhost kernel: [1818169.392008] Pid: 8290, comm:
mfschunkserver Not tainted 2.6.32-25-server #45-Ubuntu
Dec 30 20:17:34 localhost kernel: [1818169.392014] Call Trace:
Dec 30 20:17:34 localhost kernel: [1818169.392019]  <IRQ>
 [<ffffffff810f9a2e>] __alloc_pages_slowpath+0x56e/0x580
Dec 30 20:17:34 localhost kernel: [1818169.392046]  [<ffffffff810f9bb1>]
__alloc_pages_nodemask+0x171/0x180
Dec 30 20:17:34 localhost kernel: [1818169.392055]  [<ffffffff81131df2>]
kmalloc_large_node+0x62/0xb0
Dec 30 20:17:34 localhost kernel: [1818169.392063]  [<ffffffff81136409>]
__kmalloc_node_track_caller+0x109/0x160
Dec 30 20:17:34 localhost kernel: [1818169.392073]  [<ffffffff81468d8d>] ?
dev_alloc_skb+0x1d/0x40
Dec 30 20:17:34 localhost kernel: [1818169.392078]  [<ffffffff81468aa0>]
__alloc_skb+0x80/0x190
Dec 30 20:17:34 localhost kernel: [1818169.392084]  [<ffffffff81468d8d>]
dev_alloc_skb+0x1d/0x40
Dec 30 20:17:34 localhost kernel: [1818169.392109]  [<ffffffffa001af27>]
nv_alloc_rx_optimized+0x197/0x270 [forcedeth]
Dec 30 20:17:34 localhost kernel: [1818169.392120]  [<ffffffffa001a369>] ?
T.936+0x269/0x2a0 [forcedeth]
Dec 30 20:17:34 localhost kernel: [1818169.392130]  [<ffffffffa001c09c>]
nv_nic_irq_optimized+0xdc/0x330 [forcedeth]
Dec 30 20:17:34 localhost kernel: [1818169.392138]  [<ffffffff810c4030>]
handle_IRQ_event+0x60/0x170
Dec 30 20:17:34 localhost kernel: [1818169.392145]  [<ffffffff810c6472>]
handle_edge_irq+0xd2/0x170
Dec 30 20:17:34 localhost kernel: [1818169.392152]  [<ffffffff81014d12>]
handle_irq+0x22/0x30
Dec 30 20:17:34 localhost kernel: [1818169.392161]  [<ffffffff8155f29c>]
do_IRQ+0x6c/0xf0
Dec 30 20:17:34 localhost kernel: [1818169.392166]  [<ffffffff81012b13>]
ret_from_intr+0x0/0x11
Dec 30 20:17:34 localhost kernel: [1818169.392174]  [<ffffffff8106d494>] ?
__do_softirq+0xd4/0x1e0
Dec 30 20:17:34 localhost kernel: [1818169.392180]  [<ffffffff810c4030>] ?
handle_IRQ_event+0x60/0x170
Dec 30 20:17:34 localhost kernel: [1818169.392187]  [<ffffffff810132ec>] ?
call_softirq+0x1c/0x30
Dec 30 20:17:34 localhost kernel: [1818169.392192]  [<ffffffff81014cb5>] ?
do_softirq+0x65/0xa0
Dec 30 20:17:34 localhost kernel: [1818169.392197]  [<ffffffff8106d315>] ?
irq_exit+0x85/0x90
Dec 30 20:17:34 localhost kernel: [1818169.392203]  [<ffffffff8155f2a5>] ?
do_IRQ+0x75/0xf0
Dec 30 20:17:34 localhost kernel: [1818169.392208]  [<ffffffff81012b13>] ?
ret_from_intr+0x0/0x11
Dec 30 20:17:34 localhost kernel: [1818169.392211]  <EOI>
 [<ffffffff812bc2ad>] ? copy_user_generic_string+0x2d/0x40
Dec 30 20:17:34 localhost kernel: [1818169.392227]  [<ffffffff814af860>] ?
tcp_sendmsg+0x860/0xa20
Dec 30 20:17:34 localhost kernel: [1818169.392236]  [<ffffffff814630cc>] ?
sock_aio_write+0x13c/0x150
Dec 30 20:17:34 localhost kernel: [1818169.392245]  [<ffffffff8114378a>] ?
do_sync_write+0xfa/0x140
Dec 30 20:17:34 localhost kernel: [1818169.392253]  [<ffffffff8105a254>] ?
try_to_wake_up+0x284/0x380
Dec 30 20:17:34 localhost kernel: [1818169.392261]  [<ffffffff81084240>] ?
autoremove_wake_function+0x0/0x40
Dec 30 20:17:34 localhost kernel: [1818169.392269]  [<ffffffff8101078c>] ?
__switch_to+0x1ac/0x320
Dec 30 20:17:34 localhost kernel: [1818169.392276]  [<ffffffff81057850>] ?
finish_task_switch+0x50/0xe0
Dec 30 20:17:34 localhost kernel: [1818169.392285]  [<ffffffff81251fd6>] ?
security_file_permission+0x16/0x20
Dec 30 20:17:34 localhost kernel: [1818169.392292]  [<ffffffff81143b54>] ?
vfs_write+0x184/0x1a0
Dec 30 20:17:34 localhost kernel: [1818169.392298]  [<ffffffff811442f1>] ?
sys_write+0x51/0x80
Dec 30 20:17:34 localhost kernel: [1818169.392305]  [<ffffffff810121b2>] ?
system_call_fastpath+0x16/0x1b

For mfschunkservers, master and metaloggers, when we see them for the
mfsmaster then the mfsmaster starts to go crazy and all of the chunkservers
disconnect, when I stop the mfsmaster the metadata file is not created and I
have to use metalogger data or the metadata.back.tmp file.

We noticed that if we echo 3 > /proc/sys/vm/drop_caches then the kernel
errors go away, but we can't afford to have these crashes, this is the
second time this week.

This is a catastrophic problem, please let me know what you think, right now
our guess is that it is related to a kernel bug.

We are running 1.6.19 on Ubuntu 10.04 kernel 2.6.32.

Thanks

-Tom Hatch

[Moosefs-users] Crashing mfsmaster

Fault tolerant, POSIX-compliant, Net Distributed Storage / File System

[Moosefs-users] Crashing mfsmaster