From: Reinis R. <r...@ro...> - 2011-07-11 14:51:23
|
Hello, while deploying new moosefs instance (with the latest 1.6.20 and running a mongodb on it) I have stumbled upon situation where mfsmount hangs and all file operations halt. Both the mfsmount debug (-o debug) and normal log contains endless loop of: Jul 11 16:43:21 proc78 mfsmount[2066]: file: 44, index: 0, chunk: 359, version: 1 - writeworker: connection with (D5AF4BB6:9422) was timed out (unfinished writes: 2; try counter: 1) Jul 11 16:43:23 proc78 mfsmount[2066]: file: 44, index: 0, chunk: 359, version: 1 - writeworker: connection with (D5AF4BB6:9422) was timed out (unfinished writes: 2; try counter: 1) Jul 11 16:43:25 proc78 mfsmount[2066]: file: 44, index: 0, chunk: 359, version: 1 - writeworker: connection with (D5AF4BB4:9422) was timed out (unfinished writes: 2; try counter: 1) Jul 11 16:43:27 proc78 mfsmount[2066]: file: 44, index: 0, chunk: 359, version: 1 - writeworker: connection with (D5AF4BB6:9422) was timed out (unfinished writes: 2; try counter: 1) And the kernel log: Jul 11 15:33:16 proc78 kernel: [428520.580157] INFO: task mongod:8847 blocked for more than 120 seconds. Jul 11 15:33:16 proc78 kernel: [428520.580159] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 11 15:33:16 proc78 kernel: [428520.580160] mongod D 0000000000000002 0 8847 1 0x00000000 Jul 11 15:33:16 proc78 kernel: [428520.580164] ffff8802223ffe38 0000000000000082 0000000000000007 ffffffff81074bd8 Jul 11 15:33:16 proc78 kernel: [428520.580168] ffff8802223fffd8 ffff8802223fffd8 ffff8802223fffd8 0000000000012b00 Jul 11 15:33:16 proc78 kernel: [428520.580172] ffff88011207a2c0 ffff880221fbe0c0 0000000000000246 ffff880221005c80 Jul 11 15:33:16 proc78 kernel: [428520.580176] Call Trace: Jul 11 15:33:16 proc78 kernel: [428520.580183] [<ffffffffa0239d95>] fuse_set_nowrite+0x95/0xd0 [fuse] Jul 11 15:33:16 proc78 kernel: [428520.580200] [<ffffffffa023d1fa>] fuse_fsync_common+0xca/0x1a0 [fuse] Jul 11 15:33:16 proc78 kernel: [428520.580217] [<ffffffff8116ee0a>] vfs_fsync_range+0x5a/0xa0 Jul 11 15:33:16 proc78 kernel: [428520.580222] [<ffffffff8111a913>] sys_msync+0x153/0x1e0 Jul 11 15:33:16 proc78 kernel: [428520.580227] [<ffffffff81512292>] system_call_fastpath+0x16/0x1b Jul 11 15:33:16 proc78 kernel: [428520.580231] [<00007f415f730a7d>] 0x7f415f730a7c The cgi interface/mfsmaster doesn't report any problems with any servers or chunks. Checking all of the chunkserver logs (total 5 with goal 2) doesn't reveal anything besides them happily testing the local chunks. I have seen some past threads also the FAQ entry that this is a harmless message but the problem in my case is that the mountpoint really hangs making the file system commands (like df / ls) to freeze up and only way is to forcibly kill the mfsmount process and remount. Is there anything else I can do to solve the hang-up or any extra steps to identify/debug the cause? wbr rr |