From: Sébastien M. <seb...@gm...> - 2012-04-01 01:15:55
|
Hi, I lost my data this morning. I'm using moosefs for over 10 months and never had such a problem. I have two servers (debian stable) one is mfsmaster (file03) and the other one mfschunkserver. Both have 2Go of memory and have chunkserver of about 4To. I got the following message in syslog : Mar 31 07:09:00 file03 mfsmaster[3182]: total: usedspace: 7111203336192 (6622.82 GiB), totalspace: 9094555459584 (8469.96 GiB), usage: 78.19% Mar 31 07:10:00 file03 mfsmaster[3182]: chunkservers status: Mar 31 07:10:00 file03 mfsmaster[3182]: server 1 (ip: 192.168.0.182, port: 9422): usedspace: 3545542049792 (3302.04 GiB), totalspace: 3616382492672 (3368.02 GiB), usage: 98.04% Mar 31 07:10:00 file03 mfsmaster[3182]: server 2 (ip: 192.168.0.181, port: 9422): usedspace: 3565661286400 (3320.78 GiB), totalspace: 5478172966912 (5101.95 GiB), usage: 65.09% Mar 31 07:10:00 file03 mfsmaster[3182]: total: usedspace: 7111203336192 (6622.82 GiB), totalspace: 9094555459584 (8469.96 GiB), usage: 78.19% Mar 31 07:10:00 file03 mfsmaster[3182]: connection with CS(192.168.0.190) has been closed by peer Mar 31 07:10:00 file03 mfsmaster[3182]: chunkserver disconnected - ip: 192.168.0.190, port: 0, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB) Mar 31 07:10:58 file03 kernel: [849836.227866] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:10:58 file03 kernel: [849836.227872] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 Mar 31 07:10:59 file03 kernel: [849837.670014] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:10:59 file03 kernel: [849837.670021] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 Mar 31 07:11:00 file03 mfsmaster[3182]: chunkservers status: Mar 31 07:11:00 file03 mfsmaster[3182]: server 1 (ip: 192.168.0.182, port: 9422): usedspace: 3545542049792 (3302.04 GiB), totalspace: 3616382492672 (3368.02 GiB), usage: 98.04% Mar 31 07:11:00 file03 mfsmaster[3182]: server 2 (ip: 192.168.0.181, port: 9422): usedspace: 3565661286400 (3320.78 GiB), totalspace: 5478172966912 (5101.95 GiB), usage: 65.09% Mar 31 07:11:00 file03 mfsmaster[3182]: total: usedspace: 7111203336192 (6622.82 GiB), totalspace: 9094555459584 (8469.96 GiB), usage: 78.19% Mar 31 07:11:05 file03 kernel: [849843.214701] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:11:05 file03 kernel: [849843.214707] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 Mar 31 07:11:13 file03 kernel: [849851.464014] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:11:13 file03 kernel: [849851.464021] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 Mar 31 07:11:24 file03 kernel: [849862.732083] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:11:24 file03 kernel: [849862.732088] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 Mar 31 07:11:25 file03 kernel: [849863.626723] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:11:25 file03 kernel: [849863.626729] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 Mar 31 07:11:27 file03 kernel: [849865.858301] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:11:27 file03 kernel: [849865.858307] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 Mar 31 07:11:31 file03 kernel: [849869.633258] mfsmaster: page allocation failure. order:5, mode:0x4020 Mar 31 07:11:31 file03 kernel: [849869.633264] Pid: 3182, comm: mfsmaster Not tainted 2.6.32-5-686 #1 I'm using kernel 2.6.32-5-686. The mfsmaster.mfs.back was totaly corrupted (hi weight was about 100Mo when he was 400 Mo a few days ago and I din't remove so many files) and mfsmetarestore -a or mfsmetarestore -m mfsmetarestore.mfs.back changelog.*.mfs make a segmentation fault. Same problem on the mfsmetalogger since he copy the corrupted mfsmetadata.mfs.back file. How this can happen? I found ou an old copy of mfsmetarestore (10 days ago) and I started from that losing 10 days of production. Now I'm worried, the server is up, but the mfsmetadata.mfs.back hasn't been writen since this morning. He has exaclty the same md5 signature than this morning : root@file03:/var/lib/mfs# ls -l metadata.mfs.back.tmp metadata.mfs.back -rw-r----- 1 root root 412946488 mars 31 10:55 metadata.mfs.back -rw-r--r-- 1 root root 412946488 mars 21 10:27 metadata.mfs.back.tmp root@file03:/var/lib/mfs# md5sum metadata.mfs.back.tmp metadata.mfs.back d1e7c51a5f8a752dc18aa645552165e7 metadata.mfs.back.tmp d1e7c51a5f8a752dc18aa645552165e7 metadata.mfs.back So it looks like my metadata are not dumped ... Please help. Regards, Sébastien |