You can subscribe to this list here.
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 |
Jan
(20) |
Feb
(11) |
Mar
(11) |
Apr
(9) |
May
(22) |
Jun
(85) |
Jul
(94) |
Aug
(80) |
Sep
(72) |
Oct
(64) |
Nov
(69) |
Dec
(89) |
2011 |
Jan
(72) |
Feb
(109) |
Mar
(116) |
Apr
(117) |
May
(117) |
Jun
(102) |
Jul
(91) |
Aug
(72) |
Sep
(51) |
Oct
(41) |
Nov
(55) |
Dec
(74) |
2012 |
Jan
(45) |
Feb
(77) |
Mar
(99) |
Apr
(113) |
May
(132) |
Jun
(75) |
Jul
(70) |
Aug
(58) |
Sep
(58) |
Oct
(37) |
Nov
(51) |
Dec
(15) |
2013 |
Jan
(28) |
Feb
(16) |
Mar
(25) |
Apr
(38) |
May
(23) |
Jun
(39) |
Jul
(42) |
Aug
(19) |
Sep
(41) |
Oct
(31) |
Nov
(18) |
Dec
(18) |
2014 |
Jan
(17) |
Feb
(19) |
Mar
(39) |
Apr
(16) |
May
(10) |
Jun
(13) |
Jul
(17) |
Aug
(13) |
Sep
(8) |
Oct
(53) |
Nov
(23) |
Dec
(7) |
2015 |
Jan
(35) |
Feb
(13) |
Mar
(14) |
Apr
(56) |
May
(8) |
Jun
(18) |
Jul
(26) |
Aug
(33) |
Sep
(40) |
Oct
(37) |
Nov
(24) |
Dec
(20) |
2016 |
Jan
(38) |
Feb
(20) |
Mar
(25) |
Apr
(14) |
May
(6) |
Jun
(36) |
Jul
(27) |
Aug
(19) |
Sep
(36) |
Oct
(24) |
Nov
(15) |
Dec
(16) |
2017 |
Jan
(8) |
Feb
(13) |
Mar
(17) |
Apr
(20) |
May
(28) |
Jun
(10) |
Jul
(20) |
Aug
(3) |
Sep
(18) |
Oct
(8) |
Nov
|
Dec
(5) |
2018 |
Jan
(15) |
Feb
(9) |
Mar
(12) |
Apr
(7) |
May
(123) |
Jun
(41) |
Jul
|
Aug
(14) |
Sep
|
Oct
(15) |
Nov
|
Dec
(7) |
2019 |
Jan
(2) |
Feb
(9) |
Mar
(2) |
Apr
(9) |
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
(6) |
Oct
(1) |
Nov
(12) |
Dec
(2) |
2020 |
Jan
(2) |
Feb
|
Mar
|
Apr
(3) |
May
|
Jun
(4) |
Jul
(4) |
Aug
(1) |
Sep
(18) |
Oct
(2) |
Nov
|
Dec
|
2021 |
Jan
|
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(6) |
Aug
|
Sep
(5) |
Oct
(5) |
Nov
(3) |
Dec
|
2022 |
Jan
|
Feb
|
Mar
(3) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Michal B. <mic...@ge...> - 2011-04-05 08:11:53
|
Hi Pedro! It may happen that after power failure last changelog may be broken. You need to find the last line of the changelog (usually changelog.0.mfs) and delete the last line. Generally speaking, after power failure it is better to use metadata files from metalogger, not running mfsmetarestore on the master server. We made some improvements to the metarestore in the next development version. Now it is more failproof for these kind of errors. We added a '-b' option which forces to write the resulting metadata file at the first encountered error (as "better such than none"). If you would like to test it, we may send you metarestore sources before we officialy publish 1.6.21. Kind regards Michał Borychowski MooseFS Support Manager _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Gemius S.A. ul. Wołoska 7, 02-672 Warszawa Budynek MARS, klatka D Tel.: +4822 874-41-00 Fax : +4822 874-41-01 -----Original Message----- From: Pedro Naranjo [mailto:pe...@st...] Sent: Monday, March 21, 2011 11:20 PM To: moo...@li... Subject: [Moosefs-users] Power failure... unable to restore metadata.mfs Hi there, We run another test today after losing about 3TB+ of data when we could not restore the metadata.mfs file. We were in the process of copying files over when we power was cut off. This lead to an error when running the mfsmetarestore -a command as follows... 3605: '|' expected We really feel like MooseFS is the best solution that we could find but somehow after not being able to recover from something so real as a power failure, I really worry. Please advice as to how to fix this problem. Sincerely, Pedro Naranjo / STL Technologies / Solutions Architech / 888.556.0774 ---------------------------------------------------------------------------- -- Enable your software for Intel(R) Active Management Technology to meet the growing manageability and security demands of your customers. Businesses are taking advantage of Intel(R) vPro (TM) technology - will your software be a part of the solution? Download the Intel(R) Manageability Checker today! http://p.sf.net/sfu/intel-dev2devmar _______________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Michal B. <mic...@ge...> - 2011-04-05 07:49:10
|
Hi! We tried to recreate this error but we couldn't. If you could run your metalogger under valgrind? Possibly it would write some interesting information. Or you can send us some core dump? Kind regards Michał Borychowski MooseFS Support Manager _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Gemius S.A. ul. Wołoska 7, 02-672 Warszawa Budynek MARS, klatka D Tel.: +4822 874-41-00 Fax : +4822 874-41-01 -----Original Message----- From: Boyko Yordanov [mailto:b.y...@ex...] Sent: Sunday, March 20, 2011 1:19 PM To: moo...@li... Subject: [Moosefs-users] mfsmetalogger segfaults Hello! I've been using moosefs for a while. I have 3 metadata backup loggers running. I noticed that if I kill mfsmaster process on the master node (simulating power failure), mfsmetalogger crashes (segfault) on the metadata logger node. Here are logs entries: Mar 20 11:45:35 server110 mfsmetalogger[6546]: metadata downloaded 72105B/0.009982s (7.224 MB/s) Mar 20 11:45:35 server110 mfsmetalogger[6546]: changelog_0 downloaded 0B/0.000001s (0.000 MB/s) Mar 20 11:45:35 server110 mfsmetalogger[6546]: changelog_1 downloaded 164193B/0.015491s (10.599 MB/s) Mar 20 11:45:35 server110 mfsmetalogger[6546]: sessions downloaded 3050B/0.001501s (2.032 MB/s) Mar 20 11:46:03 server110 mfsmetalogger[6546]: sessions downloaded 3050B/0.001497s (2.037 MB/s) Mar 20 11:47:00 server110 mfsmetalogger[6546]: sessions downloaded 3050B/0.001246s (2.448 MB/s) Mar 20 11:48:48 server110 mfsmetalogger[6546]: sessions downloaded 3050B/0.001009s (3.023 MB/s) Mar 20 11:48:48 server110 mfsmetalogger[6546]: connection was reset by Master Mar 20 11:49:00 server110 mfsmetalogger[6546]: connecting ... Mar 20 11:49:00 server110 mfsmetalogger[6546]: connection failed, error: ECONNREFUSED (Connection refused) Mar 20 11:49:05 server110 mfsmetalogger[6546]: connecting ... Mar 20 11:49:05 server110 mfsmetalogger[6546]: connection failed, error: ECONNREFUSED (Connection refused) Mar 20 11:49:06 server110 kernel: mfsmetalogger[6546]: segfault at 0000000000000060 rip 000000318c26119d rsp 00007fff2f368170 error 4 from another metadata logger: Mar 20 13:33:00 server102 mfsmetalogger[5088]: sessions downloaded 3388B/0.000993s (3.412 MB/s) Mar 20 13:34:00 server102 mfsmetalogger[5088]: sessions downloaded 3388B/0.001000s (3.388 MB/s) Mar 20 13:35:00 server102 mfsmetalogger[5088]: sessions downloaded 3388B/0.001000s (3.388 MB/s) Mar 20 13:35:48 server102 mfsmetalogger[5088]: connection was reset by Master Mar 20 13:35:50 server102 mfsmetalogger[5088]: connecting ... Mar 20 13:35:50 server102 mfsmetalogger[5088]: connection failed, error: ECONNREFUSED (Connection refused) Mar 20 13:35:55 server102 mfsmetalogger[5088]: connecting ... Mar 20 13:35:55 server102 mfsmetalogger[5088]: connection failed, error: ECONNREFUSED (Connection refused) Mar 20 13:35:56 server102 kernel: mfsmetalogger[5088]: segfault at 0000000000000060 rip 0000003c6386119d rsp 00007fff7d13a7d0 error 4 Mar 20 13:37:23 server102 mfsmetalogger[12676]: set gid to 502 Mar 20 13:37:23 server102 mfsmetalogger[12676]: set uid to 502 Mar 20 13:37:23 server102 mfsmetalogger[12676]: connecting ... Mar 20 13:37:23 server102 mfsmetalogger[12676]: open files limit: 5000 Mar 20 13:37:23 server102 mfsmetalogger[12676]: connected to Master Mar 20 13:37:23 server102 mfsmetalogger[12676]: metadata downloaded 72113B/0.013963s (5.165 MB/s) Mar 20 13:37:23 server102 mfsmetalogger[12676]: changelog_0 downloaded 981876B/0.086934s (11.294 MB/s) Mar 20 13:37:23 server102 mfsmetalogger[12676]: changelog_1 downloaded 164193B/0.015978s (10.276 MB/s) Mar 20 13:37:23 server102 mfsmetalogger[12676]: sessions downloaded 3388B/0.001993s (1.700 MB/s) Mar 20 13:39:00 server102 mfsmetalogger[12676]: sessions downloaded 3388B/0.002965s (1.143 MB/s) Mar 20 13:40:00 server102 mfsmetalogger[12676]: sessions downloaded 3388B/0.001986s (1.706 MB/s) Mar 20 13:41:00 server102 mfsmetalogger[12676]: sessions downloaded 3388B/0.000991s (3.419 MB/s) Mar 20 13:41:23 server102 mfsmetalogger[12676]: connection was reset by Master Mar 20 13:41:25 server102 mfsmetalogger[12676]: connecting ... Mar 20 13:41:25 server102 mfsmetalogger[12676]: connection failed, error: ECONNREFUSED (Connection refused) Mar 20 13:41:30 server102 mfsmetalogger[12676]: connecting ... Mar 20 13:41:30 server102 mfsmetalogger[12676]: connection failed, error: ECONNREFUSED (Connection refused) Mar 20 13:41:31 server102 kernel: mfsmetalogger[12676]: segfault at 0000000000000060 rip 0000003c6386119d rsp 00007fff5207ee20 error 4 Both machines are running centos 5.5, x86_64, mfs-1.6.20-2, same for the master. Also, not sure if related, but while running tests - killing mfsmaster process and trying to restore from a metadata logger - sometimes I am unable to create the metadata.mfs data file, getting the following message: [root@server102 mfs]# mfsmetarestore -a -d /var/lib/mfs file 'metadata.mfs.back' not found - will try 'metadata_ml.mfs.back' instead loading objects (files,directories,etc.) ... ok loading names ... ok loading deletion timestamps ... ok checking filesystem consistency ... ok loading chunks data ... ok connecting files and chunks ... ok hole in change files (entries from 791301 to 791305 are missing) - add more files Wondering why are these entries missing. As mfsmetalogger process crashes after the mfsmaster process is killed, can this be related? (btw, I'm building the metadata.mfs file as suggested by Michal Borychowski in another email regarding a bug in moosefs when using snapshots) Can't tell for sure but I think that if I clear the /var/lib/mfs folder (delete all the logs/files) and then start mfsmetalogger clean, there are no issues when restoring metadata.mfs - all goes fine (at least for the 10 times I've tried so far). So the 'add more files' errors may be related to having old changelogs in /var/lig/mfs, can anyone confirm this? Anyone having similar issues? Thanks a lot! Boyko ---------------------------------------------------------------------------- -- Colocation vs. Managed Hosting A question and answer guide to determining the best fit for your organization - today and in the future. http://p.sf.net/sfu/internap-sfd2d _______________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Michal B. <mic...@ge...> - 2011-04-05 07:30:20
|
Hi! Ad 1. Officially we do not support HPUX. But the only reason for this is that we do not have machines with this platform for running tests. We’d be very happy to announce that MooseFS is HPUX compatible. Ad 2. For the moment, yes, the clients need to use FUSE. In the future there should be created some kind of conversion to NFS (probably to NFSv4). Ad 3. Some time ago we created a plugin to NFSv3, but it would need some improvements. I guess we could create something like this specially for you - please drop me an email if you are interested. Kind regards Michał Borychowski MooseFS Support Manager _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Gemius S.A. ul. Wołoska 7, 02-672 Warszawa Budynek MARS, klatka D Tel.: +4822 874-41-00 Fax : +4822 874-41-01 hi,i want to ask for you this question,please answer me 1.Does the MooseFs Support HPUX IA64 ??? 2. does the client must use fuse??? 3. client must use HPUX IA64 ,what should i do???? |
From: Fyodor U. <uf...@uf...> - 2011-04-04 20:11:18
|
On 04/04/2011 10:51 PM, Fyodor Ustinov wrote: > Hi. > > ceph osd pool set data size 1 > > dd if=/dev/zero of=aaa bs=1024000 count=4000 > 4096000000 bytes (4.1 GB) copied, 31.3153 s, 131 MB/s > > ceph osd pool set data size 2 > 4096000000 bytes (4.1 GB) copied, 72.7146 s, 56.3 MB/s > > ceph osd pool set data size 3 > 4096000000 bytes (4.1 GB) copied, 136.263 s, 30.1 MB/s > > Why? I thought increase in the number of copies should increase the > performance (in the worst case does not affect). > > WBR, > Fyodor. Oops. Not about moose :) I'm test ceph and moosefs simultaneously/ this about ceph. :) About moosefs - bonnie++ show dramatically slow on rewrite test.. :( |
From: Fyodor U. <uf...@uf...> - 2011-04-04 20:11:18
|
Hi. ceph osd pool set data size 1 dd if=/dev/zero of=aaa bs=1024000 count=4000 4096000000 bytes (4.1 GB) copied, 31.3153 s, 131 MB/s ceph osd pool set data size 2 4096000000 bytes (4.1 GB) copied, 72.7146 s, 56.3 MB/s ceph osd pool set data size 3 4096000000 bytes (4.1 GB) copied, 136.263 s, 30.1 MB/s Why? I thought increase in the number of copies should increase the performance (in the worst case does not affect). WBR, Fyodor. |
From: g. <guj...@ge...> - 2011-04-02 02:27:32
|
Dears: Thank you very much for your help.I solved the problem.thanks jose maria !! 2011-04-02 Best Regards!! ----------------------------------------------------------------------------------- 古举标(Juby,Gu) 存储工程师 信息生产平台 系统支持组 Mobile:13723406010 QQ:190247054 Email:guj...@ge... 华大基因研究院(BGI) 地址:深圳市盐田区北山工业区综合楼1001(518083) Addr:No.1001,Floor 10,Main Building,Beishan Industrial Zone,Yantian District,Shenzhen,China Post Code:518083 ------------------------------------------------------------------------------------ 发件人: jose maria 发送时间: 2011-04-02 01:29:29 收件人: moosefs-users 抄送: 主题: Re: [Moosefs-users] help:mfsfileinfo:operation not permitted. El vie, 01-04-2011 a las 18:22 +0800, 古举标 escribió: > Dears: > > when i use "mfsfileinfo" and "mfscheckfile" command,there is a error occur.below is the error message,pls help,thanks a lot. > #/usr/local/mfs/bin/mfsfileinfo /mnt/mfs > /mnt/mfs [0]: Operation not permitted. > #/usr/local/mfs/bin/mfscheckfile /mnt/mfs > /mnt/mfs [0]: Operation not permitted. > > #df -Th | grep mfs > mfs#mfsmaster:9421 fuse 13G 2.4M 13G 1% /mnt/mfs > > #ll /mnt/ > drwxrwxrwx 2 root root 0 Mar 30 12:51 mfs > > OS:CentOS release 5.4,Version:2.6.18-164.el5,32bit > MooseFS version: mfs-1.6.13 > Fuse version:fuse-2.8.3 > * execute the commands over files and mfsdirinfo over directory. * /mnt/mfs is local directory. ------------------------------------------------------------------------------ Create and publish websites with WebMatrix Use the most popular FREE web apps or write code yourself; WebMatrix provides all the features you need to develop and publish your website. http://p.sf.net/sfu/ms-webmatrix-sf _______________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: jose m. <let...@us...> - 2011-04-01 17:28:52
|
El vie, 01-04-2011 a las 18:22 +0800, 古举标 escribió: > Dears: > > when i use "mfsfileinfo" and "mfscheckfile" command,there is a error occur.below is the error message,pls help,thanks a lot. > #/usr/local/mfs/bin/mfsfileinfo /mnt/mfs > /mnt/mfs [0]: Operation not permitted. > #/usr/local/mfs/bin/mfscheckfile /mnt/mfs > /mnt/mfs [0]: Operation not permitted. > > #df -Th | grep mfs > mfs#mfsmaster:9421 fuse 13G 2.4M 13G 1% /mnt/mfs > > #ll /mnt/ > drwxrwxrwx 2 root root 0 Mar 30 12:51 mfs > > OS:CentOS release 5.4,Version:2.6.18-164.el5,32bit > MooseFS version: mfs-1.6.13 > Fuse version:fuse-2.8.3 > * execute the commands over files and mfsdirinfo over directory. * /mnt/mfs is local directory. |
From: 古举标 <guj...@ge...> - 2011-04-01 10:23:11
|
Dears: when i use "mfsfileinfo" and "mfscheckfile" command,there is a error occur.below is the error message,pls help,thanks a lot. #/usr/local/mfs/bin/mfsfileinfo /mnt/mfs /mnt/mfs [0]: Operation not permitted. #/usr/local/mfs/bin/mfscheckfile /mnt/mfs /mnt/mfs [0]: Operation not permitted. #df -Th | grep mfs mfs#mfsmaster:9421 fuse 13G 2.4M 13G 1% /mnt/mfs #ll /mnt/ drwxrwxrwx 2 root root 0 Mar 30 12:51 mfs OS:CentOS release 5.4,Version:2.6.18-164.el5,32bit MooseFS version: mfs-1.6.13 Fuse version:fuse-2.8.3 thanks again. 2011-03-31 -------------------------------------------------------------------------------- Best Regards!! ----------------------------------------------------------------------------------- 古举标(Juby,Gu) 存储工程师 信息生产平台 系统支持组 Mobile:13723406010 QQ:190247054 Email:guj...@ge... 华大基因研究院(BGI) 地址:深圳市盐田区北山工业区综合楼1001(518083) Addr:No.1001,Floor 10,Main Building,Beishan Industrial Zone,Yantian District,Shenzhen,China Post Code:518083 ------------------------------------------------------------------------------------ ----- 原始邮件 ----- 发件人: moo...@li... 收件人: guj...@ge... 已发送邮件: Fri, 01 Apr 2011 18:18:17 +0800 (HKT) 主题: Welcome to the "moosefs-users" mailing list Welcome to the moo...@li... mailing list! To post to this list, send your email to: moo...@li... General information about the mailing list is at: https://lists.sourceforge.net/lists/listinfo/moosefs-users If you ever want to unsubscribe or change your options (eg, switch to or from digest mode, change your password, etc.), visit your subscription page at: https://lists.sourceforge.net/lists/options/moosefs-users/gujubiao%40genomics.org.cn You can also make such adjustments via email by sending a message to: moo...@li... with the word `help' in the subject or body (don't include the quotes), and you will get back a message with instructions. You must know your password to change your options (including changing the password, itself) or to unsubscribe. It is: amfaazpe Normally, Mailman will remind you of your lists.sourceforge.net mailing list passwords once every month, although you can disable this if you prefer. This reminder will also include instructions on how to unsubscribe or change your account options. There is also a button on your options page that will email your current password to you. |
From: <da...@sq...> - 2011-03-31 11:13:14
|
It appears that both the master and the metalogger servers have gotten corrupted. When I try to run an mfsmeterestore -a on the master I get the following: hole in change files (entries from 10772493 to 624987257 are missing) - add more files So I go to the metalogger and I get the same message but with a different set of numbers. Is there a way to run a partial restore, so I can at least get to some of the data? Or is my data just gone entirely? (I still have the chunkservers, which are intact) # mfsmetarestore -v version: 1.6.20 I've been searching for several days, but haven't really been able to find much in relation to this. I know that it isn't related to the snapshot bug, I haven't used snapshots yet. Thanks, Dallin Jones |
From: <ha...@si...> - 2011-03-29 02:03:32
|
Hello! Will you help me solve some problems? We intend to use moosefs at our product environment as the storage of our product service. I want to ask some questions as follows: Problem One: About MooseFS, If the storage 500T, 3 million files, operating 500G times a day, how much memory the metadata needed? Problem Two: If you modify the management about the metadata's namespace , such as from the HASH to B-TREE, whether it needs a lot of work and whether you feel it the feasibility and reasonable? Problem Three: About MooseFS, MFS whether to support INFINIBAND, you think whether need more work and feel reasonable if we modify it to support? And the last One: About MooseFS,Whether to support the IBM AIX and HP UX ? Which OS it supports and which one it doesn't ,please tell me what list? That's all ,thanks a lot! Sincerely look forward to your reply! Best regards! Hanyw |
From: <ha...@si...> - 2011-03-25 11:36:37
|
Hello, everyone! We intend to use moosefs at our product environment as the storage of our product service. I want to ask some questions as follows: Problem One: About MooseFS, If the storage 500T, 3 million files, operating 500G times a day, how much memory the metadata needed? Problem Two: If you modify the management about the metadata's namespace , such as from the HASH to B-TREE, whether it needs a lot of work and whether you feel it the feasibility and reasonable? Problem Three: About MooseFS, MFS whether to support INFINIBAND, you think whether need more work and feel reasonable if we modify it to support? And the last One: About MooseFS,Whether to support the IBM AIX and HP UX ? Which OS it supports and which one it doesn't ,please tell me what list? That's all ,thanks a lot! Sincerely look forward to your reply! Best regards! Hanyw |
From: <ha...@si...> - 2011-03-25 11:15:15
|
Hello, everyone! We intend to use moosefs at our product environment as the storage of our product service. I want to ask some questions as follows: Problem One: About MooseFS, If the storage 500T, 3 million files, operating 500G times a day, how much memory the metadata needed? Problem Two: If you modify the management about the metadata's namespace , such as from the HASH to B-TREE, whether it needs a lot of work and whether you feel it the feasibility and reasonable? Problem Three: About MooseFS, MFS whether to support INFINIBAND, you think whether need more work and feel reasonable if we modify it to support? And the last One: About MooseFS,Whether to support the IBM AIX and HP UX ? Which OS it supports and which one it doesn't ,please tell me what list? That's all ,thanks a lot! Sincerely look forward to your reply! Best regards! Hanyw |
From: TianYuchuan(田玉川) <ti...@fo...> - 2011-03-25 06:10:01
|
Hi! Thanks! The master server inserted SAS 15K speed disk! Now the problem had solved! I had update the moosefs version,now the version is mfs-1.6.20-2. Updated ,Cpu was 16%。 Anthor question! Moosefs upgrade! I installed the moosefs mfs-1.6.20-2 version of a new server,and started the master、chunkserver、client。 The old masterserver was not stoped。The old masterserver was not connected chunkserver、not connected client,but the master process occupied 80% CPU,then I restarted the master service,reduced to 5% CPU utilization。 The master cannot release the CPU? -----邮件原件----- 发件人: Michal Borychowski [mailto:mic...@ge...] 发送时间: 2011年3月24日 16:34 收件人: 'TianYuchuan(田玉川)'; 'Shen Guowen' 抄送: moo...@li... 主题: RE: [Moosefs-users] To access data was very slowly,nearly 2 minute。oh my god! Hi! You have almost all RAM consumed. As you have 100 million files in the system we suggest putting some extra RAM to the master server. Also it would be advisable to insert SSD disk into the master server so that the hourly metadata dump takes less time. Kind regards Michał Borychowski MooseFS Support Manager _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Gemius S.A. ul. Wołoska 7, 02-672 Warszawa Budynek MARS, klatka D Tel.: +4822 874-41-00 Fax : +4822 874-41-01 -----Original Message----- From: TianYuchuan(田玉川) [mailto:ti...@fo...] Sent: Thursday, March 17, 2011 10:03 AM To: Shen Guowen Cc: moo...@li... Subject: [Moosefs-users] To access data was very slowly,nearly 2 minute。oh my god! Hello My moosefs system was accessd very slowly,I nave no idea,please help me!Thanks!!! files number 104964618 ,chunks number 104963962。 master load is not high,but When the hour every to data cannot accessed,continued for several minutes。General,visit concurrent small, to access data delay was needed a few seconds。 My moosefs system have nine chunks, The chunk station 1 localhost 192.168.0.118 9422 1.6.19 23387618 3.6 TiB 4.5 TiB 79.72 0 0 B 0 B - 2 localhost 192.168.0.119 9422 1.6.19 23246974 3.6 TiB 4.5 TiB 79.72 0 0 B 0 B - 3 localhost 192.168.0.120 9422 1.6.19 23360333 3.6 TiB 4.5 TiB 79.72 0 0 B 0 B - 4 localhost 192.168.0.121 9422 1.6.19 23192013 3.6 TiB 4.5 TiB 79.69 0 0 B 0 B - 5 localhost 192.168.0.122 9422 1.6.19 23483418 3.6 TiB 4.5 TiB 79.70 0 0 B 0 B - 6 localhost 192.168.0.123 9422 1.6.19 23308366 3.6 TiB 4.5 TiB 79.70 0 0 B 0 B - 7 localhost 192.168.0.124 9422 1.6.19 23361992 3.6 TiB 4.5 TiB 79.69 0 0 B 0 B - 8 localhost 192.168.0.125 9422 1.6.19 23300478 3.6 TiB 4.5 TiB 79.70 0 0 B 0 B - 9 localhost 192.168.0.127 9422 1.6.19 23284897 3.5 TiB 4.5 TiB 78.72 0 0 B 0 B - -------------------------------------------------------------------------------------------------------------------------------------------------- [root@localhost mfs]# free -m total used free shared buffers cached Mem: 48295 46127 2168 0 38 8204 -/+ buffers/cache: 37884 10411 Swap: 0 0 0 The CPU using 95%,the highest was by 150%。 -----邮件原件----- 发件人: Shen Guowen [mailto:sh...@ui...] 发送时间: 2010年8月9日 10:42 收件人: TianYuchuan(田玉川) 抄送: moo...@li... 主题: Re: [Moosefs-users] mfs-master[10546]: CS(192.168.0.125) packet too long (115289537/50000000) Don't worry! This is because some of your chunk servers are currently unreachable, and the master server notices it, then modifies the meta data of files in those chunk servers to set the "allvalidcopies" to 0 in "struct chunk". When the master is rescanning the files (fs_test_files() in filesystem.c), it finds out the valid copy is 0, then print information into syslog file, just as listed below. However, printing process is quite time-consuming, especially the mount of files is large. During this period, the master ignores the chunk server's connection (because it is in a big loop of test files, and it is a single thread to do this, maybe this is a pitfall). So although you make sure the chunk server working correctly, it is useless (you can notice the reconnecting information in chunk server's syslog file). You could let the master finish printing, then it will reconnect with chunk servers, and will notice the files is there, then set the "allvalidcopies" to a correct value. Then works normally. Or you can re-compile the program with commenting the line 5512 and line 5482 in filesystem.c(mfs-1.6.15). It will ignore the print messages and of cause, reduce the fs test time. Below is from Michal: ----------------------------------------------------------------------- We give you here some quick patches you can implement to the master server to improve its performance for that amount of files: In matocsserv.c in mfsmaster you need to change this line: #define MaxPacketSize 50000000 into this: #define MaxPacketSize 500000000 Also we suggest a change in filesystem.c in mfsmaster in "fs_test_files" function. Change this line: if ((uint32_t)(main_time())<=starttime+150) { into: if ((uint32_t)(main_time())<=starttime+900) { And also changing this line: for (k=0 ; k<(NODEHASHSIZE/3600) && i<NODEHASHSIZE ; k++,i++) { into this: for (k=0 ; k<(NODEHASHSIZE/14400) && i<NODEHASHSIZE ; k++,i++) { You need to recompile the master server and start it again. The above changes should make the master server work more stable with large amount of files. Another suggestion would be to create two MooseFS instances (eg. 2 x 200 million files). One master server could also be metalogger for the another system and vice versa. Kind regards Michał ----------------------------------------------------------------------------- -- Guowen Shen On Sun, 2010-08-08 at 22:51 +0800, TianYuchuan(田玉川) wrote: > > > hello,everyone! > I have a big quertion,please help me,thank you very much. > We intend to use moosefs at our product environment as the storage of > our online photo service. > We'll store for about 200 million photo files. > I've built one master server(48G mem), one metalogger server, eight > chunk servers(8*1T SATA). When I copy photo files to the moosefs > system. At start everything is good. But I had copyed files 57 > million ,the master machines'CPU were used 100% > I sthoped the master when used “/user/local/mfs/sbin/mfsmasterserver > -s”,that I started the master。but there was a big problem ,the > master had not read my files。 These documents are important to me,I > am very anxious,please help me recover these files,tihanks。 > > I got many error syslog from master server: > > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 41991323: 2668/2526212449954462668/176s.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 00000000043CD358 (inode: 50379931 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 50379931: 2926/4294909215566102926/163b.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 00000000002966C3 (inode: 48284 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 48284: bookdata/178/8533354296639220178/180b.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 0000000000594726 (inode: 4242588 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 4242588: bookdata/6631/4300989258725036631/85s.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 0000000000993541 (inode: 8436892 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 8436892: bookdata/7534/3147352338521267534/122b.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 0000000000D906E6 (inode: 12631196 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 12631196: bookdata/8691/11879047433161548691/164s.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 000000000118DC1E (inode: 16825500 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 16825500: bookdata/1232/17850056326363351232/166b.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 0000000001681BC7 (inode: 21019804 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 21019804: bookdata/26/12779298489336140026/246s.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 0000000001A804E1 (inode: 25214108 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 25214108: bookdata/3886/8729781571075193886/30s.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 0000000001E7E826 (inode: 29408412 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 29408412: bookdata/4757/142868991575144757/316b.jpg > > > Aug 7 23:56:36 localhost mfsmaster[10546]: CS(192.168.0.124) packet > too long (115289537/50000000) > Aug 7 23:56:36 localhost mfsmaster[10546]: chunkserver disconnected - > ip: 192.168.0.124, port: 0, usedspace: 0 (0.00 GiB), totalspace: 0 > (0.00 GiB) > Aug 8 00:08:14 localhost mfsmaster[10546]: CS(192.168.0.127) packet > too long (104113889/50000000) > Aug 8 00:08:14 localhost mfsmaster[10546]: chunkserver disconnected - > ip: 192.168.0.127, port: 0, usedspace: 0 (0.00 GiB), totalspace: 0 > (0.00 GiB) > Aug 8 00:21:03 localhost mfsmaster[10546]: CS(192.168.0.120) packet > too long (117046565/50000000) > Aug 8 00:21:03 localhost mfsmaster[10546]: chunkserver disconnected - > ip: 192.168.0.120, port: 0, usedspace: 0 (0.00 GiB), totalspace: 0 > (0.00 GiB) > > when I visited the mfscgi,the error was“Can't connect to MFS master > (IP:127.0.0.1 ; PORT:9421)” > 。 > > Thanks all! > ------------------------------------------------------------------------------ > This SF.net email is sponsored by > > Make an app they can't live without > Enter the BlackBerry Developer Challenge > http://p.sf.net/sfu/RIM-dev2dev > _______________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users ------------------------------------------------------------------------------ Colocation vs. Managed Hosting A question and answer guide to determining the best fit for your organization - today and in the future. http://p.sf.net/sfu/internap-sfd2d _______________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Michal B. <mic...@ge...> - 2011-03-24 08:37:56
|
Hi Robert! Do you use 'mfsmakesnapshot' operation (ie. do you have 'SNAPSHOT' entries in changelogs)? If yes, you may encounter an error I wrote about several days ago. If you have nothing secret, you may send to my email address metadata_ml.mfs.back and changelog_ml.0.mfs with changelog_ml.1.mfs so that we have a closer look at them. Kind regards Michał Borychowski MooseFS Support Manager _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Gemius S.A. ul. Wołoska 7, 02-672 Warszawa Budynek MARS, klatka D Tel.: +4822 874-41-00 Fax : +4822 874-41-01 From: Robert Dye [mailto:ro...@in...] Sent: Saturday, March 19, 2011 12:20 AM To: moo...@li... Subject: [Moosefs-users] FreeBSD mfsmetarestore - Operation Not Permitted # mfsmetarestore -x -a /var/mfs/metadata_ml.mfs.back -d /var/mfs/ file 'metadata.mfs.back' not found - will try 'metadata_ml.mfs.back' instead loading objects (files,directories,etc.) ... ok loading names ... ok loading deletion timestamps ... ok checking filesystem consistency ... ok loading chunks data ... ok connecting files and chunks ... ok found changelog file 1: /var/mfs/changelog_ml.0.mfs found changelog file 2: /var/mfs/changelog_ml.1.mfs change: 1300482060|FREEINODES():2036 154746233: error: 1 (Operation not permitted) I am the root user, is this a bug? |
From: Michal B. <mic...@ge...> - 2011-03-24 08:34:34
|
Hi! You have almost all RAM consumed. As you have 100 million files in the system we suggest putting some extra RAM to the master server. Also it would be advisable to insert SSD disk into the master server so that the hourly metadata dump takes less time. Kind regards Michał Borychowski MooseFS Support Manager _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Gemius S.A. ul. Wołoska 7, 02-672 Warszawa Budynek MARS, klatka D Tel.: +4822 874-41-00 Fax : +4822 874-41-01 -----Original Message----- From: TianYuchuan(田玉川) [mailto:ti...@fo...] Sent: Thursday, March 17, 2011 10:03 AM To: Shen Guowen Cc: moo...@li... Subject: [Moosefs-users] To access data was very slowly,nearly 2 minute。oh my god! Hello My moosefs system was accessd very slowly,I nave no idea,please help me!Thanks!!! files number 104964618 ,chunks number 104963962。 master load is not high,but When the hour every to data cannot accessed,continued for several minutes。General,visit concurrent small, to access data delay was needed a few seconds。 My moosefs system have nine chunks, The chunk station 1 localhost 192.168.0.118 9422 1.6.19 23387618 3.6 TiB 4.5 TiB 79.72 0 0 B 0 B - 2 localhost 192.168.0.119 9422 1.6.19 23246974 3.6 TiB 4.5 TiB 79.72 0 0 B 0 B - 3 localhost 192.168.0.120 9422 1.6.19 23360333 3.6 TiB 4.5 TiB 79.72 0 0 B 0 B - 4 localhost 192.168.0.121 9422 1.6.19 23192013 3.6 TiB 4.5 TiB 79.69 0 0 B 0 B - 5 localhost 192.168.0.122 9422 1.6.19 23483418 3.6 TiB 4.5 TiB 79.70 0 0 B 0 B - 6 localhost 192.168.0.123 9422 1.6.19 23308366 3.6 TiB 4.5 TiB 79.70 0 0 B 0 B - 7 localhost 192.168.0.124 9422 1.6.19 23361992 3.6 TiB 4.5 TiB 79.69 0 0 B 0 B - 8 localhost 192.168.0.125 9422 1.6.19 23300478 3.6 TiB 4.5 TiB 79.70 0 0 B 0 B - 9 localhost 192.168.0.127 9422 1.6.19 23284897 3.5 TiB 4.5 TiB 78.72 0 0 B 0 B - -------------------------------------------------------------------------------------------------------------------------------------------------- [root@localhost mfs]# free -m total used free shared buffers cached Mem: 48295 46127 2168 0 38 8204 -/+ buffers/cache: 37884 10411 Swap: 0 0 0 The CPU using 95%,the highest was by 150%。 -----邮件原件----- 发件人: Shen Guowen [mailto:sh...@ui...] 发送时间: 2010年8月9日 10:42 收件人: TianYuchuan(田玉川) 抄送: moo...@li... 主题: Re: [Moosefs-users] mfs-master[10546]: CS(192.168.0.125) packet too long (115289537/50000000) Don't worry! This is because some of your chunk servers are currently unreachable, and the master server notices it, then modifies the meta data of files in those chunk servers to set the "allvalidcopies" to 0 in "struct chunk". When the master is rescanning the files (fs_test_files() in filesystem.c), it finds out the valid copy is 0, then print information into syslog file, just as listed below. However, printing process is quite time-consuming, especially the mount of files is large. During this period, the master ignores the chunk server's connection (because it is in a big loop of test files, and it is a single thread to do this, maybe this is a pitfall). So although you make sure the chunk server working correctly, it is useless (you can notice the reconnecting information in chunk server's syslog file). You could let the master finish printing, then it will reconnect with chunk servers, and will notice the files is there, then set the "allvalidcopies" to a correct value. Then works normally. Or you can re-compile the program with commenting the line 5512 and line 5482 in filesystem.c(mfs-1.6.15). It will ignore the print messages and of cause, reduce the fs test time. Below is from Michal: ----------------------------------------------------------------------- We give you here some quick patches you can implement to the master server to improve its performance for that amount of files: In matocsserv.c in mfsmaster you need to change this line: #define MaxPacketSize 50000000 into this: #define MaxPacketSize 500000000 Also we suggest a change in filesystem.c in mfsmaster in "fs_test_files" function. Change this line: if ((uint32_t)(main_time())<=starttime+150) { into: if ((uint32_t)(main_time())<=starttime+900) { And also changing this line: for (k=0 ; k<(NODEHASHSIZE/3600) && i<NODEHASHSIZE ; k++,i++) { into this: for (k=0 ; k<(NODEHASHSIZE/14400) && i<NODEHASHSIZE ; k++,i++) { You need to recompile the master server and start it again. The above changes should make the master server work more stable with large amount of files. Another suggestion would be to create two MooseFS instances (eg. 2 x 200 million files). One master server could also be metalogger for the another system and vice versa. Kind regards Michał ----------------------------------------------------------------------------- -- Guowen Shen On Sun, 2010-08-08 at 22:51 +0800, TianYuchuan(田玉川) wrote: > > > hello,everyone! > I have a big quertion,please help me,thank you very much. > We intend to use moosefs at our product environment as the storage of > our online photo service. > We'll store for about 200 million photo files. > I've built one master server(48G mem), one metalogger server, eight > chunk servers(8*1T SATA). When I copy photo files to the moosefs > system. At start everything is good. But I had copyed files 57 > million ,the master machines'CPU were used 100% > I sthoped the master when used “/user/local/mfs/sbin/mfsmasterserver > -s”,that I started the master。but there was a big problem ,the > master had not read my files。 These documents are important to me,I > am very anxious,please help me recover these files,tihanks。 > > I got many error syslog from master server: > > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 41991323: 2668/2526212449954462668/176s.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 00000000043CD358 (inode: 50379931 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 50379931: 2926/4294909215566102926/163b.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 00000000002966C3 (inode: 48284 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 48284: bookdata/178/8533354296639220178/180b.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 0000000000594726 (inode: 4242588 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 4242588: bookdata/6631/4300989258725036631/85s.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 0000000000993541 (inode: 8436892 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 8436892: bookdata/7534/3147352338521267534/122b.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 0000000000D906E6 (inode: 12631196 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 12631196: bookdata/8691/11879047433161548691/164s.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 000000000118DC1E (inode: 16825500 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 16825500: bookdata/1232/17850056326363351232/166b.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 0000000001681BC7 (inode: 21019804 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 21019804: bookdata/26/12779298489336140026/246s.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 0000000001A804E1 (inode: 25214108 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 25214108: bookdata/3886/8729781571075193886/30s.jpg > Aug 6 00:57:01 localhost mfsmaster[10546]: currently unavailable > chunk 0000000001E7E826 (inode: 29408412 ; index: 0) > Aug 6 00:57:01 localhost mfsmaster[10546]: * currently unavailable > file 29408412: bookdata/4757/142868991575144757/316b.jpg > > > Aug 7 23:56:36 localhost mfsmaster[10546]: CS(192.168.0.124) packet > too long (115289537/50000000) > Aug 7 23:56:36 localhost mfsmaster[10546]: chunkserver disconnected - > ip: 192.168.0.124, port: 0, usedspace: 0 (0.00 GiB), totalspace: 0 > (0.00 GiB) > Aug 8 00:08:14 localhost mfsmaster[10546]: CS(192.168.0.127) packet > too long (104113889/50000000) > Aug 8 00:08:14 localhost mfsmaster[10546]: chunkserver disconnected - > ip: 192.168.0.127, port: 0, usedspace: 0 (0.00 GiB), totalspace: 0 > (0.00 GiB) > Aug 8 00:21:03 localhost mfsmaster[10546]: CS(192.168.0.120) packet > too long (117046565/50000000) > Aug 8 00:21:03 localhost mfsmaster[10546]: chunkserver disconnected - > ip: 192.168.0.120, port: 0, usedspace: 0 (0.00 GiB), totalspace: 0 > (0.00 GiB) > > when I visited the mfscgi,the error was“Can't connect to MFS master > (IP:127.0.0.1 ; PORT:9421)” > 。 > > Thanks all! > ------------------------------------------------------------------------------ > This SF.net email is sponsored by > > Make an app they can't live without > Enter the BlackBerry Developer Challenge > http://p.sf.net/sfu/RIM-dev2dev > _______________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users ------------------------------------------------------------------------------ Colocation vs. Managed Hosting A question and answer guide to determining the best fit for your organization - today and in the future. http://p.sf.net/sfu/internap-sfd2d _______________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Thomas S H. <tha...@gm...> - 2011-03-23 16:07:11
|
Yep, that worked! Thanks! 2011/3/23 Michal Borychowski <mic...@ge...> > So if you know sth happend on the hardware side, just delete the broken > chunks > > > > > > Regards > > Michal > > > > *From:* Thomas S Hatch [mailto:tha...@gm...] > *Sent:* Wednesday, March 23, 2011 3:48 PM > *To:* Michal Borychowski > *Cc:* moosefs-users > *Subject:* Re: [Moosefs-users] Failing Chunkserver > > > > Thanks Michal! > > We were having some hardware issues on the node, and I suspect that this is > a residual problem, I will give your suggestion a try! > > 2011/3/23 Michal Borychowski <mic...@ge...> > > Hi Thomas! > > > > You have bad chunk headers (but we don’t know why). You can just erase the > wrong chunks or change (just for some time) these constants: > > > > #define LASTERRSIZE 3 > > #define LASTERRTIME 60 > > > > to: > > > > #define LASTERRSIZE 10 > > #define LASTERRTIME 1 > > > > in the mfschunkserver/hddspacemgr.c file, recompile CS and run it again. CS > will stop to “unlink” the disks and will remove the wrong chunks by itself. > > > > > > Regards > > -Michal > > > > *From:* Thomas S Hatch [mailto:tha...@gm...] > *Sent:* Wednesday, March 23, 2011 6:10 AM > *To:* moosefs-users > *Subject:* [Moosefs-users] Failing Chunkserver > > > > I am having some trouble with a chunkserver, it errors out and then the > chunkserver stops working and reports %0 on the mfs cgi page > > Here is the error in the logs. > > > > 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set gid to 70003 > > 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set uid to 70003 > > 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25783]: closing > 172.11.1.110:9422 > > 2011-03-22T16:18:47+00:00 node10 mfschunkserver[25905]: main server module: > listen on 172.11.1.110:9422 > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connecting ... > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: stats file has been > loaded > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: open files limit: > 10000 > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connected to Master > > 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: testing chunk: > /mnt/moose1/11/chunk_00000000019A9511_00000001.mfs > > 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: chunk_readcrc: > file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - wrong id/version > in header (00000000019A9511_00000000) > > 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: hdd_io_begin: > file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - read error: > Unknown error > > 2011-03-22T16:19:02+00:00 node10 mfschunkserver[25905]: testing chunk: > /mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs > > 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: chunk_readcrc: > file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - wrong id/version > in header (0000000001EFFE4A_00000000) > > 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: hdd_io_begin: > file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - read error: > Unknown error > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: testing chunk: > /mnt/moose1/83/chunk_0000000001776783_00000001.mfs > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: chunk_readcrc: > file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - wrong id/version > in header (0000000001776783_00000000) > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: hdd_io_begin: > file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - read error: > Unknown error > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: 3 errors occurred > in 60 seconds on folder: /mnt/moose1/ > > 2011-03-22T16:19:15+00:00 node10 mfschunkserver[25905]: replicator: > hdd_create status: 21 > > > > What do these errors mean? And what is the best way to recover? > > > > If worse comes to worse we of course have replicated chunks, so we can > format the chunkserver and start it back up, but I am very curious how to > best approach the situation. > > > > -Thomas S Hatch > > > |
From: Michal B. <mic...@ge...> - 2011-03-23 16:04:16
|
So if you know sth happend on the hardware side, just delete the broken chunks Regards Michal From: Thomas S Hatch [mailto:tha...@gm...] Sent: Wednesday, March 23, 2011 3:48 PM To: Michal Borychowski Cc: moosefs-users Subject: Re: [Moosefs-users] Failing Chunkserver Thanks Michal! We were having some hardware issues on the node, and I suspect that this is a residual problem, I will give your suggestion a try! 2011/3/23 Michal Borychowski <mic...@ge...> Hi Thomas! You have bad chunk headers (but we don't know why). You can just erase the wrong chunks or change (just for some time) these constants: #define LASTERRSIZE 3 #define LASTERRTIME 60 to: #define LASTERRSIZE 10 #define LASTERRTIME 1 in the mfschunkserver/hddspacemgr.c file, recompile CS and run it again. CS will stop to "unlink" the disks and will remove the wrong chunks by itself. Regards -Michal From: Thomas S Hatch [mailto:tha...@gm...] Sent: Wednesday, March 23, 2011 6:10 AM To: moosefs-users Subject: [Moosefs-users] Failing Chunkserver I am having some trouble with a chunkserver, it errors out and then the chunkserver stops working and reports %0 on the mfs cgi page Here is the error in the logs. 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set gid to 70003 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set uid to 70003 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25783]: closing 172.11.1.110:9422 2011-03-22T16:18:47+00:00 node10 mfschunkserver[25905]: main server module: listen on 172.11.1.110:9422 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connecting ... 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: stats file has been loaded 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: open files limit: 10000 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connected to Master 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/11/chunk_00000000019A9511_00000001.mfs 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - wrong id/version in header (00000000019A9511_00000000) 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - read error: Unknown error 2011-03-22T16:19:02+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - wrong id/version in header (0000000001EFFE4A_00000000) 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - read error: Unknown error 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/83/chunk_0000000001776783_00000001.mfs 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - wrong id/version in header (0000000001776783_00000000) 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - read error: Unknown error 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: 3 errors occurred in 60 seconds on folder: /mnt/moose1/ 2011-03-22T16:19:15+00:00 node10 mfschunkserver[25905]: replicator: hdd_create status: 21 What do these errors mean? And what is the best way to recover? If worse comes to worse we of course have replicated chunks, so we can format the chunkserver and start it back up, but I am very curious how to best approach the situation. -Thomas S Hatch |
From: Thomas S H. <tha...@gm...> - 2011-03-23 14:48:32
|
Thanks Michal! We were having some hardware issues on the node, and I suspect that this is a residual problem, I will give your suggestion a try! 2011/3/23 Michal Borychowski <mic...@ge...> > Hi Thomas! > > > > You have bad chunk headers (but we don’t know why). You can just erase the > wrong chunks or change (just for some time) these constants: > > > > #define LASTERRSIZE 3 > > #define LASTERRTIME 60 > > > > to: > > > > #define LASTERRSIZE 10 > > #define LASTERRTIME 1 > > > > in the mfschunkserver/hddspacemgr.c file, recompile CS and run it again. CS > will stop to “unlink” the disks and will remove the wrong chunks by itself. > > > > > > Regards > > -Michal > > > > *From:* Thomas S Hatch [mailto:tha...@gm...] > *Sent:* Wednesday, March 23, 2011 6:10 AM > *To:* moosefs-users > *Subject:* [Moosefs-users] Failing Chunkserver > > > > I am having some trouble with a chunkserver, it errors out and then the > chunkserver stops working and reports %0 on the mfs cgi page > > Here is the error in the logs. > > > > 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set gid to 70003 > > 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set uid to 70003 > > 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25783]: closing > 172.11.1.110:9422 > > 2011-03-22T16:18:47+00:00 node10 mfschunkserver[25905]: main server module: > listen on 172.11.1.110:9422 > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connecting ... > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: stats file has been > loaded > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: open files limit: > 10000 > > 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connected to Master > > 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: testing chunk: > /mnt/moose1/11/chunk_00000000019A9511_00000001.mfs > > 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: chunk_readcrc: > file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - wrong id/version > in header (00000000019A9511_00000000) > > 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: hdd_io_begin: > file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - read error: > Unknown error > > 2011-03-22T16:19:02+00:00 node10 mfschunkserver[25905]: testing chunk: > /mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs > > 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: chunk_readcrc: > file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - wrong id/version > in header (0000000001EFFE4A_00000000) > > 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: hdd_io_begin: > file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - read error: > Unknown error > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: testing chunk: > /mnt/moose1/83/chunk_0000000001776783_00000001.mfs > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: chunk_readcrc: > file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - wrong id/version > in header (0000000001776783_00000000) > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: hdd_io_begin: > file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - read error: > Unknown error > > 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: 3 errors occurred > in 60 seconds on folder: /mnt/moose1/ > > 2011-03-22T16:19:15+00:00 node10 mfschunkserver[25905]: replicator: > hdd_create status: 21 > > > > What do these errors mean? And what is the best way to recover? > > > > If worse comes to worse we of course have replicated chunks, so we can > format the chunkserver and start it back up, but I am very curious how to > best approach the situation. > > > > -Thomas S Hatch > |
From: Michal B. <mic...@ge...> - 2011-03-23 12:13:13
|
Hi Thomas! You have bad chunk headers (but we don't know why). You can just erase the wrong chunks or change (just for some time) these constants: #define LASTERRSIZE 3 #define LASTERRTIME 60 to: #define LASTERRSIZE 10 #define LASTERRTIME 1 in the mfschunkserver/hddspacemgr.c file, recompile CS and run it again. CS will stop to "unlink" the disks and will remove the wrong chunks by itself. Regards -Michal From: Thomas S Hatch [mailto:tha...@gm...] Sent: Wednesday, March 23, 2011 6:10 AM To: moosefs-users Subject: [Moosefs-users] Failing Chunkserver I am having some trouble with a chunkserver, it errors out and then the chunkserver stops working and reports %0 on the mfs cgi page Here is the error in the logs. 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set gid to 70003 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set uid to 70003 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25783]: closing 172.11.1.110:9422 2011-03-22T16:18:47+00:00 node10 mfschunkserver[25905]: main server module: listen on 172.11.1.110:9422 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connecting ... 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: stats file has been loaded 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: open files limit: 10000 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connected to Master 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/11/chunk_00000000019A9511_00000001.mfs 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - wrong id/version in header (00000000019A9511_00000000) 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - read error: Unknown error 2011-03-22T16:19:02+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - wrong id/version in header (0000000001EFFE4A_00000000) 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - read error: Unknown error 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/83/chunk_0000000001776783_00000001.mfs 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - wrong id/version in header (0000000001776783_00000000) 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - read error: Unknown error 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: 3 errors occurred in 60 seconds on folder: /mnt/moose1/ 2011-03-22T16:19:15+00:00 node10 mfschunkserver[25905]: replicator: hdd_create status: 21 What do these errors mean? And what is the best way to recover? If worse comes to worse we of course have replicated chunks, so we can format the chunkserver and start it back up, but I am very curious how to best approach the situation. -Thomas S Hatch |
From: Thomas S H. <tha...@gm...> - 2011-03-23 05:10:35
|
I am having some trouble with a chunkserver, it errors out and then the chunkserver stops working and reports %0 on the mfs cgi page Here is the error in the logs. 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set gid to 70003 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25905]: set uid to 70003 2011-03-22T16:18:38+00:00 node10 mfschunkserver[25783]: closing 172.11.1.110:9422 2011-03-22T16:18:47+00:00 node10 mfschunkserver[25905]: main server module: listen on 172.11.1.110:9422 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connecting ... 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: stats file has been loaded 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: open files limit: 10000 2011-03-22T16:18:48+00:00 node10 mfschunkserver[25905]: connected to Master 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/11/chunk_00000000019A9511_00000001.mfs 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - wrong id/version in header (00000000019A9511_00000000) 2011-03-22T16:18:52+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/11/chunk_00000000019A9511_00000001.mfs - read error: Unknown error 2011-03-22T16:19:02+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - wrong id/version in header (0000000001EFFE4A_00000000) 2011-03-22T16:19:03+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/4A/chunk_0000000001EFFE4A_00000001.mfs - read error: Unknown error 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: testing chunk: /mnt/moose1/83/chunk_0000000001776783_00000001.mfs 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: chunk_readcrc: file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - wrong id/version in header (0000000001776783_00000000) 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: hdd_io_begin: file:/mnt/moose1/83/chunk_0000000001776783_00000001.mfs - read error: Unknown error 2011-03-22T16:19:13+00:00 node10 mfschunkserver[25905]: 3 errors occurred in 60 seconds on folder: /mnt/moose1/ 2011-03-22T16:19:15+00:00 node10 mfschunkserver[25905]: replicator: hdd_create status: 21 What do these errors mean? And what is the best way to recover? If worse comes to worse we of course have replicated chunks, so we can format the chunkserver and start it back up, but I am very curious how to best approach the situation. -Thomas S Hatch |
From: Boyko Y. <b.y...@ex...> - 2011-03-21 22:41:41
|
Hi, Not sure if it is related at all, but there is a known bug with mfsmetarestore, take a look here: http://sourceforge.net/mailarchive/forum.php?thread_name=045c01cbe3e6%249a511760%24cef34620%24%40borychowski%40gemius.pl&forum_name=moosefs-users You may want to try the suggested workaround, Regards, Boyko On Mar 22, 2011, at 12:19 AM, Pedro Naranjo wrote: > Hi there, > > We run another test today after losing about 3TB+ of data when we could > not restore the metadata.mfs file. We were in the process of copying > files over when we power was cut off. This lead to an error when running > the mfsmetarestore -a command as follows... > > 3605: '|' expected > > We really feel like MooseFS is the best solution that we could find but > somehow after not being able to recover from something so real as a > power failure, I really worry. > > Please advice as to how to fix this problem. > > Sincerely, > > > > Pedro Naranjo / STL Technologies / Solutions Architech / 888.556.0774 > > > ------------------------------------------------------------------------------ > Enable your software for Intel(R) Active Management Technology to meet the > growing manageability and security demands of your customers. Businesses > are taking advantage of Intel(R) vPro (TM) technology - will your software > be a part of the solution? Download the Intel(R) Manageability Checker > today! http://p.sf.net/sfu/intel-dev2devmar > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: Pedro N. <pe...@st...> - 2011-03-21 22:28:00
|
Hi there, We run another test today after losing about 3TB+ of data when we could not restore the metadata.mfs file. We were in the process of copying files over when we power was cut off. This lead to an error when running the mfsmetarestore -a command as follows... 3605: '|' expected We really feel like MooseFS is the best solution that we could find but somehow after not being able to recover from something so real as a power failure, I really worry. Please advice as to how to fix this problem. Sincerely, Pedro Naranjo / STL Technologies / Solutions Architech / 888.556.0774 |
From: Thomas S H. <tha...@gm...> - 2011-03-21 15:27:53
|
Hi Pedro! The problem I am running into is time, and a resource problem. I am in the middle of a number of other projects and my test environment is currently in a state of "flux". I agree that this would be a great thing for moosefs to come packaged with, but it should be a complete package, with ucarp failover scripts wrapped up into a simple cluster management daemon. I have mentioned it before but I hope to have more time for this in a few weeks, but it keeps getting pushed off, I might not be able to get to it for over a month, it keeps getting pushed back. I know that there are many people interested in my moosefs failover, and it is a high priority. Any contributions would be appreciated, all the code is in place, I mostly just need to package it up. P.S. Since this is a list of system admins some of you might be interested in the project that has been requiring most of my time lately, it is called salt: https://github.com/thatch45/salt Salt is a remote execution platform, I am using it to replace func, but it allows for very fast communication to servers and beats the heck out of using ssh for loops, I think it would also be very useful for people deploying MooseFS, since often you want to get information from and execute commands on many of your systems at once. I also have a blog post about it here: http://red45.wordpress.com/2011/03/19/salt-0-6-0-released/ On Mon, Mar 21, 2011 at 9:08 AM, Pedro Naranjo <pe...@st...> wrote: > Dear Thomas, > > Your contribution is very valuable. May I suggest to the Moose FS > developers to include it in the general download of the system? I have also > become very concerned about loosing data. We spent 3 days moving 3TB+ of > data only to loose it all after stimulating a power failure. Granted we had > not deployed the Metaloggers yet but never the lest what ever we can use to > make sure the system as stable as possible is very important. > > Sincerely, > > > > Pedro Naranjo / STL Technologies / Solutions Architect / 888.556.0774 > > > On 3/21/2011 7:51 AM, Thomas S Hatch wrote: > > I have been hammering away at mfs failover for quite some time and I am > familiar with your problem. > What happens is that the mfsmetaloggers continue to stream updates from the > mfsmaster even after a failover, but the mfsmetarestore command executed on > the metadata on the new mfsmaster ends up creating a different "last change > point" that what the other metaloggers see. > This means that the mfsmetaloggers that did not become the new master have > a bad set of metadata after your initial failover. > Since I wanted to have a completely clean and automated failover in my > MooseFS deployment, I created a wrapper daemon that manages the > mfsmetalogger. This daemon should be run on all metaloggers and the > mfsmaster, it detects when a failover occurs and ensures that the > mfsmetalogger is running on the right nodes and that the metadata being used > is the correct metadata. > If you do want to use my mfsmetalogger manager it is available here: > > https://github.com/thatch45/mfs-failover/blob/master/daemon/metaman.py > > It is written in python3 (my deployments default to python3) but let me > know if you are interested in running it on python2 and I will make a > python2 version. > > I also have some ucarp scripts in that github project that can be used > for managing failover automatically in conjunction with metaman, but I have > not had the time and resources to finish packaging them up. > > Let me know if you have any questions! > > -Thomas S Hatch > > On Mon, Mar 21, 2011 at 5:10 AM, Boyko Yordanov <b.y...@ex...>wrote: > >> Hi list, >> >> I'm wondering how are you guys handling mfs master failover? >> >> In my tests mfsmetalogger seems quite unreliable - 2 days of testing >> showed a few cases when mfsmetarestore is unable to restore the metadata.mfs >> datafile - getting different errors like Data mismatch, version mismatch, >> hole in change files (add more files) etc. >> >> Running 3 different metadata backup loggers, master and chunk servers all >> running mfs-1.6.20-2 on centos 5.5 x86_64, filesystem type is ext3. >> >> I'm aware that some of you are running huge clusters with terabytes of >> data - I'm wondering how do you trust your mfsmaster and am I the only one >> concerned with eventual data loss on mfsmaster failover, when mfsmetarestore >> does not properly restore the metadata.mfs file from changelogs? >> >> Boyko >> >> ------------------------------------------------------------------------------ >> Colocation vs. Managed Hosting >> A question and answer guide to determining the best fit >> for your organization - today and in the future. >> http://p.sf.net/sfu/internap-sfd2d >> _______________________________________________ >> moosefs-users mailing list >> moo...@li... >> https://lists.sourceforge.net/lists/listinfo/moosefs-users >> > > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future.http://p.sf.net/sfu/internap-sfd2d > > > _______________________________________________ > moosefs-users mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/moosefs-users > > |
From: Thomas S H. <tha...@gm...> - 2011-03-21 14:51:31
|
I have been hammering away at mfs failover for quite some time and I am familiar with your problem. What happens is that the mfsmetaloggers continue to stream updates from the mfsmaster even after a failover, but the mfsmetarestore command executed on the metadata on the new mfsmaster ends up creating a different "last change point" that what the other metaloggers see. This means that the mfsmetaloggers that did not become the new master have a bad set of metadata after your initial failover. Since I wanted to have a completely clean and automated failover in my MooseFS deployment, I created a wrapper daemon that manages the mfsmetalogger. This daemon should be run on all metaloggers and the mfsmaster, it detects when a failover occurs and ensures that the mfsmetalogger is running on the right nodes and that the metadata being used is the correct metadata. If you do want to use my mfsmetalogger manager it is available here: https://github.com/thatch45/mfs-failover/blob/master/daemon/metaman.py It is written in python3 (my deployments default to python3) but let me know if you are interested in running it on python2 and I will make a python2 version. I also have some ucarp scripts in that github project that can be used for managing failover automatically in conjunction with metaman, but I have not had the time and resources to finish packaging them up. Let me know if you have any questions! -Thomas S Hatch On Mon, Mar 21, 2011 at 5:10 AM, Boyko Yordanov <b.y...@ex...>wrote: > Hi list, > > I'm wondering how are you guys handling mfs master failover? > > In my tests mfsmetalogger seems quite unreliable - 2 days of testing showed > a few cases when mfsmetarestore is unable to restore the metadata.mfs > datafile - getting different errors like Data mismatch, version mismatch, > hole in change files (add more files) etc. > > Running 3 different metadata backup loggers, master and chunk servers all > running mfs-1.6.20-2 on centos 5.5 x86_64, filesystem type is ext3. > > I'm aware that some of you are running huge clusters with terabytes of data > - I'm wondering how do you trust your mfsmaster and am I the only one > concerned with eventual data loss on mfsmaster failover, when mfsmetarestore > does not properly restore the metadata.mfs file from changelogs? > > Boyko > > ------------------------------------------------------------------------------ > Colocation vs. Managed Hosting > A question and answer guide to determining the best fit > for your organization - today and in the future. > http://p.sf.net/sfu/internap-sfd2d > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: Jofly <xh...@16...> - 2011-03-21 13:35:23
|
Dear Sir, Thank you for you sparing some time to read my letter. I am a college student come from china, and my English is not good.I write the letter in order to ask you some questions about the background of Moosefs. Frist, did the Moosefs originate in the United States and when is it first made publicly ? Secondly, which company or who developed the software ? Thirdly,Is there any famous IT company using MooseFS, and can you tell me some successful cases? Thanks for your time again,and I am looking forward to your reply. |