From: Mike <isp...@gm...> - 2011-07-12 14:13:50
|
I have a fairly small MFS installation - 14T of storage across 2 servers, a master node and a metalogger. I'm seeing the mfsmaster jump to 100% cpu and just sit there... rendering the filesystem dead. strace shows its not doing any IO. Any thoughts or ideas where to look next? |
From: Thomas S H. <tha...@gm...> - 2011-07-12 14:23:13
|
What is the version of moosefs you are using? and what do your configs look like? Also, what OS are you using? On Tue, Jul 12, 2011 at 8:13 AM, Mike <isp...@gm...> wrote: > I have a fairly small MFS installation - 14T of storage across 2 servers, a > master node and a metalogger. I'm seeing the mfsmaster jump to 100% cpu and > just sit there... rendering the filesystem dead. strace shows its not doing > any IO. > > Any thoughts or ideas where to look next? > > > > ------------------------------------------------------------------------------ > All of the data generated in your IT infrastructure is seriously valuable. > Why? It contains a definitive record of application performance, security > threats, fraudulent activity, and more. Splunk takes this data and makes > sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-d2d-c2 > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > > |
From: Mike <isp...@gm...> - 2011-07-13 12:43:34
|
Configs are all defaults with the hostname for the master set to the correct IP :-) master server is slackware 13, chunk servers are debian, IIRC. >From the info page: version total space avail space trash space trash files reserved space reserved files all fs objects directories files chunks all chunk copies regular chunk copies 1.6.20 15 TiB 13 TiB 0 B 0 325 KiB 8 1030374 13040 1017333 1005532 2012227 2012227 On Tue, Jul 12, 2011 at 11:23 AM, Thomas S Hatch <tha...@gm...> wrote: > What is the version of moosefs you are using? and what do your configs look > like? > > Also, what OS are you using? > > |
From: Robert S. <rsa...@ne...> - 2011-07-13 13:26:17
|
Do you see the message "mfsmaster[pid]: chunkserver disconnected - ip: xxx.xxx.xxx.xxx, port: 9422" around the time the CPU jumps to 100%? Robert On 7/12/11 10:13 AM, Mike wrote: > I have a fairly small MFS installation - 14T of storage across 2 > servers, a master node and a metalogger. I'm seeing the mfsmaster jump > to 100% cpu and just sit there... rendering the filesystem dead. > strace shows its not doing any IO. > > Any thoughts or ideas where to look next? > > > > ------------------------------------------------------------------------------ > All of the data generated in your IT infrastructure is seriously valuable. > Why? It contains a definitive record of application performance, security > threats, fraudulent activity, and more. Splunk takes this data and makes > sense of it. IT sense. And common sense. > http://p.sf.net/sfu/splunk-d2d-c2 > > > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Robert S. <rsa...@ne...> - 2011-07-18 18:37:54
|
I have had it running without a crash for more than 12 hours which is a new record here. I changed one setting: MASTER_TIMEOUT = 120 in mfschunkserver.cfg. My guess at the moment is that on the hour the Master blocks connections and dumps the metadata to disk and to the mfsmetalogger servers. Due to existing load and the number of files/objects/chunks in our system this takes longer than the chunk server timeout. This then leads to a process where the chunkserver goes into a disconnect, reconnect loop until the master gets confused. What also seems to contribute is that once mfsmaster starts blocking connections mfsmount and mfschunkserver may start using more CPU which tends to aggravate the situation. It may help the situation to move mfsmaster to an unloaded and dedicated machine, but I can't help but think that this behavior limits scalability. Given enough files/folders/chunks any timeout will be exceeded even if the master machine is completely unloaded. Robert On 7/18/11 10:55 AM, Mike wrote: > > Every time it gets into this state one or two chunks gets damaged > and I have to manually repair them. Sometimes losing a file. > > At this stage I can't even get to repairing the chunks as mfsmaster > does not stay up for long enough to show me which files to repair. > > What is also strange is how predictable it is. It always happens on > the hour. Not 2 minutes past the hour, but precisely on the hour. It is > > as if there is some job/process/thread that does something every > hour that causes it to go into this state. > > I can reproduce this on our install fairly easily (well, I could last > time I looked!) Given that I'm running a completely stock config with > 2 chunkservers, it shouldn't be TOO hard to figure out what's going > on. I can recompile/reinstall/change values as needed, someone just > needs to point me in the right direction. > > |
From: Michal B. <mic...@ge...> - 2011-08-02 06:46:47
|
Hi Robert! If really increasing the timeout helped in your case, probably chunkserver registration process was slowing down the master. Again - with this number of files/chunks it should not have place. Please also check your RAM and if master is not swapping constantly. Kind regards -Michal -----Original Message----- From: Robert Sandilands [mailto:rsa...@ne...] Sent: Monday, July 18, 2011 8:37 PM To: Mike Cc: moo...@li... Subject: Re: [Moosefs-users] mfsmaster hanging at 100% cpu? I have had it running without a crash for more than 12 hours which is a new record here. I changed one setting: MASTER_TIMEOUT = 120 in mfschunkserver.cfg. My guess at the moment is that on the hour the Master blocks connections and dumps the metadata to disk and to the mfsmetalogger servers. Due to existing load and the number of files/objects/chunks in our system this takes longer than the chunk server timeout. This then leads to a process where the chunkserver goes into a disconnect, reconnect loop until the master gets confused. What also seems to contribute is that once mfsmaster starts blocking connections mfsmount and mfschunkserver may start using more CPU which tends to aggravate the situation. It may help the situation to move mfsmaster to an unloaded and dedicated machine, but I can't help but think that this behavior limits scalability. Given enough files/folders/chunks any timeout will be exceeded even if the master machine is completely unloaded. Robert On 7/18/11 10:55 AM, Mike wrote: > > Every time it gets into this state one or two chunks gets damaged > and I have to manually repair them. Sometimes losing a file. > > At this stage I can't even get to repairing the chunks as mfsmaster > does not stay up for long enough to show me which files to repair. > > What is also strange is how predictable it is. It always happens on > the hour. Not 2 minutes past the hour, but precisely on the hour. It is > > as if there is some job/process/thread that does something every > hour that causes it to go into this state. > > I can reproduce this on our install fairly easily (well, I could last > time I looked!) Given that I'm running a completely stock config with > 2 chunkservers, it shouldn't be TOO hard to figure out what's going > on. I can recompile/reinstall/change values as needed, someone just > needs to point me in the right direction. > > ---------------------------------------------------------------------------- -- Storage Efficiency Calculator This modeling tool is based on patent-pending intellectual property that has been used successfully in hundreds of IBM storage optimization engage- ments, worldwide. Store less, Store more with what you own, Move data to the right place. Try It Now! http://www.accelacomm.com/jaw/sfnl/114/51427378/ _______________________________________________ moosefs-users mailing list moo...@li... https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Robert S. <rsa...@ne...> - 2011-08-02 11:31:20
|
Hi Michal, Increasing the timeout seemed to have resolved the issue for me. I still get some times around the hour where mfsmaster is unresponsive but it does recover. There is no swapping on the master. The master has 64 GB of RAM and the mfsmaster process is using 33.5 GB of that. Robert On 8/2/11 2:46 AM, Michal Borychowski wrote: > Hi Robert! > > If really increasing the timeout helped in your case, probably chunkserver > registration process was slowing down the master. Again - with this number > of files/chunks it should not have place. Please also check your RAM and if > master is not swapping constantly. > > > Kind regards > -Michal > > > -----Original Message----- > From: Robert Sandilands [mailto:rsa...@ne...] > Sent: Monday, July 18, 2011 8:37 PM > To: Mike > Cc: moo...@li... > Subject: Re: [Moosefs-users] mfsmaster hanging at 100% cpu? > > I have had it running without a crash for more than 12 hours which is a > new record here. > > I changed one setting: > > MASTER_TIMEOUT = 120 > > in mfschunkserver.cfg. > > My guess at the moment is that on the hour the Master blocks connections > and dumps the metadata to disk and to the mfsmetalogger servers. Due to > existing load and the number of files/objects/chunks in our system this > takes longer than the chunk server timeout. This then leads to a process > where the chunkserver goes into a disconnect, reconnect loop until the > master gets confused. > > What also seems to contribute is that once mfsmaster starts blocking > connections mfsmount and mfschunkserver may start using more CPU which > tends to aggravate the situation. > > It may help the situation to move mfsmaster to an unloaded and dedicated > machine, but I can't help but think that this behavior limits > scalability. Given enough files/folders/chunks any timeout will be > exceeded even if the master machine is completely unloaded. > > Robert > > On 7/18/11 10:55 AM, Mike wrote: >>> Every time it gets into this state one or two chunks gets damaged >> and I have to manually repair them. Sometimes losing a file. >>> At this stage I can't even get to repairing the chunks as mfsmaster >> does not stay up for long enough to show me which files to repair. >>> What is also strange is how predictable it is. It always happens on >> the hour. Not 2 minutes past the hour, but precisely on the hour. It is >>> as if there is some job/process/thread that does something every >> hour that causes it to go into this state. >> >> I can reproduce this on our install fairly easily (well, I could last >> time I looked!) Given that I'm running a completely stock config with >> 2 chunkservers, it shouldn't be TOO hard to figure out what's going >> on. I can recompile/reinstall/change values as needed, someone just >> needs to point me in the right direction. >> >> > > ---------------------------------------------------------------------------- > -- > Storage Efficiency Calculator > This modeling tool is based on patent-pending intellectual property that > has been used successfully in hundreds of IBM storage optimization engage- > ments, worldwide. Store less, Store more with what you own, Move data to > the right place. Try It Now! > http://www.accelacomm.com/jaw/sfnl/114/51427378/ > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |
From: Robert S. <rsa...@ne...> - 2011-07-17 22:34:10
|
This is starting to annoy me to no end. I now have this happening every few hours and I am very close to abandoning MooseFS. The only reasons I don't is 1. I have spent a month moving my data to MooseFS and will have to redo this. 2. I don't really see any alternatives which fill me with much confidence. Every time it gets into this state one or two chunks gets damaged and I have to manually repair them. Sometimes losing a file. At this stage I can't even get to repairing the chunks as mfsmaster does not stay up for long enough to show me which files to repair. What is also strange is how predictable it is. It always happens on the hour. Not 2 minutes past the hour, but precisely on the hour. It is as if there is some job/process/thread that does something every hour that causes it to go into this state. It always seems to be the same chunkserver that is disconnected and restarting the chunkserver has no effect. The chunkserver and mfsmaster is running on the same machine. The other chunkserver does not seem to ever drop out. I would have been able to add a 3rd chunkserver on Monday but I will probably not do that until I can get the existing setup stable. On Monday I will try to move mfsmaster to a different machine and see if I can get it to stay up for longer than 8 hours. At this stage 6 hours is about the longest it stays up without going into this state. If this fails and I have no other feedback then I am back to square one and probably will have to abandon MooseFS. I have eliminated everything else that could be causing problems. At this stage it can just be mfsmaster. The following Swatch script is helping me keep my system online as much as is possible: watchfor /mfsmaster mfsmaster.*: chunkserver disconnected - ip: xxx.xxx.xxx.xxx, port: 9422, usedspace: 0 \(0.00 GiB\), totalspace: 0 \(0.00 GiB\)/ threshold track_by=xxx.xxx.xxx.xxx,type=both,count=6,seconds=1200 mail=robert,subject="MFSMaster crashed yet again" exec /usr/sbin/mfsmaster -c /etc/mfs/mfsmaster.cfg restart watchfor /mfsmaster mfsmaster.*: about 60 seconds passed and lockfile is still locked - giving up/ mail=robert,subject="MFSMaster crashed yet again and restart timed out yet again" exec /usr/sbin/mfsmaster -c /etc/mfs/mfsmaster.cfg restart Robert On 7/13/11 9:26 AM, Robert Sandilands wrote: > Do you see the message "mfsmaster[pid]: chunkserver disconnected - ip: > xxx.xxx.xxx.xxx, port: 9422" around the time the CPU jumps to 100%? > > Robert > > On 7/12/11 10:13 AM, Mike wrote: >> I have a fairly small MFS installation - 14T of storage across 2 >> servers, a master node and a metalogger. I'm seeing the mfsmaster >> jump to 100% cpu and just sit there... rendering the filesystem dead. >> strace shows its not doing any IO. >> >> Any thoughts or ideas where to look next? >> >> >> >> ------------------------------------------------------------------------------ >> All of the data generated in your IT infrastructure is seriously valuable. >> Why? It contains a definitive record of application performance, security >> threats, fraudulent activity, and more. Splunk takes this data and makes >> sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-d2d-c2 >> >> >> _______________________________________________ >> moosefs-users mailing list >> moo...@li... >> https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > > ------------------------------------------------------------------------------ > AppSumo Presents a FREE Video for the SourceForge Community by Eric > Ries, the creator of the Lean Startup Methodology on "Lean Startup > Secrets Revealed." This video shows you how to validate your ideas, > optimize your ideas and identify your business strategy. > http://p.sf.net/sfu/appsumosfdev2dev > > > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Mike <isp...@gm...> - 2011-07-18 14:55:55
|
> Every time it gets into this state one or two chunks gets damaged and I have to manually repair them. Sometimes losing a file. > At this stage I can't even get to repairing the chunks as mfsmaster does not stay up for long enough to show me which files to repair. > What is also strange is how predictable it is. It always happens on the hour. Not 2 minutes past the hour, but precisely on the hour. It is > as if there is some job/process/thread that does something every hour that causes it to go into this state. I can reproduce this on our install fairly easily (well, I could last time I looked!) Given that I'm running a completely stock config with 2 chunkservers, it shouldn't be TOO hard to figure out what's going on. I can recompile/reinstall/change values as needed, someone just needs to point me in the right direction. |
From: Michal B. <mic...@ge...> - 2011-08-02 06:42:42
|
Hi Mike! The situation you describe is quite stange. We would like to connect to your master with gdb to check this if it were possible. Or maybe your master constantly swaps? How much RAM do you have? What is the RAM usage? Kind regards Michał Borychowski MooseFS Support Manager _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Gemius S.A. ul. Wołoska 7, 02-672 Warszawa Budynek MARS, klatka D Tel.: +4822 874-41-00 Fax : +4822 874-41-01 From: Mike [mailto:isp...@gm...] Sent: Tuesday, July 12, 2011 4:14 PM To: moo...@li... Subject: [Moosefs-users] mfsmaster hanging at 100% cpu? I have a fairly small MFS installation - 14T of storage across 2 servers, a master node and a metalogger. I'm seeing the mfsmaster jump to 100% cpu and just sit there... rendering the filesystem dead. strace shows its not doing any IO. Any thoughts or ideas where to look next? |