From: Robert S. <rsa...@ne...> - 2011-08-02 11:31:20
|
Hi Michal, Increasing the timeout seemed to have resolved the issue for me. I still get some times around the hour where mfsmaster is unresponsive but it does recover. There is no swapping on the master. The master has 64 GB of RAM and the mfsmaster process is using 33.5 GB of that. Robert On 8/2/11 2:46 AM, Michal Borychowski wrote: > Hi Robert! > > If really increasing the timeout helped in your case, probably chunkserver > registration process was slowing down the master. Again - with this number > of files/chunks it should not have place. Please also check your RAM and if > master is not swapping constantly. > > > Kind regards > -Michal > > > -----Original Message----- > From: Robert Sandilands [mailto:rsa...@ne...] > Sent: Monday, July 18, 2011 8:37 PM > To: Mike > Cc: moo...@li... > Subject: Re: [Moosefs-users] mfsmaster hanging at 100% cpu? > > I have had it running without a crash for more than 12 hours which is a > new record here. > > I changed one setting: > > MASTER_TIMEOUT = 120 > > in mfschunkserver.cfg. > > My guess at the moment is that on the hour the Master blocks connections > and dumps the metadata to disk and to the mfsmetalogger servers. Due to > existing load and the number of files/objects/chunks in our system this > takes longer than the chunk server timeout. This then leads to a process > where the chunkserver goes into a disconnect, reconnect loop until the > master gets confused. > > What also seems to contribute is that once mfsmaster starts blocking > connections mfsmount and mfschunkserver may start using more CPU which > tends to aggravate the situation. > > It may help the situation to move mfsmaster to an unloaded and dedicated > machine, but I can't help but think that this behavior limits > scalability. Given enough files/folders/chunks any timeout will be > exceeded even if the master machine is completely unloaded. > > Robert > > On 7/18/11 10:55 AM, Mike wrote: >>> Every time it gets into this state one or two chunks gets damaged >> and I have to manually repair them. Sometimes losing a file. >>> At this stage I can't even get to repairing the chunks as mfsmaster >> does not stay up for long enough to show me which files to repair. >>> What is also strange is how predictable it is. It always happens on >> the hour. Not 2 minutes past the hour, but precisely on the hour. It is >>> as if there is some job/process/thread that does something every >> hour that causes it to go into this state. >> >> I can reproduce this on our install fairly easily (well, I could last >> time I looked!) Given that I'm running a completely stock config with >> 2 chunkservers, it shouldn't be TOO hard to figure out what's going >> on. I can recompile/reinstall/change values as needed, someone just >> needs to point me in the right direction. >> >> > > ---------------------------------------------------------------------------- > -- > Storage Efficiency Calculator > This modeling tool is based on patent-pending intellectual property that > has been used successfully in hundreds of IBM storage optimization engage- > ments, worldwide. Store less, Store more with what you own, Move data to > the right place. Try It Now! > http://www.accelacomm.com/jaw/sfnl/114/51427378/ > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |