Re: [Moosefs-users] mfsmaster hanging at 100% cpu?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

This is starting to annoy me to no end.

I now have this happening every few hours and I am very close to 
abandoning MooseFS. The only reasons I don't is

1. I have spent a month moving my data to MooseFS and will have to redo 
this.
2. I don't really see any alternatives which fill me with much confidence.

Every time it gets into this state one or two chunks gets damaged and I 
have to manually repair them. Sometimes losing a file. At this stage I 
can't even get to repairing the chunks as mfsmaster does not stay up for 
long enough to show me which files to repair. What is also strange is 
how predictable it is. It always happens on the hour. Not 2 minutes past 
the hour, but precisely on the hour. It is as if there is some 
job/process/thread that does something every hour that causes it to go 
into this state.

It always seems to be the same chunkserver that is disconnected and 
restarting the chunkserver has no effect. The chunkserver and mfsmaster 
is running on the same machine. The other chunkserver does not seem to 
ever drop out. I would have been able to add a 3rd chunkserver on Monday 
but I will probably not do that until I can get the existing setup stable.

On Monday I will try to move mfsmaster to a different machine and see if 
I can get it to stay up for longer than 8 hours. At this stage 6 hours 
is about the longest it stays up without going into this state. If this 
fails and I have no other feedback then I am back to square one and 
probably will have to abandon MooseFS.

I have eliminated everything else that could be causing problems. At 
this stage it can just be mfsmaster.

The following Swatch script is helping me keep my system online as much 
as is possible:

watchfor /mfsmaster mfsmaster.*: chunkserver disconnected - ip: 
xxx.xxx.xxx.xxx, port: 9422, usedspace: 0 \(0.00 GiB\), totalspace: 0 
\(0.00 GiB\)/
     threshold track_by=xxx.xxx.xxx.xxx,type=both,count=6,seconds=1200
     mail=robert,subject="MFSMaster crashed yet again"
     exec /usr/sbin/mfsmaster -c /etc/mfs/mfsmaster.cfg restart

watchfor /mfsmaster mfsmaster.*: about 60 seconds passed and lockfile is 
still locked - giving up/
     mail=robert,subject="MFSMaster crashed yet again and restart timed 
out yet again"
     exec /usr/sbin/mfsmaster -c /etc/mfs/mfsmaster.cfg restart

Robert

On 7/13/11 9:26 AM, Robert Sandilands wrote:
> Do you see the message "mfsmaster[pid]: chunkserver disconnected - ip: 
> xxx.xxx.xxx.xxx, port: 9422" around the time the CPU jumps to 100%?
>
> Robert
>
> On 7/12/11 10:13 AM, Mike wrote:
>> I have a fairly small MFS installation - 14T of storage across 2 
>> servers, a master node and a metalogger. I'm seeing the mfsmaster 
>> jump to 100% cpu and just sit there... rendering the filesystem dead. 
>> strace shows its not doing any IO.
>>
>> Any thoughts or ideas where to look next?
>>
>>
>>
>> ------------------------------------------------------------------------------
>> All of the data generated in your IT infrastructure is seriously valuable.
>> Why? It contains a definitive record of application performance, security
>> threats, fraudulent activity, and more. Splunk takes this data and makes
>> sense of it. IT sense. And common sense.
>> http://p.sf.net/sfu/splunk-d2d-c2
>>
>>
>> _______________________________________________
>> moosefs-users mailing list
>> moo...@li...
>> https://lists.sourceforge.net/lists/listinfo/moosefs-users
>
>
>
> ------------------------------------------------------------------------------
> AppSumo Presents a FREE Video for the SourceForge Community by Eric
> Ries, the creator of the Lean Startup Methodology on "Lean Startup
> Secrets Revealed." This video shows you how to validate your ideas,
> optimize your ideas and identify your business strategy.
> http://p.sf.net/sfu/appsumosfdev2dev
>
>
> _______________________________________________
> moosefs-users mailing list
> moo...@li...
> https://lists.sourceforge.net/lists/listinfo/moosefs-users

Re: [Moosefs-users] mfsmaster hanging at 100% cpu?

Fault tolerant, POSIX-compliant, Net Distributed Storage / File System

Re: [Moosefs-users] mfsmaster hanging at 100% cpu?