From: Robert S. <rsa...@ne...> - 2011-07-17 22:34:10
|
This is starting to annoy me to no end. I now have this happening every few hours and I am very close to abandoning MooseFS. The only reasons I don't is 1. I have spent a month moving my data to MooseFS and will have to redo this. 2. I don't really see any alternatives which fill me with much confidence. Every time it gets into this state one or two chunks gets damaged and I have to manually repair them. Sometimes losing a file. At this stage I can't even get to repairing the chunks as mfsmaster does not stay up for long enough to show me which files to repair. What is also strange is how predictable it is. It always happens on the hour. Not 2 minutes past the hour, but precisely on the hour. It is as if there is some job/process/thread that does something every hour that causes it to go into this state. It always seems to be the same chunkserver that is disconnected and restarting the chunkserver has no effect. The chunkserver and mfsmaster is running on the same machine. The other chunkserver does not seem to ever drop out. I would have been able to add a 3rd chunkserver on Monday but I will probably not do that until I can get the existing setup stable. On Monday I will try to move mfsmaster to a different machine and see if I can get it to stay up for longer than 8 hours. At this stage 6 hours is about the longest it stays up without going into this state. If this fails and I have no other feedback then I am back to square one and probably will have to abandon MooseFS. I have eliminated everything else that could be causing problems. At this stage it can just be mfsmaster. The following Swatch script is helping me keep my system online as much as is possible: watchfor /mfsmaster mfsmaster.*: chunkserver disconnected - ip: xxx.xxx.xxx.xxx, port: 9422, usedspace: 0 \(0.00 GiB\), totalspace: 0 \(0.00 GiB\)/ threshold track_by=xxx.xxx.xxx.xxx,type=both,count=6,seconds=1200 mail=robert,subject="MFSMaster crashed yet again" exec /usr/sbin/mfsmaster -c /etc/mfs/mfsmaster.cfg restart watchfor /mfsmaster mfsmaster.*: about 60 seconds passed and lockfile is still locked - giving up/ mail=robert,subject="MFSMaster crashed yet again and restart timed out yet again" exec /usr/sbin/mfsmaster -c /etc/mfs/mfsmaster.cfg restart Robert On 7/13/11 9:26 AM, Robert Sandilands wrote: > Do you see the message "mfsmaster[pid]: chunkserver disconnected - ip: > xxx.xxx.xxx.xxx, port: 9422" around the time the CPU jumps to 100%? > > Robert > > On 7/12/11 10:13 AM, Mike wrote: >> I have a fairly small MFS installation - 14T of storage across 2 >> servers, a master node and a metalogger. I'm seeing the mfsmaster >> jump to 100% cpu and just sit there... rendering the filesystem dead. >> strace shows its not doing any IO. >> >> Any thoughts or ideas where to look next? >> >> >> >> ------------------------------------------------------------------------------ >> All of the data generated in your IT infrastructure is seriously valuable. >> Why? It contains a definitive record of application performance, security >> threats, fraudulent activity, and more. Splunk takes this data and makes >> sense of it. IT sense. And common sense. >> http://p.sf.net/sfu/splunk-d2d-c2 >> >> >> _______________________________________________ >> moosefs-users mailing list >> moo...@li... >> https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > > ------------------------------------------------------------------------------ > AppSumo Presents a FREE Video for the SourceForge Community by Eric > Ries, the creator of the Lean Startup Methodology on "Lean Startup > Secrets Revealed." This video shows you how to validate your ideas, > optimize your ideas and identify your business strategy. > http://p.sf.net/sfu/appsumosfdev2dev > > > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users |