From: Ricardo J. B. <ric...@da...> - 2011-12-21 21:33:05
|
El Martes 20 Diciembre 2011, Chris Picton escribió: > Hi List Hi Chris, > I have a moosefs installation in my test environment consisting of 4 > pcs, each with 2x80 Gb and 2x 1TB drives. They are running a > corosync/pacemaker cluster with any one of the 4 machines acting as the > master, and all 4 running metaloggers (I am happy to share the ocf > script if anyone would like to look at it) > > They are running chunk servers (and I have 4 extra chunk servers as well) > > We are hosting kvm images on the cluster (goal=3) > > However, we had a problem where the temperature in the lab went too > high, and some of the HDDs shut down. > > No new files were being created at the time, but there was read/write > access to most of the existing files. > > The current master kernel paniced, and a backup metalogger was promoted > to master (using mfsmetarestore). Two of the other chunk servers had a 1Tb > drive fail in each of them. > > So all-in-all, a fairly bad problem, where we lost 3 copies of some of > the data (and goal was 3). This was overnight, and I only saw the > problem this morning. > > > However. The missing data should have still been on the drives (kernel > paniced master, and failed hdds on other machines) > > So I have now done: > Power off all machines > Reseat all drives > fsck all drives (no filesystem errors found) > restart the master, metalogger and chunkservers > > The CGI is showing 44 chunks which have zero copies. > (It also shows some chunks with 4 copies, and some chunks with 5 copies > - which implies that the undergoal chunks were being replicated after > the problem happened.) > > My question is - why would there be any chunks with zero copies? No new > files were being added or deleted - the metalogger/masters would all > have had the same data. The failed drives have started again, with no > filesystem errors. Where are my missing chunks?? > > Any help would be appreciated > > Chris I had a similar problem last week in a small MFS cluster (1 master, 1 metalogger, 2 chunkservers) and had several chunks with zero copies. I grep'ed /var/log/messages for "currently unavailable file" and ran mfsfileinfo on all the files with missing chunks (i.e.: chunks marked as "no valid copies"). Then I tan mfsfilrepair on those files, expecting to get those missing chunks zeroed out but their version got downgraded instead. I think this is because the master might have received an update operation but the chunkservers didn't get a chance to complete the update, so the master "restored" the last good version of the chunk. Since (AFAIK) there is no way to know beforehand what mfsfilerepair is going to do with the missing chunks, I'd recommend you to be careful. Regards, -- Ricardo J. Barberis Senior SysAdmin / ITI Dattatec.com :: Soluciones de Web Hosting Tu Hosting hecho Simple! ------------------------------------------ |