From: Steve W. <st...@pu...> - 2012-04-18 15:13:29
|
On 04/10/2012 11:09 AM, Steve Wilson wrote: > Thanks, Peter! > > I'll plan to take this chunk server offline one evening and run > memtest on it. Unfortunately, I don't have a spare system that can > take the load from this one so we won't have any redundancy while > running memtest. But I do make a nightly backup of this 24TB MooseFS > volume just in case something happens while we're running only one > chunk server. > > Steve > Just a follow up on this... I finally was able to take the chunk server out of service last night and run memtest on it. Within a few minutes, memtest detected a memory error. So the CRC errors reported by MFS turned out to be caused by faulty memory in this case. Steve > On 04/08/2012 09:10 AM, Peter aNeutrino wrote: >> Hi Steve :) >> It looks like memory or mainboard/controller issue. >> >> However there is some probability that this machine has all hard >> drives broken. >> (eg. by temperature or by some shaking/vibration) >> >> If I were you I would mark this machine for maintenance and make full >> tests on it: >> - first we need to make sure that all data are with desired level of >> safety by marking all disks in /etc/mfshdd.cfg config file with >> asterisk like this: >> */mfs/01 >> */mfs/02 >> ... >> - restart the chunk server service (eg. /etc/init.d/mfs-chunkserver >> restart) >> - wait for all chunks from this machine to be replicated >> - stop the chunk server service >> >> ....and then make tests eg.: >> - "memtest" for memory >> -- if error occours replace RAM test it again >> -- if error occurs again so it looks like mainboard issue. >> >> - "badblock" for harddrives you can test all disk together parallel >> but I would run them after I moved disks into different machine. >> (just move them before you run memtest so you can run memtest and >> badblock in the same tame) >> >> if all test PASS (no errors) than I would try to replace controller >> and mainboard. >> and put tested memory and disks into this new mainbord/controller (or >> even CPU) >> >> That is for one server case. With big installations like 100+ such >> errors of hardware can occur every week/month and it is worth to have >> better procedure, which our Technical Support would create for you :) >> >> Good luck with testing and please share with us when you fix it :) >> aNeutrino :) >> -- >> Peter aNeutrino >> http://pl.linkedin.com/in/aneutrino >> +48 602 302 132 >> Evangelist and Product Manager ofhttp://MooseFS.org >> at Core Technology sp. z o.o. >> >> >> >> >> On Thu, Apr 5, 2012 at 22:29, Steve Wilson <st...@pu... >> <mailto:st...@pu...>> wrote: >> >> Hi, >> >> One of my chunk servers will log a CRC error from time to time >> like the >> following: >> >> Apr 4 17:29:10 massachusetts mfschunkserver[2224]: >> write_block_to_chunk: >> file:/mfs/08/27/chunk_00000000066B5D27_00000001.mfs - crc error >> >> Is the most likely cause faulty system memory? Or disk >> controller? We >> get an error about every two days or so and spread across most of the >> drives: >> >> # IP path (switch to name) chunks >> last error >> 9 128.210.48.62:9422:/mfs/01/ 934123 2012-03-28 17:41 >> 10 128.210.48.62:9422:/mfs/02/ 931903 2012-03-23 21:28 >> 11 128.210.48.62:9422:/mfs/03/ 888712 2012-03-30 19:13 >> 12 128.210.48.62:9422:/mfs/04/ 931661 2012-04-01 03:01 >> 13 128.210.48.62:9422:/mfs/05/ 935681 no errors >> 14 128.210.48.62:9422:/mfs/06/ 929248 2012-04-04 13:41 >> 15 128.210.48.62:9422:/mfs/07/ 929592 2012-03-30 19:02 >> 16 128.210.48.62:9422:/mfs/08/ 829446 2012-04-04 17:29 >> >> Thanks, >> Steve >> >> ------------------------------------------------------------------------------ >> Better than sec? Nothing is better than sec when it comes to >> monitoring Big Data applications. Try Boundary one-second >> resolution app monitoring today. Free. >> http://p.sf.net/sfu/Boundary-dev2dev >> _______________________________________________ >> moosefs-users mailing list >> moo...@li... >> <mailto:moo...@li...> >> https://lists.sourceforge.net/lists/listinfo/moosefs-users >> >> > > -- > Steven M. Wilson, Systems and Network Manager > Markey Center for Structural Biology > Purdue University > (765) 496-1946 > > > ------------------------------------------------------------------------------ > Better than sec? Nothing is better than sec when it comes to > monitoring Big Data applications. Try Boundary one-second > resolution app monitoring today. Free. > http://p.sf.net/sfu/Boundary-dev2dev > > > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users -- Steven M. Wilson, Systems and Network Manager Markey Center for Structural Biology Purdue University (765) 496-1946 |