From: Peter a. <pio...@co...> - 2012-04-18 18:09:46
|
Well done Steve :) Several days ago I have detected broken ethernet port on switch (thanks to MooseFS's CRC checking on client site) - looks like one could use MooseFS for testing the whole hardware infrastructure in datacenter :p cheers aNeutrino :) -- Peter aNeutrino http://pl.linkedin.com/in/aneutrino+48 602 302 132 Evangelist and Product Manager of http://MooseFS.org at Core Technology sp. z o.o. On Wed, Apr 18, 2012 at 17:13, Steve Wilson <st...@pu...> wrote: > On 04/10/2012 11:09 AM, Steve Wilson wrote: > > Thanks, Peter! > > I'll plan to take this chunk server offline one evening and run memtest on > it. Unfortunately, I don't have a spare system that can take the load from > this one so we won't have any redundancy while running memtest. But I do > make a nightly backup of this 24TB MooseFS volume just in case something > happens while we're running only one chunk server. > > Steve > > Just a follow up on this... I finally was able to take the chunk server > out of service last night and run memtest on it. Within a few minutes, > memtest detected a memory error. So the CRC errors reported by MFS turned > out to be caused by faulty memory in this case. > > > Steve > > On 04/08/2012 09:10 AM, Peter aNeutrino wrote: > > Hi Steve :) > It looks like memory or mainboard/controller issue. > > However there is some probability that this machine has all hard drives > broken. > (eg. by temperature or by some shaking/vibration) > > If I were you I would mark this machine for maintenance and make full > tests on it: > - first we need to make sure that all data are with desired level of > safety by marking all disks in /etc/mfshdd.cfg config file with asterisk > like this: > */mfs/01 > */mfs/02 > ... > - restart the chunk server service (eg. /etc/init.d/mfs-chunkserver > restart) > - wait for all chunks from this machine to be replicated > - stop the chunk server service > > ....and then make tests eg.: > - "memtest" for memory > -- if error occours replace RAM test it again > -- if error occurs again so it looks like mainboard issue. > > - "badblock" for harddrives you can test all disk together parallel but > I would run them after I moved disks into different machine. > (just move them before you run memtest so you can run memtest and badblock > in the same tame) > > if all test PASS (no errors) than I would try to replace controller and > mainboard. > and put tested memory and disks into this new mainbord/controller (or even > CPU) > > That is for one server case. With big installations like 100+ such > errors of hardware can occur every week/month and it is worth to have > better procedure, which our Technical Support would create for you :) > > Good luck with testing and please share with us when you fix it :) > aNeutrino :) > > > -- > Peter aNeutrino http://pl.linkedin.com/in/aneutrino+48 602 302 132 > > Evangelist and Product Manager of http://MooseFS.org > at Core Technology sp. z o.o. > > > > > On Thu, Apr 5, 2012 at 22:29, Steve Wilson <st...@pu...> wrote: > >> Hi, >> >> One of my chunk servers will log a CRC error from time to time like the >> following: >> >> Apr 4 17:29:10 massachusetts mfschunkserver[2224]: >> write_block_to_chunk: >> file:/mfs/08/27/chunk_00000000066B5D27_00000001.mfs - crc error >> >> Is the most likely cause faulty system memory? Or disk controller? We >> get an error about every two days or so and spread across most of the >> drives: >> >> # IP path (switch to name) chunks last error >> 9 128.210.48.62:9422:/mfs/01/ 934123 2012-03-28 17:41 >> 10 128.210.48.62:9422:/mfs/02/ 931903 2012-03-23 21:28 >> 11 128.210.48.62:9422:/mfs/03/ 888712 2012-03-30 19:13 >> 12 128.210.48.62:9422:/mfs/04/ 931661 2012-04-01 03:01 >> 13 128.210.48.62:9422:/mfs/05/ 935681 no errors >> 14 128.210.48.62:9422:/mfs/06/ 929248 2012-04-04 13:41 >> 15 128.210.48.62:9422:/mfs/07/ 929592 2012-03-30 19:02 >> 16 128.210.48.62:9422:/mfs/08/ 829446 2012-04-04 17:29 >> >> Thanks, >> Steve >> >> >> ------------------------------------------------------------------------------ >> Better than sec? Nothing is better than sec when it comes to >> monitoring Big Data applications. Try Boundary one-second >> resolution app monitoring today. Free. >> http://p.sf.net/sfu/Boundary-dev2dev >> _______________________________________________ >> moosefs-users mailing list >> moo...@li... >> https://lists.sourceforge.net/lists/listinfo/moosefs-users >> > > > -- > Steven M. Wilson, Systems and Network Manager > Markey Center for Structural Biology > Purdue University > (765) 496-1946 > > > > ------------------------------------------------------------------------------ > Better than sec? Nothing is better than sec when it comes to > monitoring Big Data applications. Try Boundary one-second > resolution app monitoring today. Free.http://p.sf.net/sfu/Boundary-dev2dev > > > > _______________________________________________ > moosefs-users mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/moosefs-users > > > -- > Steven M. Wilson, Systems and Network Manager > Markey Center for Structural Biology > Purdue University > (765) 496-1946 > > |