From: Peter a. <pio...@co...> - 2012-04-08 13:11:26
|
Hi Steve :) It looks like memory or mainboard/controller issue. However there is some probability that this machine has all hard drives broken. (eg. by temperature or by some shaking/vibration) If I were you I would mark this machine for maintenance and make full tests on it: - first we need to make sure that all data are with desired level of safety by marking all disks in /etc/mfshdd.cfg config file with asterisk like this: */mfs/01 */mfs/02 ... - restart the chunk server service (eg. /etc/init.d/mfs-chunkserver restart) - wait for all chunks from this machine to be replicated - stop the chunk server service ....and then make tests eg.: - "memtest" for memory -- if error occours replace RAM test it again -- if error occurs again so it looks like mainboard issue. - "badblock" for harddrives you can test all disk together parallel but I would run them after I moved disks into different machine. (just move them before you run memtest so you can run memtest and badblock in the same tame) if all test PASS (no errors) than I would try to replace controller and mainboard. and put tested memory and disks into this new mainbord/controller (or even CPU) That is for one server case. With big installations like 100+ such errors of hardware can occur every week/month and it is worth to have better procedure, which our Technical Support would create for you :) Good luck with testing and please share with us when you fix it :) aNeutrino :) -- Peter aNeutrino http://pl.linkedin.com/in/aneutrino+48 602 302 132 Evangelist and Product Manager of http://MooseFS.org at Core Technology sp. z o.o. On Thu, Apr 5, 2012 at 22:29, Steve Wilson <st...@pu...> wrote: > Hi, > > One of my chunk servers will log a CRC error from time to time like the > following: > > Apr 4 17:29:10 massachusetts mfschunkserver[2224]: > write_block_to_chunk: > file:/mfs/08/27/chunk_00000000066B5D27_00000001.mfs - crc error > > Is the most likely cause faulty system memory? Or disk controller? We > get an error about every two days or so and spread across most of the > drives: > > # IP path (switch to name) chunks last error > 9 128.210.48.62:9422:/mfs/01/ 934123 2012-03-28 17:41 > 10 128.210.48.62:9422:/mfs/02/ 931903 2012-03-23 21:28 > 11 128.210.48.62:9422:/mfs/03/ 888712 2012-03-30 19:13 > 12 128.210.48.62:9422:/mfs/04/ 931661 2012-04-01 03:01 > 13 128.210.48.62:9422:/mfs/05/ 935681 no errors > 14 128.210.48.62:9422:/mfs/06/ 929248 2012-04-04 13:41 > 15 128.210.48.62:9422:/mfs/07/ 929592 2012-03-30 19:02 > 16 128.210.48.62:9422:/mfs/08/ 829446 2012-04-04 17:29 > > Thanks, > Steve > > > ------------------------------------------------------------------------------ > Better than sec? Nothing is better than sec when it comes to > monitoring Big Data applications. Try Boundary one-second > resolution app monitoring today. Free. > http://p.sf.net/sfu/Boundary-dev2dev > _______________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users > |