Re: [Moosefs-users] Cause of CRC errors

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Steve :)
It looks like memory or mainboard/controller issue.

However there is some probability that this machine has all hard drives
broken.
(eg. by temperature or by some shaking/vibration)

If I were you I would mark this machine for maintenance and make full tests
on it:
- first we need to make sure that all data are with desired level of safety
by marking all disks in /etc/mfshdd.cfg config file with asterisk like this:
*/mfs/01
*/mfs/02
...
- restart the chunk server service (eg. /etc/init.d/mfs-chunkserver restart)
- wait for all chunks from this machine to be replicated
- stop the chunk server service

....and then make tests eg.:
- "memtest" for memory
-- if error occours replace RAM test it again
-- if error occurs again so it looks like mainboard issue.

- "badblock" for harddrives you can test all disk together parallel but I
would run them after I moved disks into different machine.
(just move them before you run memtest so you can run memtest and badblock
in the same tame)

if all test PASS (no errors) than I would try to replace controller and
mainboard.
and put tested memory and disks into this new mainbord/controller (or even
CPU)

That is for one server case. With big installations like 100+ such errors
of hardware can occur every week/month and it is worth to have better
procedure, which our Technical Support would create for you :)

Good luck with testing and please share with us when you fix it :)
aNeutrino :)

-- 
Peter aNeutrino http://pl.linkedin.com/in/aneutrino+48 602 302 132

Evangelist and Product Manager of http://MooseFS.org
at Core Technology sp. z o.o.

On Thu, Apr 5, 2012 at 22:29, Steve Wilson <st...@pu...> wrote:

> Hi,
>
> One of my chunk servers will log a CRC error from time to time like the
> following:
>
>     Apr  4 17:29:10 massachusetts mfschunkserver[2224]:
> write_block_to_chunk:
> file:/mfs/08/27/chunk_00000000066B5D27_00000001.mfs - crc error
>
> Is the most likely cause faulty system memory?  Or disk controller?  We
> get an error about every two days or so and spread across most of the
> drives:
>
>  #       IP path (switch to name)                chunks       last error
>  9    128.210.48.62:9422:/mfs/01/    934123    2012-03-28 17:41
> 10    128.210.48.62:9422:/mfs/02/    931903    2012-03-23 21:28
> 11    128.210.48.62:9422:/mfs/03/    888712    2012-03-30 19:13
> 12    128.210.48.62:9422:/mfs/04/    931661    2012-04-01 03:01
> 13    128.210.48.62:9422:/mfs/05/    935681    no errors
> 14    128.210.48.62:9422:/mfs/06/    929248    2012-04-04 13:41
> 15    128.210.48.62:9422:/mfs/07/    929592    2012-03-30 19:02
> 16    128.210.48.62:9422:/mfs/08/    829446    2012-04-04 17:29
>
> Thanks,
> Steve
>
>
> ------------------------------------------------------------------------------
> Better than sec? Nothing is better than sec when it comes to
> monitoring Big Data applications. Try Boundary one-second
> resolution app monitoring today. Free.
> http://p.sf.net/sfu/Boundary-dev2dev
> _______________________________________________
> moosefs-users mailing list
> moo...@li...
> https://lists.sourceforge.net/lists/listinfo/moosefs-users
>

Re: [Moosefs-users] Cause of CRC errors

Fault tolerant, POSIX-compliant, Net Distributed Storage / File System

Re: [Moosefs-users] Cause of CRC errors