From: Roberto N. <rob...@su...> - 2010-11-15 15:23:27
|
Hello. Last week I noticed in the log some messages from smartd about Offline uncorrectable sectors. So I took some time and read the docs and run some more tests, but now I'm at a dead end and I'm not sure if I have to worry or not. This happens on two identical file servers. So, some bits about my environment: # uname -rms Linux 2.6.9-89.0.18.ELsmp i686 # cat /etc/redhat-release CentOS release 4.8 (Final) smarttools are those from the official centos packages # smartctl -V smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ ... smartd is enabled and started at boot time. Each server have two 1TB sata disks in software raid1. # df Filesystem 1K-blocks Used Available Use% Mounted on /dev/md0 40313848 2750104 35515868 8% / /dev/sda1 248895 37159 198886 16% /boot /dev/sdb3 248895 12483 223560 6% /boot2 none 1037332 0 1037332 0% /dev/shm /dev/md1 918801540 168135056 703994052 20% /home I first run: # smartctl -t long /dev/sda and # smartctl -t long /dev/sdb then, the day after: # smartctl -s on -o on -S on /dev/sda and # smartctl -s on -o on -S on /dev/sdb and after a while I also run # smartctl -A /dev/sda and # smartctl -A /dev/sdb I'll attach the output of the above commands, but I believe there's nothing to worry about that. Shat now worries me is: # smartctl -l selftest /dev/sda smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 0 Warning: ATA Specification requires self-test log structure revision number = 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 20675 61853873 and the command behaves in almost the same way for both drives in both hosts exept for the LBA_of_first_error value. I'm still reading docs, but in the meanwhile, can any kind soul please cast some light into, please? Do I have to worry or is it all well? Thank you. Roberto Nunnari |
From: Bokhan A. <ap...@ng...> - 2010-11-15 17:45:09
|
If your kernel supports, try echo check > /sys/block/md0/md/sync_action echo repair > /sys/block/md0/md/sync_action This must fix offline unc. If not, then you will have to fix them manualy. Here is howto http://smartmontools.sourceforge.net/badblockhowto.html 15.11.2010 21:23, Roberto Nunnari ?????: > Hello. > > Last week I noticed in the log some messages from smartd > about Offline uncorrectable sectors. > > So I took some time and read the docs and run some more > tests, but now I'm at a dead end and I'm not sure if > I have to worry or not. > > This happens on two identical file servers. > > So, some bits about my environment: > > # uname -rms > Linux 2.6.9-89.0.18.ELsmp i686 > > # cat /etc/redhat-release > CentOS release 4.8 (Final) > > smarttools are those from the official centos packages > > # smartctl -V > smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 > Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > ... > > smartd is enabled and started at boot time. > > Each server have two 1TB sata disks in software raid1. > > # df > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/md0 40313848 2750104 35515868 8% / > /dev/sda1 248895 37159 198886 16% /boot > /dev/sdb3 248895 12483 223560 6% /boot2 > none 1037332 0 1037332 0% /dev/shm > /dev/md1 918801540 168135056 703994052 20% /home > > > I first run: > # smartctl -t long /dev/sda > and > # smartctl -t long /dev/sdb > > then, the day after: > # smartctl -s on -o on -S on /dev/sda > and > # smartctl -s on -o on -S on /dev/sdb > > and after a while I also run > # smartctl -A /dev/sda > and > # smartctl -A /dev/sdb > > I'll attach the output of the above commands, but I believe there's > nothing to worry about that. Shat now worries me is: > > # smartctl -l selftest /dev/sda > smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 > Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > === START OF READ SMART DATA SECTION === > SMART Self-test log structure revision number 0 > Warning: ATA Specification requires self-test log structure revision > number = 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Extended offline Completed: read failure 90% 20675 > 61853873 > > and the command behaves in almost the same way for both drives > in both hosts exept for the LBA_of_first_error value. > > I'm still reading docs, but in the meanwhile, can any > kind soul please cast some light into, please? > Do I have to worry or is it all well? > > Thank you. > Roberto Nunnari > > > ------------------------------------------------------------------------------ > Centralized Desktop Delivery: Dell and VMware Reference Architecture > Simplifying enterprise desktop deployment and management using > Dell EqualLogic storage and VMware View: A highly scalable, end-to-end > client virtualization framework. Read more! > http://p.sf.net/sfu/dell-eql-dev2dev > > > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support |
From: Roberto N. <rob...@su...> - 2010-11-18 08:53:58
|
Bokhan Artem ha scritto: > If your kernel supports, try Hello Bokhan. I'm sorry to get back to you so late.. I had to change backup, change disks and restore about one TB data on one of the two fileserver.. Not because of failure but because the disks were full and I could not add more. > echo check > /sys/block/md0/md/sync_action > echo repair > /sys/block/md0/md/sync_action I'm sorry I don't have that directory and files: # ls -lAF /sys/block/md0/ total 0 -r--r--r-- 1 root root 4096 Nov 18 09:25 dev -r--r--r-- 1 root root 4096 Nov 18 09:25 range -r--r--r-- 1 root root 4096 Nov 18 09:25 removable -r--r--r-- 1 root root 4096 Nov 18 09:25 size -r--r--r-- 1 root root 4096 Nov 18 09:25 stat Does that mean my kernel doesn't support it? > > This must fix offline unc. > > If not, then you will have to fix them manualy. Here is howto > http://smartmontools.sourceforge.net/badblockhowto.html I'm now following the badblockhowto.. but I use software raid.. I hope I can map it to my problem.. Thank you. Roberto Nunnari > > 15.11.2010 21:23, Roberto Nunnari пишет: >> Hello. >> >> Last week I noticed in the log some messages from smartd >> about Offline uncorrectable sectors. >> >> So I took some time and read the docs and run some more >> tests, but now I'm at a dead end and I'm not sure if >> I have to worry or not. >> >> This happens on two identical file servers. >> >> So, some bits about my environment: >> >> # uname -rms >> Linux 2.6.9-89.0.18.ELsmp i686 >> >> # cat /etc/redhat-release >> CentOS release 4.8 (Final) >> >> smarttools are those from the official centos packages >> >> # smartctl -V >> smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 >> Bruce Allen >> Home page is http://smartmontools.sourceforge.net/ >> ... >> >> smartd is enabled and started at boot time. >> >> Each server have two 1TB sata disks in software raid1. >> >> # df >> Filesystem 1K-blocks Used Available Use% Mounted on >> /dev/md0 40313848 2750104 35515868 8% / >> /dev/sda1 248895 37159 198886 16% /boot >> /dev/sdb3 248895 12483 223560 6% /boot2 >> none 1037332 0 1037332 0% /dev/shm >> /dev/md1 918801540 168135056 703994052 20% /home >> >> >> I first run: >> # smartctl -t long /dev/sda >> and >> # smartctl -t long /dev/sdb >> >> then, the day after: >> # smartctl -s on -o on -S on /dev/sda >> and >> # smartctl -s on -o on -S on /dev/sdb >> >> and after a while I also run >> # smartctl -A /dev/sda >> and >> # smartctl -A /dev/sdb >> >> I'll attach the output of the above commands, but I believe there's >> nothing to worry about that. Shat now worries me is: >> >> # smartctl -l selftest /dev/sda >> smartctl version 5.33 [i686-redhat-linux-gnu] Copyright (C) 2002-4 >> Bruce Allen >> Home page is http://smartmontools.sourceforge.net/ >> >> === START OF READ SMART DATA SECTION === >> SMART Self-test log structure revision number 0 >> Warning: ATA Specification requires self-test log structure revision >> number = 1 >> Num Test_Description Status Remaining >> LifeTime(hours) LBA_of_first_error >> # 1 Extended offline Completed: read failure 90% 20675 >> 61853873 >> >> and the command behaves in almost the same way for both drives >> in both hosts exept for the LBA_of_first_error value. >> >> I'm still reading docs, but in the meanwhile, can any >> kind soul please cast some light into, please? >> Do I have to worry or is it all well? >> >> Thank you. >> Roberto Nunnari >> >> >> ------------------------------------------------------------------------------ >> Centralized Desktop Delivery: Dell and VMware Reference Architecture >> Simplifying enterprise desktop deployment and management using >> Dell EqualLogic storage and VMware View: A highly scalable, end-to-end >> client virtualization framework. Read more! >> http://p.sf.net/sfu/dell-eql-dev2dev >> >> >> _______________________________________________ >> Smartmontools-support mailing list >> Sma...@li... >> https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------------ > Centralized Desktop Delivery: Dell and VMware Reference Architecture > Simplifying enterprise desktop deployment and management using > Dell EqualLogic storage and VMware View: A highly scalable, end-to-end > client virtualization framework. Read more! > http://p.sf.net/sfu/dell-eql-dev2dev > > > ------------------------------------------------------------------------ > > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support |
From: Artem B. <ap...@ng...> - 2010-11-18 09:16:17
|
18.11.2010 14:53, Roberto Nunnari пишет: > Does that mean my kernel doesn't support it? Yes. > I'm now following the badblockhowto.. but I use software raid.. > I hope I can map it to my problem.. It will help. All you need is to copy by hand sectors with badblock from one disk to other. |
From: Roberto N. <rob...@su...> - 2010-11-18 10:23:46
|
Artem Bokhan ha scritto: > 18.11.2010 14:53, Roberto Nunnari: >> Does that mean my kernel doesn't support it? > Yes. > >> I'm now following the badblockhowto.. but I use software raid.. >> I hope I can map it to my problem.. > It will help. All you need is to copy by hand sectors with badblock from one > disk to other. debugfs doesn't behave like stated in the badblockhowto: # debugfs debugfs 1.35 (28-Feb-2004) debugfs: open /dev/md0 debugfs: testb 1736947 Block 1736947 marked in use debugfs: icheck 1736947 Block Inode number 1736947 <block not found> debugfs: icheck 1736947 10 Block Inode number 1736947 <block not found> 10 7 and if I try again testb it still reports 'marked in use' debugfs: testb 1736947 Block 1736947 marked in use Also, probably accessing the block with debugfs, raised an error and the raid software put the failing drive offline. From /var/log/messages Nov 18 10:05:23 homeb kernel: ata1: EH complete Nov 18 10:05:23 homeb kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Nov 18 10:05:23 homeb kernel: ata1.00: (BMDMA stat 0x24) Nov 18 10:05:23 homeb kernel: ata1.00: cmd c8/00:48:b8:df:db/00:00:00:00:00/e0 tag 0 cdb 0x0 data 36864 in Nov 18 10:05:23 homeb kernel: res 51/40:00:b8:df:db/00:00:00:00:00/e0 Emask 0x9 (media error) Nov 18 10:05:23 homeb kernel: ata1.00: configured for UDMA/133 Nov 18 10:05:23 homeb kernel: SCSI error : <0 0 0 0> return code = 0x8000002 Nov 18 10:05:23 homeb kernel: Info fld=0x4000000 (nonstd), Invalid sda: sense = 72 11 Nov 18 10:05:23 homeb kernel: end_request: I/O error, dev sda, sector 14409656 Nov 18 10:05:23 homeb kernel: raid1: Disk failure on sda2, disabling device. Nov 18 10:05:23 homeb kernel: Operation continuing on 1 devices Nov 18 10:05:23 homeb kernel: raid1: sda2: rescheduling sector 13895576 Nov 18 10:05:23 homeb kernel: ata1: EH complete Nov 18 10:05:23 homeb kernel: raid1: sdb1: redirecting sector 13895576 to another mirror Nov 18 10:05:23 homeb kernel: SCSI device sda: drive cache: write back Nov 18 10:05:23 homeb kernel: RAID1 conf printout: Nov 18 10:05:23 homeb kernel: --- wd:1 rd:2 Nov 18 10:05:23 homeb kernel: disk 0, wo:1, o:0, dev:sda2 Nov 18 10:05:23 homeb kernel: disk 1, wo:0, o:1, dev:sdb1 Nov 18 10:05:23 homeb kernel: RAID1 conf printout: Nov 18 10:05:23 homeb kernel: --- wd:1 rd:2 Nov 18 10:05:23 homeb kernel: disk 1, wo:0, o:1, dev:sdb1 Nov 18 10:05:23 homeb kernel: SCSI device sda: 1953525168 512-byte hdwr sectors (1000205 MB) Nov 18 10:05:23 homeb kernel: SCSI device sda: drive cache: write back Nov 18 10:08:45 homeb smartd[21413]: Device: /dev/sda, 4 Currently unreadable (pending) sectors Nov 18 10:08:45 homeb smartd[21413]: Device: /dev/sda, 1 Offline uncorrectable sectors Nov 18 10:08:45 homeb smartd[21413]: Device: /dev/sdb, 2 Currently unreadable (pending) sectors Nov 18 10:08:45 homeb smartd[21413]: Device: /dev/sdb, 1 Offline uncorrectable sectors Now I'm really scared. What should I do now? Robi |
From: Roberto N. <rob...@su...> - 2010-11-18 10:41:45
|
Roberto Nunnari ha scritto: > Artem Bokhan ha scritto: >> 18.11.2010 14:53, Roberto Nunnari: >>> Does that mean my kernel doesn't support it? >> Yes. >> >>> I'm now following the badblockhowto.. but I use software raid.. >>> I hope I can map it to my problem.. >> It will help. All you need is to copy by hand sectors with badblock from one >> disk to other. > > debugfs doesn't behave like stated in the badblockhowto: > > # debugfs > debugfs 1.35 (28-Feb-2004) > debugfs: open /dev/md0 > debugfs: testb 1736947 > Block 1736947 marked in use > debugfs: icheck 1736947 > Block Inode number > 1736947 <block not found> > debugfs: icheck 1736947 10 > Block Inode number > 1736947 <block not found> > 10 7 > > and if I try again testb it still reports 'marked in use' > debugfs: testb 1736947 > Block 1736947 marked in use > > Also, probably accessing the block with debugfs, raised an error > and the raid software put the failing drive offline. > From /var/log/messages > > Nov 18 10:05:23 homeb kernel: ata1: EH complete > Nov 18 10:05:23 homeb kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr > 0x0 action 0x0 > Nov 18 10:05:23 homeb kernel: ata1.00: (BMDMA stat 0x24) > Nov 18 10:05:23 homeb kernel: ata1.00: cmd > c8/00:48:b8:df:db/00:00:00:00:00/e0 tag 0 cdb 0x0 data 36864 in > Nov 18 10:05:23 homeb kernel: res > 51/40:00:b8:df:db/00:00:00:00:00/e0 Emask 0x9 (media error) > Nov 18 10:05:23 homeb kernel: ata1.00: configured for UDMA/133 > Nov 18 10:05:23 homeb kernel: SCSI error : <0 0 0 0> return code = 0x8000002 > Nov 18 10:05:23 homeb kernel: Info fld=0x4000000 (nonstd), Invalid sda: > sense = 72 11 > Nov 18 10:05:23 homeb kernel: end_request: I/O error, dev sda, sector > 14409656 > Nov 18 10:05:23 homeb kernel: raid1: Disk failure on sda2, disabling device. > Nov 18 10:05:23 homeb kernel: Operation continuing on 1 devices > Nov 18 10:05:23 homeb kernel: raid1: sda2: rescheduling sector 13895576 > Nov 18 10:05:23 homeb kernel: ata1: EH complete > Nov 18 10:05:23 homeb kernel: raid1: sdb1: redirecting sector 13895576 > to another mirror > Nov 18 10:05:23 homeb kernel: SCSI device sda: drive cache: write back > Nov 18 10:05:23 homeb kernel: RAID1 conf printout: > Nov 18 10:05:23 homeb kernel: --- wd:1 rd:2 > Nov 18 10:05:23 homeb kernel: disk 0, wo:1, o:0, dev:sda2 > Nov 18 10:05:23 homeb kernel: disk 1, wo:0, o:1, dev:sdb1 > Nov 18 10:05:23 homeb kernel: RAID1 conf printout: > Nov 18 10:05:23 homeb kernel: --- wd:1 rd:2 > Nov 18 10:05:23 homeb kernel: disk 1, wo:0, o:1, dev:sdb1 > Nov 18 10:05:23 homeb kernel: SCSI device sda: 1953525168 512-byte hdwr > sectors (1000205 MB) > Nov 18 10:05:23 homeb kernel: SCSI device sda: drive cache: write back > Nov 18 10:08:45 homeb smartd[21413]: Device: /dev/sda, 4 Currently > unreadable (pending) sectors > Nov 18 10:08:45 homeb smartd[21413]: Device: /dev/sda, 1 Offline > uncorrectable sectors > Nov 18 10:08:45 homeb smartd[21413]: Device: /dev/sdb, 2 Currently > unreadable (pending) sectors > Nov 18 10:08:45 homeb smartd[21413]: Device: /dev/sdb, 1 Offline > uncorrectable sectors > > > Now I'm really scared. What should I do now? > Robi I forgot to attach the raid status. PLEASE HELP! # mdadm -Q -D /dev/md0 /dev/md0: Version : 00.90.01 Creation Time : Fri Jul 4 11:44:51 2008 Raid Level : raid1 Array Size : 40957568 (39.06 GiB 41.94 GB) Device Size : 40957568 (39.06 GiB 41.94 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Thu Nov 18 11:39:40 2010 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 1 Spare Devices : 0 UUID : 7c8c833d:28e6e715:06ed1245:b3fda0bf Events : 0.7212396 Number Major Minor RaidDevice State 0 0 0 - removed 1 8 17 1 active sync /dev/sdb1 2 8 2 - faulty /dev/sda2 |
From: Tim S. <ti...@se...> - 2010-11-18 11:10:51
|
Really, the best place to ask for help with Linux RAID issues is on the kernel linux-raid list - lin...@vg... smartmontools is cross-platform, so this is off topic for those running smartctl on Mac Windows, Solaris, FreeBSD etc. not to mention people running non-RAID Linux boxes... Also, USING CAPITALS in the subject line is not generally going to get you more responses - if anything it will get you fewer responses... Cheers, Tim. -- South East Open Source Solutions Limited Registered in England and Wales with company number 06134732. Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309 |
From: Roberto N. <rob...@su...> - 2010-11-18 16:02:15
|
Tim Small ha scritto: > Really, the best place to ask for help with Linux RAID issues is on the > kernel linux-raid list - lin...@vg... Thanks, I just subscribed to that list. > smartmontools is cross-platform, so this is off topic for those running > smartctl on Mac Windows, Solaris, FreeBSD etc. not to mention people > running non-RAID Linux boxes... ok, you're right. > Also, USING CAPITALS in the subject line is not generally going to get > you more responses - if anything it will get you fewer responses... humm.. you might be right, but I was really at a complete loss. People in other lists are usually less touchy on this. I remove the capitals, ok? Anyways, could anybody please make me understand which are the most important attributes (output of smartctl -A) and what's their meaning? do you have to worry when you see Raw_Read_Error_Rate > 0 ? do you have to worry when you see Current_Pending_Sector > 0 ? do you have to worry when you see Offline_Uncorrectable > 0 ? are there other values that could tell the drive is about to fail? By the way, I'm probably going to take a dump of /home reinstall the file-server via kickstart and restore /home in place. Thank you very much to the developers of smartmontools it let me discover the drives are failing in advance and so I can replace them without loosing any data. Best regards. Robi |
From: Alex S. <ml...@os...> - 2010-11-18 16:33:55
|
On 11/18/2010 05:01 PM, Roberto Nunnari wrote: > > do you have to worry when you see Raw_Read_Error_Rate> 0 ? > do you have to worry when you see Current_Pending_Sector> 0 ? > do you have to worry when you see Offline_Uncorrectable> 0 ? > > I am recommending to read wikipedia about smart attributes and to play with gsmartcontrol for better understanding of attributes meaning. |
From: Roberto N. <rob...@su...> - 2010-11-18 18:18:43
|
Alex Samorukov ha scritto: > On 11/18/2010 05:01 PM, Roberto Nunnari wrote: >> do you have to worry when you see Raw_Read_Error_Rate> 0 ? >> do you have to worry when you see Current_Pending_Sector> 0 ? >> do you have to worry when you see Offline_Uncorrectable> 0 ? >> >> > > I am recommending to read wikipedia about smart attributes and to play > with gsmartcontrol for better understanding of attributes meaning. Thanks. It was a worth read. Best regards. Robi |
From: Artem B. <ap...@ng...> - 2010-11-18 09:32:28
|
18.11.2010 14:53, Roberto Nunnari пишет: > Does that mean my kernel doesn't support it? You may also boot from LiveCD witch last kernel and run resync |