From: Ivan L. Jr. <iva...@gm...> - 2012-01-12 11:18:02
|
Hi! One of my disks apparently has bad sectors on it which I'd love to take care of on filesystem level and try and save the files that are affected and then zero-out the bad sectors and remap them. Here's a bit of diagnostic information: % smartctl -i /dev/hdd smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family Device Model: ST380011A Serial Number: 5JVA4SQL Firmware Version: 3.06 User Capacity: 80,026,361,856 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 Local Time is: Thu Jan 12 13:11:46 2012 EET SMART support is: Available - device has SMART capability. SMART support is: Enabled % smartctl -l selftest /dev/hdd smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ ---- START OF READ SMART DATA SECTION ----- SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 50% 17604 80735076 # 2 Extended offline Completed without error 00% 17590 - # 3 Short offline Completed: read failure 90% 17589 80744740 % smartctl -l error /dev/hdd |grep "at LBA" 40 51 08 cc 5a d2 f0 Error: UNC 8 sectors at LBA = 0x00d25acc = 13785804 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 40 51 08 1c 11 d0 f0 Error: UNC 8 sectors at LBA = 0x00d0111c = 13635868 40 51 08 2c d2 97 f0 Error: UNC 8 sectors at LBA = 0x0097d22c = 9949740 Full version: % smartctl -A /dev/hdd smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 054 051 006 Pre-fail Always - 230062867 3 Spin_Up_Time 0x0003 098 098 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 319 5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail Always - 46 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 281367075 9 Power_On_Hours 0x0032 080 080 000 Old_age Always - 18192 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 903 194 Temperature_Celsius 0x0022 039 051 000 Old_age Always - 39 195 Hardware_ECC_Recovered 0x001a 054 051 000 Old_age Always - 230062867 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 5 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 5 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 TA_Increase_Count 0x0032 086 239 000 Old_age Always - 14 root@c-h-p-a /home/ilj % smartctl -l error /dev/hdd smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Error Log Version: 1 ATA Error Count: 599 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 599 occurred at disk power-on lifetime: 17665 hours (736 days + 1 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 cc 5a d2 f0 Error: UNC 8 sectors at LBA = 0x00d25acc = 13785804 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 cc 5a d2 f0 00 05:14:50.934 READ DMA EXT 25 00 08 b4 53 d2 f0 00 05:14:50.918 READ DMA EXT 25 00 08 1c d0 cb f0 00 05:14:50.896 READ DMA EXT 25 00 08 2c 58 94 f0 00 05:14:50.868 READ DMA EXT 25 00 08 d4 57 94 f0 00 05:14:50.843 READ DMA EXT Error 598 occurred at disk power-on lifetime: 17579 hours (732 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 5c eb cf f0 00 05:53:45.727 READ DMA EXT 25 00 08 cc 44 64 f0 00 05:53:45.726 READ DMA EXT e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE 35 00 08 f4 e3 50 f0 00 05:53:45.726 WRITE DMA EXT e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE Error 597 occurred at disk power-on lifetime: 17579 hours (732 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 5c eb cf f0 00 05:53:45.727 READ DMA EXT e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE 35 00 08 f4 e3 50 f0 00 05:53:45.726 WRITE DMA EXT e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE 35 00 18 04 d0 3b f0 00 05:53:45.726 WRITE DMA EXT Error 596 occurred at disk power-on lifetime: 17574 hours (732 days + 6 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 1c 11 d0 f0 Error: UNC 8 sectors at LBA = 0x00d0111c = 13635868 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 1c 11 d0 f0 00 00:35:27.205 READ DMA EXT 25 00 80 ec 87 0d f0 00 00:35:27.205 READ DMA EXT 25 00 08 1c 6a 61 f0 00 00:35:27.198 READ DMA EXT 25 00 08 2c 63 61 f0 00 00:35:27.164 READ DMA EXT 25 00 08 54 52 60 f0 00 00:35:27.159 READ DMA EXT Error 595 occurred at disk power-on lifetime: 8166 hours (340 days + 6 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 08 2c d2 97 f0 Error: UNC 8 sectors at LBA = 0x0097d22c = 9949740 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 08 2c d2 97 f0 00 00:00:43.594 READ DMA EXT e7 00 00 00 00 00 f0 00 00:00:43.585 FLUSH CACHE 35 00 08 f4 e3 50 f0 00 00:00:43.573 WRITE DMA EXT e7 00 00 00 00 00 f0 00 00:00:47.550 FLUSH CACHE 25 00 08 2c d2 97 f0 00 00:00:47.517 READ DMA EXT So, I used the following guide http://smartmontools.sourceforge.net/badblockhowto.html#e2_example1 to diagnose the disk and the filesystem. I've come up with the following data, using the formula from that guide (disregarding (int), though... simply because I'm not sure how to use that in BASH and what it really does. Correct me if I'm wrong, but it's not THAT important, is it?) So, the disk in question looks like this: % fdisk -lu /dev/hdd Disk /dev/hdd: 80.0 GB, 80026361856 bytes 255 heads, 63 sectors/track, 9729 cylinders, total 156301488 sectors Units = sectors of 1 * 512 = 512 bytes Disk identifier: 0xdf3cdf3c Device Boot Start End Blocks Id System /dev/hdd1 63 3919859 1959898+ fd Linux raid autodetect /dev/hdd2 * 3919860 156296384 76188262+ fd Linux raid autodetect Yes, it's in fact a software RAID1 with ext3 filesystem: % cat /proc/mdstat Personalities : [raid1] md1 : active raid1 hda2[0] hdd2[1] 76188160 blocks [2/2] [UU] md0 : active raid1 hda1[0] hdd1[1] 1959808 blocks [2/2] [UU] unused devices: <none> And here's the table of data after all the calculations have been done (hope it doesn't get botched up in plain-text): SMART Value LBA on partition Problem LBA Inode Number 80735076 76815216 9601902 4784266 80744740 76824880 9603110 4801095 13785804 9865944 1233243 491537 13626205 9706345 1213293 491536 13635868 9716008 1214501 491536 9949740 6029880 9949740 1638623 So, I went on to check whether these inodes were actually not readable (pardon if I sound like an incompetent shmuck; I am actually one when it comes to low level disk stuff!), and figured that they looked totally OK: % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=4784266 1+0 records in 1+0 records out 512 bytes (512 B) copied, 0.0336201 s, 15.2 kB/s % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=4801095 1+0 records in 1+0 records out 512 bytes (512 B) copied, 0.0177189 s, 28.9 kB/s % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491537 1+0 records in 1+0 records out 512 bytes (512 B) copied, 0.0273371 s, 18.7 kB/s % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491536 1+0 records in 1+0 records out 512 bytes (512 B) copied, 6.9713e-05 s, 7.3 MB/s % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491536 1+0 records in 1+0 records out 512 bytes (512 B) copied, 6.7209e-05 s, 7.6 MB/s % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=1638623 1+0 records in 1+0 records out 512 bytes (512 B) copied, 0.00991578 s, 51.6 kB/s So, I'm kinda at a loss. Either my calculations were incorrect (missing (int) is the problem?), or the formula from the guide doesn't apply anymore (?), or it's useless when working with software RAID-1, or something else. I was hoping that someone who feels comfortable playing with this stuff would step up and give me a hint as to what I should do next. |
From: Bokhan A. <ap...@ng...> - 2012-01-12 19:02:20
|
Running consistency check on md should help. Suggest to set scterc if possible to avoid raid degradation. 12.01.2012 18:17, Ivan Lezhnjov Jr. пишет: > Hi! > > One of my disks apparently has bad sectors on it which I'd love to > take care of on filesystem level and try and save the files that are > affected and then zero-out the bad sectors and remap them. > > Here's a bit of diagnostic information: > > % smartctl -i /dev/hdd > smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > === START OF INFORMATION SECTION === > Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family > Device Model: ST380011A > Serial Number: 5JVA4SQL > Firmware Version: 3.06 > User Capacity: 80,026,361,856 bytes > Device is: In smartctl database [for details use: -P show] > ATA Version is: 6 > ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 > Local Time is: Thu Jan 12 13:11:46 2012 EET > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > % smartctl -l selftest /dev/hdd > smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > ---- START OF READ SMART DATA SECTION ----- > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Extended offline Completed: read failure 50% 17604 > 80735076 > # 2 Extended offline Completed without error 00% 17590 - > # 3 Short offline Completed: read failure 90% 17589 > 80744740 > > % smartctl -l error /dev/hdd |grep "at LBA" > 40 51 08 cc 5a d2 f0 Error: UNC 8 sectors at LBA = 0x00d25acc = 13785804 > 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 > 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 > 40 51 08 1c 11 d0 f0 Error: UNC 8 sectors at LBA = 0x00d0111c = 13635868 > 40 51 08 2c d2 97 f0 Error: UNC 8 sectors at LBA = 0x0097d22c = 9949740 > > Full version: > > % smartctl -A /dev/hdd > smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > === START OF READ SMART DATA SECTION === > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 054 051 006 Pre-fail > Always - 230062867 > 3 Spin_Up_Time 0x0003 098 098 000 Pre-fail > Always - 0 > 4 Start_Stop_Count 0x0032 100 100 020 Old_age > Always - 319 > 5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail > Always - 46 > 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail > Always - 281367075 > 9 Power_On_Hours 0x0032 080 080 000 Old_age > Always - 18192 > 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 020 Old_age > Always - 903 > 194 Temperature_Celsius 0x0022 039 051 000 Old_age > Always - 39 > 195 Hardware_ECC_Recovered 0x001a 054 051 000 Old_age > Always - 230062867 > 197 Current_Pending_Sector 0x0012 100 100 000 Old_age > Always - 5 > 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age > Offline - 5 > 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age > Always - 0 > 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age > Offline - 0 > 202 TA_Increase_Count 0x0032 086 239 000 Old_age > Always - 14 > > root@c-h-p-a /home/ilj % smartctl -l error /dev/hdd > smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > === START OF READ SMART DATA SECTION === > SMART Error Log Version: 1 > ATA Error Count: 599 (device log contains only the most recent five errors) > CR = Command Register [HEX] > FR = Features Register [HEX] > SC = Sector Count Register [HEX] > SN = Sector Number Register [HEX] > CL = Cylinder Low Register [HEX] > CH = Cylinder High Register [HEX] > DH = Device/Head Register [HEX] > DC = Device Command Register [HEX] > ER = Error register [HEX] > ST = Status register [HEX] > Powered_Up_Time is measured from power on, and printed as > DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, > SS=sec, and sss=millisec. It "wraps" after 49.710 days. > > Error 599 occurred at disk power-on lifetime: 17665 hours (736 days + 1 hours) > When the command that caused the error occurred, the device was > active or idle. > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 40 51 08 cc 5a d2 f0 Error: UNC 8 sectors at LBA = 0x00d25acc = 13785804 > > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name > -- -- -- -- -- -- -- -- ---------------- -------------------- > 25 00 08 cc 5a d2 f0 00 05:14:50.934 READ DMA EXT > 25 00 08 b4 53 d2 f0 00 05:14:50.918 READ DMA EXT > 25 00 08 1c d0 cb f0 00 05:14:50.896 READ DMA EXT > 25 00 08 2c 58 94 f0 00 05:14:50.868 READ DMA EXT > 25 00 08 d4 57 94 f0 00 05:14:50.843 READ DMA EXT > > Error 598 occurred at disk power-on lifetime: 17579 hours (732 days + 11 hours) > When the command that caused the error occurred, the device was > active or idle. > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 > > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name > -- -- -- -- -- -- -- -- ---------------- -------------------- > 25 00 08 5c eb cf f0 00 05:53:45.727 READ DMA EXT > 25 00 08 cc 44 64 f0 00 05:53:45.726 READ DMA EXT > e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE > 35 00 08 f4 e3 50 f0 00 05:53:45.726 WRITE DMA EXT > e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE > > Error 597 occurred at disk power-on lifetime: 17579 hours (732 days + 11 hours) > When the command that caused the error occurred, the device was > active or idle. > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 > > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name > -- -- -- -- -- -- -- -- ---------------- -------------------- > 25 00 08 5c eb cf f0 00 05:53:45.727 READ DMA EXT > e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE > 35 00 08 f4 e3 50 f0 00 05:53:45.726 WRITE DMA EXT > e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE > 35 00 18 04 d0 3b f0 00 05:53:45.726 WRITE DMA EXT > > Error 596 occurred at disk power-on lifetime: 17574 hours (732 days + 6 hours) > When the command that caused the error occurred, the device was > active or idle. > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 40 51 08 1c 11 d0 f0 Error: UNC 8 sectors at LBA = 0x00d0111c = 13635868 > > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name > -- -- -- -- -- -- -- -- ---------------- -------------------- > 25 00 08 1c 11 d0 f0 00 00:35:27.205 READ DMA EXT > 25 00 80 ec 87 0d f0 00 00:35:27.205 READ DMA EXT > 25 00 08 1c 6a 61 f0 00 00:35:27.198 READ DMA EXT > 25 00 08 2c 63 61 f0 00 00:35:27.164 READ DMA EXT > 25 00 08 54 52 60 f0 00 00:35:27.159 READ DMA EXT > > Error 595 occurred at disk power-on lifetime: 8166 hours (340 days + 6 hours) > When the command that caused the error occurred, the device was > active or idle. > > After command completion occurred, registers were: > ER ST SC SN CL CH DH > -- -- -- -- -- -- -- > 40 51 08 2c d2 97 f0 Error: UNC 8 sectors at LBA = 0x0097d22c = 9949740 > > Commands leading to the command that caused the error were: > CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name > -- -- -- -- -- -- -- -- ---------------- -------------------- > 25 00 08 2c d2 97 f0 00 00:00:43.594 READ DMA EXT > e7 00 00 00 00 00 f0 00 00:00:43.585 FLUSH CACHE > 35 00 08 f4 e3 50 f0 00 00:00:43.573 WRITE DMA EXT > e7 00 00 00 00 00 f0 00 00:00:47.550 FLUSH CACHE > 25 00 08 2c d2 97 f0 00 00:00:47.517 READ DMA EXT > > So, I used the following guide > http://smartmontools.sourceforge.net/badblockhowto.html#e2_example1 to > diagnose the disk and the filesystem. > > I've come up with the following data, using the formula from that > guide (disregarding (int), though... simply because I'm not sure how > to use that in BASH and what it really does. Correct me if I'm wrong, > but it's not THAT important, is it?) > > So, the disk in question looks like this: > > % fdisk -lu /dev/hdd > > Disk /dev/hdd: 80.0 GB, 80026361856 bytes > 255 heads, 63 sectors/track, 9729 cylinders, total 156301488 sectors > Units = sectors of 1 * 512 = 512 bytes > Disk identifier: 0xdf3cdf3c > > Device Boot Start End Blocks Id System > /dev/hdd1 63 3919859 1959898+ fd Linux raid autodetect > /dev/hdd2 * 3919860 156296384 76188262+ fd Linux raid autodetect > > Yes, it's in fact a software RAID1 with ext3 filesystem: > > % cat /proc/mdstat > Personalities : [raid1] > md1 : active raid1 hda2[0] hdd2[1] > 76188160 blocks [2/2] [UU] > > md0 : active raid1 hda1[0] hdd1[1] > 1959808 blocks [2/2] [UU] > > unused devices:<none> > > And here's the table of data after all the calculations have been done > (hope it doesn't get botched up in plain-text): > > SMART Value LBA on partition Problem LBA Inode Number > 80735076 76815216 9601902 4784266 > 80744740 76824880 9603110 4801095 > 13785804 9865944 1233243 491537 > 13626205 9706345 1213293 491536 > 13635868 9716008 1214501 491536 > 9949740 6029880 9949740 1638623 > > So, I went on to check whether these inodes were actually not readable > (pardon if I sound like an incompetent shmuck; I am actually one when > it comes to low level disk stuff!), and figured that they looked > totally OK: > > % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=4784266 > 1+0 records in > 1+0 records out > 512 bytes (512 B) copied, 0.0336201 s, 15.2 kB/s > % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=4801095 > 1+0 records in > 1+0 records out > 512 bytes (512 B) copied, 0.0177189 s, 28.9 kB/s > % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491537 > 1+0 records in > 1+0 records out > 512 bytes (512 B) copied, 0.0273371 s, 18.7 kB/s > % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491536 > 1+0 records in > 1+0 records out > 512 bytes (512 B) copied, 6.9713e-05 s, 7.3 MB/s > % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491536 > 1+0 records in > 1+0 records out > 512 bytes (512 B) copied, 6.7209e-05 s, 7.6 MB/s > % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=1638623 > 1+0 records in > 1+0 records out > 512 bytes (512 B) copied, 0.00991578 s, 51.6 kB/s > > So, I'm kinda at a loss. Either my calculations were incorrect > (missing (int) is the problem?), or the formula from the guide doesn't > apply anymore (?), or it's useless when working with software RAID-1, > or something else. > > I was hoping that someone who feels comfortable playing with this > stuff would step up and give me a hint as to what I should do next. > > ------------------------------------------------------------------------------ > RSA(R) Conference 2012 > Mar 27 - Feb 2 > Save $400 by Jan. 27 > Register now! > http://p.sf.net/sfu/rsa-sfdev2dev2 > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support |
From: Ivan L. Jr. <iva...@gm...> - 2012-01-17 13:41:08
|
It doesn't. I took this time to do what you recommended. Consistency check identified some number of mismatches, this is fixed. It doesn't matter much, though, because as far as I understood this is a pretty common and harmless behavior on raid1 entities. Setting scterc is not an option. smartmontools doesn't seem to support it: % smartctl -l scterc /dev/hdd smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ =======> INVALID ARGUMENT TO -l: scterc =======> VALID ARGUMENTS ARE: error, selftest, selective, directory, background, scttemp[sts|hist] <======= Use smartctl -h to get a usage summary Ultimately, the pending sector count is the same - 5 sectors. On Thu, Jan 12, 2012 at 9:01 PM, Bokhan Artem <ap...@ng...> wrote: > Running consistency check on md should help. Suggest to set scterc if > possible to avoid raid degradation. > > 12.01.2012 18:17, Ivan Lezhnjov Jr. пишет: >> >> Hi! >> >> One of my disks apparently has bad sectors on it which I'd love to >> take care of on filesystem level and try and save the files that are >> affected and then zero-out the bad sectors and remap them. >> >> Here's a bit of diagnostic information: >> >> % smartctl -i /dev/hdd >> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen >> Home page is http://smartmontools.sourceforge.net/ >> >> === START OF INFORMATION SECTION === >> Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family >> Device Model: ST380011A >> Serial Number: 5JVA4SQL >> Firmware Version: 3.06 >> User Capacity: 80,026,361,856 bytes >> Device is: In smartctl database [for details use: -P show] >> ATA Version is: 6 >> ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 >> Local Time is: Thu Jan 12 13:11:46 2012 EET >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> >> % smartctl -l selftest /dev/hdd >> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen >> Home page is http://smartmontools.sourceforge.net/ >> >> ---- START OF READ SMART DATA SECTION ----- >> SMART Self-test log structure revision number 1 >> Num Test_Description Status Remaining >> LifeTime(hours) LBA_of_first_error >> # 1 Extended offline Completed: read failure 50% 17604 >> 80735076 >> # 2 Extended offline Completed without error 00% 17590 >> - >> # 3 Short offline Completed: read failure 90% 17589 >> 80744740 >> >> % smartctl -l error /dev/hdd |grep "at LBA" >> 40 51 08 cc 5a d2 f0 Error: UNC 8 sectors at LBA = 0x00d25acc = >> 13785804 >> 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 >> 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 >> 40 51 08 1c 11 d0 f0 Error: UNC 8 sectors at LBA = 0x00d0111c = >> 13635868 >> 40 51 08 2c d2 97 f0 Error: UNC 8 sectors at LBA = 0x0097d22c = 9949740 >> >> Full version: >> >> % smartctl -A /dev/hdd >> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen >> Home page is http://smartmontools.sourceforge.net/ >> >> === START OF READ SMART DATA SECTION === >> SMART Attributes Data Structure revision number: 10 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >> UPDATED WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x000f 054 051 006 Pre-fail >> Always - 230062867 >> 3 Spin_Up_Time 0x0003 098 098 000 Pre-fail >> Always - 0 >> 4 Start_Stop_Count 0x0032 100 100 020 Old_age >> Always - 319 >> 5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail >> Always - 46 >> 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail >> Always - 281367075 >> 9 Power_On_Hours 0x0032 080 080 000 Old_age >> Always - 18192 >> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail >> Always - 0 >> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age >> Always - 903 >> 194 Temperature_Celsius 0x0022 039 051 000 Old_age >> Always - 39 >> 195 Hardware_ECC_Recovered 0x001a 054 051 000 Old_age >> Always - 230062867 >> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age >> Always - 5 >> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age >> Offline - 5 >> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age >> Always - 0 >> 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age >> Offline - 0 >> 202 TA_Increase_Count 0x0032 086 239 000 Old_age >> Always - 14 >> >> root@c-h-p-a /home/ilj % smartctl -l error /dev/hdd >> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen >> Home page is http://smartmontools.sourceforge.net/ >> >> === START OF READ SMART DATA SECTION === >> SMART Error Log Version: 1 >> ATA Error Count: 599 (device log contains only the most recent five >> errors) >> CR = Command Register [HEX] >> FR = Features Register [HEX] >> SC = Sector Count Register [HEX] >> SN = Sector Number Register [HEX] >> CL = Cylinder Low Register [HEX] >> CH = Cylinder High Register [HEX] >> DH = Device/Head Register [HEX] >> DC = Device Command Register [HEX] >> ER = Error register [HEX] >> ST = Status register [HEX] >> Powered_Up_Time is measured from power on, and printed as >> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, >> SS=sec, and sss=millisec. It "wraps" after 49.710 days. >> >> Error 599 occurred at disk power-on lifetime: 17665 hours (736 days + 1 >> hours) >> When the command that caused the error occurred, the device was >> active or idle. >> >> After command completion occurred, registers were: >> ER ST SC SN CL CH DH >> -- -- -- -- -- -- -- >> 40 51 08 cc 5a d2 f0 Error: UNC 8 sectors at LBA = 0x00d25acc = >> 13785804 >> >> Commands leading to the command that caused the error were: >> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >> -- -- -- -- -- -- -- -- ---------------- -------------------- >> 25 00 08 cc 5a d2 f0 00 05:14:50.934 READ DMA EXT >> 25 00 08 b4 53 d2 f0 00 05:14:50.918 READ DMA EXT >> 25 00 08 1c d0 cb f0 00 05:14:50.896 READ DMA EXT >> 25 00 08 2c 58 94 f0 00 05:14:50.868 READ DMA EXT >> 25 00 08 d4 57 94 f0 00 05:14:50.843 READ DMA EXT >> >> Error 598 occurred at disk power-on lifetime: 17579 hours (732 days + 11 >> hours) >> When the command that caused the error occurred, the device was >> active or idle. >> >> After command completion occurred, registers were: >> ER ST SC SN CL CH DH >> -- -- -- -- -- -- -- >> 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 >> >> Commands leading to the command that caused the error were: >> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >> -- -- -- -- -- -- -- -- ---------------- -------------------- >> 25 00 08 5c eb cf f0 00 05:53:45.727 READ DMA EXT >> 25 00 08 cc 44 64 f0 00 05:53:45.726 READ DMA EXT >> e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE >> 35 00 08 f4 e3 50 f0 00 05:53:45.726 WRITE DMA EXT >> e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE >> >> Error 597 occurred at disk power-on lifetime: 17579 hours (732 days + 11 >> hours) >> When the command that caused the error occurred, the device was >> active or idle. >> >> After command completion occurred, registers were: >> ER ST SC SN CL CH DH >> -- -- -- -- -- -- -- >> 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 >> >> Commands leading to the command that caused the error were: >> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >> -- -- -- -- -- -- -- -- ---------------- -------------------- >> 25 00 08 5c eb cf f0 00 05:53:45.727 READ DMA EXT >> e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE >> 35 00 08 f4 e3 50 f0 00 05:53:45.726 WRITE DMA EXT >> e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE >> 35 00 18 04 d0 3b f0 00 05:53:45.726 WRITE DMA EXT >> >> Error 596 occurred at disk power-on lifetime: 17574 hours (732 days + 6 >> hours) >> When the command that caused the error occurred, the device was >> active or idle. >> >> After command completion occurred, registers were: >> ER ST SC SN CL CH DH >> -- -- -- -- -- -- -- >> 40 51 08 1c 11 d0 f0 Error: UNC 8 sectors at LBA = 0x00d0111c = >> 13635868 >> >> Commands leading to the command that caused the error were: >> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >> -- -- -- -- -- -- -- -- ---------------- -------------------- >> 25 00 08 1c 11 d0 f0 00 00:35:27.205 READ DMA EXT >> 25 00 80 ec 87 0d f0 00 00:35:27.205 READ DMA EXT >> 25 00 08 1c 6a 61 f0 00 00:35:27.198 READ DMA EXT >> 25 00 08 2c 63 61 f0 00 00:35:27.164 READ DMA EXT >> 25 00 08 54 52 60 f0 00 00:35:27.159 READ DMA EXT >> >> Error 595 occurred at disk power-on lifetime: 8166 hours (340 days + 6 >> hours) >> When the command that caused the error occurred, the device was >> active or idle. >> >> After command completion occurred, registers were: >> ER ST SC SN CL CH DH >> -- -- -- -- -- -- -- >> 40 51 08 2c d2 97 f0 Error: UNC 8 sectors at LBA = 0x0097d22c = 9949740 >> >> Commands leading to the command that caused the error were: >> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >> -- -- -- -- -- -- -- -- ---------------- -------------------- >> 25 00 08 2c d2 97 f0 00 00:00:43.594 READ DMA EXT >> e7 00 00 00 00 00 f0 00 00:00:43.585 FLUSH CACHE >> 35 00 08 f4 e3 50 f0 00 00:00:43.573 WRITE DMA EXT >> e7 00 00 00 00 00 f0 00 00:00:47.550 FLUSH CACHE >> 25 00 08 2c d2 97 f0 00 00:00:47.517 READ DMA EXT >> >> So, I used the following guide >> http://smartmontools.sourceforge.net/badblockhowto.html#e2_example1 to >> diagnose the disk and the filesystem. >> >> I've come up with the following data, using the formula from that >> guide (disregarding (int), though... simply because I'm not sure how >> to use that in BASH and what it really does. Correct me if I'm wrong, >> but it's not THAT important, is it?) >> >> So, the disk in question looks like this: >> >> % fdisk -lu /dev/hdd >> >> Disk /dev/hdd: 80.0 GB, 80026361856 bytes >> 255 heads, 63 sectors/track, 9729 cylinders, total 156301488 sectors >> Units = sectors of 1 * 512 = 512 bytes >> Disk identifier: 0xdf3cdf3c >> >> Device Boot Start End Blocks Id System >> /dev/hdd1 63 3919859 1959898+ fd Linux raid >> autodetect >> /dev/hdd2 * 3919860 156296384 76188262+ fd Linux raid >> autodetect >> >> Yes, it's in fact a software RAID1 with ext3 filesystem: >> >> % cat /proc/mdstat >> Personalities : [raid1] >> md1 : active raid1 hda2[0] hdd2[1] >> 76188160 blocks [2/2] [UU] >> >> md0 : active raid1 hda1[0] hdd1[1] >> 1959808 blocks [2/2] [UU] >> >> unused devices:<none> >> >> And here's the table of data after all the calculations have been done >> (hope it doesn't get botched up in plain-text): >> >> SMART Value LBA on partition Problem LBA Inode Number >> 80735076 76815216 9601902 4784266 >> 80744740 76824880 9603110 4801095 >> 13785804 9865944 1233243 491537 >> 13626205 9706345 1213293 491536 >> 13635868 9716008 1214501 491536 >> 9949740 6029880 9949740 1638623 >> >> So, I went on to check whether these inodes were actually not readable >> (pardon if I sound like an incompetent shmuck; I am actually one when >> it comes to low level disk stuff!), and figured that they looked >> totally OK: >> >> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=4784266 >> 1+0 records in >> 1+0 records out >> 512 bytes (512 B) copied, 0.0336201 s, 15.2 kB/s >> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=4801095 >> 1+0 records in >> 1+0 records out >> 512 bytes (512 B) copied, 0.0177189 s, 28.9 kB/s >> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491537 >> 1+0 records in >> 1+0 records out >> 512 bytes (512 B) copied, 0.0273371 s, 18.7 kB/s >> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491536 >> 1+0 records in >> 1+0 records out >> 512 bytes (512 B) copied, 6.9713e-05 s, 7.3 MB/s >> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491536 >> 1+0 records in >> 1+0 records out >> 512 bytes (512 B) copied, 6.7209e-05 s, 7.6 MB/s >> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=1638623 >> 1+0 records in >> 1+0 records out >> 512 bytes (512 B) copied, 0.00991578 s, 51.6 kB/s >> >> So, I'm kinda at a loss. Either my calculations were incorrect >> (missing (int) is the problem?), or the formula from the guide doesn't >> apply anymore (?), or it's useless when working with software RAID-1, >> or something else. >> >> I was hoping that someone who feels comfortable playing with this >> stuff would step up and give me a hint as to what I should do next. >> >> >> ------------------------------------------------------------------------------ >> RSA(R) Conference 2012 >> Mar 27 - Feb 2 >> Save $400 by Jan. 27 >> Register now! >> http://p.sf.net/sfu/rsa-sfdev2dev2 >> _______________________________________________ >> Smartmontools-support mailing list >> Sma...@li... >> https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > |
From: Ivan L. Jr. <iva...@gm...> - 2012-01-17 13:45:31
|
This is a 80GB disk drive. I tired the LBA numbers reported by smartctl in dd command and it works just fine. Everything is readable. I then went to test every LBA from my table be it the number reported by smartctl or derived from the calculations using formula from the Bad Block How-To. Well, all of them are perfectly readable. As I've just mentioned in a preceding message, there was a number of mismatches on this raid1 entity, those have been fixed and don't seem to pose any serious threats at all according to what I found on Google. So, I'm back to where I started with my question. What else can I do to troubleshoot these 5 pending sectors in -A talbe? On Thu, Jan 12, 2012 at 10:58 PM, Christian Franke <Chr...@t-...> wrote: > Ivan Lezhnjov Jr. wrote: >> >> Hi! >> >> One of my disks apparently has bad sectors on it which I'd love to >> take care of on filesystem level and try and save the files that are >> affected and then zero-out the bad sectors and remap them. >> >> Here's a bit of diagnostic information: >> [...] >> % smartctl -l selftest /dev/hdd >> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen >> Home page ishttp://smartmontools.sourceforge.net/ >> >> ---- START OF READ SMART DATA SECTION ----- >> SMART Self-test log structure revision number 1 >> Num Test_Description Status Remaining >> LifeTime(hours) LBA_of_first_error >> # 1 Extended offline Completed: read failure 50% 17604 >> 80735076 >> # 2 Extended offline Completed without error 00% 17590 >> - >> # 3 Short offline Completed: read failure 90% 17589 >> 80744740 >> >> % smartctl -l error /dev/hdd |grep "at LBA" >> 40 51 08 cc 5a d2 f0 Error: UNC 8 sectors at LBA = 0x00d25acc = >> 13785804 >> 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 >> 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 >> 40 51 08 1c 11 d0 f0 Error: UNC 8 sectors at LBA = 0x00d0111c = >> 13635868 >> 40 51 08 2c d2 97 f0 Error: UNC 8 sectors at LBA = 0x0097d22c = 9949740 >> >> [...] >> So, I used the following guide >> http://smartmontools.sourceforge.net/badblockhowto.html#e2_example1 to >> diagnose the disk and the filesystem. >> >> I've come up with the following data, using the formula from that >> guide (disregarding (int), though... simply because I'm not sure how >> to use that in BASH and what it really does. Correct me if I'm wrong, >> but it's not THAT important, is it?) >> >> So, the disk in question looks like this: >> >> % fdisk -lu /dev/hdd >> >> Disk /dev/hdd: 80.0 GB, 80026361856 bytes >> 255 heads, 63 sectors/track, 9729 cylinders, total 156301488 sectors >> Units = sectors of 1 * 512 = 512 bytes >> Disk identifier: 0xdf3cdf3c >> >> Device Boot Start End Blocks Id System >> /dev/hdd1 63 3919859 1959898+ fd Linux raid >> autodetect >> /dev/hdd2 * 3919860 156296384 76188262+ fd Linux raid >> autodetect >> ... >> >> And here's the table of data after all the calculations have been done >> (hope it doesn't get botched up in plain-text): >> >> SMART Value LBA on partition Problem LBA Inode Number >> 80735076 76815216 9601902 4784266 >> 80744740 76824880 9603110 4801095 >> 13785804 9865944 1233243 491537 >> 13626205 9706345 1213293 491536 >> 13635868 9716008 1214501 491536 >> 9949740 6029880 9949740 1638623 >> >> So, I went on to check whether these inodes were actually not readable >> (pardon if I sound like an incompetent shmuck; I am actually one when >> it comes to low level disk stuff!), and figured that they looked >> totally OK: >> >> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=4784266 >> 1+0 records in >> 1+0 records out >> 512 bytes (512 B) copied, 0.0336201 s, 15.2 kB/s > > > Using filesystem "Inode Number" as an offset for dd does not make any sense. > You could use "Problem LBA" (which is apparently a filesystem block number) > if its size is correctly specified (bs=4096) > > A first check can be done without any calculation. Simply use the disk LBA > reported by SMART and the disk /dev/ice name: > > # dd if=/dev/hdd of=/dev/null bs=512 count=1 skip=80735076 iflag=direct > > If this works, an error might be transient and no longer reproducible. This > occasionally happens for LBAs reported in the SMART error log. > > Note that for disks > 128GiB the old SMART error log is no longer > sufficient, use smartctl >= 5.39 and '-l xerror' then. > > Christian > |
From: Weedy <wee...@gm...> - 2012-01-17 14:55:40
|
On 17/01/12 08:45 AM, Ivan Lezhnjov Jr. wrote: > So, I'm back to where I started with my question. What else can I do > to troubleshoot these 5 pending sectors in -A talbe? wget http://downloads.sourceforge.net/project/hdrecover/hdrecover/hdrecover-0.4/hdrecover-0.4.tar.gz tar xvf hdrecover-0.4.tar.gz cd hdrecover-0.4 ./configure make ./hdrecover /dev/hdd |
From: Artem B. <ap...@ng...> - 2012-01-17 15:35:16
|
On 17.01.2012 20:40, Ivan Lezhnjov Jr. wrote: > It doesn't. I took this time to do what you recommended. Consistency > check identified some number of mismatches, this is fixed. It doesn't > matter much, though, because as far as I understood this is a pretty > common and harmless behavior on raid1 entities. Setting scterc is not > an option. smartmontools doesn't seem to support it: You have old version. Compile last binary. > So, I'm back to where I started with my question. What else can I do > to troubleshoot these 5 pending sectors in -A talbe? Pending sectors are not bad, Offline_Uncorrectable are bad. If your disk is readable now ("dd if=/dev/sdX of=/dev/null bs=1M" ends without errors) and passes long self-test (re-run it) then you may foget pendings. Do you still have Offline_Uncorrectable? > > % smartctl -l scterc /dev/hdd > smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > =======> INVALID ARGUMENT TO -l: scterc > =======> VALID ARGUMENTS ARE: error, selftest, selective, directory, > background, scttemp[sts|hist]<======= > > Use smartctl -h to get a usage summary > > Ultimately, the pending sector count is the same - 5 sectors. > > On Thu, Jan 12, 2012 at 9:01 PM, Bokhan Artem<ap...@ng...> wrote: >> Running consistency check on md should help. Suggest to set scterc if >> possible to avoid raid degradation. >> >> 12.01.2012 18:17, Ivan Lezhnjov Jr. пишет: >>> Hi! >>> >>> One of my disks apparently has bad sectors on it which I'd love to >>> take care of on filesystem level and try and save the files that are >>> affected and then zero-out the bad sectors and remap them. >>> >>> Here's a bit of diagnostic information: >>> >>> % smartctl -i /dev/hdd >>> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen >>> Home page is http://smartmontools.sourceforge.net/ >>> >>> === START OF INFORMATION SECTION === >>> Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family >>> Device Model: ST380011A >>> Serial Number: 5JVA4SQL >>> Firmware Version: 3.06 >>> User Capacity: 80,026,361,856 bytes >>> Device is: In smartctl database [for details use: -P show] >>> ATA Version is: 6 >>> ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 >>> Local Time is: Thu Jan 12 13:11:46 2012 EET >>> SMART support is: Available - device has SMART capability. >>> SMART support is: Enabled >>> >>> % smartctl -l selftest /dev/hdd >>> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen >>> Home page is http://smartmontools.sourceforge.net/ >>> >>> ---- START OF READ SMART DATA SECTION ----- >>> SMART Self-test log structure revision number 1 >>> Num Test_Description Status Remaining >>> LifeTime(hours) LBA_of_first_error >>> # 1 Extended offline Completed: read failure 50% 17604 >>> 80735076 >>> # 2 Extended offline Completed without error 00% 17590 >>> - >>> # 3 Short offline Completed: read failure 90% 17589 >>> 80744740 >>> >>> % smartctl -l error /dev/hdd |grep "at LBA" >>> 40 51 08 cc 5a d2 f0 Error: UNC 8 sectors at LBA = 0x00d25acc = >>> 13785804 >>> 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 >>> 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 >>> 40 51 08 1c 11 d0 f0 Error: UNC 8 sectors at LBA = 0x00d0111c = >>> 13635868 >>> 40 51 08 2c d2 97 f0 Error: UNC 8 sectors at LBA = 0x0097d22c = 9949740 >>> >>> Full version: >>> >>> % smartctl -A /dev/hdd >>> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen >>> Home page is http://smartmontools.sourceforge.net/ >>> >>> === START OF READ SMART DATA SECTION === >>> SMART Attributes Data Structure revision number: 10 >>> Vendor Specific SMART Attributes with Thresholds: >>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >>> UPDATED WHEN_FAILED RAW_VALUE >>> 1 Raw_Read_Error_Rate 0x000f 054 051 006 Pre-fail >>> Always - 230062867 >>> 3 Spin_Up_Time 0x0003 098 098 000 Pre-fail >>> Always - 0 >>> 4 Start_Stop_Count 0x0032 100 100 020 Old_age >>> Always - 319 >>> 5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail >>> Always - 46 >>> 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail >>> Always - 281367075 >>> 9 Power_On_Hours 0x0032 080 080 000 Old_age >>> Always - 18192 >>> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail >>> Always - 0 >>> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age >>> Always - 903 >>> 194 Temperature_Celsius 0x0022 039 051 000 Old_age >>> Always - 39 >>> 195 Hardware_ECC_Recovered 0x001a 054 051 000 Old_age >>> Always - 230062867 >>> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age >>> Always - 5 >>> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age >>> Offline - 5 >>> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age >>> Always - 0 >>> 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age >>> Offline - 0 >>> 202 TA_Increase_Count 0x0032 086 239 000 Old_age >>> Always - 14 >>> >>> root@c-h-p-a /home/ilj % smartctl -l error /dev/hdd >>> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen >>> Home page is http://smartmontools.sourceforge.net/ >>> >>> === START OF READ SMART DATA SECTION === >>> SMART Error Log Version: 1 >>> ATA Error Count: 599 (device log contains only the most recent five >>> errors) >>> CR = Command Register [HEX] >>> FR = Features Register [HEX] >>> SC = Sector Count Register [HEX] >>> SN = Sector Number Register [HEX] >>> CL = Cylinder Low Register [HEX] >>> CH = Cylinder High Register [HEX] >>> DH = Device/Head Register [HEX] >>> DC = Device Command Register [HEX] >>> ER = Error register [HEX] >>> ST = Status register [HEX] >>> Powered_Up_Time is measured from power on, and printed as >>> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, >>> SS=sec, and sss=millisec. It "wraps" after 49.710 days. >>> >>> Error 599 occurred at disk power-on lifetime: 17665 hours (736 days + 1 >>> hours) >>> When the command that caused the error occurred, the device was >>> active or idle. >>> >>> After command completion occurred, registers were: >>> ER ST SC SN CL CH DH >>> -- -- -- -- -- -- -- >>> 40 51 08 cc 5a d2 f0 Error: UNC 8 sectors at LBA = 0x00d25acc = >>> 13785804 >>> >>> Commands leading to the command that caused the error were: >>> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >>> -- -- -- -- -- -- -- -- ---------------- -------------------- >>> 25 00 08 cc 5a d2 f0 00 05:14:50.934 READ DMA EXT >>> 25 00 08 b4 53 d2 f0 00 05:14:50.918 READ DMA EXT >>> 25 00 08 1c d0 cb f0 00 05:14:50.896 READ DMA EXT >>> 25 00 08 2c 58 94 f0 00 05:14:50.868 READ DMA EXT >>> 25 00 08 d4 57 94 f0 00 05:14:50.843 READ DMA EXT >>> >>> Error 598 occurred at disk power-on lifetime: 17579 hours (732 days + 11 >>> hours) >>> When the command that caused the error occurred, the device was >>> active or idle. >>> >>> After command completion occurred, registers were: >>> ER ST SC SN CL CH DH >>> -- -- -- -- -- -- -- >>> 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 >>> >>> Commands leading to the command that caused the error were: >>> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >>> -- -- -- -- -- -- -- -- ---------------- -------------------- >>> 25 00 08 5c eb cf f0 00 05:53:45.727 READ DMA EXT >>> 25 00 08 cc 44 64 f0 00 05:53:45.726 READ DMA EXT >>> e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE >>> 35 00 08 f4 e3 50 f0 00 05:53:45.726 WRITE DMA EXT >>> e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE >>> >>> Error 597 occurred at disk power-on lifetime: 17579 hours (732 days + 11 >>> hours) >>> When the command that caused the error occurred, the device was >>> active or idle. >>> >>> After command completion occurred, registers were: >>> ER ST SC SN CL CH DH >>> -- -- -- -- -- -- -- >>> 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 >>> >>> Commands leading to the command that caused the error were: >>> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >>> -- -- -- -- -- -- -- -- ---------------- -------------------- >>> 25 00 08 5c eb cf f0 00 05:53:45.727 READ DMA EXT >>> e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE >>> 35 00 08 f4 e3 50 f0 00 05:53:45.726 WRITE DMA EXT >>> e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE >>> 35 00 18 04 d0 3b f0 00 05:53:45.726 WRITE DMA EXT >>> >>> Error 596 occurred at disk power-on lifetime: 17574 hours (732 days + 6 >>> hours) >>> When the command that caused the error occurred, the device was >>> active or idle. >>> >>> After command completion occurred, registers were: >>> ER ST SC SN CL CH DH >>> -- -- -- -- -- -- -- >>> 40 51 08 1c 11 d0 f0 Error: UNC 8 sectors at LBA = 0x00d0111c = >>> 13635868 >>> >>> Commands leading to the command that caused the error were: >>> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >>> -- -- -- -- -- -- -- -- ---------------- -------------------- >>> 25 00 08 1c 11 d0 f0 00 00:35:27.205 READ DMA EXT >>> 25 00 80 ec 87 0d f0 00 00:35:27.205 READ DMA EXT >>> 25 00 08 1c 6a 61 f0 00 00:35:27.198 READ DMA EXT >>> 25 00 08 2c 63 61 f0 00 00:35:27.164 READ DMA EXT >>> 25 00 08 54 52 60 f0 00 00:35:27.159 READ DMA EXT >>> >>> Error 595 occurred at disk power-on lifetime: 8166 hours (340 days + 6 >>> hours) >>> When the command that caused the error occurred, the device was >>> active or idle. >>> >>> After command completion occurred, registers were: >>> ER ST SC SN CL CH DH >>> -- -- -- -- -- -- -- >>> 40 51 08 2c d2 97 f0 Error: UNC 8 sectors at LBA = 0x0097d22c = 9949740 >>> >>> Commands leading to the command that caused the error were: >>> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >>> -- -- -- -- -- -- -- -- ---------------- -------------------- >>> 25 00 08 2c d2 97 f0 00 00:00:43.594 READ DMA EXT >>> e7 00 00 00 00 00 f0 00 00:00:43.585 FLUSH CACHE >>> 35 00 08 f4 e3 50 f0 00 00:00:43.573 WRITE DMA EXT >>> e7 00 00 00 00 00 f0 00 00:00:47.550 FLUSH CACHE >>> 25 00 08 2c d2 97 f0 00 00:00:47.517 READ DMA EXT >>> >>> So, I used the following guide >>> http://smartmontools.sourceforge.net/badblockhowto.html#e2_example1 to >>> diagnose the disk and the filesystem. >>> >>> I've come up with the following data, using the formula from that >>> guide (disregarding (int), though... simply because I'm not sure how >>> to use that in BASH and what it really does. Correct me if I'm wrong, >>> but it's not THAT important, is it?) >>> >>> So, the disk in question looks like this: >>> >>> % fdisk -lu /dev/hdd >>> >>> Disk /dev/hdd: 80.0 GB, 80026361856 bytes >>> 255 heads, 63 sectors/track, 9729 cylinders, total 156301488 sectors >>> Units = sectors of 1 * 512 = 512 bytes >>> Disk identifier: 0xdf3cdf3c >>> >>> Device Boot Start End Blocks Id System >>> /dev/hdd1 63 3919859 1959898+ fd Linux raid >>> autodetect >>> /dev/hdd2 * 3919860 156296384 76188262+ fd Linux raid >>> autodetect >>> >>> Yes, it's in fact a software RAID1 with ext3 filesystem: >>> >>> % cat /proc/mdstat >>> Personalities : [raid1] >>> md1 : active raid1 hda2[0] hdd2[1] >>> 76188160 blocks [2/2] [UU] >>> >>> md0 : active raid1 hda1[0] hdd1[1] >>> 1959808 blocks [2/2] [UU] >>> >>> unused devices:<none> >>> >>> And here's the table of data after all the calculations have been done >>> (hope it doesn't get botched up in plain-text): >>> >>> SMART Value LBA on partition Problem LBA Inode Number >>> 80735076 76815216 9601902 4784266 >>> 80744740 76824880 9603110 4801095 >>> 13785804 9865944 1233243 491537 >>> 13626205 9706345 1213293 491536 >>> 13635868 9716008 1214501 491536 >>> 9949740 6029880 9949740 1638623 >>> >>> So, I went on to check whether these inodes were actually not readable >>> (pardon if I sound like an incompetent shmuck; I am actually one when >>> it comes to low level disk stuff!), and figured that they looked >>> totally OK: >>> >>> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=4784266 >>> 1+0 records in >>> 1+0 records out >>> 512 bytes (512 B) copied, 0.0336201 s, 15.2 kB/s >>> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=4801095 >>> 1+0 records in >>> 1+0 records out >>> 512 bytes (512 B) copied, 0.0177189 s, 28.9 kB/s >>> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491537 >>> 1+0 records in >>> 1+0 records out >>> 512 bytes (512 B) copied, 0.0273371 s, 18.7 kB/s >>> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491536 >>> 1+0 records in >>> 1+0 records out >>> 512 bytes (512 B) copied, 6.9713e-05 s, 7.3 MB/s >>> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491536 >>> 1+0 records in >>> 1+0 records out >>> 512 bytes (512 B) copied, 6.7209e-05 s, 7.6 MB/s >>> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=1638623 >>> 1+0 records in >>> 1+0 records out >>> 512 bytes (512 B) copied, 0.00991578 s, 51.6 kB/s >>> >>> So, I'm kinda at a loss. Either my calculations were incorrect >>> (missing (int) is the problem?), or the formula from the guide doesn't >>> apply anymore (?), or it's useless when working with software RAID-1, >>> or something else. >>> >>> I was hoping that someone who feels comfortable playing with this >>> stuff would step up and give me a hint as to what I should do next. >>> >>> >>> ------------------------------------------------------------------------------ >>> RSA(R) Conference 2012 >>> Mar 27 - Feb 2 >>> Save $400 by Jan. 27 >>> Register now! >>> http://p.sf.net/sfu/rsa-sfdev2dev2 >>> _______________________________________________ >>> Smartmontools-support mailing list >>> Sma...@li... >>> https://lists.sourceforge.net/lists/listinfo/smartmontools-support >> > ------------------------------------------------------------------------------ > Keep Your Developer Skills Current with LearnDevNow! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-d2d > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support |
From: Ivan L. Jr. <iva...@gm...> - 2012-01-18 13:48:29
|
On Tue, Jan 17, 2012 at 5:35 PM, Artem Bokhan <ap...@ng...> wrote: > On 17.01.2012 20:40, Ivan Lezhnjov Jr. wrote: >> >> It doesn't. I took this time to do what you recommended. Consistency >> check identified some number of mismatches, this is fixed. It doesn't >> matter much, though, because as far as I understood this is a pretty >> common and harmless behavior on raid1 entities. Setting scterc is not >> an option. smartmontools doesn't seem to support it: > > You have old version. Compile last binary. I will see what I can do. It's stable Debian and I don't feel like introducing a package that APT cannot update automatically. >> So, I'm back to where I started with my question. What else can I do >> to troubleshoot these 5 pending sectors in -A talbe? > > Pending sectors are not bad, Offline_Uncorrectable are bad. If your disk is > readable now ("dd if=/dev/sdX of=/dev/null bs=1M" ends without errors) and > passes long self-test (re-run it) then you may foget pendings. > > Do you still have Offline_Uncorrectable? Yes, I do. dd says the drive partitions are readable. smartctl says Completed without error (#6) % smartctl -l selftest /dev/hdd smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 18334 - # 2 Extended offline Completed without error 00% 18312 - # 3 Short offline Completed without error 00% 18312 - # 4 Short offline Completed without error 00% 18312 - # 5 Extended offline Completed: read failure 50% 17604 80735076 # 6 Extended offline Completed without error 00% 17590 - # 7 Short offline Completed: read failure 90% 17589 80744740 If there's no problem why Offline_Uncorrectable counter isn't zero? >> >> % smartctl -l scterc /dev/hdd >> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen >> Home page is http://smartmontools.sourceforge.net/ >> >> =======> INVALID ARGUMENT TO -l: scterc >> =======> VALID ARGUMENTS ARE: error, selftest, selective, directory, >> background, scttemp[sts|hist]<======= >> >> Use smartctl -h to get a usage summary >> >> Ultimately, the pending sector count is the same - 5 sectors. >> >> On Thu, Jan 12, 2012 at 9:01 PM, Bokhan Artem<ap...@ng...> wrote: >>> >>> Running consistency check on md should help. Suggest to set scterc if >>> possible to avoid raid degradation. >>> >>> 12.01.2012 18:17, Ivan Lezhnjov Jr. пишет: >>>> >>>> Hi! >>>> >>>> One of my disks apparently has bad sectors on it which I'd love to >>>> take care of on filesystem level and try and save the files that are >>>> affected and then zero-out the bad sectors and remap them. >>>> >>>> Here's a bit of diagnostic information: >>>> >>>> % smartctl -i /dev/hdd >>>> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce >>>> Allen >>>> Home page is http://smartmontools.sourceforge.net/ >>>> >>>> === START OF INFORMATION SECTION === >>>> Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family >>>> Device Model: ST380011A >>>> Serial Number: 5JVA4SQL >>>> Firmware Version: 3.06 >>>> User Capacity: 80,026,361,856 bytes >>>> Device is: In smartctl database [for details use: -P show] >>>> ATA Version is: 6 >>>> ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 >>>> Local Time is: Thu Jan 12 13:11:46 2012 EET >>>> SMART support is: Available - device has SMART capability. >>>> SMART support is: Enabled >>>> >>>> % smartctl -l selftest /dev/hdd >>>> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce >>>> Allen >>>> Home page is http://smartmontools.sourceforge.net/ >>>> >>>> ---- START OF READ SMART DATA SECTION ----- >>>> SMART Self-test log structure revision number 1 >>>> Num Test_Description Status Remaining >>>> LifeTime(hours) LBA_of_first_error >>>> # 1 Extended offline Completed: read failure 50% 17604 >>>> 80735076 >>>> # 2 Extended offline Completed without error 00% 17590 >>>> - >>>> # 3 Short offline Completed: read failure 90% 17589 >>>> 80744740 >>>> >>>> % smartctl -l error /dev/hdd |grep "at LBA" >>>> 40 51 08 cc 5a d2 f0 Error: UNC 8 sectors at LBA = 0x00d25acc = >>>> 13785804 >>>> 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 >>>> 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 >>>> 40 51 08 1c 11 d0 f0 Error: UNC 8 sectors at LBA = 0x00d0111c = >>>> 13635868 >>>> 40 51 08 2c d2 97 f0 Error: UNC 8 sectors at LBA = 0x0097d22c = >>>> 9949740 >>>> >>>> Full version: >>>> >>>> % smartctl -A /dev/hdd >>>> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce >>>> Allen >>>> Home page is http://smartmontools.sourceforge.net/ >>>> >>>> === START OF READ SMART DATA SECTION === >>>> SMART Attributes Data Structure revision number: 10 >>>> Vendor Specific SMART Attributes with Thresholds: >>>> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE >>>> UPDATED WHEN_FAILED RAW_VALUE >>>> 1 Raw_Read_Error_Rate 0x000f 054 051 006 Pre-fail >>>> Always - 230062867 >>>> 3 Spin_Up_Time 0x0003 098 098 000 Pre-fail >>>> Always - 0 >>>> 4 Start_Stop_Count 0x0032 100 100 020 Old_age >>>> Always - 319 >>>> 5 Reallocated_Sector_Ct 0x0033 099 099 036 Pre-fail >>>> Always - 46 >>>> 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail >>>> Always - 281367075 >>>> 9 Power_On_Hours 0x0032 080 080 000 Old_age >>>> Always - 18192 >>>> 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail >>>> Always - 0 >>>> 12 Power_Cycle_Count 0x0032 100 100 020 Old_age >>>> Always - 903 >>>> 194 Temperature_Celsius 0x0022 039 051 000 Old_age >>>> Always - 39 >>>> 195 Hardware_ECC_Recovered 0x001a 054 051 000 Old_age >>>> Always - 230062867 >>>> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age >>>> Always - 5 >>>> 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age >>>> Offline - 5 >>>> 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age >>>> Always - 0 >>>> 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age >>>> Offline - 0 >>>> 202 TA_Increase_Count 0x0032 086 239 000 Old_age >>>> Always - 14 >>>> >>>> root@c-h-p-a /home/ilj % smartctl -l error /dev/hdd >>>> smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce >>>> Allen >>>> Home page is http://smartmontools.sourceforge.net/ >>>> >>>> === START OF READ SMART DATA SECTION === >>>> SMART Error Log Version: 1 >>>> ATA Error Count: 599 (device log contains only the most recent five >>>> errors) >>>> CR = Command Register [HEX] >>>> FR = Features Register [HEX] >>>> SC = Sector Count Register [HEX] >>>> SN = Sector Number Register [HEX] >>>> CL = Cylinder Low Register [HEX] >>>> CH = Cylinder High Register [HEX] >>>> DH = Device/Head Register [HEX] >>>> DC = Device Command Register [HEX] >>>> ER = Error register [HEX] >>>> ST = Status register [HEX] >>>> Powered_Up_Time is measured from power on, and printed as >>>> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, >>>> SS=sec, and sss=millisec. It "wraps" after 49.710 days. >>>> >>>> Error 599 occurred at disk power-on lifetime: 17665 hours (736 days + 1 >>>> hours) >>>> When the command that caused the error occurred, the device was >>>> active or idle. >>>> >>>> After command completion occurred, registers were: >>>> ER ST SC SN CL CH DH >>>> -- -- -- -- -- -- -- >>>> 40 51 08 cc 5a d2 f0 Error: UNC 8 sectors at LBA = 0x00d25acc = >>>> 13785804 >>>> >>>> Commands leading to the command that caused the error were: >>>> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >>>> -- -- -- -- -- -- -- -- ---------------- -------------------- >>>> 25 00 08 cc 5a d2 f0 00 05:14:50.934 READ DMA EXT >>>> 25 00 08 b4 53 d2 f0 00 05:14:50.918 READ DMA EXT >>>> 25 00 08 1c d0 cb f0 00 05:14:50.896 READ DMA EXT >>>> 25 00 08 2c 58 94 f0 00 05:14:50.868 READ DMA EXT >>>> 25 00 08 d4 57 94 f0 00 05:14:50.843 READ DMA EXT >>>> >>>> Error 598 occurred at disk power-on lifetime: 17579 hours (732 days + 11 >>>> hours) >>>> When the command that caused the error occurred, the device was >>>> active or idle. >>>> >>>> After command completion occurred, registers were: >>>> ER ST SC SN CL CH DH >>>> -- -- -- -- -- -- -- >>>> 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 >>>> >>>> Commands leading to the command that caused the error were: >>>> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >>>> -- -- -- -- -- -- -- -- ---------------- -------------------- >>>> 25 00 08 5c eb cf f0 00 05:53:45.727 READ DMA EXT >>>> 25 00 08 cc 44 64 f0 00 05:53:45.726 READ DMA EXT >>>> e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE >>>> 35 00 08 f4 e3 50 f0 00 05:53:45.726 WRITE DMA EXT >>>> e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE >>>> >>>> Error 597 occurred at disk power-on lifetime: 17579 hours (732 days + 11 >>>> hours) >>>> When the command that caused the error occurred, the device was >>>> active or idle. >>>> >>>> After command completion occurred, registers were: >>>> ER ST SC SN CL CH DH >>>> -- -- -- -- -- -- -- >>>> 40 51 00 5d eb cf f0 Error: UNC at LBA = 0x00cfeb5d = 13626205 >>>> >>>> Commands leading to the command that caused the error were: >>>> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >>>> -- -- -- -- -- -- -- -- ---------------- -------------------- >>>> 25 00 08 5c eb cf f0 00 05:53:45.727 READ DMA EXT >>>> e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE >>>> 35 00 08 f4 e3 50 f0 00 05:53:45.726 WRITE DMA EXT >>>> e7 00 00 00 00 00 f0 00 05:53:45.726 FLUSH CACHE >>>> 35 00 18 04 d0 3b f0 00 05:53:45.726 WRITE DMA EXT >>>> >>>> Error 596 occurred at disk power-on lifetime: 17574 hours (732 days + 6 >>>> hours) >>>> When the command that caused the error occurred, the device was >>>> active or idle. >>>> >>>> After command completion occurred, registers were: >>>> ER ST SC SN CL CH DH >>>> -- -- -- -- -- -- -- >>>> 40 51 08 1c 11 d0 f0 Error: UNC 8 sectors at LBA = 0x00d0111c = >>>> 13635868 >>>> >>>> Commands leading to the command that caused the error were: >>>> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >>>> -- -- -- -- -- -- -- -- ---------------- -------------------- >>>> 25 00 08 1c 11 d0 f0 00 00:35:27.205 READ DMA EXT >>>> 25 00 80 ec 87 0d f0 00 00:35:27.205 READ DMA EXT >>>> 25 00 08 1c 6a 61 f0 00 00:35:27.198 READ DMA EXT >>>> 25 00 08 2c 63 61 f0 00 00:35:27.164 READ DMA EXT >>>> 25 00 08 54 52 60 f0 00 00:35:27.159 READ DMA EXT >>>> >>>> Error 595 occurred at disk power-on lifetime: 8166 hours (340 days + 6 >>>> hours) >>>> When the command that caused the error occurred, the device was >>>> active or idle. >>>> >>>> After command completion occurred, registers were: >>>> ER ST SC SN CL CH DH >>>> -- -- -- -- -- -- -- >>>> 40 51 08 2c d2 97 f0 Error: UNC 8 sectors at LBA = 0x0097d22c = >>>> 9949740 >>>> >>>> Commands leading to the command that caused the error were: >>>> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name >>>> -- -- -- -- -- -- -- -- ---------------- -------------------- >>>> 25 00 08 2c d2 97 f0 00 00:00:43.594 READ DMA EXT >>>> e7 00 00 00 00 00 f0 00 00:00:43.585 FLUSH CACHE >>>> 35 00 08 f4 e3 50 f0 00 00:00:43.573 WRITE DMA EXT >>>> e7 00 00 00 00 00 f0 00 00:00:47.550 FLUSH CACHE >>>> 25 00 08 2c d2 97 f0 00 00:00:47.517 READ DMA EXT >>>> >>>> So, I used the following guide >>>> http://smartmontools.sourceforge.net/badblockhowto.html#e2_example1 to >>>> diagnose the disk and the filesystem. >>>> >>>> I've come up with the following data, using the formula from that >>>> guide (disregarding (int), though... simply because I'm not sure how >>>> to use that in BASH and what it really does. Correct me if I'm wrong, >>>> but it's not THAT important, is it?) >>>> >>>> So, the disk in question looks like this: >>>> >>>> % fdisk -lu /dev/hdd >>>> >>>> Disk /dev/hdd: 80.0 GB, 80026361856 bytes >>>> 255 heads, 63 sectors/track, 9729 cylinders, total 156301488 sectors >>>> Units = sectors of 1 * 512 = 512 bytes >>>> Disk identifier: 0xdf3cdf3c >>>> >>>> Device Boot Start End Blocks Id System >>>> /dev/hdd1 63 3919859 1959898+ fd Linux raid >>>> autodetect >>>> /dev/hdd2 * 3919860 156296384 76188262+ fd Linux raid >>>> autodetect >>>> >>>> Yes, it's in fact a software RAID1 with ext3 filesystem: >>>> >>>> % cat /proc/mdstat >>>> Personalities : [raid1] >>>> md1 : active raid1 hda2[0] hdd2[1] >>>> 76188160 blocks [2/2] [UU] >>>> >>>> md0 : active raid1 hda1[0] hdd1[1] >>>> 1959808 blocks [2/2] [UU] >>>> >>>> unused devices:<none> >>>> >>>> And here's the table of data after all the calculations have been done >>>> (hope it doesn't get botched up in plain-text): >>>> >>>> SMART Value LBA on partition Problem LBA Inode Number >>>> 80735076 76815216 9601902 4784266 >>>> 80744740 76824880 9603110 4801095 >>>> 13785804 9865944 1233243 491537 >>>> 13626205 9706345 1213293 491536 >>>> 13635868 9716008 1214501 491536 >>>> 9949740 6029880 9949740 1638623 >>>> >>>> So, I went on to check whether these inodes were actually not readable >>>> (pardon if I sound like an incompetent shmuck; I am actually one when >>>> it comes to low level disk stuff!), and figured that they looked >>>> totally OK: >>>> >>>> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=4784266 >>>> 1+0 records in >>>> 1+0 records out >>>> 512 bytes (512 B) copied, 0.0336201 s, 15.2 kB/s >>>> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=4801095 >>>> 1+0 records in >>>> 1+0 records out >>>> 512 bytes (512 B) copied, 0.0177189 s, 28.9 kB/s >>>> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491537 >>>> 1+0 records in >>>> 1+0 records out >>>> 512 bytes (512 B) copied, 0.0273371 s, 18.7 kB/s >>>> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491536 >>>> 1+0 records in >>>> 1+0 records out >>>> 512 bytes (512 B) copied, 6.9713e-05 s, 7.3 MB/s >>>> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=491536 >>>> 1+0 records in >>>> 1+0 records out >>>> 512 bytes (512 B) copied, 6.7209e-05 s, 7.6 MB/s >>>> % dd if=/dev/hdd2 of=/dev/null bs=512 count=1 skip=1638623 >>>> 1+0 records in >>>> 1+0 records out >>>> 512 bytes (512 B) copied, 0.00991578 s, 51.6 kB/s >>>> >>>> So, I'm kinda at a loss. Either my calculations were incorrect >>>> (missing (int) is the problem?), or the formula from the guide doesn't >>>> apply anymore (?), or it's useless when working with software RAID-1, >>>> or something else. >>>> >>>> I was hoping that someone who feels comfortable playing with this >>>> stuff would step up and give me a hint as to what I should do next. >>>> >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> RSA(R) Conference 2012 >>>> Mar 27 - Feb 2 >>>> Save $400 by Jan. 27 >>>> Register now! >>>> http://p.sf.net/sfu/rsa-sfdev2dev2 >>>> _______________________________________________ >>>> Smartmontools-support mailing list >>>> Sma...@li... >>>> https://lists.sourceforge.net/lists/listinfo/smartmontools-support >>> >>> >> >> ------------------------------------------------------------------------------ >> Keep Your Developer Skills Current with LearnDevNow! >> The most comprehensive online learning library for Microsoft developers >> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >> Metro Style Apps, more. Free future releases when you subscribe now! >> http://p.sf.net/sfu/learndevnow-d2d >> _______________________________________________ >> Smartmontools-support mailing list >> Sma...@li... >> https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > |
From: Christian F. <Chr...@t-...> - 2012-01-17 18:40:59
|
Ivan Lezhnjov Jr. wrote: > This is a 80GB disk drive. I tired the LBA numbers reported by > smartctl in dd command and it works just fine. Everything is readable. > I then went to test every LBA from my table be it the number reported > by smartctl or derived from the calculations using formula from the > Bad Block How-To. Well, all of them are perfectly readable. > > As I've just mentioned in a preceding message, there was a number of > mismatches on this raid1 entity, those have been fixed and don't seem > to pose any serious threats at all according to what I found on > Google. > > So, I'm back to where I started with my question. What else can I do > to troubleshoot these 5 pending sectors in -A talbe? > Some (older) disks might don't reset these attributes if the bad sectors are no longer present and might not increment Reallocated_Sector_Ct also. I've seen this occasionally in the past. Use some tool do a full read check: badblocks, ddrescue, dd_rescue, hdrecover or even dd (I prefer ddrescue). Run a long smart test. If all blocks are readable, there is nothing left to do. If there are any (persistent or transient) read errors, I would replace this disk. Christian |
From: Ivan L. Jr. <iva...@gm...> - 2012-01-18 14:05:09
|
Thanks for the pointers. I will play around with the tools you mentioned, but from results of dd and long self-test it seems like there's nothing left to do :) Is this more of a rule or exception that counters are not reset/updated? Specifically, does anyone happen to know for sure how Fujitsua, Samsung, Seagate and Western Digital drives behave in this respect? Is there any way to figure that out on your own (except trial and error, maybe some document, wiki page or something)? On Tue, Jan 17, 2012 at 8:40 PM, Christian Franke <Chr...@t-...> wrote: > Ivan Lezhnjov Jr. wrote: >> >> This is a 80GB disk drive. I tired the LBA numbers reported by >> smartctl in dd command and it works just fine. Everything is readable. >> I then went to test every LBA from my table be it the number reported >> by smartctl or derived from the calculations using formula from the >> Bad Block How-To. Well, all of them are perfectly readable. >> >> As I've just mentioned in a preceding message, there was a number of >> mismatches on this raid1 entity, those have been fixed and don't seem >> to pose any serious threats at all according to what I found on >> Google. >> >> So, I'm back to where I started with my question. What else can I do >> to troubleshoot these 5 pending sectors in -A talbe? >> > > Some (older) disks might don't reset these attributes if the bad sectors are > no longer present and might not increment Reallocated_Sector_Ct also. I've > seen this occasionally in the past. > > Use some tool do a full read check: badblocks, ddrescue, dd_rescue, > hdrecover or even dd (I prefer ddrescue). Run a long smart test. If all > blocks are readable, there is nothing left to do. If there are any > (persistent or transient) read errors, I would replace this disk. > > Christian > |