From: Zoltan P. <zol...@gm...> - 2005-01-11 23:47:56
|
Perhaps I misunderstood something, but I'm unable to reallocate offline_uncorrectable sectors with the method described in badblockhowto. The disk currently is a member of a raid1 array. I have found the "bad" blocks around the reported sector with the following command: export i=139735500 && while [ $i -lt 139735600 ]; do echo $i; dd if=/dev/hdb of=/dev/null bs=512 count=1 skip=$i; let i+=1; done I have removed the partition (not the whole disk) from raid1 array (raidsetfaulty && raidhotremove) and I have zeroed the sectors and soon the whole partition (with sync!) but nothing had happened. I have executed smartctl -t offline /dev/hdb when finished smartctl -t long /dev/hdb but I still have errors when I executing the script that I mentioned earlier: 139735510 1+0 records in 1+0 records out 139735511 dd: reading `/dev/hdb': Input/output error 0+0 records in 0+0 records out 139735512 dd: reading `/dev/hdb': Input/output error 0+0 records in 0+0 records out [..] Regards, Zoltan smartctl version 5.1-11 Copyright (C) 2002-3 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG SP1614N Serial Number: XXXXXXXXXXXXXXXX Firmware Version: TM100-24 Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0 Local Time is: Sun Jan 9 21:48:40 2005 CET ==> WARNING: Contact developers; may need -F enabled. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Off-line data collection status: (0x04) Offline data collection activity was suspended by an interrupting command from host. Auto Off-line Data Collection: Disabled. Self-test execution status: ( 112) The previous self-test completed having the read element of the test failed. Total time to complete off-line data collection: (5760) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Automatic timer ON/OFF support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 96) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 051 Pre-fail - 13 3 Spin_Up_Time 0x0007 071 055 000 Pre-fail - 5120 4 Start_Stop_Count 0x0032 100 100 000 Old_age - 60 5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail - 0 7 Seek_Error_Rate 0x000b 253 253 051 Pre-fail - 0 8 Seek_Time_Performance 0x0024 091 091 000 Old_age - 9155 9 Power_On_Hours 0x0032 100 100 000 Old_age - 594852 10 Spin_Retry_Count 0x0013 253 253 049 Pre-fail - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age - 58 194 Temperature_Celsius 0x0022 133 109 000 Old_age - 35 195 Hardware_ECC_Recovered 0x000a 100 100 000 Old_age - 307575713 196 Reallocated_Event_Count 0x0012 099 099 000 Old_age - 4 197 Current_Pending_Sector 0x0033 253 253 010 Pre-fail - 0 198 Offline_Uncorrectable 0x0031 099 099 010 Pre-fail - 4 199 UDMA_CRC_Error_Count 0x000b 100 100 051 Pre-fail - 0 200 Multi_Zone_Error_Rate 0x000b 100 100 051 Pre-fail - 0 201 Unknown_Attribute 0x000b 100 100 051 Pre-fail - 3 SMART Error Log Version: 1 No Errors Logged SMART Self-test log, version number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended off-line Completed: read failure 00% 4926 0x0058db1f # 2 Short off-line Completed: read failure 00% 4903 0x085431e3 # 3 Extended off-line Completed: read failure 00% 4903 0x0058db1f # 4 Short off-line Completed: read failure 00% 4903 0x085431e3 # 5 Short off-line Completed: read failure 00% 4896 0x085431e3 # 6 Extended off-line Completed: read failure 00% 4893 0x0058db1f # 7 Short off-line Completed: read failure 00% 4893 0x085431e3 |
From: Bruno W. I. <br...@wo...> - 2005-01-12 05:27:48
|
On Wed, Jan 12, 2005 at 00:47:48 +0100, Zoltan Podlovics <zol...@gm...> wrote: > Perhaps I misunderstood something, but I'm unable to reallocate > offline_uncorrectable sectors with the method described > in badblockhowto. > > The disk currently is a member of a raid1 array. I have found the > "bad" blocks around the reported sector with the following command: > > export i=139735500 && while [ $i -lt 139735600 ]; do echo $i; dd > if=/dev/hdb of=/dev/null bs=512 count=1 skip=$i; let i+=1; done > > I have removed the partition (not the whole disk) from raid1 array > (raidsetfaulty && raidhotremove) and I have zeroed the sectors and > soon the whole partition (with sync!) but nothing had happened. I have > executed smartctl -t offline /dev/hdb when finished smartctl -t long > /dev/hdb but I still have errors when I executing the script that I > mentioned earlier: > > 139735510 > 1+0 records in > 1+0 records out > 139735511 > dd: reading `/dev/hdb': Input/output error > 0+0 records in > 0+0 records out > 139735512 > dd: reading `/dev/hdb': Input/output error > 0+0 records in > 0+0 records out > [..] The above suggests that the OS is handling blocks using a size larger than 512 bytes and that to write a partial block it is doing a read first which fails because of the bad sector. Try using a blocksize of 4096. You will need to divide the sector numbers by 8 when you do this. |
From: Zoltan P. <zol...@gm...> - 2005-01-12 09:40:41
|
I tried it with 4096 blocksize but still I have almost the same errors: export i=17466938 && while [ $i -lt 17466950 ]; do echo $i; dd if=/dev/hdb of=/dev/null bs=4096 count=1 skip=$i; let i+=1; done 17466938 dd: reading `/dev/hdb': Input/output error 0+0 records in 0+0 records out 17466939 dd: reading `/dev/hdb': Input/output error 0+0 records in 0+0 records out 17466940 dd: reading `/dev/hdb': Input/output error 0+0 records in 0+0 records out 17466941 1+0 records in 1+0 records out 17466942 1+0 records in 1+0 records out 17466943 1+0 records in 1+0 records out 17466944 1+0 records in 1+0 records out 17466945 1+0 records in 1+0 records out 17466946 1+0 records in 1+0 records out 17466947 1+0 records in 1+0 records out 17466948 1+0 records in 1+0 records out 17466949 1+0 records in 1+0 records out Regards, Zoltan > The above suggests that the OS is handling blocks using a size larger than > 512 bytes and that to write a partial block it is doing a read first which > fails because of the bad sector. Try using a blocksize of 4096. You will > need to divide the sector numbers by 8 when you do this. > |
From: Volker K. <lis...@pa...> - 2005-01-12 10:13:56
|
> I tried it with 4096 blocksize but still I have almost the same errors: Didn't you say this disk was in a raid1 (mirror)? The easiest way to make sure all the dubious blocks are overwritten is to remove it from the raid, then add it in again, thus making sure the whole partition gets written again. Didn't you already do this? If so, it seems to me that either your raid setup is dodgy or your disk is distinctly stuffed. Does the area of bad blocks get bigger? If so, throw the disk out now. Volker -- Volker Kuhlmann is possibly list0570 with the domain in header http://volker.dnsalias.net/ Please do not CC list postings to me. |
From: Bruno W. I. <br...@wo...> - 2005-01-12 15:32:25
|
On Wed, Jan 12, 2005 at 10:40:34 +0100, Zoltan Podlovics <zol...@gm...> wrote: > I tried it with 4096 blocksize but still I have almost the same errors: > > export i=17466938 && while [ $i -lt 17466950 ]; do echo $i; dd > if=/dev/hdb of=/dev/null bs=4096 count=1 skip=$i; let i+=1; done Now that I look closer you are just reading the bad blocks in this script. You need to write over the bad blocks to allow the bad sectors to be reallocated. You should be copying them from the good drive to the bad drive. In this case you might fail the bad drive out of the mirror, write all of the blocks by copying /dev/zero to the /dev/hdb, run badblocks on the problem drive with some write tests to see if reallocating the bad sectors has stablized the drive, and if so add it back into the mirror. |
From: Zoltan P. <zol...@gm...> - 2005-01-12 19:55:38
|
It's my "test" script. When reallocation happens the test should completed without error. As I write my last email I zeroed bad sectors and soon the whole partition. After a completed offline test (smartctl -t offline /dev/hdb) and long self test (smartctl -t long /dev/hdb) my disk still have zero Reallocated_Sector_Ct. Any idea? Regards, Zoltan > Now that I look closer you are just reading the bad blocks in this > script. You need to write over the bad blocks to allow the bad sectors > to be reallocated. > > You should be copying them from the good drive to the bad drive. > > In this case you might fail the bad drive out of the mirror, write all of > the blocks by copying /dev/zero to the /dev/hdb, run badblocks on the > problem drive with some write tests to see if reallocating the bad > sectors has stablized the drive, and if so add it back into the mirror. > |
From: Mario H. <Mario.Holbe@TU-Ilmenau.DE> - 2005-01-13 07:20:15
|
Zoltan Podlovics <zol...@gm...> wrote: > Perhaps I misunderstood something, but I'm unable to reallocate > offline_uncorrectable sectors with the method described > in badblockhowto. ... > (raidsetfaulty && raidhotremove) and I have zeroed the sectors and > soon the whole partition (with sync!) but nothing had happened. I have ... > Device Model: SAMSUNG SP1614N > 196 Reallocated_Event_Count 0x0012 099 099 000 Old_age - 4 > 197 Current_Pending_Sector 0x0033 253 253 010 Pre-fail - 0 > 198 Offline_Uncorrectable 0x0031 099 099 010 Pre-fail - 4 Samsung doesn't clear the Offline_Uncorrectable counter when reallocation happened. You could run Samsungs hutil, which clears *all* S.M.A.R.T data as an undocumented side-effect. Not that it would be nice or sane. And of course it tells you the disk is absolutely okay. But be careful with the disk at all - my SP1614N started the same - first some uncorrectable sectors, growing over some weeks and then finally a whole head gone dead, saying something about servo failure. At this point even hutil couldn't ignore the disk having some problem. However, when you have it in a RAID already, this shouldn't be a big issue then. regards, Mario -- "Why are we hiding from the police, daddy?" | J. E. Guenther "Because we use SuSE son, they use SYSVR4." | de.alt.sysadmin.recovery |