#28 badblocks output list overreports bad sectors

open
nobody
None
5
2014-08-22
2009-06-09
Haudy Kazemi
No

Background:
I have an old ATA/EIDE IBM Deskstar 60GXP IC35L020AVER07-0 hard drive (/dev/sdb) that has bad sectors, and I'm using it to thoroughly test some data recovery/drive repair tools. (I have copies of anything that was important on the drive. The drive has over 36000 hours on it (e.g. 4 years+ continuous) and the internal SMART error log counter has reached its maximum 65k during various sector re-read/retry testing. One time I did see the error log counter reset to about 50k while running HDAT2 via Ultimate Boot CD 5 beta 12.)

While using various tools, I have found some areas where badblocks ability detect and/or remedy bad sectors could be improved.

Methods:
I've run multiple tests and compared the output of these commands on Ubuntu 8.10:
badblocks -sv -b 512 -c 1 -o sdberrorlist3.txt /dev/sdb
smartctl --all /dev/sdb
sudo hdparm --read-sector 17938735 /dev/sdb
dmesg

The first command uses badblocks in its non-destructive read-only mode and is supposed to write a list of bad blocks to the output file sdberrorlist3.txt. The -b parameter sets the block size to 512 bytes (i.e. one sector), and the -c parameter tells badblocks to test 1 block at a time. The block device is /dev/sdb, which I want to read sector by sector to identify find all bad sectors.
The second command lists all the SMART info from the drive including the device error log which lists the 5 most recent errors.
The third command uses hdparm to read the specified sector. I'm using it to attempt to read sectors that badblocks identified and listed in the output file sdberrorlist3.txt.

Findings:
I let badblocks run to completion on /dev/sdb. It found over 2200 bad blocks on this 20 gb drive, taking about 12+ hours to complete. I examined the smartctl output and saw errors like this whenever badblocks hit a bad sector area:

Error 65535 occurred at disk power-on lifetime: 36702 hours (1529 days + 6 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 08 b0 e7 73 e1 Error: UNC 8 sectors at LBA = 0x0173e7b0 = 24373168

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 b0 e7 73 e1 00 1d+13:46:56.100 READ DMA
f8 00 00 00 00 00 e0 00 1d+13:46:56.100 READ NATIVE MAX ADDRESS
ec 00 00 00 00 00 a0 02 1d+13:46:56.100 IDENTIFY DEVICE
ef 07 00 00 00 00 a0 00 1d+13:46:56.100 SET FEATURES [Set device spin-up]
ec 00 00 00 00 00 a0 02 1d+13:46:56.100 IDENTIFY DEVICE

From this, it looks like badblocks is instructing the subsystem to read 8 sectors at a time and using DMA access, even though I specified a blocksize of 512 bytes, which is one sector.

I tested sector readability with the hdparm --read-sector command, which successfully read some of the sectors badblocks had reported as bad. Two sectors that were listed in the badblocks output file were 17938734 and 17938735. The hdparm sector read results are:

user@user-desktop:~$ sudo hdparm --read-sector 17938734 /dev/sdb
/dev/sdb:
reading sector 17938734: FAILED: Input/output error

user@user-desktop:~$ sudo hdparm --read-sector 17938735 /dev/sdb
/dev/sdb:
reading sector 17938735: succeeded
5a4d 0090 0003 0000 0004 0000 ffff 0000
(and 20+ similar lines)

Here is what the SMART error log showed after running the hdparm command that had the error shown above:

Error 65535 occurred at disk power-on lifetime: 36758 hours (1529 days + 7 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 59 01 2e b9 11 e1 Error: UNC at LBA = 0x0111b92e = 17938734

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
20 00 01 2e b9 11 e1 00 3d+21:25:01.300 READ SECTOR(S)
20 00 01 2f b9 11 e1 00 3d+20:44:56.200 READ SECTOR(S)
e5 00 00 00 00 00 a0 00 1d+20:44:47.300 CHECK POWER MODE
e0 00 00 00 00 00 e0 00 1d+20:44:31.200 STANDBY IMMEDIATE
b0 d0 01 00 4f c2 a0 00 1d+20:44:17.200 SMART READ DATA

It looks like hdparm is truly attempting to read a single sector and not using DMA.

It appears that badblocks is overreporting which blocks are bad (i.e. some of the blocks reported as bad are actually still readable). Further confirmation of the overreporting comes from the SMART data field '197 Current_Pending_Sector' which is at 1176. This Current_Pending_Sector count decreases when I use hdparm --repair-sector, or HDAT2, or the IBM Drive Fitness Test's Corrupt Sector repair option. As the Reallocated_Sector_Ct has not been increasing along with these repairs, I believe that most of them are due to sectors that were written improperly (i.e. the sector data written doesn't match the sector ECC written). Drive firmware bugs or intermittent component failures might be the root cause but that's really just speculation.

Suggestions:
Change how badblocks reads the disk so that the badblocks output file does not include sectors that are actually readable on an individual basis. As it is right now, the badblocks output lists a mix of sectors that do in fact read fine along with bad sectors.
GNU ddrescue 1.9 (different from dd_rescue) can read on a sector by sector basis, and it also has a direct drive access option (-d). It also can fast skip over suspect areas to get the drive scanned quickly before returning for a more thorough examination of the previously skipped areas. This is a useful characteristic in a scrub.

Use case 1:
Take the output of badblocks and use it in other repair tools, e.g. hdparm --repair-sector

Use case 2:
Another tool (I don't believe this one exists yet for ext2/3/4, but would be a good addition to this toolset) would be a physical or logical sector to file mapping tool, so one can quickly identify which files have been affected by the bad sectors, and then be able to decide whether that file is best off deleted and copied from another location, or if a partial recovery possibly losing the whole 512 bytes using hdparm/HDAT/Spinrite is preferable.

P.S. There is a NTFS tool that can do logical sector to file mapping. It is called NFI (NTFS File Sector Information Utility) and is part of the download available here: http://support.microsoft.com/kb/253066 . (Obviously to use NFI.exe you first need to map the physical sector to a logical sector based on where the partition is on the drive.) (The original 'dd_rescue', different from 'ddrescue', called NFI.exe in the ddr2nfi.pl script: http://www.mail-archive.com/bug-ddrescue@gnu.org/msg00088.html ).

Discussion