## [Smartmontools-support]Re: Explanation of "errors"

 Re: [Smartmontools-support]Re: Explanation of "errors" From: Leonid A Broukhis - 2003-08-05 21:42:06 On Tue, Aug 05, 2003 at 03:21:03PM -0500, Bruce Allen wrote: > I think I understand this. A sector goes bad when it's damaged -- for > example by contact with a dust particle. How does the disk know that the > sector is bad? It's because the information is stored redundantly on the > sector using ECC codes. The consistency of the data with the ECC codes, > or lack of consistency, tells the drive if the data is right or wrong, > typically with a probability of 1-10^-14. So that's how the drive can > recognize if a sector has gone bad. My understanding is that with GMR and other "near-quantum" effects, the read operation is always probabilistic, and a low rate of "soft" (correctable) read errors is expected by design. That said, you're mixing the objectives of ECC and checksumming. An ECC algorithm is usually designed to correct N bit errors in a block M bits long, and detect N + 1 errors. If the number of errors is actually higher, they will be "corrected" in a wrong way. Here's an example: Consider ECCs for 1 bit. Duplicating that bit will give you: 00 -> 0 11 -> 1 01 -> error 10 -> error Here we're not able to correct anything (N = 0), but we're able to detect all 1-bit errors in a 2-bit block. Triplicating: 000 -> 0 001, 010, 100 -> 0, corrected 111 -> 1 110, 101, 011 -> 1, corrected Here we're able to correct all 1-bit errors, but we're not able to detect any 2-bit errors. Such ECCs do not provide "graceful degradation" and are not recommended when used alone, but in conjunction with a CRC check they are OK, although by themselves they do not offer any help in finding whether the data are reliable. Quadruplicating : 0000 -> 0 0001, 0010, 0100, 1000 -> 0, corrected 1111 -> 1 1110, 1101, 1011, 0111 -> 1, corrected 0011, 0101, 0110, 1001, 1010, 1100 -> error Here we're able to correct all 1-bit errors and to detect all 2-bit errors. The integrity check is done by computing a checksum (CRC) after the ECC does its job. If the CRC is wrong, it is too late to relocate. But it is possible, if the ECC can correct up to 32 (it's just an example, I do not know what ECCs are used in real devices) single-bit errors in a 512 byte block, to relocate a sector if it was read with more than 24 errors. It is my understanding that this is exactly what the off-line test was supposed to do. And, I believe, that's exactly what it did, exhausting the pool of the replacement sectors "prematurely" and causing disks to fail early. > By the way, there is a nice document here: > http://bazaclub.starnet.ru/hwdiver/smartdoc.html > explaning the meanings of different Attributes. But I can't read > Russian. Can you or a friend translate it? Russian is my first language. That document is clearly a compilation, and not a very fresh one (2001): the conveyance test is not mentioned at all. Leo