From: Brad <bra...@gm...> - 2008-06-13 06:21:58
|
Hi. I don't understand what an 'offline' test does and would appreciate some help in trying to understand it. I have a machine with a Western Digital 500GB SATA-2 disk drive (model WDC WD5000AAKS-00YGA0). The drive has a single bad block, which prompted me to install and read up on Smartmontools. I'm running version 5.36 on Linux 2.6.23.14 & Slackware 12. Reading the documentation I can understand - from a user/admin point of view - how the 'self-tests' work. When I run an explicit self- test using 'smartctl -t short/long/conveyance' I can 'see' that things are happening: - the 'Self-test execution status' field tells me that a self-test is running, and how much of the test remains; - the results of the test are posted in the 'Self-test log'. The self-tests are tangible; I start them, I can see that they're running, I can see when they finished, and I can see their results. But when I run a 'smartctl -t offline' command I can discern no such feedback at all. Running 'smartctl -a' before, during and (presumably) after an 'offline' test a 'diff' of the smartctl output shows me that only one thing changed; before my first 'offline' test the 'Offline data collection status was as follows': Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. and during the test it was: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. But the status stays at that later value - 0x84 - hours after the offline test has supposedly completed, so I can't see that it's any use in trying to work out if an 'offline' test is taking place. Also there were no new errors logged for the bad block, nor atrribute values increased. I would have thought that the 'offline' test would have prompted some new errors in the SMART error log at least. (Funnily enough a 'short' self-test almost immediately terminated with a 'read failure' on the known bad block, but the 2.5 hour 'long' self-test completed without error, which has me scratching my head a little too. But since the raw value for ID 198, 'Offline_Uncorrectable', is zero - although the 'WORST' value is one less than its current - I'm assuming that the ECC checksum is allowing the drive to repair the contents of the block? I'll do a write to the block after all these various tests to try and repair it) Furthermore the 'offline' test on my 500 GB drive apparently takes 3.5 hours to run: Total time to complete Offline data collection: (12600) seconds. and the man page for smartd says that, if I turn on SMART automatic offline testing with '-o on', the tests will run 'every four hours'. I don't much like the idea of the drive spending 87.5% of its time - 3.5 hours out of every 4 - running tests, especially when I can't 'see' what it's doing. Could someone help me out with the following questions about 'offline' tests? o What exactly does an 'offline' test do? I gather a 'long' self-test does a media scan, reading every sector on the disk, and so forth. Does an offline test do that as well? Does it do anything that a 'long' self-test doesn't do? Should I even bother with 'offline' tests? Since they irk me so much by being invisible :-) o How can I tell when an offline test is running? The 'offline data collection status' field doesn't seem reliable; it changed from 0x85 to 0x84 once I started the test, but stayed on that value long after the 3.5 hour-duration of the test had well and truly passed. o If an offline test takes 3.5 hours to run, does it make sense to run smartctl with '-o on' or put '-o on' in my smartd.conf? Even if the disk suspends testing when queries come in from the Operating System I would think the test would still slow down performance to a degree, if only because it's moving the heads away from the 'hot spots' being used by my application(s). And I don't much like the idea of my disk drive spending practically all of its time running flat-out doing tests that I can't really see :-) Should I maybe just turn off automatic offline tests and just schedule short and long self-tests through smartd? One other quick question, if I may: o How was it that the 'long' self-test finished without error - after taking 2.5 hours (an hour shorter than the 'offline' test is supposed to take) - but the 'short' self-test finished just after a few seconds with a 'read error' on the known bad block? Does a 'short' test go and deliberately re-read known bad blocks (that haven't yet been remapped)? Is a list of such block addresses, their LBA numbers, accessible? Many thanks for any advice - I'd like to know more about these 'offline' tests and whether I should stress my poor drive with them :-) Regards, Brad |