From: Fredrik P. <fr...@br...> - 2004-05-02 14:37:39
|
Hello! I'm new to this list, but I've browsed the archive for my particular problem before posting. I've got a Samsung SV1604N (160GB, 5400rpm) that I ran the long test on. (Like so: 'smartctl -t long', perhaps I should've included '-F samsung'?) It completely KILLED the HD! After about an hour, this started to turn up when doing 'dmesg': May 2 13:23:30 rostig kernel: hdh: irq timeout: status=0xd0 { Busy } May 2 13:23:31 rostig kernel: hdh: status timeout: status=0xd0 { Busy } May 2 13:23:31 rostig kernel: hdh: drive not ready for command May 2 13:23:32 rostig kernel: hdh: status timeout: status=0xd0 { Busy } May 2 13:23:32 rostig kernel: hdh: drive not ready for command May 2 13:23:33 rostig kernel: hdh: status timeout: status=0xd0 { Busy } May 2 13:23:33 rostig kernel: hdh: drive not ready for command Not good. I've also configured SMART to send me emails. I received four of those, within a four-second period starting at 13:23:30. First: The following warning/error was logged by the smartd daemon: Device: /dev/hdh, not capable of SMART self-check Second: The following warning/error was logged by the smartd daemon: Device: /dev/hdh, failed to read SMART Attribute Data Third: The following warning/error was logged by the smartd daemon: Device: /dev/hdh, Read SMART Error Log Failed Fourth: The following warning/error was logged by the smartd daemon: Device: /dev/hdh, Read SMART Self Test Log Failed After that, 'smartctl -a /dev/hdh/' claimed that /dev/hdh wasn't able to do SMART-communication. I then rebooted the machine. Now, the drive wont even show up. 'dmesg' shows this: hda: Conner Peripherals 850MB - CFS850A, ATA DISK drive hdc: SAMSUNG SV1204H, ATA DISK drive hde: WDC WD1200AB-00CBA1, ATA DISK drive hdf: WDC WD1200AB-00CBA1, ATA DISK drive hdg: Maxtor 6Y120L0, ATA DISK drive No hdh anywhere. Disaster. What can possibly have happened here? The HD was fairly new (just a few months old) has NOT been running 24/7 or anything like that although it's been running for 5-8 hours every day. Any help or hints about this problem would be greatly appreciated, thanks! /Fredrik Persson |
From: Bruce A. <ba...@gr...> - 2004-05-03 15:26:48
|
Hi Fredrik, On Sun, 2 May 2004, Fredrik Persson wrote: > I'm new to this list, but I've browsed the archive for my particular > problem before posting. I've got a Samsung SV1604N (160GB, 5400rpm) > that I ran the long test on. (Like so: 'smartctl -t long', perhaps I > should've included '-F samsung'?) > > It completely KILLED the HD! I'm sorry to hear this. If it's any consolation, the disk would have died anyway -- the long self-test was simply the little bit of extra load that pushed the disk past its failure point. Was there any prior sign that the disk was 'in trouble'? The long self-test read scans the entire disk surface. If the disk has an electronic or mechanical problem, then this extended read scan can provoke failure. (This type of failure is also commonly seen when people backup disks. Because the load of reading all the data from the disk is a heavy one, it often leads to catastrophic failure in the middle of the backup. This is why you should always have a PAIR of backups, an over-write the older of the two, but preserve the newer of the two.) Before you give up on the disk, double check the power and signal cabling to be sure that nothing has worked loose. Additional comments below. > After about an hour, this started to turn up when doing 'dmesg': > > May 2 13:23:30 rostig kernel: hdh: irq timeout: status=0xd0 { Busy } > May 2 13:23:31 rostig kernel: hdh: status timeout: status=0xd0 { Busy } > May 2 13:23:31 rostig kernel: hdh: drive not ready for command > May 2 13:23:32 rostig kernel: hdh: status timeout: status=0xd0 { Busy } > May 2 13:23:32 rostig kernel: hdh: drive not ready for command > May 2 13:23:33 rostig kernel: hdh: status timeout: status=0xd0 { Busy } > May 2 13:23:33 rostig kernel: hdh: drive not ready for command The drive simply stopped responding to commands. > Not good. I've also configured SMART to send me emails. I received four of > those, within a four-second period starting at 13:23:30. > > First: > > The following warning/error was logged by the smartd daemon: > Device: /dev/hdh, not capable of SMART self-check > > Second: > > The following warning/error was logged by the smartd daemon: > Device: /dev/hdh, failed to read SMART Attribute Data > > Third: > > The following warning/error was logged by the smartd daemon: > Device: /dev/hdh, Read SMART Error Log Failed > > Fourth: > > The following warning/error was logged by the smartd daemon: > Device: /dev/hdh, Read SMART Self Test Log Failed These four messages are because the disk wasn't reachable any more. > After that, 'smartctl -a /dev/hdh/' claimed that /dev/hdh wasn't able to do > SMART-communication. I then rebooted the machine. Now, the drive wont even > show up. 'dmesg' shows this: > > hda: Conner Peripherals 850MB - CFS850A, ATA DISK drive > hdc: SAMSUNG SV1204H, ATA DISK drive > hde: WDC WD1200AB-00CBA1, ATA DISK drive > hdf: WDC WD1200AB-00CBA1, ATA DISK drive > hdg: Maxtor 6Y120L0, ATA DISK drive > > No hdh anywhere. As I said, double check the power and signal cabling. But they are probably OK -- this looks like a straighfoward electronic (not mechanical) drive failure. > Disaster. What can possibly have happened here? The HD was fairly new > (just a few months old) has NOT been running 24/7 or anything like > that although it's been running for 5-8 hours every day. Really there are just three possibilities. (1) The additional load of a self-test provoked catastrophic failure (would have happened anyway, when the disk was under load in the future) (2) sudden electrical failure unrelated to self-test (eg, voltage spike killed a chip in the disk) or (3) cabling problems (do double check to eliminate this possiblity). > Any help or hints about this problem would be greatly appreciated, If the disk has failed (and its just a few months old) it should still be under warranty. Hopefully you can re-create the data that was on it. Cheers, Bruce |
From: Fredrik P. <fr...@br...> - 2004-05-03 20:52:12
|
Hello, and thanks for your quick reply. Short: it came back to life! How? I shut it down in the evening and started it again about 12 hours later, it there the disk was, alive and kicking. So the case went like this: booted the machine, ran the long self test, got the errors I described below, rebooted the machine to see if that got the drive working. It didn't, it got worse, the drive didn't exist at all (no /dev/hdh). Turned it off, waited 12 hours, turned it on and everything was back to normal. Before you dismiss me as a nutcase, please read the comments below. However, what I'd *really* like to know is this: would '-F samsung' have made any difference when I ran the long selftest? On Monday 03 May 2004 17.26, Bruce Allen wrote: > Hi Fredrik, > > On Sun, 2 May 2004, Fredrik Persson wrote: > > I'm new to this list, but I've browsed the archive for my particular > > problem before posting. I've got a Samsung SV1604N (160GB, 5400rpm) > > that I ran the long test on. (Like so: 'smartctl -t long', perhaps I > > should've included '-F samsung'?) > > > > It completely KILLED the HD! > > I'm sorry to hear this. If it's any consolation, the disk would have died > anyway -- the long self-test was simply the little bit of extra load that > pushed the disk past its failure point. > > Was there any prior sign that the disk was 'in trouble'? Maybe. This is what I get from 'smartctl -a -F samsung /dev/hdh': (sorry about the linebreaks, I hope it's still readable.) ---------------------------------------------- smartctl version 5.1-18 Copyright (C) 2002-3 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG SV1604N Serial Number: S01FJ10X102037 Firmware Version: TR100-24 Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0 Local Time is: Mon May 3 22:32:06 2004 CEST ==> WARNING: Contact developers; may need -F samsung enabled. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Off-line data collection status: (0x00) Offline data collection activity was never started. Auto Off-line Data Collection: Disabled. Self-test execution status: ( 39) The self-test routine was interrupted by the host with a hard or soft reset. Total time to complete off-line data collection: (7200) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Automatic timer ON/OFF support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 120) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 073 070 000 Pre-fail Always - 4864 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 171 5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 253 253 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0024 253 253 000 Old_age Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 123448 10 Spin_Retry_Count 0x0013 253 253 049 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 101 194 Temperature_Celsius 0x0022 169 115 000 Old_age Always - 23 195 Hardware_ECC_Recovered 0x000a 100 100 000 Old_age Always - 11375294 196 Reallocated_Event_Count 0x0012 253 253 000 Old_age Always - 0 197 Current_Pending_Sector 0x0033 253 253 010 Pre-fail Always - 0 198 Offline_Uncorrectable 0x0031 253 253 010 Pre-fail Offline - 0 199 UDMA_CRC_Error_Count 0x000b 100 100 051 Pre-fail Always - 1 200 Multi_Zone_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0 201 Soft_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0 SMART Error Log Version: 1 Warning: ATA error count 1 inconsistent with error log pointer 5 ATA Error Count: 1 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Timestamp = decimal seconds since the previous disk power-on. Note: timestamp "wraps" after 2^32 msec = 49.710 days. Error 1 occurred at disk power-on lifetime: 0 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 51 00 01 00 00 a0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- b1 c0 00 01 00 00 a0 00 1663959.040 DEVICE CONFIGURATION RESTORE ec 00 03 01 00 00 a0 00 1663959.040 IDENTIFY DEVICE 91 00 3f 01 00 00 af 00 1663959.040 INITIALIZE DEVICE PARAMETERS [OBS-6] 10 00 00 01 00 00 a0 00 1663959.040 RECALIBRATE [OBS-4] ec 00 01 01 00 00 a0 00 623771.648 IDENTIFY DEVICE SMART Self-test log structure revision number 1 No self-tests have been logged ---------------------------------------------- I think there are a few interesting things to note here: 1. The self-test execution status. It says it was interrupted by the with a hard or soft reset after 39 minutes, which sounds correct according to what I saw when it happened. So the disk acknowledges that something went wrong, the question is what? 2. There's a SMART attribute called "Hardware_ECC_Recovered", with the value 11375294. I'm not sure what this means, but ECC should be some kind of error correction, and the value is high. 3. The "UDMA_CRC_Error_Count" is 1. Could this have happened during the failed self-test, or even be the cause of it? If so, what could have triggered this error? 4. There is one error in the log, which seems to have occured the first time the disk was powered up. Apart from this, I cannot see anything that could've caused this error. > The long self-test read scans the entire disk surface. If the disk has an > electronic or mechanical problem, then this extended read scan can provoke > failure. (This type of failure is also commonly seen when people backup > disks. Because the load of reading all the data from the disk is a heavy > one, it often leads to catastrophic failure in the middle of the backup. > This is why you should always have a PAIR of backups, an over-write the > older of the two, but preserve the newer of the two.) > > Before you give up on the disk, double check the power and signal cabling > to be sure that nothing has worked loose. Additional comments below. Power and and signal cabling are untouched, and the disk is working again. I didn't even open the machine. > > After about an hour, this started to turn up when doing 'dmesg': > > > > May 2 13:23:30 rostig kernel: hdh: irq timeout: status=0xd0 { Busy } > > May 2 13:23:31 rostig kernel: hdh: status timeout: status=0xd0 { Busy } > > May 2 13:23:31 rostig kernel: hdh: drive not ready for command > > May 2 13:23:32 rostig kernel: hdh: status timeout: status=0xd0 { Busy } > > May 2 13:23:32 rostig kernel: hdh: drive not ready for command > > May 2 13:23:33 rostig kernel: hdh: status timeout: status=0xd0 { Busy } > > May 2 13:23:33 rostig kernel: hdh: drive not ready for command > > The drive simply stopped responding to commands. > > > Not good. I've also configured SMART to send me emails. I received four > > of those, within a four-second period starting at 13:23:30. > > > > First: > > > > The following warning/error was logged by the smartd daemon: > > Device: /dev/hdh, not capable of SMART self-check > > > > Second: > > > > The following warning/error was logged by the smartd daemon: > > Device: /dev/hdh, failed to read SMART Attribute Data > > > > Third: > > > > The following warning/error was logged by the smartd daemon: > > Device: /dev/hdh, Read SMART Error Log Failed > > > > Fourth: > > > > The following warning/error was logged by the smartd daemon: > > Device: /dev/hdh, Read SMART Self Test Log Failed > > These four messages are because the disk wasn't reachable any more. > > > After that, 'smartctl -a /dev/hdh/' claimed that /dev/hdh wasn't able to > > do SMART-communication. I then rebooted the machine. Now, the drive wont > > even show up. 'dmesg' shows this: > > > > hda: Conner Peripherals 850MB - CFS850A, ATA DISK drive > > hdc: SAMSUNG SV1204H, ATA DISK drive > > hde: WDC WD1200AB-00CBA1, ATA DISK drive > > hdf: WDC WD1200AB-00CBA1, ATA DISK drive > > hdg: Maxtor 6Y120L0, ATA DISK drive > > > > No hdh anywhere. > > As I said, double check the power and signal cabling. But they are > probably OK -- this looks like a straighfoward electronic (not > mechanical) drive failure. Cabling untouched, and the disk works again as it has for months. I'm curious; does this happen often? I mean, where the disk gets an error like this and then works again after 12 hours switched off? > > Disaster. What can possibly have happened here? The HD was fairly new > > (just a few months old) has NOT been running 24/7 or anything like > > that although it's been running for 5-8 hours every day. > > Really there are just three possibilities. (1) The additional load of a > self-test provoked catastrophic failure (would have happened anyway, when > the disk was under load in the future) (2) sudden electrical failure > unrelated to self-test (eg, voltage spike killed a chip in the disk) or > (3) cabling problems (do double check to eliminate this possiblity). I did run selftests on three other disks simultaneously, and the finished fine. Cabling problem is not very probable, and voltage spikes are extremely rare here. (Sweden) > > Any help or hints about this problem would be greatly appreciated, > > If the disk has failed (and its just a few months old) it should still be > under warranty. Hopefully you can re-create the data that was on it. The disk is alive so I can take a backup now. However, won't I have a difficult time claiming warranty since it is fully functional now? Would you have tried to get a new disk if you were in my shoes? > > Cheers, > Bruce > Bruce, thank you very much for this very extensive reply! Best Regards Fredrik Persson |
From: Bruce A. <ba...@gr...> - 2004-05-03 21:19:05
|
Hi Fredrik, > Hello, and thanks for your quick reply. > > Short: it came back to life! How? I shut it down in the evening and started it > again about 12 hours later, it there the disk was, alive and kicking. So the > case went like this: booted the machine, ran the long self test, got the > errors I described below, rebooted the machine to see if that got the drive > working. It didn't, it got worse, the drive didn't exist at all > (no /dev/hdh). Turned it off, waited 12 hours, turned it on and everything > was back to normal. I'd try another long self-test to see what happens. > Before you dismiss me as a nutcase, please read the comments below. However, > what I'd *really* like to know is this: would '-F samsung' have made any > difference when I ran the long selftest? None. -F samsung only affects the interpretation of the results from the error and self-test logs. It doesn't affect how a self-test is done. > 199 UDMA_CRC_Error_Count 0x000b 100 100 051 Pre-fail Always > - 1 This is a sign of a cabling problem. Check your cables. > SMART Error Log Version: 1 > Warning: ATA error count 1 inconsistent with error log pointer 5 You probably need -F samsung2 (use release 5.30 of smartmontools). > SMART Self-test log structure revision number 1 > No self-tests have been logged You should now show a self-test logged. If not, try -F samsung and -F samsung2. > 1. The self-test execution status. It says it was interrupted by the with a > hard or soft reset after 39 minutes, which sounds correct according to what I > saw when it happened. So the disk acknowledges that something went wrong, the > question is what? Could be a cabling problem. > 2. There's a SMART attribute called "Hardware_ECC_Recovered", with the value > 11375294. I'm not sure what this means, but ECC should be some kind of error > correction, and the value is high. Ignore it. > 3. The "UDMA_CRC_Error_Count" is 1. Could this have happened during the failed > self-test, or even be the cause of it? If so, what could have triggered this > error? Cabling problem. > Power and and signal cabling are untouched, and the disk is working again. I > didn't even open the machine. Consistent with an intermittent cable or power connection. Check the cabling. > Cabling untouched, and the disk works again as it has for months. I suggest you check the cabling. > I'm curious; does this happen often? I mean, where the disk gets an error like > this and then works again after 12 hours switched off? It sound like an intermittent electrical or signal connection. Check the power and signal cables. > I did run selftests on three other disks simultaneously, and the finished > fine. Cabling problem is not very probable, and voltage spikes are extremely > rare here. (Sweden) The UDMA CRC count is an indication of a cabling problem. > The disk is alive so I can take a backup now. However, won't I have a > difficult time claiming warranty since it is fully functional now? Would you > have tried to get a new disk if you were in my shoes? No. I'd check the cables (unplug and replug, or change signal cables) then run a long self-test. It should appear in the logs with -F samsung or -F samsung2. Cheers, Bruce |