Thread: [smartmontools-support]'smartctl -t long /dev/hdh' killed my Samsung SV1604N

Disk Inspection and Monitoring

Brought to you by: ballen4705, chrfranke, dipohl

smartmontools-support

[smartmontools-support]'smartctl -t long /dev/hdh' killed my Samsung SV1604N

From: Fredrik P. <fr...@br...> - 2004-05-02 14:37:39

Hello!

I'm new to this list, but I've browsed the archive for my particular problem 
before posting. I've got a Samsung SV1604N (160GB, 5400rpm) that I ran the 
long test on. (Like so: 'smartctl -t long', perhaps I should've included '-F 
samsung'?)

It completely KILLED the HD!

After about an hour, this started to turn up when doing 'dmesg':

May  2 13:23:30 rostig kernel: hdh: irq timeout: status=0xd0 { Busy }
May  2 13:23:31 rostig kernel: hdh: status timeout: status=0xd0 { Busy }
May  2 13:23:31 rostig kernel: hdh: drive not ready for command
May  2 13:23:32 rostig kernel: hdh: status timeout: status=0xd0 { Busy }
May  2 13:23:32 rostig kernel: hdh: drive not ready for command
May  2 13:23:33 rostig kernel: hdh: status timeout: status=0xd0 { Busy }
May  2 13:23:33 rostig kernel: hdh: drive not ready for command

Not good. I've also configured SMART to send me emails. I received four of 
those, within a four-second period starting at 13:23:30.

First:

The following warning/error was logged by the smartd daemon:
Device: /dev/hdh, not capable of SMART self-check

Second:

The following warning/error was logged by the smartd daemon:
Device: /dev/hdh, failed to read SMART Attribute Data

Third:

The following warning/error was logged by the smartd daemon:
Device: /dev/hdh, Read SMART Error Log Failed

Fourth:

The following warning/error was logged by the smartd daemon:
Device: /dev/hdh, Read SMART Self Test Log Failed

After that, 'smartctl -a /dev/hdh/' claimed that /dev/hdh wasn't able to do 
SMART-communication. I then rebooted the machine. Now, the drive wont even 
show up. 'dmesg' shows this:

hda: Conner Peripherals 850MB - CFS850A, ATA DISK drive
hdc: SAMSUNG SV1204H, ATA DISK drive
hde: WDC WD1200AB-00CBA1, ATA DISK drive
hdf: WDC WD1200AB-00CBA1, ATA DISK drive
hdg: Maxtor 6Y120L0, ATA DISK drive

No hdh anywhere.

Disaster. What can possibly have happened here? The HD was fairly new (just a 
few months old) has NOT been running 24/7 or anything like that although it's 
been running for 5-8 hours every day.

Any help or hints about this problem would be greatly appreciated, thanks!

/Fredrik Persson

Re: [smartmontools-support]'smartctl -t long /dev/hdh' killed my Samsung SV1604N

From: Bruce A. <ba...@gr...> - 2004-05-03 15:26:48

Hi Fredrik,

On Sun, 2 May 2004, Fredrik Persson wrote:

> I'm new to this list, but I've browsed the archive for my particular
> problem before posting. I've got a Samsung SV1604N (160GB, 5400rpm)
> that I ran the long test on. (Like so: 'smartctl -t long', perhaps I
> should've included '-F samsung'?)
> 
> It completely KILLED the HD!

I'm sorry to hear this.  If it's any consolation, the disk would have died
anyway -- the long self-test was simply the little bit of extra load that
pushed the disk past its failure point.

Was there any prior sign that the disk was 'in trouble'?

The long self-test read scans the entire disk surface.  If the disk has an
electronic or mechanical problem, then this extended read scan can provoke
failure.  (This type of failure is also commonly seen when people backup
disks.  Because the load of reading all the data from the disk is a heavy
one, it often leads to catastrophic failure in the middle of the backup.  
This is why you should always have a PAIR of backups, an over-write the
older of the two, but preserve the newer of the two.)

Before you give up on the disk, double check the power and signal cabling
to be sure that nothing has worked loose.  Additional comments below.

> After about an hour, this started to turn up when doing 'dmesg':
> 
> May  2 13:23:30 rostig kernel: hdh: irq timeout: status=0xd0 { Busy }
> May  2 13:23:31 rostig kernel: hdh: status timeout: status=0xd0 { Busy }
> May  2 13:23:31 rostig kernel: hdh: drive not ready for command
> May  2 13:23:32 rostig kernel: hdh: status timeout: status=0xd0 { Busy }
> May  2 13:23:32 rostig kernel: hdh: drive not ready for command
> May  2 13:23:33 rostig kernel: hdh: status timeout: status=0xd0 { Busy }
> May  2 13:23:33 rostig kernel: hdh: drive not ready for command

The drive simply stopped responding to commands.

> Not good. I've also configured SMART to send me emails. I received four of 
> those, within a four-second period starting at 13:23:30.
> 
> First:
> 
> The following warning/error was logged by the smartd daemon:
> Device: /dev/hdh, not capable of SMART self-check
> 
> Second:
> 
> The following warning/error was logged by the smartd daemon:
> Device: /dev/hdh, failed to read SMART Attribute Data
> 
> Third:
> 
> The following warning/error was logged by the smartd daemon:
> Device: /dev/hdh, Read SMART Error Log Failed
> 
> Fourth:
> 
> The following warning/error was logged by the smartd daemon:
> Device: /dev/hdh, Read SMART Self Test Log Failed

These four messages are because the disk wasn't reachable any more.

> After that, 'smartctl -a /dev/hdh/' claimed that /dev/hdh wasn't able to do 
> SMART-communication. I then rebooted the machine. Now, the drive wont even 
> show up. 'dmesg' shows this:
> 
> hda: Conner Peripherals 850MB - CFS850A, ATA DISK drive
> hdc: SAMSUNG SV1204H, ATA DISK drive
> hde: WDC WD1200AB-00CBA1, ATA DISK drive
> hdf: WDC WD1200AB-00CBA1, ATA DISK drive
> hdg: Maxtor 6Y120L0, ATA DISK drive
> 
> No hdh anywhere.

As I said, double check the power and signal cabling. But they are
probably OK -- this looks like a straighfoward electronic (not
mechanical) drive failure.

> Disaster. What can possibly have happened here? The HD was fairly new
> (just a few months old) has NOT been running 24/7 or anything like
> that although it's been running for 5-8 hours every day.

Really there are just three possibilities.  (1) The additional load of a
self-test provoked catastrophic failure (would have happened anyway, when
the disk was under load in the future) (2) sudden electrical failure
unrelated to self-test (eg, voltage spike killed a chip in the disk) or
(3) cabling problems (do double check to eliminate this possiblity).

> Any help or hints about this problem would be greatly appreciated,

If the disk has failed (and its just a few months old) it should still be
under warranty.  Hopefully you can re-create the data that was on it.

Cheers,	
    Bruce

Re: [smartmontools-support]'smartctl -t long /dev/hdh' killed my Samsung SV1604N

From: Fredrik P. <fr...@br...> - 2004-05-03 20:52:12

Hello, and thanks for your quick reply.

Short: it came back to life! How? I shut it down in the evening and started it 
again about 12 hours later, it there the disk was, alive and kicking. So the 
case went like this: booted the machine, ran the long self test, got the 
errors I described below, rebooted the machine to see if that got the drive 
working. It didn't, it got worse, the drive didn't exist at all 
(no /dev/hdh). Turned it off, waited 12 hours, turned it on and everything 
was back to normal.

Before you dismiss me as a nutcase, please read the comments below. However, 
what I'd *really* like to know is this: would '-F samsung' have made any 
difference when I ran the long selftest?

On Monday 03 May 2004 17.26, Bruce Allen wrote:
> Hi Fredrik,
>
> On Sun, 2 May 2004, Fredrik Persson wrote:
> > I'm new to this list, but I've browsed the archive for my particular
> > problem before posting. I've got a Samsung SV1604N (160GB, 5400rpm)
> > that I ran the long test on. (Like so: 'smartctl -t long', perhaps I
> > should've included '-F samsung'?)
> >
> > It completely KILLED the HD!
>
> I'm sorry to hear this.  If it's any consolation, the disk would have died
> anyway -- the long self-test was simply the little bit of extra load that
> pushed the disk past its failure point.
>
> Was there any prior sign that the disk was 'in trouble'?

Maybe. This is what I get from 'smartctl -a -F samsung /dev/hdh': (sorry about 
the linebreaks, I hope it's still readable.)

----------------------------------------------

smartctl version 5.1-18 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG SV1604N
Serial Number:    S01FJ10X102037
Firmware Version: TR100-24
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:    Mon May  3 22:32:06 2004 CEST

==> WARNING: Contact developers; may need -F samsung enabled.


SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Off-line data collection status: (0x00) Offline data collection activity was
                                        never started.
                                        Auto Off-line Data Collection: 
Disabled.
Self-test execution status:      (  39) The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete off-line
data collection:                 (7200) seconds.
Offline data collection
capabilities:                    (0x1b) SMART execute Offline immediate.
                                        Automatic timer ON/OFF support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging support.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 120) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   051    Pre-fail  Always       
-       0
  3 Spin_Up_Time            0x0007   073   070   000    Pre-fail  Always       
-       4864
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       
-       171
  5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  Always       
-       0
  7 Seek_Error_Rate         0x000b   253   253   051    Pre-fail  Always       
-       0
  8 Seek_Time_Performance   0x0024   253   253   000    Old_age   Offline      
-       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       
-       123448
 10 Spin_Retry_Count        0x0013   253   253   049    Pre-fail  Always       
-       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       
-       101
194 Temperature_Celsius     0x0022   169   115   000    Old_age   Always       
-       23
195 Hardware_ECC_Recovered  0x000a   100   100   000    Old_age   Always       
-       11375294
196 Reallocated_Event_Count 0x0012   253   253   000    Old_age   Always       
-       0
197 Current_Pending_Sector  0x0033   253   253   010    Pre-fail  Always       
-       0
198 Offline_Uncorrectable   0x0031   253   253   010    Pre-fail  Offline      
-       0
199 UDMA_CRC_Error_Count    0x000b   100   100   051    Pre-fail  Always       
-       1
200 Multi_Zone_Error_Rate   0x000b   100   100   051    Pre-fail  Always       
-       0
201 Soft_Read_Error_Rate    0x000b   100   100   051    Pre-fail  Always       
-       0

SMART Error Log Version: 1
Warning: ATA error count 1 inconsistent with error log pointer 5

ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Timestamp = decimal seconds since the previous disk power-on.
Note: timestamp "wraps" after 2^32 msec = 49.710 days.

Error 1 occurred at disk power-on lifetime: 0 hours
  When the command that caused the error occurred, the device was active or 
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 01 00 00 a0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Timestamp  Command/Feature_Name
  -- -- -- -- -- -- -- --   ---------  --------------------
  b1 c0 00 01 00 00 a0 00 1663959.040  DEVICE CONFIGURATION RESTORE
  ec 00 03 01 00 00 a0 00 1663959.040  IDENTIFY DEVICE
  91 00 3f 01 00 00 af 00 1663959.040  INITIALIZE DEVICE PARAMETERS [OBS-6]
  10 00 00 01 00 00 a0 00 1663959.040  RECALIBRATE [OBS-4]
  ec 00 01 01 00 00 a0 00  623771.648  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
No self-tests have been logged

----------------------------------------------

I think there are a few interesting things to note here:

1. The self-test execution status. It says it was interrupted by the with a 
hard or soft reset after 39 minutes, which sounds correct according to what I 
saw when it happened. So the disk acknowledges that something went wrong, the 
question is what?

2. There's a SMART attribute called "Hardware_ECC_Recovered", with the value 
11375294. I'm not sure what this means, but ECC should be some kind of error 
correction, and the value is high.

3. The "UDMA_CRC_Error_Count" is 1. Could this have happened during the failed 
self-test, or even be the cause of it? If so, what could have triggered this 
error?

4. There is one error in the log, which seems to have occured the first time 
the disk was powered up.

Apart from this, I cannot see anything that could've caused this error.

> The long self-test read scans the entire disk surface.  If the disk has an
> electronic or mechanical problem, then this extended read scan can provoke
> failure.  (This type of failure is also commonly seen when people backup
> disks.  Because the load of reading all the data from the disk is a heavy
> one, it often leads to catastrophic failure in the middle of the backup.
> This is why you should always have a PAIR of backups, an over-write the
> older of the two, but preserve the newer of the two.)
>
> Before you give up on the disk, double check the power and signal cabling
> to be sure that nothing has worked loose.  Additional comments below.

Power and and signal cabling are untouched, and the disk is working again. I 
didn't even open the machine.

> > After about an hour, this started to turn up when doing 'dmesg':
> >
> > May  2 13:23:30 rostig kernel: hdh: irq timeout: status=0xd0 { Busy }
> > May  2 13:23:31 rostig kernel: hdh: status timeout: status=0xd0 { Busy }
> > May  2 13:23:31 rostig kernel: hdh: drive not ready for command
> > May  2 13:23:32 rostig kernel: hdh: status timeout: status=0xd0 { Busy }
> > May  2 13:23:32 rostig kernel: hdh: drive not ready for command
> > May  2 13:23:33 rostig kernel: hdh: status timeout: status=0xd0 { Busy }
> > May  2 13:23:33 rostig kernel: hdh: drive not ready for command
>
> The drive simply stopped responding to commands.
>
> > Not good. I've also configured SMART to send me emails. I received four
> > of those, within a four-second period starting at 13:23:30.
> >
> > First:
> >
> > The following warning/error was logged by the smartd daemon:
> > Device: /dev/hdh, not capable of SMART self-check
> >
> > Second:
> >
> > The following warning/error was logged by the smartd daemon:
> > Device: /dev/hdh, failed to read SMART Attribute Data
> >
> > Third:
> >
> > The following warning/error was logged by the smartd daemon:
> > Device: /dev/hdh, Read SMART Error Log Failed
> >
> > Fourth:
> >
> > The following warning/error was logged by the smartd daemon:
> > Device: /dev/hdh, Read SMART Self Test Log Failed
>
> These four messages are because the disk wasn't reachable any more.
>
> > After that, 'smartctl -a /dev/hdh/' claimed that /dev/hdh wasn't able to
> > do SMART-communication. I then rebooted the machine. Now, the drive wont
> > even show up. 'dmesg' shows this:
> >
> > hda: Conner Peripherals 850MB - CFS850A, ATA DISK drive
> > hdc: SAMSUNG SV1204H, ATA DISK drive
> > hde: WDC WD1200AB-00CBA1, ATA DISK drive
> > hdf: WDC WD1200AB-00CBA1, ATA DISK drive
> > hdg: Maxtor 6Y120L0, ATA DISK drive
> >
> > No hdh anywhere.
>
> As I said, double check the power and signal cabling. But they are
> probably OK -- this looks like a straighfoward electronic (not
> mechanical) drive failure.

Cabling untouched, and the disk works again as it has for months. 

I'm curious; does this happen often? I mean, where the disk gets an error like 
this and then works again after 12 hours switched off?

> > Disaster. What can possibly have happened here? The HD was fairly new
> > (just a few months old) has NOT been running 24/7 or anything like
> > that although it's been running for 5-8 hours every day.
>
> Really there are just three possibilities.  (1) The additional load of a
> self-test provoked catastrophic failure (would have happened anyway, when
> the disk was under load in the future) (2) sudden electrical failure
> unrelated to self-test (eg, voltage spike killed a chip in the disk) or
> (3) cabling problems (do double check to eliminate this possiblity).

I did run selftests on three other disks simultaneously, and the finished 
fine. Cabling problem is not very probable, and voltage spikes are extremely 
rare here. (Sweden)

> > Any help or hints about this problem would be greatly appreciated,
>
> If the disk has failed (and its just a few months old) it should still be
> under warranty.  Hopefully you can re-create the data that was on it.

The disk is alive so I can take a backup now. However, won't I have a 
difficult time claiming warranty since it is fully functional now? Would you 
have tried to get a new disk if you were in my shoes?

>
> Cheers,
>     Bruce
>

Bruce, thank you very much for this very extensive reply! 

Best Regards

Fredrik Persson

Re: [smartmontools-support]'smartctl -t long /dev/hdh' killed my Samsung SV1604N

From: Bruce A. <ba...@gr...> - 2004-05-03 21:19:05

Hi Fredrik,

> Hello, and thanks for your quick reply.
> 
> Short: it came back to life! How? I shut it down in the evening and started it 
> again about 12 hours later, it there the disk was, alive and kicking. So the 
> case went like this: booted the machine, ran the long self test, got the 
> errors I described below, rebooted the machine to see if that got the drive 
> working. It didn't, it got worse, the drive didn't exist at all 
> (no /dev/hdh). Turned it off, waited 12 hours, turned it on and everything 
> was back to normal.

I'd try another long self-test to see what happens.

> Before you dismiss me as a nutcase, please read the comments below. However, 
> what I'd *really* like to know is this: would '-F samsung' have made any 
> difference when I ran the long selftest?

None.  -F samsung only affects the interpretation of the results from the
error and self-test logs.  It doesn't affect how a self-test is done.

> 199 UDMA_CRC_Error_Count    0x000b   100   100   051    Pre-fail  Always       
> -       1

This is a sign of a cabling problem.  Check your cables.

> SMART Error Log Version: 1
> Warning: ATA error count 1 inconsistent with error log pointer 5

You probably need -F samsung2 (use release 5.30 of smartmontools).

> SMART Self-test log structure revision number 1
> No self-tests have been logged

You should now show a self-test logged.  If not, try -F samsung and -F
samsung2.

> 1. The self-test execution status. It says it was interrupted by the with a 
> hard or soft reset after 39 minutes, which sounds correct according to what I 
> saw when it happened. So the disk acknowledges that something went wrong, the 
> question is what?

Could be a cabling problem.

> 2. There's a SMART attribute called "Hardware_ECC_Recovered", with the value 
> 11375294. I'm not sure what this means, but ECC should be some kind of error 
> correction, and the value is high.

Ignore it.

> 3. The "UDMA_CRC_Error_Count" is 1. Could this have happened during the failed 
> self-test, or even be the cause of it? If so, what could have triggered this 
> error?

Cabling problem.

> Power and and signal cabling are untouched, and the disk is working again. I 
> didn't even open the machine.

Consistent with an intermittent cable or power connection.  Check the
cabling.

> Cabling untouched, and the disk works again as it has for months. 

I suggest you check the cabling.

> I'm curious; does this happen often? I mean, where the disk gets an error like 
> this and then works again after 12 hours switched off?

It sound like an intermittent electrical or signal connection.  Check the
power and signal cables.

> I did run selftests on three other disks simultaneously, and the finished 
> fine. Cabling problem is not very probable, and voltage spikes are extremely 
> rare here. (Sweden)

The UDMA CRC count is an indication of a cabling problem.

> The disk is alive so I can take a backup now. However, won't I have a 
> difficult time claiming warranty since it is fully functional now? Would you 
> have tried to get a new disk if you were in my shoes?

No.  I'd check the cables (unplug and replug, or change signal
cables) then run a long self-test.  It should appear in the logs with -F
samsung or -F samsung2.

Cheers,
	Bruce