[smartmontools-support]Data Corruption

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi there guys,

Before I start, I would like to say that there are four kinds of people:
- the ones who care about that corruption and protect themselves
- the ones who care and think they protect themselves with raid
- the ones who care and do nothing
- the ones who know nothing

I'm a guy who cares and doesn't know how to protect the systems under his
responsabilities.

The think is, I know that raid can protect from disk failure, but acordin=
g
to some folks it doesn't prevent the disk from feeding corrupted that whe=
n
its an ocasionall failure. I have read that the cable or the kernel can
cause data corruption, I thought that at least to the cable, there was so=
me
kind of protection in place.

So what i really need to know is this, how can I read data from a disk
safely, meaning without data corruption (garanted).
Do I have to create a virtual layer (virtual filesystem) that creates som=
e
kind of checksum of the data and checks it everytime its read?
Or is that done already?

On servers, working in clusters, what kind of instalation do you recomend=
?

I have this idea but I'm not sure its okay, I would have the operating
system under raid, and the remaining data without raid, and the reason is
simple, as I'm talking about a cluster, it doesn't makes much sense to
replicate data four times (well it some situations it does).
The ideia is simple, the os stays always up, and after the replacement of
the failing hdd he pulls the lost data from the other clusters.

/boot              raid 1 - /dev/hda1              /dev/hdc1
swap              raid 1 - /dev/hda2             /dev/hdc2
/                     raid 1 - /dev/hda3             /dev/hdc3
/data              /dev/hda4, /dev/hdb1, /dev/hdc4, /dev/hdd

Can someone give me a hint if this is okay?

Another problem, how do I read the values from smarmontools?
I have two seagate drives (i'm not a fan of seagate, they were recomended=
, I
bought WD, but they didn't fit in the rack) working for about 6 months no=
w,
and a week ago I lost the filesystem on it. Meaning server down.
I started to install from scratch linux, and in the kernel I got some
problems, so I run badblocks, at first he gave 1 badblock.
Then I re-run it several times and he never did that again.

So I went to cry for help to the gentoo gurus, and they told me about
smartmontools, but as it semed they weren't to sure of out to read the da=
ta,
some sayed things that didn't made any sense, etc. etc.
So I'm heading to the gurus of smartmontools to help me out, in reading t=
he
data.

But before that, I must say that after having read the data of another di=
sk,
I'm getting really confused.
I believe that we should create some kind of documentation brand-depend,
there aren't so many and as people understant the readings, people could
help another one later in need. We could even grasp a hand from the brand=
s,
who knows. I know that the brands have tools to check the disks, but hey,=
 we
are talking of remote servers. And its not nice for me to go there evey n=
ow
and then to check if they are okay. I believe that all of you have this
problem.

To make things worst I have read a paper from your site, and one of the
values he claims important (if I haven't misunderstood) is the first one,=
 In
my workstation WD this value is 0, in the seagates, the value is always
changing and once it got very small, I think it was an overflow.

I'm dumping info on two hdds, they are both seagates, they are in use for
about the same time, the second one is the one that caused some problems.
Another issue, apparently the warranty on this two babies runs out in
28/5/2004, as you can check on seagates website. So I really need to know=
 if
i should exchange them now or not...

Thanks, a lot, (dump follows)
Jos=E9 Faria
Portugal

smartctl version 5.26 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=3D=3D=3D START OF INFORMATION SECTION =3D=3D=3D
Device Model:     ST3120023A
Serial Number:    3KA1S22A
Firmware Version: 3.33
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
Local Time is:    Fri May  7 03:41:50 2004 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=3D=3D=3D START OF READ SMART DATA SECTION =3D=3D=3D
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity =
was
                                        completed without error.
                                        Auto Offline Data Collection:
Enabled.
Self-test execution status:      (   0) The previous self-test routine
completed
                                        without error or no self-test has
ever
                                        been run.
Total time to complete Offline
data collection:                 ( 426) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/o=
ff
support.
                                        Suspend Offline collection upon n=
ew
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging suppor=
t.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  84) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME             FLAG     VALUE WORST THRESH TYPE      UPDA=
TED
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate             0x000f   053   049   006    Pre-fail
Always       -       4481948
  3 Spin_Up_Time                           0x0003   100   100   000
Pre-fail  Always       -       0
  4 Start_Stop_Count                      0x0032   100   100   020
Old_age   Always       -       1
  5 Reallocated_Sector_Ct              0x0033   100   100   036    Pre-fa=
il
Always       -       0
  7 Seek_Error_Rate                       0x000f   073   060   030
Pre-fail  Always       -       21966414
  9 Power_On_Hours                      0x0032   095   095   000    Old_a=
ge
Always       -       4388
 10 Spin_Retry_Count                     0x0013   100   100   097
Pre-fail  Always       -       0
 12 Power_Cycle_Count                 0x0032   100   100   020    Old_age
Always       -       40
194 Temperature_Celsius                0x0022   027   053   000    Old_ag=
e
Always       -       27
195 Hardware_ECC_Recovered    0x001a   053   049   000    Old_age
ys       -       4481948
197 Current_Pending_Sector          0x0012   100   100   000    Old_age
Always       -       0
198 Offline_Uncorrectable              0x0010   100   100   000    Old_ag=
e
Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
ys       -       0
200 Multi_Zone_Error_Rate          0x0000   100   253   000    Old_age
Offline      -       0
202 TA_Increase_Count               0x0032   100   253   000    Old_age
Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hour=
s)
LBA_of_first_error
# 1  Extended offline    Completed without error       00%
1         -
# 2  Extended offline    Completed without error       00%
9         -

smartctl version 5.26 Copyright (C) 2002-3 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=3D=3D=3D START OF INFORMATION SECTION =3D=3D=3D
Device Model:     ST3120023A
Serial Number:    3KA1RESZ
Firmware Version: 3.33
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
Local Time is:    Fri May  7 03:47:42 2004 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=3D=3D=3D START OF READ SMART DATA SECTION =3D=3D=3D
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity =
was
                                        completed without error.
                                        Auto Offline Data Collection:
Enabled.
Self-test execution status:      (   0) The previous self-test routine
completed
                                        without error or no self-test has
ever
                                        been run.
Total time to complete Offline
data collection:                 ( 426) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/o=
ff
support.
                                        Suspend Offline collection upon n=
ew
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        No General Purpose Logging suppor=
t.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (  84) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate           0x000f   052   048   006    Pre-fail
Always       -       24901055
  3 Spin_Up_Time                        0x0003   100   100   000    Pre-f=
ail
Always       -       0
  4 Start_Stop_Count                   0x0032   100   100   020    Old_ag=
e
Always       -       2
  5 Reallocated_Sector_Ct           0x0033   100   100   036    Pre-fail
Always       -       16
  7 Seek_Error_Rate                    0x000f   074   060   030    Pre-fa=
il
Always       -       27435119
  9 Power_On_Hours                   0x0032   100   100   000    Old_age
Always       -       576
 10 Spin_Retry_Count                 0x0013   100   100   097    Pre-fail
Always       -       0
 12 Power_Cycle_Count              0x0032   100   100   020    Old_age
Always       -       38
194 Temperature_Celsius              0x0022   025   050   000    Old_age
Always       -       25
195 Hardware_ECC_Recovered  0x001a   052   047   000    Old_age
ys       -       24901055
197 Current_Pending_Sector        0x0012   100   100   000    Old_age
Always       -       0
198 Offline_Uncorrectable             0x0010   100   100   000    Old_age
Offline      -       0
199 UDMA_CRC_Error_Count   0x003e   200   200   000    Old_age
ys       -       0
200 Multi_Zone_Error_Rate          0x0000   100   253   000    Old_age
Offline      -       0
202 TA_Increase_Count               0x0032   100   253   000    Old_age
Always       -       0

SMART Error Log Version: 1
ATA Error Count: 72 (device log contains only the most recent five errors=
)
        CR =3D Command Register [HEX]
        FR =3D Features Register [HEX]
        SC =3D Sector Count Register [HEX]
        SN =3D Sector Number Register [HEX]
        CL =3D Cylinder Low Register [HEX]
        CH =3D Cylinder High Register [HEX]
        DH =3D Device/Head Register [HEX]
        DC =3D Device Command Register [HEX]
        ER =3D Error register [HEX]
        ST =3D Status register [HEX]
Timestamp =3D decimal seconds since the previous disk power-on.
Note: timestamp "wraps" after 2^32 msec =3D 49.710 days.

Error 72 occurred at disk power-on lifetime: 517 hours
  When the command that caused the error occurred, the device was active =
or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 7b 0f a6 e0  Error: UNC

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Timestamp  Command/Feature_Name
  -- -- -- -- -- -- -- --   ---------  --------------------
  c8 00 18 6b 0f a6 e0 00   85534.201  READ DMA
  c8 00 20 63 0f a6 e0 00   85529.462  READ DMA
  ca 00 20 43 0f a6 e0 00   85529.462  WRITE DMA
  c8 00 20 43 0f a6 e0 00   85529.461  READ DMA
  ca 00 20 43 0f a6 e0 00   85529.460  WRITE DMA

Error 71 occurred at disk power-on lifetime: 517 hours
  When the command that caused the error occurred, the device was active =
or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 7b 0f a6 e0  Error: UNC

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Timestamp  Command/Feature_Name
  -- -- -- -- -- -- -- --   ---------  --------------------
  c8 00 20 63 0f a6 e0 00   85529.462  READ DMA
  ca 00 20 43 0f a6 e0 00   85529.462  WRITE DMA
  c8 00 20 43 0f a6 e0 00   85529.461  READ DMA
  ca 00 20 43 0f a6 e0 00   85529.460  WRITE DMA
  c8 00 20 43 0f a6 e0 00   85529.445  READ DMA

Error 70 occurred at disk power-on lifetime: 546 hours
  When the command that caused the error occurred, the device was active =
or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 07 ec 7d 9b e0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Timestamp  Command/Feature_Name
  -- -- -- -- -- -- -- --   ---------  --------------------
  c4 00 08 eb 7d 9b e0 00   28863.279  READ MULTIPLE
  c4 00 08 bb 8d 9b e0 00   28863.278  READ MULTIPLE
  c4 00 08 ab 8d 9b e0 00   28863.276  READ MULTIPLE
  c4 00 08 9b 8d 9b e0 00   28863.263  READ MULTIPLE
  c4 00 08 c3 f0 d3 e0 00   28863.261  READ MULTIPLE

Error 69 occurred at disk power-on lifetime: 546 hours
  When the command that caused the error occurred, the device was active =
or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 07 ec 7d 9b e0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Timestamp  Command/Feature_Name
  -- -- -- -- -- -- -- --   ---------  --------------------
  c4 00 08 eb 7d 9b e0 00   28146.689  READ MULTIPLE
  c4 00 08 fb 99 88 e0 00   28146.688  READ MULTIPLE
  c4 00 08 f3 99 88 e0 00   28146.687  READ MULTIPLE
  c4 00 08 e3 99 88 e0 00   28146.669  READ MULTIPLE
  c4 00 08 93 62 68 e1 00   28146.668  READ MULTIPLE

Error 68 occurred at disk power-on lifetime: 546 hours
  When the command that caused the error occurred, the device was active =
or
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 07 ec 7d 9b e0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Timestamp  Command/Feature_Name
  -- -- -- -- -- -- -- --   ---------  --------------------
  c4 00 08 eb 7d 9b e0 00   28141.582  READ MULTIPLE
  c4 00 08 1b f6 fc e0 00   28141.581  READ MULTIPLE
  c4 00 08 1b a3 fb e0 00   28141.579  READ MULTIPLE
  c4 00 08 fb f5 fc e0 00   28141.578  READ MULTIPLE
  c4 00 08 eb f5 fc e0 00   28141.568  READ MULTIPLE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hour=
s)
LBA_of_first_error
# 1  Extended offline    Completed without error       00%
5         -
# 2  Extended offline    Completed without error       00%
4         -
# 3  Extended offline    Aborted by host               90%
3         -
# 4  Extended offline    Completed without error       00%
2         -
# 5  Extended offline    Completed without error       00%
6         -
# 6  Extended offline    Completed without error       00%
0         -
# 7  Extended offline    Completed without error       00%
6         -

[smartmontools-support]Data Corruption

Disk Inspection and Monitoring

[smartmontools-support]Data Corruption