Hello all smartctl experts !:)
First of all:
My smartctl version is 5.37 [i686-redhat-linux-gnu] and I ran it with params:" -a /dev/sda" to control my Samsung drive (which is described below - CIT I)
Linux version is Fedora 7, kernel 2.6.21
My disk (/dev/sda) technical information :
Device Model: SAMSUNG SP0802N
Serial Number: S00JJ40XC00821
Firmware Version: TK200-04
User Capacity: 80,060,424,192 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Fri Feb 1 15:42:55 2008 CET
==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details.
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
And the problem is:
1) A lot of days ago: during starting Linux, precisely "NFS statd" it appeared:
"ATA: abnormal status 0x80 on port 0x000101f7
ata1: failed to recover some devices
INIT: cannot execute '/sbin/mingetty' ...
and so on, and so on..."
Suspecting a problem with HDD (with system, so it means /dev/sda I suspected) I decided to:
-correct physical connections of /dev/sda
-make backup of /dev/sda's critical data
so that is, what was next:
2) starting sendmail... (and... HANGUP:( )
what I did more:
exactly just after 1), after I've read about smartctl's possibilities, I decided to check my /dev/sda with "smartctl -a".
It returned not quiet fine messages:
although returned (quiet fine):
"=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED"
?"SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0007 100 100 025 Pre-fail Always - 7424
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 234
5 Reallocated_Sector_Ct 0x0033 100 100 011 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 9187
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 137282
10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0
11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 172
194 Temperature_Celsius 0x0022 151 082 000 Old_age Always - 29
195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 59105
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x000a 100 100 051 Old_age Always - 0
201 Soft_Read_Error_Rate 0x000a 100 100 051 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 1265 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
but that was the case:
"Error 1265 occurred at disk power-on lifetime: 802 hours (33 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 00 01 4f c2 a0 Error: ABRT
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
b0 d9 00 01 4f c2 a0 00 00:00:08.688 SMART DISABLE OPERATIONS
ec 00 ff 01 00 00 a0 00 00:00:08.688 IDENTIFY DEVICE
10 00 ff 01 00 00 a0 00 00:00:08.688 RECALIBRATE [OBS-4]
91 00 ff 01 00 00 af 00 00:00:08.688 INITIALIZE DEVICE PARAMETERS [OBS-6]
ec 00 ff 01 00 00 a0 00 00:00:08.563 IDENTIFY DEVICE "
repeated strictly 4 (+above) times (with different error no. and power-on lifetimes, of course) and further:
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 938 -
# 2 Extended offline Completed without error 00% 912 -
# 3 Extended offline Completed without error 00% 889 -
# 4 Short offline Completed without error 00% 886 -
# 5 Short offline Completed without error 00% 886 -
# 6 Short offline Completed without error 00% 884 -
# 7 Extended offline Completed without error 00% 876 -
# 8 Abort offline test Aborted by host 90% 875 -
(whole part of smartctl's output I've written into attachment file "SmartLogLocal")
... so it seems, there is not concrete LBA (sector) giving error, which I've rewritten.
Instead of that I've started to grab in "SMART DISABLE OPERATIONS" error's reasons (because exactly that command seems to be source of error). I've found some of their descriptions in the ATA7's specification- chapter/point 6.54.1 ( http://www.t13.org/Documents/UploadedDocuments/docs2007/D1532v1r4b-AT_Attachment_with_Packet_Interface_-_7_Volume_1.pdf (DOC I)) and
the SMART's specification - chapter/point 2.1 ( ftp://ftp3.ds.pg.gda.pl/people/macro/S.M.A.R.T./8035R2_0.PDF (DOC II)). Both of these specifications seem to be official documents so- in my hope- conductive to error situations should be fully described there.
So- in DOC II (ERROR OUTPUTS) is written:
"If the device does not support the S.M.A.R.T. feature set, if S.M.A.R.T. is not
enabled or if the values in the Features, Cylinder Low or Cylinder High registers are invalid, an Aborted
command error is posted."
while in DOC I (220.127.116.11 Error outputs):
"If the device does not support this command, if SMART is not enabled, or if the values in the Features, LBA
Mid, or LBA High registers are invalid, the device shall return command aborted."
and in the same (under previous):
Error register -
ABRT shall be set to one if this command is not supported, if SMART is not enabled, or if input
register values are invalid. ABRT may be set to one if the device is not able to complete the
action requested by the command.
Device register -
DEV shall indicate the selected device.
Status register -
BSY shall be cleared to zero indicating command completion.
DRDY shall be set to one.
DF (Device Fault) shall be set to one if a device fault has occurred.
DRQ shall be cleared to zero.
ERR shall be set to one if an Error register bit is set to one.
...so let's check our values of Device register and Status register (CIT IV). Device register (Device command (DC) or Device/Head (DH) ?) has 00h (or a0h). Independently of our choice the "DEV" flag is cleared (because that is 4th bit of Device register (DOC I)), so selected device is 0 (why not?)). But anyway- the more important is, that ABRT is set to one, ERR is set to 1, BSY is cleared to 0, DRDY is set to 1 and DF is cleared to 0. The last fact seems to imply, that "device fault" HASN'T occurred. So,.. what HAS?? In DOC I (3.1.81 Definitions and abbreviations) is written:
?"UNRECOVERABLE ERROR: When the device sets either the ERR bit or the DF bit to one in the Status register at command completion."
...so setting ERR to 1 (by device) means, that some error occurred (with device, probably...) and (additionally) that is "unrecoverable".
So, I'm asking you- what is the conclusion? At last: the error HAS or HASN'T occurred ?
Let's investigate the log further:
(first part of CIT VIII)-
as CIT IX:
??"ABRT shall be set to one if this command is not supported, if SMART is not enabled, or if input
register values are invalid. ABRT may be set to one if the device is not able to complete the
action requested by the command"
The first of above citation has actually the same meaning as CIT VII except, that CIT VIII tells about "input register values" and VII- "values in the Features, LBA Mid, or LBA High registers" so the first citation is slightly less specific, but practically (look at DOC I, 18.104.22.168) means the same. In "Features" we have "d9h", which is perfect for SMART DISABLE OPERATIONS (22.214.171.124), while "LBA Mid" and "LBA High" (correct me, if I'm wrong :)) which are probably (comparing with CIT IV and CIT VI) "Cylinder Low (CL)" and "Cylinder High (CH)" which have values respectively 4fh and C2h. Both of these last values are also perfect for SM. DIS. OP. command (DOC I, INPUTS). DEV is of course also correct (0 at input is equal to analogical value in Device register at output).
As we excluded "wrong input values" error reason, we can also reject "SMART is not enabled" reason (last line of CIT I) "SMART support is: Enabled" and probably "the device does not support the S.M.A.R.T. feature set" (a line before the last in CIT I). Whence- because of supporting S.M.A.R.T. feature set by my disk, SMART DISABLE OPERATIONS command MUST BE also supported by the disk (DOC II, 4.8.5). So S.M.A.R.T. is supported, S.M.A.R.T. DISABLE OPERATIONS is supported, S.M.A.R.T. is enabled, so moreover what else could be a reason of our errror?
We haven't indeed researched yet a second sentence of first part of CIT VIII (i.e. CIT IX), where spec.'s author admits, that our "positive" value of ABRT can be (probably) caused by "device is not able to complete the action requested by the command" event. Could this be the true in our case? Correct me, but I would rather be not sure. At last:??
??from CIT VIII, Status register, 1.sentence:
???"BSY shall be cleared to zero indicating command completion"
?and (as I mentioned a few lines above)
"BSY is cleared to 0"
so I'd say, that "device CAN'T be not able to complete the action request by command", because (as BSY indicates) command COMPLETED. It's drives also to pretty strange conclusion: our command (SMART DISABLE OPERATIONS) ABORTED and (parallely) COMPLETED (help me, please, explain me that, pleaseeee!!!:)).
But anyway- what about the error? Clarifications written in official documentations (DOC I && DOC II) don't seem to be informative enough to acknowledge the problem's conditions and fight that one sophistically. Lookong from a slightly other side, I've also searched smartmontools' mailing lists' archive to find matters similar to mine and I've not found any relevant posts, maybe except: http://sourceforge.net/mailarchive/message.php?msg_id=c96oer%24p3t%241%40sea.gmane.org , where smartctl's (Carsten Schurig) user uncovers his device model, which is identical to mine and his smart error log (which is identical to mine w.r.t. CIT IV (single SM. DISABLE OPER.'s error trace)) As you see- almost everything was relevant to my problem, but the discussion over that topic (with Bruce Allen and Mario Holbe) was only restricted to disk temperatures (pitty:((), not to SMART DISABLE OPERATIONS error. So unfortunately- I've not found any suitable solution even in the archive:(, Google was also rather empty and documentations appeared insufficient. So that concluding my situation don't be surprised as I'm telling, you are simply my last chance.
So- that's, what I'm asking: Tell me (who just can know that) is my problem (that with SMART DISABLE OPERATIONS or maybe another for example one uncovered in CIT III) really (serious) disk error, which I must repair with VERY radical tools (like zeroing disk with "dd /dev/zero /dev/sda") or must I return the disk to seller and wait till they repair that or return my money (I think my guarantee is still valid) and how should I describe problem to the seller or (this is also nice solution) there is a softer method of repairing this problem not requiring "horrible" zeroing. Maybe you'll say the best solution would even be ignoring problem and botherless using of the disk till the next alarm (remembering last crashes I couldn't be rather so quiet though). Most pleasure would be for me returning the disk to the seller, but please, help me to somehow account that neccessity for to him, explain him all important circumstances and convince him, that technical state of the drive neccessirily requires its exchanging. I believe you can help me.
many thanks in advance and regards