|
From: <jos...@vi...> - 2004-05-07 03:55:12
|
Hi there guys, Before I start, I would like to say that there are four kinds of people: - the ones who care about that corruption and protect themselves - the ones who care and think they protect themselves with raid - the ones who care and do nothing - the ones who know nothing I'm a guy who cares and doesn't know how to protect the systems under his responsabilities. The think is, I know that raid can protect from disk failure, but acordin= g to some folks it doesn't prevent the disk from feeding corrupted that whe= n its an ocasionall failure. I have read that the cable or the kernel can cause data corruption, I thought that at least to the cable, there was so= me kind of protection in place. So what i really need to know is this, how can I read data from a disk safely, meaning without data corruption (garanted). Do I have to create a virtual layer (virtual filesystem) that creates som= e kind of checksum of the data and checks it everytime its read? Or is that done already? On servers, working in clusters, what kind of instalation do you recomend= ? I have this idea but I'm not sure its okay, I would have the operating system under raid, and the remaining data without raid, and the reason is simple, as I'm talking about a cluster, it doesn't makes much sense to replicate data four times (well it some situations it does). The ideia is simple, the os stays always up, and after the replacement of the failing hdd he pulls the lost data from the other clusters. /boot raid 1 - /dev/hda1 /dev/hdc1 swap raid 1 - /dev/hda2 /dev/hdc2 / raid 1 - /dev/hda3 /dev/hdc3 /data /dev/hda4, /dev/hdb1, /dev/hdc4, /dev/hdd Can someone give me a hint if this is okay? Another problem, how do I read the values from smarmontools? I have two seagate drives (i'm not a fan of seagate, they were recomended= , I bought WD, but they didn't fit in the rack) working for about 6 months no= w, and a week ago I lost the filesystem on it. Meaning server down. I started to install from scratch linux, and in the kernel I got some problems, so I run badblocks, at first he gave 1 badblock. Then I re-run it several times and he never did that again. So I went to cry for help to the gentoo gurus, and they told me about smartmontools, but as it semed they weren't to sure of out to read the da= ta, some sayed things that didn't made any sense, etc. etc. So I'm heading to the gurus of smartmontools to help me out, in reading t= he data. But before that, I must say that after having read the data of another di= sk, I'm getting really confused. I believe that we should create some kind of documentation brand-depend, there aren't so many and as people understant the readings, people could help another one later in need. We could even grasp a hand from the brand= s, who knows. I know that the brands have tools to check the disks, but hey,= we are talking of remote servers. And its not nice for me to go there evey n= ow and then to check if they are okay. I believe that all of you have this problem. To make things worst I have read a paper from your site, and one of the values he claims important (if I haven't misunderstood) is the first one,= In my workstation WD this value is 0, in the seagates, the value is always changing and once it got very small, I think it was an overflow. I'm dumping info on two hdds, they are both seagates, they are in use for about the same time, the second one is the one that caused some problems. Another issue, apparently the warranty on this two babies runs out in 28/5/2004, as you can check on seagates website. So I really need to know= if i should exchange them now or not... Thanks, a lot, (dump follows) Jos=E9 Faria Portugal smartctl version 5.26 Copyright (C) 2002-3 Bruce Allen Home page is http://smartmontools.sourceforge.net/ =3D=3D=3D START OF INFORMATION SECTION =3D=3D=3D Device Model: ST3120023A Serial Number: 3KA1S22A Firmware Version: 3.33 Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 Local Time is: Fri May 7 03:41:50 2004 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled =3D=3D=3D START OF READ SMART DATA SECTION =3D=3D=3D SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity = was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 426) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/o= ff support. Suspend Offline collection upon n= ew command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging suppor= t. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 84) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDA= TED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 053 049 006 Pre-fail Always - 4481948 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 1 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fa= il Always - 0 7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always - 21966414 9 Power_On_Hours 0x0032 095 095 000 Old_a= ge Always - 4388 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 40 194 Temperature_Celsius 0x0022 027 053 000 Old_ag= e Always - 27 195 Hardware_ECC_Recovered 0x001a 053 049 000 Old_age ys - 4481948 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_ag= e Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age ys - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hour= s) LBA_of_first_error # 1 Extended offline Completed without error 00% 1 - # 2 Extended offline Completed without error 00% 9 - smartctl version 5.26 Copyright (C) 2002-3 Bruce Allen Home page is http://smartmontools.sourceforge.net/ =3D=3D=3D START OF INFORMATION SECTION =3D=3D=3D Device Model: ST3120023A Serial Number: 3KA1RESZ Firmware Version: 3.33 Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 Local Time is: Fri May 7 03:47:42 2004 GMT SMART support is: Available - device has SMART capability. SMART support is: Enabled =3D=3D=3D START OF READ SMART DATA SECTION =3D=3D=3D SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity = was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 426) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/o= ff support. Suspend Offline collection upon n= ew command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging suppor= t. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 84) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 052 048 006 Pre-fail Always - 24901055 3 Spin_Up_Time 0x0003 100 100 000 Pre-f= ail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_ag= e Always - 2 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 16 7 Seek_Error_Rate 0x000f 074 060 030 Pre-fa= il Always - 27435119 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 576 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 38 194 Temperature_Celsius 0x0022 025 050 000 Old_age Always - 25 195 Hardware_ECC_Recovered 0x001a 052 047 000 Old_age ys - 24901055 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age ys - 0 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 72 (device log contains only the most recent five errors= ) CR =3D Command Register [HEX] FR =3D Features Register [HEX] SC =3D Sector Count Register [HEX] SN =3D Sector Number Register [HEX] CL =3D Cylinder Low Register [HEX] CH =3D Cylinder High Register [HEX] DH =3D Device/Head Register [HEX] DC =3D Device Command Register [HEX] ER =3D Error register [HEX] ST =3D Status register [HEX] Timestamp =3D decimal seconds since the previous disk power-on. Note: timestamp "wraps" after 2^32 msec =3D 49.710 days. Error 72 occurred at disk power-on lifetime: 517 hours When the command that caused the error occurred, the device was active = or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 7b 0f a6 e0 Error: UNC Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- c8 00 18 6b 0f a6 e0 00 85534.201 READ DMA c8 00 20 63 0f a6 e0 00 85529.462 READ DMA ca 00 20 43 0f a6 e0 00 85529.462 WRITE DMA c8 00 20 43 0f a6 e0 00 85529.461 READ DMA ca 00 20 43 0f a6 e0 00 85529.460 WRITE DMA Error 71 occurred at disk power-on lifetime: 517 hours When the command that caused the error occurred, the device was active = or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 7b 0f a6 e0 Error: UNC Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- c8 00 20 63 0f a6 e0 00 85529.462 READ DMA ca 00 20 43 0f a6 e0 00 85529.462 WRITE DMA c8 00 20 43 0f a6 e0 00 85529.461 READ DMA ca 00 20 43 0f a6 e0 00 85529.460 WRITE DMA c8 00 20 43 0f a6 e0 00 85529.445 READ DMA Error 70 occurred at disk power-on lifetime: 546 hours When the command that caused the error occurred, the device was active = or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 07 ec 7d 9b e0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- c4 00 08 eb 7d 9b e0 00 28863.279 READ MULTIPLE c4 00 08 bb 8d 9b e0 00 28863.278 READ MULTIPLE c4 00 08 ab 8d 9b e0 00 28863.276 READ MULTIPLE c4 00 08 9b 8d 9b e0 00 28863.263 READ MULTIPLE c4 00 08 c3 f0 d3 e0 00 28863.261 READ MULTIPLE Error 69 occurred at disk power-on lifetime: 546 hours When the command that caused the error occurred, the device was active = or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 07 ec 7d 9b e0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- c4 00 08 eb 7d 9b e0 00 28146.689 READ MULTIPLE c4 00 08 fb 99 88 e0 00 28146.688 READ MULTIPLE c4 00 08 f3 99 88 e0 00 28146.687 READ MULTIPLE c4 00 08 e3 99 88 e0 00 28146.669 READ MULTIPLE c4 00 08 93 62 68 e1 00 28146.668 READ MULTIPLE Error 68 occurred at disk power-on lifetime: 546 hours When the command that caused the error occurred, the device was active = or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 07 ec 7d 9b e0 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Timestamp Command/Feature_Name -- -- -- -- -- -- -- -- --------- -------------------- c4 00 08 eb 7d 9b e0 00 28141.582 READ MULTIPLE c4 00 08 1b f6 fc e0 00 28141.581 READ MULTIPLE c4 00 08 1b a3 fb e0 00 28141.579 READ MULTIPLE c4 00 08 fb f5 fc e0 00 28141.578 READ MULTIPLE c4 00 08 eb f5 fc e0 00 28141.568 READ MULTIPLE SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hour= s) LBA_of_first_error # 1 Extended offline Completed without error 00% 5 - # 2 Extended offline Completed without error 00% 4 - # 3 Extended offline Aborted by host 90% 3 - # 4 Extended offline Completed without error 00% 2 - # 5 Extended offline Completed without error 00% 6 - # 6 Extended offline Completed without error 00% 0 - # 7 Extended offline Completed without error 00% 6 - |