From: David N. <do...@ma...> - 2017-04-06 00:26:26
|
Hello, First of all, I *have* been backing up my data. I'm going to post LOTS of details here, fell free to skim. My problem is that once upon a time my drived failed < 6 months after I bought my laptop. Sending it to a professional did not help, nor did replacing the PCB, it was dead. The symptoms leading up to the event was a sudden freeze of the OS. I was not too bright about Linux at the time, so I thought that perhaps X froze. Now I'm getting the identical thing, a sudden freeze. I can ping the kernel, I cannot restore the frame buffer, sync, or umount the file systems. My syslog metalog records no messages during this period, it is set to sync the dmesg messages. I cannot ssh, but I can uses sysreq to reboot. I'm using OpenRC. This has happened twice or three times. I just ran a self test and it says PASSED, I'm not seeing anything that stands out. smartmontools-6.4 Gentoo Linux 4.9.x Below is my S.M.A.R.T. data. BTW: it is unwrapped. What do you think? Thanks, David === START OF INFORMATION SECTION === Model Family: Western Digital Blue Mobile Device Model: WDC WD7500BPVX-22JC3T0 Serial Number: WD-WXC1A14E1823 LU WWN Device Id: 5 0014ee 209f3d675 Firmware Version: 01.01A01 User Capacity: 750,156,374,016 bytes [750 GB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Tue Apr 4 14:47:54 2017 UTC SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (13920) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 157) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x7035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 196 179 021 Pre-fail Always - 1166 4 Start_Stop_Count 0x0032 058 058 000 Old_age Always - 42367 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 082 082 000 Old_age Always - 13532 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1695 191 G-Sense_Error_Rate 0x0032 001 001 000 Old_age Always - 124 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 158 193 Load_Cycle_Count 0x0032 183 183 000 Old_age Always - 51774 194 Temperature_Celsius 0x0022 107 091 000 Old_age Always - 40 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 13529 - # 2 Extended offline Completed without error 00% 12288 - # 3 Extended offline Completed without error 00% 9247 - # 4 Extended offline Completed without error 00% 7609 - # 5 Extended offline Completed without error 00% 5469 - # 6 Short offline Completed without error 00% 0 - # 7 Short offline Completed without error 00% 0 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. |
From: Carlos E. R. <rob...@te...> - 2017-04-06 10:09:47
Attachments:
signature.asc
|
On 2017-04-06 02:26, David Niklas wrote: > Hello, > First of all, I *have* been backing up my data. > I'm going to post LOTS of details here, fell free to skim. > > My problem is that once upon a time my drived failed < 6 months after I > bought my laptop. Sending it to a professional did not help, nor did > replacing the PCB, it was dead. > The symptoms leading up to the event was a sudden freeze of the OS. I was > not too bright about Linux at the time, so I thought that perhaps X froze. > Now I'm getting the identical thing, a sudden freeze. I can ping the > kernel, I cannot restore the frame buffer, sync, or umount the > file systems. My syslog metalog records no messages during this period, > it is set to sync the dmesg messages. I cannot ssh, but I can uses sysreq > to reboot. I'm using OpenRC. > > This has happened twice or three times. > I just ran a self test and it says PASSED, I'm not seeing anything that > stands out. > > smartmontools-6.4 > Gentoo Linux 4.9.x > > Below is my S.M.A.R.T. data. BTW: it is unwrapped. > What do you think? No evidence of problem here, that I can see. If it were the disk, you typically would see messages of the kernel complaining in "dmesg". -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" (Minas Tirith)) |
From: <ro...@sp...> - 2017-04-06 14:30:15
|
I'll second "Carlos E. R."'s verdict. I see nothing wrong either. However, that does not guarantee there isn't something wrong. Somewhere I read a study that said SMART only predicts about 60% of hard drive failures. The other 40% give no warning. Backups are always a good idea. They protect not only against hard drive failures, but also accidental or malicious data loss. Now have you tested those backups? I remember when I was free-lance going to a brand new client (first visit). They needed me to do a restore. OK, got their backup media (it was back in the zip disk days). Every disk was write-protected, and blank. Needless to say, that day didn't go well. > Hello, > First of all, I *have* been backing up my data. > I'm going to post LOTS of details here, fell free to skim. > > My problem is that once upon a time my drived failed < 6 months after I > bought my laptop. Sending it to a professional did not help, nor did > replacing the PCB, it was dead. > The symptoms leading up to the event was a sudden freeze of the OS. I was > not too bright about Linux at the time, so I thought that perhaps X froze. > Now I'm getting the identical thing, a sudden freeze. I can ping the > kernel, I cannot restore the frame buffer, sync, or umount the > file systems. My syslog metalog records no messages during this period, > it is set to sync the dmesg messages. I cannot ssh, but I can uses sysreq > to reboot. I'm using OpenRC. > > This has happened twice or three times. > I just ran a self test and it says PASSED, I'm not seeing anything that > stands out. > > smartmontools-6.4 > Gentoo Linux 4.9.x > > Below is my S.M.A.R.T. data. BTW: it is unwrapped. > What do you think? > Thanks, David > > > === START OF INFORMATION SECTION === > Model Family: Western Digital Blue Mobile > Device Model: WDC WD7500BPVX-22JC3T0 > Serial Number: WD-WXC1A14E1823 > LU WWN Device Id: 5 0014ee 209f3d675 > Firmware Version: 01.01A01 > User Capacity: 750,156,374,016 bytes [750 GB] > Sector Sizes: 512 bytes logical, 4096 bytes physical > Rotation Rate: 5400 rpm > Device is: In smartctl database [for details use: -P show] > ATA Version is: ACS-2 (minor revision not indicated) > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > Local Time is: Tue Apr 4 14:47:54 2017 UTC > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x00) Offline data collection activity > was never started. > Auto Offline Data Collection: Disabled. > Self-test execution status: ( 0) The previous self-test routine > completed > without error or no self-test has ever > been run. > Total time to complete Offline > data collection: (13920) seconds. > Offline data collection > capabilities: (0x7b) SMART execute Offline immediate. > Auto Offline data collection on/off support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 157) minutes. > Conveyance self-test routine > recommended polling time: ( 5) minutes. > SCT capabilities: (0x7035) SCT Status supported. > SCT Feature Control supported. > SCT Data Table supported. > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED > WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always > - 0 > 3 Spin_Up_Time 0x0027 196 179 021 Pre-fail Always > - 1166 > 4 Start_Stop_Count 0x0032 058 058 000 Old_age Always > - 42367 > 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always > - 0 > 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always > - 0 > 9 Power_On_Hours 0x0032 082 082 000 Old_age Always > - 13532 > 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always > - 0 > 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always > - 0 > 12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always > - 1695 > 191 G-Sense_Error_Rate 0x0032 001 001 000 Old_age Always > - 124 > 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always > - 158 > 193 Load_Cycle_Count 0x0032 183 183 000 Old_age Always > - 51774 > 194 Temperature_Celsius 0x0022 107 091 000 Old_age Always > - 40 > 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always > - 0 > 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always > - 0 > 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline > - 0 > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always > - 0 > 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline > - 0 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Extended offline Completed without error 00% 13529 > - > # 2 Extended offline Completed without error 00% 12288 > - > # 3 Extended offline Completed without error 00% 9247 > - > # 4 Extended offline Completed without error 00% 7609 > - > # 5 Extended offline Completed without error 00% 5469 > - > # 6 Short offline Completed without error 00% 0 > - > # 7 Short offline Completed without error 00% 0 > - > > SMART Selective self-test log data structure revision number 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute > delay. > > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > |
From: Robin H. J. <ro...@ge...> - 2017-04-06 20:52:03
|
On Wed, Apr 05, 2017 at 08:26:12PM -0400, David Niklas wrote: > The symptoms leading up to the event was a sudden freeze of the OS. I was > not too bright about Linux at the time, so I thought that perhaps X froze. > Now I'm getting the identical thing, a sudden freeze. I can ping the > kernel, I cannot restore the frame buffer, sync, or umount the > file systems. My syslog metalog records no messages during this period, > it is set to sync the dmesg messages. I cannot ssh, but I can uses sysreq > to reboot. I'm using OpenRC. Metalog would only be useful is writes to disk were succeeding. It's certainly possible for the kernel to hang in such a state that there is kernel panic, and writes to disk are not happening (this includes sending the sysrq-sync command). That you can ping the kernel simply says that there's enough left running for the kernel to handle ICMP without going to userspace. That you can't SSH says something in userspace failed (which could be a myriad of reasons). Just because the system seems to freeze does not mean that the drive is faulty. Also entirely possible there is a logged drive event in dmesg that you can't see. If you can repeat it, consider some of the following to get a better insight as to what's going on. - set up serial kernel console or network kernel console logging. - set up kdump or similar. That's not to say that the drive isn't the source of the problem, just that it's not likely based on the output you've shown. You say this is a laptop, and the drive by power hours has racked up ~1.5 years of usage, so it possibly hasn't been opened in at least that long. How much dust has built up inside it? Overheating of the graphics CAN cause the symptoms you've described. -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer E-Mail : ro...@ge... GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 |
From: David N. <do...@ma...> - 2017-04-07 20:28:56
|
On Thu, 6 Apr 2017 20:51:51 +0000 "Robin H. Johnson" <ro...@ge...> wrote: > On Wed, Apr 05, 2017 at 08:26:12PM -0400, David Niklas wrote: > > The symptoms leading up to the event was a sudden freeze of the OS. I > > was not too bright about Linux at the time, so I thought that perhaps > > X froze. Now I'm getting the identical thing, a sudden freeze. I can > > ping the kernel, I cannot restore the frame buffer, sync, or umount > > the file systems. My syslog metalog records no messages during this > > period, it is set to sync the dmesg messages. I cannot ssh, but I can > > uses sysreq to reboot. I'm using OpenRC. > Metalog would only be useful is writes to disk were succeeding. It's > certainly possible for the kernel to hang in such a state that there is > kernel panic, and writes to disk are not happening (this includes > sending the sysrq-sync command). > > That you can ping the kernel simply says that there's enough left > running for the kernel to handle ICMP without going to userspace. > > That you can't SSH says something in userspace failed (which could be a > myriad of reasons). > > Just because the system seems to freeze does not mean that the drive is > faulty. Also entirely possible there is a logged drive event in dmesg > that you can't see. > > If you can repeat it, consider some of the following to get a better > insight as to what's going on. > - set up serial kernel console or network kernel console logging. > - set up kdump or similar. No, It's random so far. > That's not to say that the drive isn't the source of the problem, just > that it's not likely based on the output you've shown. Why not? What else causes all writes to the drive to stop except a problem with the drive or MB (my laptop has not cabling)? > You say this is a laptop, and the drive by power hours has racked up > ~1.5 years of usage, so it possibly hasn't been opened in at least that > long. How much dust has built up inside it? Overheating of the graphics > CAN cause the symptoms you've described. The laptop is my primary way to get online, it's not be left off for more than 2 days unless it's HW failed (the original drive died). So, I'm not misreading the S.M.A.R.T. data? No values that aught to be interpreted in HEX, OCTAL or something? Thanks, David |
From: Robin H. J. <ro...@ge...> - 2017-04-07 20:55:12
|
On Fri, Apr 07, 2017 at 04:28:43PM -0400, David Niklas wrote: ... > > If you can repeat it, consider some of the following to get a better > > insight as to what's going on. > > - set up serial kernel console or network kernel console logging. > > - set up kdump or similar. > No, It's random so far. Ok, get yourself network console logging, since networking was still working, and you can just let the kernel send a copy of all klog entries over the network. See in the kernel sources, see Documentation/networking/netconsole.txt or examples in the Ubuntu & Arch wikis. > > That's not to say that the drive isn't the source of the problem, just > > that it's not likely based on the output you've shown. > Why not? > What else causes all writes to the drive to stop except a problem with > the drive or MB (my laptop has not cabling)? Most failure modes of a spinning drive would cause various error counters to be incremented. The few that I could think of that wouldn't involve specific component failures on the drive PCB. Drive PCB-originating failures should NOT cause your video to lock up, but may stop the logging to disk of any errors. I can start up a linux system, running off a sata drive, open a terminal, suddenly disconnect the drive, and still be able to run dmesg and/or see live kernel log entries (Provided that dmesg itself is at least already cached and running doesn't need anything to be read off disk). So what we're looking for as root cause is some manner of error that causes both video & drive to become unresponsive, but the kernel to still respond to ICMP ping (ergo network stack is operational). That root cause COULD have other effects (like a power spike that then damages the drive PCB), but it's the root cause we care about. Overheating causing a component fault (like causing a capacitor to go out of tolerance or fail) on one of the PCI/PCIe busses, and therein affecting the drive & graphics. The networking might be on a different bus, and continues to function. > > You say this is a laptop, and the drive by power hours has racked up > > ~1.5 years of usage, so it possibly hasn't been opened in at least that > > long. How much dust has built up inside it? Overheating of the graphics > > CAN cause the symptoms you've described. > The laptop is my primary way to get online, it's not be left off for more > than 2 days unless it's HW failed (the original drive died). > > So, I'm not misreading the S.M.A.R.T. data? No values that aught to be > interpreted in HEX, OCTAL or something? No, the drive data seems good, and representative of a health & well-used drive. No reallocated sectors, no other issues, not that many power cycles even for a laptop drive w/ aggressive power saving. -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Trustee & Treasurer E-Mail : ro...@ge... GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 |