From: mathog <ma...@ca...> - 2017-01-10 01:13:00
|
Hi, One system has a WDC WD1600SB-01KBA0 which shows 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 1 Sadly, it does not list the block number in any of the test results (or the system log files). Tried these steps to find the pending sector... reboot (to clear cache) # log in once it came back up dd if=/dev/sda of=/dev/null bs=512 which completed without error. Then tried smartctl -t long /dev/sda and that also completed without error. However "smartctl -a " still shows a pending sector. Is there some other trick to find the thing? Thanks, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |
From: Carlos E. R. <rob...@te...> - 2017-01-10 10:08:05
Attachments:
signature.asc
|
On 2017-01-10 02:12, mathog wrote: > Hi, > > One system has a WDC WD1600SB-01KBA0 which shows > > 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always > - 1 > 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always > - 0 > 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always > - 1 > > Sadly, it does not list the block number in any of the test results > (or the system log files). > > Tried these steps to find the pending sector... > > reboot (to clear cache) There is a way to clear it without reboot. Let me see... [...] > To free pagecache: echo 1 > /proc/sys/vm/drop_caches To free > dentries and inodes: echo 2 > /proc/sys/vm/drop_caches To free > pagecache, dentries and inodes: echo 3 > /proc/sys/vm/drop_caches Or issue "sync" at the end. > /sbin/sysctl -q -w vm.drop_caches=3 > using /sbin/sysctl is equivialent to the "echo >/proc/sys/..." line > above > # log in once it came back up > dd if=/dev/sda of=/dev/null bs=512 > > which completed without error. Then tried > > smartctl -t long /dev/sda > > and that also completed without error. > > However "smartctl -a " still shows a pending sector. The same thing happened to me recently. > Is there some other trick to find the thing? I run "badblocks" with the intention of locating them, and they disappeared... -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar) |
From: mathog <ma...@ca...> - 2017-01-11 01:15:28
|
On 10-Jan-2017 02:07, Carlos E. R. wrote: > On 2017-01-10 02:12, mathog wrote: >> Hi, >> >> One system has a WDC WD1600SB-01KBA0 which shows >> >> 197 Current_Pending_Sector 0x0012 200 200 000 Old_age >> Always >> - 1 >> 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age >> Always >> - 0 >> 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age >> Always >> - 1 > >> Is there some other trick to find the thing? > > I run "badblocks" with the intention of locating them, and they > disappeared... Rebooted that node into PLD Rescue CD over the network, ssh'd into it and ran badblocks -nvs /dev/sda >/tmp/bb.log 2>&1 & somewhere along the line the pending sector cleared, but there was no message giving the block number, and it said there were no errors. The UDMA_CRC_ERROR_COUNT is still 1. So that worked. Now, for the next time, is there a command one can use while the OS is running and the disk mounted that can do something similar? badblocks -n isn't happy running on mounted disks, and that badblocks command took ~4.5 hours. Thanks, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |
From: Carlos E. R. <rob...@te...> - 2017-01-11 02:15:05
Attachments:
signature.asc
|
On 2017-01-11 02:15, mathog wrote: > On 10-Jan-2017 02:07, Carlos E. R. wrote: >> I run "badblocks" with the intention of locating them, and they >> disappeared... > > Rebooted that node into PLD Rescue CD over the network, ssh'd into it > and ran > > badblocks -nvs /dev/sda >/tmp/bb.log 2>&1 & > > somewhere along the line the pending sector cleared, but there was no > message > giving the block number, and it said there were no errors. The > UDMA_CRC_ERROR_COUNT is still 1. Yes, same thing here. I don't remember that parameter what value it had. > So that worked. > > Now, for the next time, is there a command one can use > while the OS is running and the disk mounted that can do something > similar? Previously I figured it out from the point that the long test stopped. > badblocks -n isn't happy running on mounted disks, and that badblocks > command took ~4.5 hours. Yes, it runs for a very long time, yes. I'm unsure if my disk was mounted or not. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar) |
From: Bruce A. <ba...@uw...> - 2017-01-11 09:41:23
Attachments:
signature.asc
|
David, I think the UDMA_CRC_ERROR_COUNT is referring to a crc error on the data bus. If that is right then it can be safely ignored; if it is recurring I would try and clean and replug the data connections to the drive. Cheers, Bruce > On 11 Jan 2017, at 02:15, mathog <ma...@ca...> wrote: > > On 10-Jan-2017 02:07, Carlos E. R. wrote: >> On 2017-01-10 02:12, mathog wrote: >>> Hi, >>> >>> One system has a WDC WD1600SB-01KBA0 which shows >>> >>> 197 Current_Pending_Sector 0x0012 200 200 000 Old_age >>> Always >>> - 1 >>> 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age >>> Always >>> - 0 >>> 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age >>> Always >>> - 1 >> >>> Is there some other trick to find the thing? >> >> I run "badblocks" with the intention of locating them, and they >> disappeared... > > Rebooted that node into PLD Rescue CD over the network, ssh'd into it > and ran > > badblocks -nvs /dev/sda >/tmp/bb.log 2>&1 & > > somewhere along the line the pending sector cleared, but there was no > message > giving the block number, and it said there were no errors. The > UDMA_CRC_ERROR_COUNT is still 1. > > So that worked. > > Now, for the next time, is there a command one can use > while the OS is running and the disk mounted that can do something > similar? > badblocks -n isn't happy running on mounted disks, and that badblocks > command took ~4.5 hours. > > Thanks, > > David Mathog > ma...@ca... > Manager, Sequence Analysis Facility, Biology Division, Caltech > > ------------------------------------------------------------------------------ > Developer Access Program for Intel Xeon Phi Processors > Access to Intel Xeon Phi processor-based developer platforms. > With one year of Intel Parallel Studio XE. > Training and support from Colfax. > Order your platform today. http://sdm.link/xeonphi > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support -------------------------------------------------------------------- Bruce Allen, Adjunct Professor of Physics Leonard E. Parker Center for Gravitation, Cosmology and Astrophysics Physics Department University of Wisconsin - Milwaukee 3135 N Maryland Ave Milwaukee, 53211 USA Tel: +1 414-229-4474 Fax: +1 414-229-5589 ba...@uw... |
From: <ro...@sp...> - 2017-01-11 15:25:57
|
I've also seen the UDMA_CRC_ERROR_COUNT caused by a bad SATA cable. With only 1, I wouldn't sweat it. The case I remember, the count was in the hundreds. It stopped climbing after I switched out the cable. The existing one was noticeably frayed. > David, I think the UDMA_CRC_ERROR_COUNT is referring to a crc error on the > data bus. If that is right then it can be safely ignored; if it is > recurring I would try and clean and replug the data connections to the > drive. Cheers, Bruce > > > >> On 11 Jan 2017, at 02:15, mathog <ma...@ca...> wrote: >> >> On 10-Jan-2017 02:07, Carlos E. R. wrote: >>> On 2017-01-10 02:12, mathog wrote: >>>> Hi, >>>> >>>> One system has a WDC WD1600SB-01KBA0 which shows >>>> >>>> 197 Current_Pending_Sector 0x0012 200 200 000 Old_age >>>> Always >>>> - 1 >>>> 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age >>>> Always >>>> - 0 >>>> 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age >>>> Always >>>> - 1 >>> >>>> Is there some other trick to find the thing? >>> >>> I run "badblocks" with the intention of locating them, and they >>> disappeared... >> >> Rebooted that node into PLD Rescue CD over the network, ssh'd into it >> and ran >> >> badblocks -nvs /dev/sda >/tmp/bb.log 2>&1 & >> >> somewhere along the line the pending sector cleared, but there was no >> message >> giving the block number, and it said there were no errors. The >> UDMA_CRC_ERROR_COUNT is still 1. >> >> So that worked. >> >> Now, for the next time, is there a command one can use >> while the OS is running and the disk mounted that can do something >> similar? >> badblocks -n isn't happy running on mounted disks, and that badblocks >> command took ~4.5 hours. >> >> Thanks, >> >> David Mathog >> ma...@ca... >> Manager, Sequence Analysis Facility, Biology Division, Caltech >> >> ------------------------------------------------------------------------------ >> Developer Access Program for Intel Xeon Phi Processors >> Access to Intel Xeon Phi processor-based developer platforms. >> With one year of Intel Parallel Studio XE. >> Training and support from Colfax. >> Order your platform today. http://sdm.link/xeonphi >> _______________________________________________ >> Smartmontools-support mailing list >> Sma...@li... >> https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > -------------------------------------------------------------------- > Bruce Allen, Adjunct Professor of Physics > Leonard E. Parker Center for Gravitation, Cosmology and Astrophysics > Physics Department > University of Wisconsin - Milwaukee > 3135 N Maryland Ave > Milwaukee, 53211 USA > Tel: +1 414-229-4474 > Fax: +1 414-229-5589 > ba...@uw... > > > ------------------------------------------------------------------------------ > Developer Access Program for Intel Xeon Phi Processors > Access to Intel Xeon Phi processor-based developer platforms. > With one year of Intel Parallel Studio XE. > Training and support from Colfax. > Order your platform today. > http://sdm.link/xeonphi_______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > |
From: mathog <ma...@ca...> - 2017-01-11 17:42:24
|
On 11-Jan-2017 07:25, ro...@sp... wrote: > I've also seen the UDMA_CRC_ERROR_COUNT caused by a bad SATA cable. OK, I will make a note to look at the cable if this unit ever has problems like this again. (It could have been a gamma ray or something hitting a gate, right?) Now I'm trying to understand what happened here. My best guess is that it went something like this (leaving out a few steps): 1. some issue with cable, connectors, radiation etc. arose. 2. a write to a specific block, presumably with new data, ran into (1) and failed. 3. ??? the disk shuffled that data off to a temporary location (spare physical block, flash, or ?) and set the pending and UDMA_CRC_ERROR_COUNT. 4. Read of entire disk found no errors because the disk retrieved either the [??? old or new] contents without problems. 5. System was powered down for several minutes and started back up. The pending block and UDMA_CRC_ERROR_COUNT were still set. Presumably this means the pending data was stored in a nonvolatile location. 6. badblocks -nvs read the bad block [??? old or new] data and then wrote it back to disk. It saw no errors while doing so because (1) was not longer a problem. This time the write succeeded and the pending block was reset to 0. The reallocated block count stayed 0. Either it didn't reallocate the block or it did and it didn't increment the counter. So the question is - _which_ data is in that iffy block now? Is it the data which caused the failed write in the first place, or whatever was there before the write? Hopefully it is the former! Thanks, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |
From: L.A. W. <sma...@tl...> - 2017-01-11 18:42:19
|
mathog wrote: > Hi, > > One system has a WDC WD1600SB-01KBA0 which shows > > 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always > 198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always > 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always > > Sadly, it does not list the block number in any of the test results > (or the system log files). > > Is there some other trick to find the thing? > ---- Most of the time, you can't find the exact sector, but its an indication that the disk may had to move the data to a backup sector. Modern hard disks usually have 'tracks' of spare sectors that they can reallocate (up to and including reallocating entire tracks) when they start to detect weak and/or unreliable signals on _READ_. They are a sign that the disk is nearing the end of its useful life. The smart diagnostics are not intended to be exact diagnostics but an _Early_Warning_ system -- meaning that you had better move that data off to a safer location. Before the advent of the SMART diags, you could often hear a disk going bad, as what was supposed to be sequential, linear reads, weren't. You could hear the excess seeking as the disk had to seek over to the replacement sectors and back again. Really -- you should be ready to replace this disk "soon" (as soon as you can) and use any remaining life in it to make sure everything on it is backed up. |
From: Carlos E. R. <rob...@te...> - 2017-01-11 20:59:15
Attachments:
signature.asc
|
On 2017-01-11 18:42, mathog wrote: > On 11-Jan-2017 07:25, robert@ wrote: >> I've also seen the UDMA_CRC_ERROR_COUNT caused by a bad SATA cable. > > OK, I will make a note to look at the cable if this unit ever has > problems like this again. (It could have been a gamma ray or something > hitting a gate, right?) > > Now I'm trying to understand what happened here. My best guess is that > it went something like this (leaving out a few steps): > > 1. some issue with cable, connectors, radiation etc. arose. > 2. a write to a specific block, presumably with new data, ran > into (1) and failed. If this happens during a write, the sector is reallocated. If it happens during a read, reallocation is postponed and the sector noted. I don't know if it creates a list and how to read that list. Reallocation happens during an attempted write, to another permanent location. I don't know what happened during the badblock run, because it is a read operation. -- Cheers / Saludos, Carlos E. R. (from 42.2 x86_64 "Malachite" at Telcontar) |
From: Dan L. <da...@ob...> - 2017-01-12 00:37:45
|
On 11.1.2017 21:58, Carlos E. R. wrote: > If this happens during a write, the sector is reallocated. > Reallocation happens during an attempted write, to another permanent > location. Note the relocation may not occur if write request doesn't cover entire physical sector (it may happen on "advanced format" disk). Just an error may be returned instead here. This behavior has been observed on WDC disk (but I don't remember the exact model and firmware version). So physical sector size and location needs to be taken into consideration. Dan |
From: mathog <ma...@ca...> - 2017-01-11 21:12:49
|
On 11-Jan-2017 12:58, Carlos E. R. wrote: > If this happens during a write, the sector is reallocated. If it > happens > during a read, reallocation is postponed and the sector noted. I don't > know if it creates a list and how to read that list. Subsequent reads - of the whole disk, did not log any errors, nor did they clear the "current pending sector" count. That's the odd part - the disk had somewhere stored "there is some problem with block N" and incremented the pending sector count, but it seems that several reads from block N (wherever that was) which completed without error were not enough to change its mind. It seems like a major shortcoming in the SMART protocol that there is no "list the pending sectors" command. The disk must have this information, otherwise we cannot explain the way it behaved in this case. > > Reallocation happens during an attempted write, to another permanent > location. Agreed. > > I don't know what happened during the badblock run, because it is a > read > operation. with -nvs there is also a write after the read. It presumably read the iffy block successfully (for about the 6th time) and when it wrote it back the flag finally cleared. It may or may not have been reallocated, but if it was, the counter did not increment. It seems about as likely that the disk just cleared the flag. Perhaps the firmware at that point did a couple of read/write tests on its own and decided all was now OK. We can't really know what the disk does "underneath" the level we interact with. Regards, David Mathog ma...@ca... Manager, Sequence Analysis Facility, Biology Division, Caltech |
From: Bruce A. <bru...@ae...> - 2017-01-12 01:57:23
Attachments:
signature.asc
|
Hi David, > It seems like a major shortcoming in the SMART protocol that there is no "list the pending sectors" command. The disk must have this > information, otherwise we cannot explain the way it behaved in this case. I agree. The truth is that the entire SMART protocol is something of a hack. It was first implemented by a couple of vendors, then turned into an SFF "specification" which was subsequently actively withdrawn (meaning: the industry did its best to destroy every copy of the document in existence). Then VERY limited parts of that were included in the ATA specification, which were then gradually morphed into something with a different intent (on and off-line testing, rather than monitoring and failure prediction). All in all, SMART is useful, but it's also very flawed. My personal hope is that over the coming ten years, the SSD will replace the HDD, and the devices and algorithms that underlie the SSD will become reliable enough that almost all of the SMART protocol and features become irrelevant and fade away. Time will tell. Cheers, Bruce -------------------------------------------------------------------------- Prof. Dr. Bruce Allen, Director Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Callinstrasse 38 D-30167 Hannover, Germany Tel +49-511-762-17145 Fax +49-511-762-17182 Email: bru...@ae... |