From: Douglas G. <do...@to...> - 2004-09-11 12:58:36
|
Bruce Allen wrote: > Doug, > > This one's for you. It might also merit a few sentences in the smartctl > man page. > > ---------- Forwarded message ---------- > Date: Fri, 10 Sep 2004 09:51:49 -0400 > From: Don Shesnicky <dsh...@en...> > To: sma...@li... > Subject: [smartmontools-support]minor/major errors > > > > I have a number of Seagate scsi disks and am wondering about the output. > What do the > "Errors Corrected, minor/major" fields mean? > > Don > > > Output: > > Errors Corrected Total Total Correction > Gigabytes Total > delay: [rereads/ errors algorithm > processed uncorrected > minor | major rewrites] corrected invocations [10^9 > bytes] errors > read: 161234 0 0 161234 163110 130.596 > 0 > write: 0 0 70 70 1483 359.225 > 0 After reading a recent Hitachi disk product manual I have changed the wording of those two columns in version 5.33. The terms shown above come from the SCSI SPC-2 (and draft SPC-3) standard (see http://www.t10.org). The standard says the meaning is vendor specific. The Hitachi description is: - error sectors corrected on the fly by ECC [was "minor"] - error sectors corrected by ECC with possible delays ["major"] Various worried smartmontools users have reported largish numbers in the "on the fly" column. My guess is that one or more sectors on the disk have correctable ECC blemishes which the drive does not think warrant reporting SCSI errors for (or RECOVERED ERROR). So the sector(s) stays in place, and is read often by the OS. The returned data is correct but that counter is bumped every time it happens. As for doing something about this, then the "read-write error recovery" mode page is relevant; specifically the ARRE (automatic read reallocation enabled) bit. Manipulating that is beyond the scope of smartmontools. [Ducks for cover.] If that bit was set then the next time the recoverable sectors were read the second column ("with possible delays") would be incremented IMO. Doug Gilbert |
From: Don S. <dsh...@en...> - 2004-09-13 22:55:08
|
=20 Anyone, The reason I got interested in this utility is that I had a 36 gig SCSI Seagate drive go bad in a login server. Just before it went down it was showing one major error. Today we had the rebuilt server just about do the exact same scenerio. The /var partition is trashed and I had to create a temporary one on the root partition to get us up running again. Yesterday the output of a smartctl script run started showing 1 major error... just like before. Let me ask everyone/anyone this - is there ANYTHING in the smartmon tools itself that could cause this? Don -----Original Message----- From: Douglas Gilbert [mailto:do...@to...]=20 Sent: Monday, September 13, 2004 5:52 PM To: Don Shesnicky Subject: Re: [smartmontools-support]minor/major errors (fwd) Don Shesnicky wrote: > Doug, > Can you tell me what bit/flag/variable the tool is looking at so I can > ask Seagate for the meaning and whether it is a problem? Here are the=20 > numbers for another disk which is heavily used but I have none that=20 > don't show something in these columns. >=20 > Don >=20 > Error counter log: > Errors Corrected Total Total Correction > Gigabytes Total > delay: [rereads/ errors algorithm > processed uncorrected > minor | major rewrites] corrected invocations [10^9 > bytes] errors > read: 182805994 0 0 182805994 182805994 > 22277.952 0 > write: 0 0 0 0 0 > 2733.390 0 Don, It is the "Read Error Counter" log page (i.e. log page code 3) and the parameter is called "Errors corrected without substantial delay" (i.e. parameter code 0). That log page is defined in SPC (a.k.a. SCSI-3), SPC-2 and SPC-3 (i.e. it has been there since at least 1997). Doug Gilbert |
From: Bruce A. <ba...@gr...> - 2004-09-14 01:10:45
|
Don, This doesn't answer your question directly -- but have you tried running a self-test on the disk (smartctl -t long)? If the disk supports self-testing, this is the first thing to try. And note: if the disk is 'on its last legs' then the added load of a self-test *could* be the final straw. (But if this is the case then it's doomed anyway -- it's just a matter of time.) Cheers, Bruce On Mon, 13 Sep 2004, Don Shesnicky wrote: > > Anyone, > The reason I got interested in this utility is that I had a 36 gig SCSI > Seagate drive go bad in a login server. Just before it went down it was > showing one major error. Today we had the rebuilt server just about do > the exact same scenerio. The /var partition is trashed and I had to > create a temporary one on the root partition to get us up running again. > Yesterday the output of a smartctl script run started showing 1 major > error... just like before. > > Let me ask everyone/anyone this - is there ANYTHING in the smartmon > tools itself that could cause this? > > Don > > -----Original Message----- > From: Douglas Gilbert [mailto:do...@to...] > Sent: Monday, September 13, 2004 5:52 PM > To: Don Shesnicky > Subject: Re: [smartmontools-support]minor/major errors (fwd) > > Don Shesnicky wrote: > > Doug, > > Can you tell me what bit/flag/variable the tool is looking at so I can > > > ask Seagate for the meaning and whether it is a problem? Here are the > > numbers for another disk which is heavily used but I have none that > > don't show something in these columns. > > > > Don > > > > Error counter log: > > Errors Corrected Total Total Correction > > Gigabytes Total > > delay: [rereads/ errors algorithm > > processed uncorrected > > minor | major rewrites] corrected invocations [10^9 > > bytes] errors > > read: 182805994 0 0 182805994 182805994 > > 22277.952 0 > > write: 0 0 0 0 0 > > 2733.390 0 > > Don, > It is the "Read Error Counter" log page (i.e. log page code 3) and the > parameter is called "Errors corrected without substantial delay" (i.e. > parameter code 0). That log page is defined in SPC (a.k.a. SCSI-3), > SPC-2 and SPC-3 (i.e. it has been there since at least 1997). > > Doug Gilbert > > > ------------------------------------------------------- > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 > Project Admins to receive an Apple iPod Mini FREE for your judgement on > who ports your project to Linux PPC the best. Sponsored by IBM. > Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > |
From: Don S. <dsh...@en...> - 2004-09-14 03:10:22
|
=20 Bruce, Does this test take the drive offline in any way? Don -----Original Message----- From: Bruce Allen [mailto:ba...@gr...]=20 Sent: Monday, September 13, 2004 9:11 PM To: Don Shesnicky Cc: Douglas Gilbert; Smartmontools Mailing List Subject: RE: [smartmontools-support]minor/major errors (fwd) Don, This doesn't answer your question directly -- but have you tried running a self-test on the disk (smartctl -t long)? If the disk supports self-testing, this is the first thing to try. And note: if the disk is 'on its last legs' then the added load of a self-test *could* be the final straw. (But if this is the case then it's doomed anyway -- it's just a matter of time.) Cheers, Bruce On Mon, 13 Sep 2004, Don Shesnicky wrote: > =20 > Anyone, > The reason I got interested in this utility is that I had a 36 gig=20 > SCSI Seagate drive go bad in a login server. Just before it went down=20 > it was showing one major error. Today we had the rebuilt server just=20 > about do the exact same scenerio. The /var partition is trashed and I=20 > had to create a temporary one on the root partition to get us up running again. > Yesterday the output of a smartctl script run started showing 1 major=20 > error... just like before. >=20 > Let me ask everyone/anyone this - is there ANYTHING in the smartmon=20 > tools itself that could cause this? >=20 > Don >=20 > -----Original Message----- > From: Douglas Gilbert [mailto:do...@to...] > Sent: Monday, September 13, 2004 5:52 PM > To: Don Shesnicky > Subject: Re: [smartmontools-support]minor/major errors (fwd) >=20 > Don Shesnicky wrote: > > Doug, > > Can you tell me what bit/flag/variable the tool is looking at so I=20 > > can >=20 > > ask Seagate for the meaning and whether it is a problem? Here are=20 > > the numbers for another disk which is heavily used but I have none=20 > > that don't show something in these columns. > >=20 > > Don > >=20 > > Error counter log: > > Errors Corrected Total Total Correction > > Gigabytes Total > > delay: [rereads/ errors algorithm > > processed uncorrected > > minor | major rewrites] corrected invocations [10^9 > > bytes] errors > > read: 182805994 0 0 182805994 182805994 > > 22277.952 0 > > write: 0 0 0 0 0 > > 2733.390 0 >=20 > Don, > It is the "Read Error Counter" log page (i.e. log page code 3) and the > parameter is called "Errors corrected without substantial delay" (i.e. > parameter code 0). That log page is defined in SPC (a.k.a. SCSI-3), > SPC-2 and SPC-3 (i.e. it has been there since at least 1997). >=20 > Doug Gilbert >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170=20 > Project Admins to receive an Apple iPod Mini FREE for your judgement=20 > on who ports your project to Linux PPC the best. Sponsored by IBM. > Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php=20 > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support >=20 >=20 |
From: Bruce A. <ba...@gr...> - 2004-09-14 03:28:35
|
> Does this test take the drive offline in any way? Don, As long as you don't use the captive flag (-C) the drive will still respond and be usable. However it will respond more slowly than when not doing a self-test. So you CAN do a self-test on a live system with mounted disks in multi-user mode. Cheers, Bruce > -----Original Message----- > From: Bruce Allen [mailto:ba...@gr...] > Sent: Monday, September 13, 2004 9:11 PM > To: Don Shesnicky > Cc: Douglas Gilbert; Smartmontools Mailing List > Subject: RE: [smartmontools-support]minor/major errors (fwd) > > Don, > > This doesn't answer your question directly -- but have you tried running > a self-test on the disk (smartctl -t long)? If the disk supports > self-testing, this is the first thing to try. And note: if the disk is > 'on its last legs' then the added load of a self-test *could* be the > final straw. (But if this is the case then it's doomed anyway -- it's > just a matter of time.) > > Cheers, > Bruce > > On Mon, 13 Sep 2004, Don Shesnicky wrote: > > > > > Anyone, > > The reason I got interested in this utility is that I had a 36 gig > > SCSI Seagate drive go bad in a login server. Just before it went down > > it was showing one major error. Today we had the rebuilt server just > > about do the exact same scenerio. The /var partition is trashed and I > > had to create a temporary one on the root partition to get us up > running again. > > Yesterday the output of a smartctl script run started showing 1 major > > error... just like before. > > > > Let me ask everyone/anyone this - is there ANYTHING in the smartmon > > tools itself that could cause this? > > > > Don > > > > -----Original Message----- > > From: Douglas Gilbert [mailto:do...@to...] > > Sent: Monday, September 13, 2004 5:52 PM > > To: Don Shesnicky > > Subject: Re: [smartmontools-support]minor/major errors (fwd) > > > > Don Shesnicky wrote: > > > Doug, > > > Can you tell me what bit/flag/variable the tool is looking at so I > > > can > > > > > ask Seagate for the meaning and whether it is a problem? Here are > > > the numbers for another disk which is heavily used but I have none > > > that don't show something in these columns. > > > > > > Don > > > > > > Error counter log: > > > Errors Corrected Total Total Correction > > > Gigabytes Total > > > delay: [rereads/ errors algorithm > > > processed uncorrected > > > minor | major rewrites] corrected invocations [10^9 > > > bytes] errors > > > read: 182805994 0 0 182805994 182805994 > > > 22277.952 0 > > > write: 0 0 0 0 0 > > > 2733.390 0 > > > > Don, > > It is the "Read Error Counter" log page (i.e. log page code 3) and the > > > parameter is called "Errors corrected without substantial delay" (i.e. > > parameter code 0). That log page is defined in SPC (a.k.a. SCSI-3), > > SPC-2 and SPC-3 (i.e. it has been there since at least 1997). > > > > Doug Gilbert > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 > > Project Admins to receive an Apple iPod Mini FREE for your judgement > > on who ports your project to Linux PPC the best. Sponsored by IBM. > > Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php > > _______________________________________________ > > Smartmontools-support mailing list > > Sma...@li... > > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 > Project Admins to receive an Apple iPod Mini FREE for your judgement on > who ports your project to Linux PPC the best. Sponsored by IBM. > Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > |
From: Bruce A. <ba...@gr...> - 2004-09-14 03:30:16
|
Don, What version of smartmontools and OS are you using? I'll let Doug answer your question. Cheers, Bruce On Mon, 13 Sep 2004, Don Shesnicky wrote: > > Bruce, > Just to clarify the original question - is it at all possible that > smartmon could poke a wrong value into a drive that might cause it to > fail or otherwise behave strangely? Don't mind me for asking but after > this server has gone down twice while we head for a deadline has me in a > very twitchy state. > > Don > > -----Original Message----- > From: Bruce Allen [mailto:ba...@gr...] > Sent: Monday, September 13, 2004 9:11 PM > To: Don Shesnicky > Cc: Douglas Gilbert; Smartmontools Mailing List > Subject: RE: [smartmontools-support]minor/major errors (fwd) > > Don, > > This doesn't answer your question directly -- but have you tried running > a self-test on the disk (smartctl -t long)? If the disk supports > self-testing, this is the first thing to try. And note: if the disk is > 'on its last legs' then the added load of a self-test *could* be the > final straw. (But if this is the case then it's doomed anyway -- it's > just a matter of time.) > > Cheers, > Bruce > > On Mon, 13 Sep 2004, Don Shesnicky wrote: > > > > > Anyone, > > The reason I got interested in this utility is that I had a 36 gig > > SCSI Seagate drive go bad in a login server. Just before it went down > > it was showing one major error. Today we had the rebuilt server just > > about do the exact same scenerio. The /var partition is trashed and I > > had to create a temporary one on the root partition to get us up > running again. > > Yesterday the output of a smartctl script run started showing 1 major > > error... just like before. > > > > Let me ask everyone/anyone this - is there ANYTHING in the smartmon > > tools itself that could cause this? > > > > Don > > > > -----Original Message----- > > From: Douglas Gilbert [mailto:do...@to...] > > Sent: Monday, September 13, 2004 5:52 PM > > To: Don Shesnicky > > Subject: Re: [smartmontools-support]minor/major errors (fwd) > > > > Don Shesnicky wrote: > > > Doug, > > > Can you tell me what bit/flag/variable the tool is looking at so I > > > can > > > > > ask Seagate for the meaning and whether it is a problem? Here are > > > the numbers for another disk which is heavily used but I have none > > > that don't show something in these columns. > > > > > > Don > > > > > > Error counter log: > > > Errors Corrected Total Total Correction > > > Gigabytes Total > > > delay: [rereads/ errors algorithm > > > processed uncorrected > > > minor | major rewrites] corrected invocations [10^9 > > > bytes] errors > > > read: 182805994 0 0 182805994 182805994 > > > 22277.952 0 > > > write: 0 0 0 0 0 > > > 2733.390 0 > > > > Don, > > It is the "Read Error Counter" log page (i.e. log page code 3) and the > > > parameter is called "Errors corrected without substantial delay" (i.e. > > parameter code 0). That log page is defined in SPC (a.k.a. SCSI-3), > > SPC-2 and SPC-3 (i.e. it has been there since at least 1997). > > > > Doug Gilbert > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 > > Project Admins to receive an Apple iPod Mini FREE for your judgement > > on who ports your project to Linux PPC the best. Sponsored by IBM. > > Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php > > _______________________________________________ > > Smartmontools-support mailing list > > Sma...@li... > > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > > > > > > |
From: Don S. <dsh...@en...> - 2004-09-14 04:13:48
|
Bruce/Doug, Its Smartmon 5.32 with Redhat 7.2 stock kernel.=20 =20 If there is any doubt as to the answer I need to know it. The fact that = I just started using smartmon=20 and have had two drive problems in one or two weeks makes it a definite = question. =20 Don ________________________________ From: Bruce Allen [mailto:ba...@gr...] Sent: Mon 13/09/2004 11:30 PM To: Don Shesnicky Cc: Smartmontools Mailing List; Douglas Gilbert Subject: RE: [smartmontools-support]minor/major errors (fwd) Don, What version of smartmontools and OS are you using? I'll let Doug = answer your question. Cheers, Bruce On Mon, 13 Sep 2004, Don Shesnicky wrote: > > Bruce, > Just to clarify the original question - is it at all possible that > smartmon could poke a wrong value into a drive that might cause it to > fail or otherwise behave strangely? Don't mind me for asking but after > this server has gone down twice while we head for a deadline has me in = a > very twitchy state. > > Don > > -----Original Message----- > From: Bruce Allen [mailto:ba...@gr...] > Sent: Monday, September 13, 2004 9:11 PM > To: Don Shesnicky > Cc: Douglas Gilbert; Smartmontools Mailing List > Subject: RE: [smartmontools-support]minor/major errors (fwd) > > Don, > > This doesn't answer your question directly -- but have you tried = running > a self-test on the disk (smartctl -t long)? If the disk supports > self-testing, this is the first thing to try. And note: if the disk = is > 'on its last legs' then the added load of a self-test *could* be the > final straw. (But if this is the case then it's doomed anyway -- it's > just a matter of time.) > > Cheers, > Bruce > > On Mon, 13 Sep 2004, Don Shesnicky wrote: > > >=20 > > Anyone, > > The reason I got interested in this utility is that I had a 36 gig > > SCSI Seagate drive go bad in a login server. Just before it went = down > > it was showing one major error. Today we had the rebuilt server just > > about do the exact same scenerio. The /var partition is trashed and = I > > had to create a temporary one on the root partition to get us up > running again. > > Yesterday the output of a smartctl script run started showing 1 = major > > error... just like before. > > > > Let me ask everyone/anyone this - is there ANYTHING in the smartmon > > tools itself that could cause this? > > > > Don > > > > -----Original Message----- > > From: Douglas Gilbert [mailto:do...@to...] > > Sent: Monday, September 13, 2004 5:52 PM > > To: Don Shesnicky > > Subject: Re: [smartmontools-support]minor/major errors (fwd) > > > > Don Shesnicky wrote: > > > Doug, > > > Can you tell me what bit/flag/variable the tool is looking at so I > > > can > > > > > ask Seagate for the meaning and whether it is a problem? Here are > > > the numbers for another disk which is heavily used but I have none > > > that don't show something in these columns. > > > > > > Don > > > > > > Error counter log: > > > Errors Corrected Total Total Correction > > > Gigabytes Total > > > delay: [rereads/ errors algorithm > > > processed uncorrected > > > minor | major rewrites] corrected invocations = [10^9 > > > bytes] errors > > > read: 182805994 0 0 182805994 182805994 > > > 22277.952 0 > > > write: 0 0 0 0 0 > > > 2733.390 0 > > > > Don, > > It is the "Read Error Counter" log page (i.e. log page code 3) and = the > > > parameter is called "Errors corrected without substantial delay" = (i.e. > > parameter code 0). That log page is defined in SPC (a.k.a. SCSI-3), > > SPC-2 and SPC-3 (i.e. it has been there since at least 1997). > > > > Doug Gilbert > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 > > Project Admins to receive an Apple iPod Mini FREE for your judgement > > on who ports your project to Linux PPC the best. Sponsored by IBM. > > Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php > > _______________________________________________ > > Smartmontools-support mailing list > > Sma...@li... > > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > > > > > > |
From: Bruce A. <ba...@gr...> - 2004-09-14 09:06:49
|
> Its Smartmon 5.32 with Redhat 7.2 stock kernel. At this point there are no known bugs in 5.32. It's based on an experimental release from May, so the main code base has been out there for about four months. The kernel is ancient but if there were relevant SCSI bugs, Doug would have mentioned them. > If there is any doubt as to the answer I need to know it. The fact > that I just started using smartmon and have had two drive problems in > one or two weeks makes it a definite question. Speaking as a scientist I would ask about the causal relation here. Is it really the case that smartmontools is the cause of your drive problems? Or is it the drive problems that have piqued your interest in smartmontools? I would still advise you to run a long self-test. Might this cause problems? Yes, if something is wrong with the disk. Would you see the problems anyway? Yes, if something is wrong with the disk. [In any case, and at the risk of sounding like your mother, if this is a mission critical server you should have a spare disk (or spare system) on hand and a reliable and tested backup system.] Cheers, Bruce > Don, > > What version of smartmontools and OS are you using? I'll let Doug answer > your question. > > Cheers, > Bruce > > On Mon, 13 Sep 2004, Don Shesnicky wrote: > > > > > Bruce, > > Just to clarify the original question - is it at all possible that > > smartmon could poke a wrong value into a drive that might cause it to > > fail or otherwise behave strangely? Don't mind me for asking but after > > this server has gone down twice while we head for a deadline has me in a > > very twitchy state. > > > > Don > > > > -----Original Message----- > > From: Bruce Allen [mailto:ba...@gr...] > > Sent: Monday, September 13, 2004 9:11 PM > > To: Don Shesnicky > > Cc: Douglas Gilbert; Smartmontools Mailing List > > Subject: RE: [smartmontools-support]minor/major errors (fwd) > > > > Don, > > > > This doesn't answer your question directly -- but have you tried running > > a self-test on the disk (smartctl -t long)? If the disk supports > > self-testing, this is the first thing to try. And note: if the disk is > > 'on its last legs' then the added load of a self-test *could* be the > > final straw. (But if this is the case then it's doomed anyway -- it's > > just a matter of time.) > > > > Cheers, > > Bruce > > > > On Mon, 13 Sep 2004, Don Shesnicky wrote: > > > > > > > > Anyone, > > > The reason I got interested in this utility is that I had a 36 gig > > > SCSI Seagate drive go bad in a login server. Just before it went down > > > it was showing one major error. Today we had the rebuilt server just > > > about do the exact same scenerio. The /var partition is trashed and I > > > had to create a temporary one on the root partition to get us up > > running again. > > > Yesterday the output of a smartctl script run started showing 1 major > > > error... just like before. > > > > > > Let me ask everyone/anyone this - is there ANYTHING in the smartmon > > > tools itself that could cause this? > > > > > > Don > > > > > > -----Original Message----- > > > From: Douglas Gilbert [mailto:do...@to...] > > > Sent: Monday, September 13, 2004 5:52 PM > > > To: Don Shesnicky > > > Subject: Re: [smartmontools-support]minor/major errors (fwd) > > > > > > Don Shesnicky wrote: > > > > Doug, > > > > Can you tell me what bit/flag/variable the tool is looking at so I > > > > can > > > > > > > ask Seagate for the meaning and whether it is a problem? Here are > > > > the numbers for another disk which is heavily used but I have none > > > > that don't show something in these columns. > > > > > > > > Don > > > > > > > > Error counter log: > > > > Errors Corrected Total Total Correction > > > > Gigabytes Total > > > > delay: [rereads/ errors algorithm > > > > processed uncorrected > > > > minor | major rewrites] corrected invocations [10^9 > > > > bytes] errors > > > > read: 182805994 0 0 182805994 182805994 > > > > 22277.952 0 > > > > write: 0 0 0 0 0 > > > > 2733.390 0 > > > > > > Don, > > > It is the "Read Error Counter" log page (i.e. log page code 3) and the > > > > > parameter is called "Errors corrected without substantial delay" (i.e. > > > parameter code 0). That log page is defined in SPC (a.k.a. SCSI-3), > > > SPC-2 and SPC-3 (i.e. it has been there since at least 1997). > > > > > > Doug Gilbert > > > > > > > > > ------------------------------------------------------- > > > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 > > > Project Admins to receive an Apple iPod Mini FREE for your judgement > > > on who ports your project to Linux PPC the best. Sponsored by IBM. > > > Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php > > > _______________________________________________ > > > Smartmontools-support mailing list > > > Sma...@li... > > > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > > > > > > > > > > > > > > > |
From: Douglas G. <do...@to...> - 2004-09-14 08:26:18
|
Don Shesnicky wrote: > Bruce/Doug, > Its Smartmon 5.32 with Redhat 7.2 stock kernel. > > If there is any doubt as to the answer I need to know it. The fact that > I just started using smartmon > and have had two drive problems in one or two weeks makes it a definite > question. Don, I have no reports of the SCSI code in smartmontools doing anything destructive. Neither format nor write commands are invoked. Self-tests are done via the SEND DIAGNOSTIC command. As Bruce said both '-t short' and '-t long' can be done on a disk with mounted file systems as long as the captive flag (i.e. -C) is _not_ given. There should be only minor impact on performance (this applies both to SCSI and ATA disks). My primary source for information (design and coding) is the SCSI (draft) standards at http://www.t10.org , in particular SPC-3, SBC-2 and SSC-2. There is no compliance regime with "SCSI" standards, but IMO Seagate SCSI disks comply as good as, if not better than, most. ** Disks fail, there are bad batches; IBM got out of the disk business (forming a joint ventue with Hitachi) due to ongoing reliability concerns some years back. I have had 2 (separate) laptop disk failures this year and one machine was only 6 weeks old and that failure occurred during an XP defrag. There is not much else I can say. Doug Gilbert ** I have no commercial interest in any SCSI equipment manufacturer. Seagate have their own diagnostic package called "seatools" which can do self tests and it runs on Linux. > ------------------------------------------------------------------------ > *From:* Bruce Allen [mailto:ba...@gr...] > *Sent:* Mon 13/09/2004 11:30 PM > *To:* Don Shesnicky > *Cc:* Smartmontools Mailing List; Douglas Gilbert > *Subject:* RE: [smartmontools-support]minor/major errors (fwd) > > Don, > > What version of smartmontools and OS are you using? I'll let Doug answer > your question. > > Cheers, > Bruce > > On Mon, 13 Sep 2004, Don Shesnicky wrote: > > > > > Bruce, > > Just to clarify the original question - is it at all possible that > > smartmon could poke a wrong value into a drive that might cause it to > > fail or otherwise behave strangely? Don't mind me for asking but after > > this server has gone down twice while we head for a deadline has me in a > > very twitchy state. > > > > Don > > > > -----Original Message----- > > From: Bruce Allen [mailto:ba...@gr...] > > Sent: Monday, September 13, 2004 9:11 PM > > To: Don Shesnicky > > Cc: Douglas Gilbert; Smartmontools Mailing List > > Subject: RE: [smartmontools-support]minor/major errors (fwd) > > > > Don, > > > > This doesn't answer your question directly -- but have you tried running > > a self-test on the disk (smartctl -t long)? If the disk supports > > self-testing, this is the first thing to try. And note: if the disk is > > 'on its last legs' then the added load of a self-test *could* be the > > final straw. (But if this is the case then it's doomed anyway -- it's > > just a matter of time.) > > > > Cheers, > > Bruce > > > > On Mon, 13 Sep 2004, Don Shesnicky wrote: > > > > > > > > Anyone, > > > The reason I got interested in this utility is that I had a 36 gig > > > SCSI Seagate drive go bad in a login server. Just before it went down > > > it was showing one major error. Today we had the rebuilt server just > > > about do the exact same scenerio. The /var partition is trashed and I > > > had to create a temporary one on the root partition to get us up > > running again. > > > Yesterday the output of a smartctl script run started showing 1 major > > > error... just like before. > > > > > > Let me ask everyone/anyone this - is there ANYTHING in the smartmon > > > tools itself that could cause this? > > > > > > Don > > > > > > -----Original Message----- > > > From: Douglas Gilbert [mailto:do...@to...] > > > Sent: Monday, September 13, 2004 5:52 PM > > > To: Don Shesnicky > > > Subject: Re: [smartmontools-support]minor/major errors (fwd) > > > > > > Don Shesnicky wrote: > > > > Doug, > > > > Can you tell me what bit/flag/variable the tool is looking at so I > > > > can > > > > > > > ask Seagate for the meaning and whether it is a problem? Here are > > > > the numbers for another disk which is heavily used but I have none > > > > that don't show something in these columns. > > > > > > > > Don > > > > > > > > Error counter log: > > > > Errors Corrected Total Total Correction > > > > Gigabytes Total > > > > delay: [rereads/ errors algorithm > > > > processed uncorrected > > > > minor | major rewrites] corrected invocations [10^9 > > > > bytes] errors > > > > read: 182805994 0 0 182805994 182805994 > > > > 22277.952 0 > > > > write: 0 0 0 0 0 > > > > 2733.390 0 > > > > > > Don, > > > It is the "Read Error Counter" log page (i.e. log page code 3) and the > > > > > parameter is called "Errors corrected without substantial delay" (i.e. > > > parameter code 0). That log page is defined in SPC (a.k.a. SCSI-3), > > > SPC-2 and SPC-3 (i.e. it has been there since at least 1997). > > > > > > Doug Gilbert > > > > > > > > > ------------------------------------------------------- > > > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 > > > Project Admins to receive an Apple iPod Mini FREE for your judgement > > > on who ports your project to Linux PPC the best. Sponsored by IBM. > > > Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php > > > _______________________________________________ > > > Smartmontools-support mailing list > > > Sma...@li... > > > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > > > > > > > > > > > > |
From: Don S. <dsh...@en...> - 2004-09-14 12:00:10
|
Bruce/Doug, I appreciate your detailed replies. I do not think that smartmon was = necessarily responsible for=20 the drive failures but since there was a relationship between the = failures and when I first started running and testing the utility I wanted to ask. I'm sure you appreciate = the sweat that problems=20 like this can generate and the desire to nail down what is happening.=20 =20 I'm not totally certain this second incident is a disk failure, as you = state we need to test more, however it is showing some symptoms of some sort of disk problem and the = possibility of two=20 failures on high end drives with MTBFs of 1.2 million hours is a cause = for worry since those are the disks we have in all of our servers. =20 Don ________________________________ From: Bruce Allen [mailto:ba...@gr...] Sent: Tue 14/09/2004 5:06 AM To: Don Shesnicky Cc: Smartmontools Mailing List; Douglas Gilbert Subject: RE: [smartmontools-support]minor/major errors (fwd) > Its Smartmon 5.32 with Redhat 7.2 stock kernel. At this point there are no known bugs in 5.32. It's based on an experimental release from May, so the main code base has been out there for about four months. The kernel is ancient but if there were relevant SCSI bugs, Doug would have mentioned them. > If there is any doubt as to the answer I need to know it. The fact > that I just started using smartmon and have had two drive problems in > one or two weeks makes it a definite question. Speaking as a scientist I would ask about the causal relation here. Is = it really the case that smartmontools is the cause of your drive problems?=20 Or is it the drive problems that have piqued your interest in smartmontools? I would still advise you to run a long self-test. Might this cause problems? Yes, if something is wrong with the disk. Would you see the problems anyway? Yes, if something is wrong with the disk. [In any case, and at the risk of sounding like your mother, if this is a mission critical server you should have a spare disk (or spare system) = on hand and a reliable and tested backup system.] Cheers, Bruce > Don, > > What version of smartmontools and OS are you using? I'll let Doug = answer > your question. > > Cheers, > Bruce > > On Mon, 13 Sep 2004, Don Shesnicky wrote: > > > > > Bruce, > > Just to clarify the original question - is it at all possible that > > smartmon could poke a wrong value into a drive that might cause it = to > > fail or otherwise behave strangely? Don't mind me for asking but = after > > this server has gone down twice while we head for a deadline has me = in a > > very twitchy state. > > > > Don > > > > -----Original Message----- > > From: Bruce Allen [mailto:ba...@gr...] > > Sent: Monday, September 13, 2004 9:11 PM > > To: Don Shesnicky > > Cc: Douglas Gilbert; Smartmontools Mailing List > > Subject: RE: [smartmontools-support]minor/major errors (fwd) > > > > Don, > > > > This doesn't answer your question directly -- but have you tried = running > > a self-test on the disk (smartctl -t long)? If the disk supports > > self-testing, this is the first thing to try. And note: if the disk = is > > 'on its last legs' then the added load of a self-test *could* be the > > final straw. (But if this is the case then it's doomed anyway -- = it's > > just a matter of time.) > > > > Cheers, > > Bruce > > > > On Mon, 13 Sep 2004, Don Shesnicky wrote: > > > > > > > > Anyone, > > > The reason I got interested in this utility is that I had a 36 gig > > > SCSI Seagate drive go bad in a login server. Just before it went = down > > > it was showing one major error. Today we had the rebuilt server = just > > > about do the exact same scenerio. The /var partition is trashed = and I > > > had to create a temporary one on the root partition to get us up > > running again. > > > Yesterday the output of a smartctl script run started showing 1 = major > > > error... just like before. > > > > > > Let me ask everyone/anyone this - is there ANYTHING in the = smartmon > > > tools itself that could cause this? > > > > > > Don > > > > > > -----Original Message----- > > > From: Douglas Gilbert [mailto:do...@to...] > > > Sent: Monday, September 13, 2004 5:52 PM > > > To: Don Shesnicky > > > Subject: Re: [smartmontools-support]minor/major errors (fwd) > > > > > > Don Shesnicky wrote: > > > > Doug, > > > > Can you tell me what bit/flag/variable the tool is looking at so = I > > > > can > > > > > > > ask Seagate for the meaning and whether it is a problem? Here = are > > > > the numbers for another disk which is heavily used but I have = none > > > > that don't show something in these columns. > > > > > > > > Don > > > > > > > > Error counter log: > > > > Errors Corrected Total Total Correction > > > > Gigabytes Total > > > > delay: [rereads/ errors algorithm > > > > processed uncorrected > > > > minor | major rewrites] corrected invocations = [10^9 > > > > bytes] errors > > > > read: 182805994 0 0 182805994 182805994 > > > > 22277.952 0 > > > > write: 0 0 0 0 0 > > > > 2733.390 0 > > > > > > Don, > > > It is the "Read Error Counter" log page (i.e. log page code 3) and = the > > > > > parameter is called "Errors corrected without substantial delay" = (i.e. > > > parameter code 0). That log page is defined in SPC (a.k.a. = SCSI-3), > > > SPC-2 and SPC-3 (i.e. it has been there since at least 1997). > > > > > > Doug Gilbert > > > > > > > > > ------------------------------------------------------- > > > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170 > > > Project Admins to receive an Apple iPod Mini FREE for your = judgement > > > on who ports your project to Linux PPC the best. Sponsored by IBM. > > > Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php > > > _______________________________________________ > > > Smartmontools-support mailing list > > > Sma...@li... > > > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > > > > > > > > > > > > > > > > |