From: Drew S. <dr...@ea...> - 2002-11-14 23:54:26
|
Holas, In my ongoing quest to monitor my RH8 Dell machines on a hardware level, I have been led to an interesting utility: http://smartmontools.sourceforge.net/ From the webpage: "The smartmontools package contains two utility programs (smartctl and smartd) to control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology System (S.M.A.R.T.) built into most modern ATA and SCSI hard disks. It is derived from the smartsuite package, and includes support for ATA/ATAPI-5 disks. It should run on any modern Linux system." So I downloaded the utilities and tested them on a laptop and on a local machine, both with IDE drives. Nice! Works as advertised, and gives a pile of very useful information! So, I figured I'd deploy it on all my machines... but first, a final test. Screw the long story. DO NOT TRY THIS PACKAGE on Dell servers using AACRAID controllers! I've got a guy on the way over to the colocation facility now, to reboot the Dell 2550 machine which appears to be hung solid. Hopefully it comes back up, and I haven't destroyed anything - that box isn't in production yet, but it's slotted for production in the next couple of weeks. :( I don't blame the author of the software at all; this is not a slam by any stretch! The software works as advertised, on the IDE systems that I tested - however, on the AACRAID platform, with four 72G SCSI drives in a pair of mirrors, showing up as sda and sdb. Using 'smartctl -i /dev/sda' (iirc) reported that sda was SMART-capable, but that the temperature sensors were disabled... issuing 'smartctl -c /dev/sda' locked the system *solid*. Cheers, - Drew. PS: Any word on when the next OMSA, with RH8 support, will be available? -- Drew Smith (mux) <dr...@ri...> Encrypted e-mail preferred - finger for public key. 5801 7134 B54C 3D71 EBE1 CE24 F4DB 2528 5A46 A31B |
From: Folkert v. H. <fo...@va...> - 2002-11-20 17:01:46
|
Hi, I wrote a patch for smartmontools which simplifies the process of becoming a daemon-process. Here it is: diff -uNr smartmontools-5.0-26.org/smartd.c smartmontools-5.0-26/smartd.c --- smartmontools-5.0-26.org/smartd.c Thu Oct 31 17:38:30 2002 +++ smartmontools-5.0-26/smartd.c Wed Nov 20 17:59:46 2002 @@ -116,51 +116,6 @@ return; } -// Forks new process, closes all file descriptors, redirects stdin, -// stdout, stderr -int daemon_init(void){ - pid_t pid; - int i; - - if ((pid=fork()) < 0) { - // unable to fork! - printout(LOG_CRIT,"smartd unable to fork daemon process!\n"); - exit(1); - } - else if (pid) - // we are the parent process -- exit cleanly - exit(0); - - // from here on, we are the child process. - setsid(); - - // Fork one more time to avoid any possibility of having terminals - if ((pid=fork()) < 0) { - // unable to fork! - printout(LOG_CRIT,"smartd unable to fork daemon process!\n"); - exit(1); - } - else if (pid) - // we are the parent process -- exit cleanly - exit(0); - - // Now we are the child's child... - - // close any open file descriptors - for (i=getdtablesize();i>=0;--i) - close(i); - - // redirect any IO attempts to /dev/null for stdin - i=open("/dev/null",O_RDWR); - // stdout - dup(i); - // stderr - dup(i); - umask(0); - chdir("/"); - return 0; -} - // Prints header identifying version of code and home void printhead(){ printout(LOG_INFO,"smartd version %d.%d-%d - S.M.A.R.T. Daemon.\n", @@ -1158,7 +1113,11 @@ // If in background as a daemon, fork and close file descriptors if (!debugmode){ - daemon_init(); + if (daemon(0, 0) == -1) + { + syslog(LOG_CRIT, "Could not become daemon-process!"); + exit(1); + } } // setup signal handler for shutdown |
From: Bruce A. <ba...@gr...> - 2002-11-21 09:42:39
|
Hi Folkert, > I wrote a patch for smartmontools which simplifies the process of becoming a > daemon-process. > + if (daemon(0, 0) == -1) > + { > + syslog(LOG_CRIT, "Could not become daemon-process!"); > + exit(1); > + } Thanks very much for your patch! I'd like to incorporate it. I do have one question, though. (This question has been in the TODO file for some time now!) The existing daemon_init() closes ALL open file descriptors, not just those associated with stdin, stdout, and stderr. This is important because a logic bug in smartd might leave open fds from disks, and syslog might also leave open fds. Do I still need to do this closing of "other" fds by hand? Or will daemon close all open fds, not just 0,1,2? Cheers, Bruce |
From: <kn...@mo...> - 2002-11-21 09:55:25
|
On Thu, 21 Nov 2002, Bruce Allen wrote: >> I wrote a patch for smartmontools which simplifies the process of becoming a >> daemon-process. >> + if (daemon(0, 0) == -1) >> + { >> + syslog(LOG_CRIT, "Could not become daemon-process!"); >> + exit(1); >> + } > >Thanks very much for your patch! I'd like to incorporate it. > >I do have one question, though. (This question has been in the TODO file >for some time now!) > >The existing daemon_init() closes ALL open file descriptors, not just >those associated with stdin, stdout, and stderr. This is important >because a logic bug in smartd might leave open fds from disks, and syslog >might also leave open fds. > >Do I still need to do this closing of "other" fds by hand? Or will daemon >close all open fds, not just 0,1,2? According to the FreeBSD manpage for daemon, it will close and reopen stdin, stdout and stderr as /dev/null. http://www.gsp.com/cgi-bin/man.cgi?section=3&topic=daemon So it will certainly not close any dangling fds we might be leaving around. -- Erik I. Bolsø | email: <knan at mo.himolde.no> The UNIX philosophy basically involves giving you enough rope to hang yourself. And then a couple of feet more, just to be sure. |
From: Bruce A. <ba...@gr...> - 2002-11-21 10:12:44
|
> >I do have one question, though. (This question has been in the TODO file > >for some time now!) > > > >The existing daemon_init() closes ALL open file descriptors, not just > >those associated with stdin, stdout, and stderr. This is important > >because a logic bug in smartd might leave open fds from disks, and syslog > >might also leave open fds. > > > >Do I still need to do this closing of "other" fds by hand? Or will daemon > >close all open fds, not just 0,1,2? > > According to the FreeBSD manpage for daemon, it will close and reopen > stdin, stdout and stderr as /dev/null. > > http://www.gsp.com/cgi-bin/man.cgi?section=3&topic=daemon > > So it will certainly not close any dangling fds we might be leaving > around. OK, so the patch as it stands needs a bit of modification. But it's still cleaner... Bruce |
From: <kn...@mo...> - 2002-11-21 11:06:24
|
On Thu, 21 Nov 2002, Bruce Allen wrote: >> >I do have one question, though. (This question has been in the TODO file >> >for some time now!) >> > >> >The existing daemon_init() closes ALL open file descriptors, not just >> >those associated with stdin, stdout, and stderr. This is important >> >because a logic bug in smartd might leave open fds from disks, and syslog >> >might also leave open fds. >> > >> >Do I still need to do this closing of "other" fds by hand? Or will daemon >> >close all open fds, not just 0,1,2? >> >> According to the FreeBSD manpage for daemon, it will close and reopen >> stdin, stdout and stderr as /dev/null. >> >> http://www.gsp.com/cgi-bin/man.cgi?section=3&topic=daemon >> >> So it will certainly not close any dangling fds we might be leaving >> around. > >OK, so the patch as it stands needs a bit of modification. But it's still >cleaner... Note: This will be glibc 2.0+ only... libc5 doesn't have the daemon() call, according to a quick strings libc.so | grep daemon And as smartmontools could be interesting for the embedded people w/libc4, libc5, uclibc and such, perhaps best not to use too obscure functions? I'd advise keeping our current approach. -- Erik I. Bolsø | email: <knan at mo.himolde.no> The UNIX philosophy basically involves giving you enough rope to hang yourself. And then a couple of feet more, just to be sure. |
From: <kn...@mo...> - 2002-11-21 11:16:44
|
On Thu, 21 Nov 2002, Erik Inge Bolsø wrote: >On Thu, 21 Nov 2002, Bruce Allen wrote: >>OK, so the patch as it stands needs a bit of modification. But it's still >>cleaner... > >Note: This will be glibc 2.0+ only... libc5 doesn't have the daemon() >call, according to a quick strings libc.so | grep daemon > >And as smartmontools could be interesting for the embedded people w/libc4, >libc5, uclibc and such, perhaps best not to use too obscure functions? Example: I can guess that smartmontools would be interesting for Axis' linux people for their DVR. That box runs some homebrew embedded linux system from flash memory, and stores months worth of surveillance pictures on 2-4 ide disks. Monitoring failures of those disks with smartmon sounds worthwhile. I've worked with a few other Axis boxes, which use Linux on the Etrax processor, kernel 2.0 or 2.4, and certainly no glibc (in perhaps 8 MB flash? no way) So I'd advise keeping our current approach. -- Erik I. Bolsø | email: <knan at mo.himolde.no> The UNIX philosophy basically involves giving you enough rope to hang yourself. And then a couple of feet more, just to be sure. |
From: Bruce A. <ba...@gr...> - 2002-11-21 11:22:36
|
Folkert, > >OK, so the patch as it stands needs a bit of modification. But it's still > >cleaner... > > Note: This will be glibc 2.0+ only... libc5 doesn't have the daemon() > call, according to a quick strings libc.so | grep daemon > > And as smartmontools could be interesting for the embedded people w/libc4, > libc5, uclibc and such, perhaps best not to use too obscure functions? > > I'd advise keeping our current approach. Any comments about this? I was (eventually) planning to do what your patch does (modulo closing ALL fds). But since daemon() isn't universal perhas Erik's point is a good one. Cheers, Bruce |
From: Folkert v. H. <fo...@va...> - 2002-11-21 17:41:45
|
> >OK, so the patch as it stands needs a bit of modification. But it's still > >cleaner... > > Note: This will be glibc 2.0+ only... libc5 doesn't have the daemon() > call, according to a quick strings libc.so | grep daemon > > And as smartmontools could be interesting for the embedded people w/libc4, > libc5, uclibc and such, perhaps best not to use too obscure functions? > > I'd advise keeping our current approach. BA> Any comments about this? I was (eventually) planning to do what your BA> patch does (modulo closing ALL fds). But since daemon() isn't universal BA> perhas Erik's point is a good one. Yes, I totally agree; keep the original approach. Folkert Altough sadly I'm now one step further away from world-wide recognition. Oh well, such is life. |
From: Bruce A. <ba...@gr...> - 2002-11-21 20:45:04
|
> > I'd advise keeping our current approach. > BA> Any comments about this? I was (eventually) planning to do what your > BA> patch does (modulo closing ALL fds). But since daemon() isn't universal > BA> perhas Erik's point is a good one. > > Yes, I totally agree; keep the original approach. OK. > Altough sadly I'm now one step further away from world-wide > recognition. Oh well, such is life. Well there are lots of othere things to fix and improve. Ask me and I'll provide a list. (By the way, do you know if we need to fork twice?? I found contradictory opinons when I searched for an answer.) Cheers, Bruce |
From: Folkert v. H. <fo...@va...> - 2002-11-21 20:57:25
|
> > Altough sadly I'm now one step further away from world-wide > > recognition. Oh well, such is life. > Well there are lots of othere things to fix and improve. Ask me and I'll > provide a list. Gimme! :o) > (By the way, do you know if we need to fork twice?? I found contradictory > opinons when I searched for an answer.) Yes, thy shall fork twice. Cannot remember why, though. You should do: fork setsid fork and then things like chdir("/") umask(0), close(0...2) especially the chdir: it's a real pain if you want to unmount the partition from which you accidently started the daemon |
From: Bruce A. <ba...@gr...> - 2002-11-21 21:23:37
|
> > > Altough sadly I'm now one step further away from world-wide > > > recognition. Oh well, such is life. > > Well there are lots of othere things to fix and improve. Ask me and I'll > > provide a list. > > Gimme! :o) OK.. you asked! Look at how I do "sleep()" and then interrupt it when I catch SIGUSR1. See sleephandler() and search for sleeptime. [Only in the latest CVS version of smartd.c.] There must be a better and simpler way of doing this where I just set an alarm for (say) 30 min from now, and sleep until it goes off, *unless* a user SIGUSR1 arrives. I don't like calling sleep every second. It works, but strikes me as ugly. Can yoy provide a better solution? > > > (By the way, do you know if we need to fork twice?? I found contradictory > > opinons when I searched for an answer.) > > Yes, thy shall fork twice. Cannot remember why, though. > You should do: > fork > setsid > fork > and then things like chdir("/") umask(0), close(0...2) > especially the chdir: it's a real pain if you want to unmount the partition > from which you accidently started the daemon OK, then the code is ok -- it does fork twice. Bruce |
From: Bruce A. <ba...@gr...> - 2002-11-28 10:15:38
|
Hi Folkert, > > > > Altough sadly I'm now one step further away from world-wide > > > > recognition. Oh well, such is life. > > > Well there are lots of othere things to fix and improve. Ask me and I'll > > > provide a list. > > > > Gimme! :o) > > OK.. you asked! > > Look at how I do "sleep()" and then interrupt it when I catch SIGUSR1. > See sleephandler() and search for sleeptime. [Only in the latest CVS > version of smartd.c.] > > There must be a better and simpler way of doing this where I just set an > alarm for (say) 30 min from now, and sleep until it goes off, *unless* a > user SIGUSR1 arrives. I don't like calling sleep every second. It works, > but strikes me as ugly. > > Can yoy provide a better solution? I was just wondering if you had given this any thought... Bruce |
From: Jens B. <sma...@te...> - 2003-06-15 21:11:04
|
Hallo, this is my result. Is the RAW_VALUE Value of Load_Cycle_Count "normal"? ---------------------------------------------------------------------- smartctl version 5.1-11 Copyright (C) 2002-3 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: TOSHIBA MK3017GAP Serial Number: 61E30501T Firmware Version: A0.02 H Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 5 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Jun 15 22:23:54 2003 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Off-line data collection status: (0x00) Offline data collection activity was never started. Auto Off-line Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete off-line data collection: ( 347) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate. Automatic timer ON/OFF support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 39) minutes. SMART Attributes Data Structure revision number: 6 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 253 100 050 Pre-fail - 0 2 Throughput_Performance 0x0004 101 100 050 Old_age - 0 3 Spin_Up_Time 0x0026 100 100 001 Old_age - 1725 4 Start_Stop_Count 0x0032 100 100 000 Old_age - 1846 5 Reallocated_Sector_Ct 0x0033 100 100 001 Pre-fail - 5 8 Seek_Time_Performance 0x0004 100 100 050 Old_age - 0 9 Power_On_Hours 0x0032 085 085 050 Old_age - 6311 10 Spin_Retry_Count 0x0033 135 100 030 Pre-fail - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age - 1461 199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age - 0 220 Disk_Shift 0x0002 100 100 001 Old_age - 4370 222 Loaded_Hours 0x0032 090 090 050 Old_age - 4214 223 Load_Retry_Count 0x0032 100 100 050 Old_age - 0 224 Load_Friction 0x0022 100 100 050 Old_age - 0 225 Load_Cycle_Count 0x0032 066 066 070 Old_age FAILING_NOW 342111 226 Load-in_Time 0x0026 100 100 001 Old_age - 152 228 Power-off_Retract_Count 0x0032 100 100 060 Old_age - 40 240 Head flying hours 0x0001 100 100 001 Pre-fail - 15 SMART Error Log Version: 1 ATA Error Count: 36 (device log contains only the most recent five errors) DCR = Device Control Register FR = Features Register SC = Sector Count Register SN = Sector Number Register CL = Cylinder Low Register CH = Cylinder High Register D/H = Device/Head Register CR = Content written to Command Register ER = Error register STA = Status register Timestamp is seconds since the previous disk power-on. Note: timestamp "wraps" after 2^32 msec = 49.710 days. Error 36 occurred at disk power-on lifetime: 713 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER:40 SC:00 SN:c6 CL:49 CH:bb D/H:e2 ST:51 Sequence of commands leading to the command that caused the error were: DCR FR SC SN CL CH D/H CR Timestamp 00 00 6c a3 49 bb e2 40 1663.643 00 00 08 bf 24 81 e0 ca 1663.643 00 00 70 9f 49 bb e2 40 1654.900 00 00 08 bf 24 81 e0 ca 1654.900 00 00 74 9b 49 bb e2 40 1646.157 Error 35 occurred at disk power-on lifetime: 713 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER:40 SC:00 SN:c6 CL:49 CH:bb D/H:e2 ST:51 Sequence of commands leading to the command that caused the error were: DCR FR SC SN CL CH D/H CR Timestamp 00 00 70 9f 49 bb e2 40 1654.900 00 00 08 bf 24 81 e0 ca 1654.900 00 00 74 9b 49 bb e2 40 1646.157 00 00 08 bf 24 81 e0 ca 1646.157 00 00 78 97 49 bb e2 40 1637.429 Error 34 occurred at disk power-on lifetime: 713 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER:40 SC:00 SN:c6 CL:49 CH:bb D/H:e2 ST:51 Sequence of commands leading to the command that caused the error were: DCR FR SC SN CL CH D/H CR Timestamp 00 00 74 9b 49 bb e2 40 1646.157 00 00 08 bf 24 81 e0 ca 1646.157 00 00 78 97 49 bb e2 40 1637.429 00 00 08 bf 24 81 e0 ca 1637.428 00 00 50 bf 49 bb e2 40 1628.657 Error 33 occurred at disk power-on lifetime: 713 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER:40 SC:00 SN:c6 CL:49 CH:bb D/H:e2 ST:51 Sequence of commands leading to the command that caused the error were: DCR FR SC SN CL CH D/H CR Timestamp 00 00 78 97 49 bb e2 40 1637.429 00 00 08 bf 24 81 e0 ca 1637.428 00 00 50 bf 49 bb e2 40 1628.657 00 00 08 bf 24 81 e0 ca 1628.657 00 00 58 b7 49 bb e2 40 1619.929 Error 32 occurred at disk power-on lifetime: 713 hours When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER:40 SC:00 SN:c6 CL:49 CH:bb D/H:e2 ST:51 Sequence of commands leading to the command that caused the error were: DCR FR SC SN CL CH D/H CR Timestamp 00 00 50 bf 49 bb e2 40 1628.657 00 00 08 bf 24 81 e0 ca 1628.657 00 00 58 b7 49 bb e2 40 1619.929 00 00 08 bf 24 81 e0 ca 1619.928 00 00 60 af 49 bb e2 40 1611.186 SMART Self-test log, version number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short off-line Completed 00% 3 - ---------------------------------------------------------------------- -Jens |
From: Bruce A. <ba...@gr...> - 2003-06-16 02:44:36
|
Hi Jens, On Sun, 15 Jun 2003, Jens Balk wrote: > Hallo, > this is my result. > Is the RAW_VALUE Value of Load_Cycle_Count "normal"? Apparently not, since the disk firmware is reporting that value as failing. But note that since this Attribute is an old-age Attribute rather than a prefail Attribute, the disk is not predicting its own demise. To quote from SFF-8035i revision 2 (see REFERENCES on smartmontools home page): "Attribute value less than or equal to its corresponding attribute threshold indicates an advisory condition where the usage or age of the device has exceeded its intended design life period." I think that these load cycles are caused when the disk is idle for some time and "unloads" the heads. Then reloads them when a read or write is needed. Here are some further comments: > smartctl version 5.1-11 Copyright (C) 2002-3 Bruce Allen Home page is > http://smartmontools.sourceforge.net/ > > === START OF INFORMATION SECTION === > Device Model: TOSHIBA MK3017GAP > Serial Number: 61E30501T > Firmware Version: A0.02 H > Device is: Not in smartctl database [for details use: -P showall] > ATA Version is: 5 > ATA Standard is: Exact ATA specification draft version not indicated > Local Time is: Sun Jun 15 22:23:54 2003 CEST > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > See vendor-specific Attribute list for marginal Attributes. > > General SMART Values: > Off-line data collection status: (0x00) Offline data collection activity was > never started. > Auto Off-line Data Collection: > Disabled. You might try enabling the auto online test timer with -o on. > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE WHEN_FAILED > RAW_VALUE > 4 Start_Stop_Count 0x0032 100 100 000 Old_age - > 1846 > 5 Reallocated_Sector_Ct 0x0033 100 100 001 Pre-fail - > 5 Hmm, some sectors have been reallocated. Not a good sign... > 225 Load_Cycle_Count 0x0032 066 066 070 Old_age FAILING_NOW > 342111 OK, this is definitely a failing attribute, it's value (66) is well below the threshold (70). > SMART Error Log Version: 1 > ATA Error Count: 36 (device log contains only the most recent five errors) > DCR = Device Control Register > FR = Features Register > SC = Sector Count Register > SN = Sector Number Register > CL = Cylinder Low Register > CH = Cylinder High Register > D/H = Device/Head Register > CR = Content written to Command Register > ER = Error register > STA = Status register > Timestamp is seconds since the previous disk power-on. > Note: timestamp "wraps" after 2^32 msec = 49.710 days. > > Error 36 occurred at disk power-on lifetime: 713 hours > When the command that caused the error occurred, the device was active or > idle. > After command completion occurred, registers were: > ER:40 SC:00 SN:c6 CL:49 CH:bb D/H:e2 ST:51 > Sequence of commands leading to the command that caused the error were: > DCR FR SC SN CL CH D/H CR Timestamp > 00 00 6c a3 49 bb e2 40 1663.643 > 00 00 08 bf 24 81 e0 ca 1663.643 > 00 00 70 9f 49 bb e2 40 1654.900 > 00 00 08 bf 24 81 e0 ca 1654.900 > 00 00 74 9b 49 bb e2 40 1646.157 > > Error 35 occurred at disk power-on lifetime: 713 hours > When the command that caused the error occurred, the device was active or > idle. > After command completion occurred, registers were: > ER:40 SC:00 SN:c6 CL:49 CH:bb D/H:e2 ST:51 > Sequence of commands leading to the command that caused the error were: > DCR FR SC SN CL CH D/H CR Timestamp > 00 00 70 9f 49 bb e2 40 1654.900 > 00 00 08 bf 24 81 e0 ca 1654.900 > 00 00 74 9b 49 bb e2 40 1646.157 > 00 00 08 bf 24 81 e0 ca 1646.157 > 00 00 78 97 49 bb e2 40 1637.429 > > Error 34 occurred at disk power-on lifetime: 713 hours > When the command that caused the error occurred, the device was active or > idle. > After command completion occurred, registers were: > ER:40 SC:00 SN:c6 CL:49 CH:bb D/H:e2 ST:51 > Sequence of commands leading to the command that caused the error were: > DCR FR SC SN CL CH D/H CR Timestamp > 00 00 74 9b 49 bb e2 40 1646.157 > 00 00 08 bf 24 81 e0 ca 1646.157 > 00 00 78 97 49 bb e2 40 1637.429 > 00 00 08 bf 24 81 e0 ca 1637.428 > 00 00 50 bf 49 bb e2 40 1628.657 > > Error 33 occurred at disk power-on lifetime: 713 hours > When the command that caused the error occurred, the device was active or > idle. > After command completion occurred, registers were: > ER:40 SC:00 SN:c6 CL:49 CH:bb D/H:e2 ST:51 > Sequence of commands leading to the command that caused the error were: > DCR FR SC SN CL CH D/H CR Timestamp > 00 00 78 97 49 bb e2 40 1637.429 > 00 00 08 bf 24 81 e0 ca 1637.428 > 00 00 50 bf 49 bb e2 40 1628.657 > 00 00 08 bf 24 81 e0 ca 1628.657 > 00 00 58 b7 49 bb e2 40 1619.929 > > Error 32 occurred at disk power-on lifetime: 713 hours > When the command that caused the error occurred, the device was active or > idle. > After command completion occurred, registers were: > ER:40 SC:00 SN:c6 CL:49 CH:bb D/H:e2 ST:51 > Sequence of commands leading to the command that caused the error were: > DCR FR SC SN CL CH D/H CR Timestamp > 00 00 50 bf 49 bb e2 40 1628.657 > 00 00 08 bf 24 81 e0 ca 1628.657 > 00 00 58 b7 49 bb e2 40 1619.929 > 00 00 08 bf 24 81 e0 ca 1619.928 > 00 00 60 af 49 bb e2 40 1611.186 Interesting. All of these errors are from when the disk was 713 hours old. It's now 6000+ hours old. So these are old errors -- you can probably ignore them since none are recent. > SMART Self-test log, version number 1 > Num Test_Description Status Remaining LifeTime(hours) > LBA_of_first_error > # 1 Short off-line Completed 00% > 3 - Your last self-test was when the disk was 3 hours old -- I would do some extended self-tests now: smartctl -t long /dev/hda and check the error log to see if they were OK. The reallocated sector count value makes me suspect that your disk may be having trouble -- the self-tests should reveal this. Cheers, Bruce |
From: Peter V. <pve...@ar...> - 2010-11-27 17:41:33
|
Hello, I have download and installes the Smartmontools and now, when I want to uninstall there are problems, I should close the program first..... HOW to uninstall the program?????? thanks Peter |
From: Bruce A. <ba...@gr...> - 2002-11-15 10:51:02
|
Hi Drew, Thanks very much for your note. I'll add a warning about the AACRAID controllers in a couple of prominent places. I don't know much about how the SCSI code will interact with SCSI raid controllers. By the way, do you know if the smartsuite utilities cause the same problems on the SCSI raid box? Cheers, Bruce On 14 Nov 2002, Drew Smith wrote: > > Holas, > > In my ongoing quest to monitor my RH8 Dell machines on a hardware > level, I have been led to an interesting utility: > > http://smartmontools.sourceforge.net/ > > From the webpage: > > "The smartmontools package contains two utility programs (smartctl and > smartd) to control and monitor storage systems using the > Self-Monitoring, Analysis and Reporting Technology System (S.M.A.R.T.) > built into most modern ATA and SCSI hard disks. It is derived from the > smartsuite package, and includes support for ATA/ATAPI-5 disks. It > should run on any modern Linux system." > > So I downloaded the utilities and tested them on a laptop and on a > local machine, both with IDE drives. Nice! Works as advertised, and > gives a pile of very useful information! So, I figured I'd deploy it on > all my machines... but first, a final test. > > Screw the long story. DO NOT TRY THIS PACKAGE on Dell servers using > AACRAID controllers! I've got a guy on the way over to the colocation > facility now, to reboot the Dell 2550 machine which appears to be hung > solid. Hopefully it comes back up, and I haven't destroyed anything - > that box isn't in production yet, but it's slotted for production in the > next couple of weeks. :( > > I don't blame the author of the software at all; this is not a slam by > any stretch! The software works as advertised, on the IDE systems that > I tested - however, on the AACRAID platform, with four 72G SCSI drives > in a pair of mirrors, showing up as sda and sdb. Using 'smartctl -i > /dev/sda' (iirc) reported that sda was SMART-capable, but that the > temperature sensors were disabled... issuing 'smartctl -c /dev/sda' > locked the system *solid*. > > Cheers, > - Drew. > > PS: Any word on when the next OMSA, with RH8 support, will be > available? > > -- > Drew Smith (mux) <dr...@ri...> > Encrypted e-mail preferred - finger for public key. > 5801 7134 B54C 3D71 EBE1 CE24 F4DB 2528 5A46 A31B > > > > ------------------------------------------------------- > This sf.net email is sponsored by: To learn the basics of securing > your web site with SSL, click here to get a FREE TRIAL of a Thawte > Server Certificate: http://www.gothawte.com/rd524.html > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > |
From: Drew S. <dr...@ea...> - 2002-11-15 16:24:25
|
On Fri, 2002-11-15 at 04:50, Bruce Allen wrote: > Hi Drew, > > Thanks very much for your note. I'll add a warning about the AACRAID > controllers in a couple of prominent places. I don't know much about how > the SCSI code will interact with SCSI raid controllers. Hi Bruce; thanks for your attention! I didn't get a chance to see the error messages for myself, but the colocation attendant reported that the screen was full of SCSI Sense Errors. You probably knew that already, but just to confirm. > By the way, do you know if the smartsuite utilities cause the same > problems on the SCSI raid box? Sorry, I only tested with smartmontools. I'll be testing on a Dell 6400 with non-RAID later today; which, IIRC, is an Adaptec aic7899 Ultra160 adapter and four 72G SCSI drives. Most of my machines do not use hardware RAID, so one crash is not a show-stopper. Cheers, - Drew. -- Drew Smith (mux) <dr...@ri...> Encrypted e-mail preferred - finger for public key. 5801 7134 B54C 3D71 EBE1 CE24 F4DB 2528 5A46 A31B |
From: Bruce A. <ba...@gr...> - 2002-11-15 16:44:52
|
Hi Drew, > > Thanks very much for your note. I'll add a warning about the AACRAID > > controllers in a couple of prominent places. I don't know much about how > > the SCSI code will interact with SCSI raid controllers. > > Hi Bruce; thanks for your attention! > > I didn't get a chance to see the error messages for myself, but the > colocation attendant reported that the screen was full of SCSI Sense > Errors. You probably knew that already, but just to confirm. > > > By the way, do you know if the smartsuite utilities cause the same > > problems on the SCSI raid box? > > Sorry, I only tested with smartmontools. I've not been able to work on the SCSI part of smartmontools (no SCSI systems to test on). So this means that the SCSI part of smartmontools is (apart from cosmetic and user-interface items) the same as smartsuite. So I'd be surprised if the same problem did not occur there. > I'll be testing on a Dell 6400 with non-RAID later today; which, > IIRC, is an Adaptec aic7899 Ultra160 adapter and four 72G SCSI drives. > Most of my machines do not use hardware RAID, so one crash is not a > show-stopper. Please let me know the results of the RAID testing. Note that a new developer has joined smartmontools who is going to be making some minor changes to the SCSI part of the code. You may also want to try the latest version from CVS after he has made his changes. Cheers, Bruce k |
From: Bruce A. <ba...@gr...> - 2002-11-15 15:07:01
|
Hi Drew, I just wanted to let you know that I have added a file called "WARNINGS" to the smartmontools distribution, and also put a prominent link to it on the package home page. The AACRAID controller is the first entry on that list. Cheers, Bruce On 14 Nov 2002, Drew Smith wrote: > > Holas, > > In my ongoing quest to monitor my RH8 Dell machines on a hardware > level, I have been led to an interesting utility: > > http://smartmontools.sourceforge.net/ > > From the webpage: > > "The smartmontools package contains two utility programs (smartctl and > smartd) to control and monitor storage systems using the > Self-Monitoring, Analysis and Reporting Technology System (S.M.A.R.T.) > built into most modern ATA and SCSI hard disks. It is derived from the > smartsuite package, and includes support for ATA/ATAPI-5 disks. It > should run on any modern Linux system." > > So I downloaded the utilities and tested them on a laptop and on a > local machine, both with IDE drives. Nice! Works as advertised, and > gives a pile of very useful information! So, I figured I'd deploy it on > all my machines... but first, a final test. > > Screw the long story. DO NOT TRY THIS PACKAGE on Dell servers using > AACRAID controllers! I've got a guy on the way over to the colocation > facility now, to reboot the Dell 2550 machine which appears to be hung > solid. Hopefully it comes back up, and I haven't destroyed anything - > that box isn't in production yet, but it's slotted for production in the > next couple of weeks. :( > > I don't blame the author of the software at all; this is not a slam by > any stretch! The software works as advertised, on the IDE systems that > I tested - however, on the AACRAID platform, with four 72G SCSI drives > in a pair of mirrors, showing up as sda and sdb. Using 'smartctl -i > /dev/sda' (iirc) reported that sda was SMART-capable, but that the > temperature sensors were disabled... issuing 'smartctl -c /dev/sda' > locked the system *solid*. > > Cheers, > - Drew. > > PS: Any word on when the next OMSA, with RH8 support, will be > available? > > -- > Drew Smith (mux) <dr...@ri...> > Encrypted e-mail preferred - finger for public key. > 5801 7134 B54C 3D71 EBE1 CE24 F4DB 2528 5A46 A31B > > > > ------------------------------------------------------- > This sf.net email is sponsored by: To learn the basics of securing > your web site with SSL, click here to get a FREE TRIAL of a Thawte > Server Certificate: http://www.gothawte.com/rd524.html > _______________________________________________ > Smartmontools-support mailing list > Sma...@li... > https://lists.sourceforge.net/lists/listinfo/smartmontools-support > |