Bruce/Doug,
I appreciate your detailed replies. I do not think that smartmon was necessarily responsible for
the drive failures but since there was a relationship between the failures and when I first started
running and testing the utility I wanted to ask. I'm sure you appreciate the sweat that problems
like this can generate and the desire to nail down what is happening.
 
I'm not totally certain this second incident is a disk failure, as you state we need to test more,
however it is showing some symptoms of some sort of disk problem and the possibility of two
failures on high end drives with MTBFs of 1.2 million hours is a cause for worry since those are
the disks we have in all of our servers.
 
Don


From: Bruce Allen [mailto:ballen@gravity.phys.uwm.edu]
Sent: Tue 14/09/2004 5:06 AM
To: Don Shesnicky
Cc: Smartmontools Mailing List; Douglas Gilbert
Subject: RE: [smartmontools-support]minor/major errors (fwd)

> Its Smartmon 5.32 with Redhat 7.2 stock kernel.

At this point there are no known bugs in 5.32. It's based on an
experimental release from May, so the main code base has been out there
for about four months.  The kernel is ancient but if there were relevant
SCSI bugs, Doug would have mentioned them.

> If there is any doubt as to the answer I need to know it. The fact
> that I just started using smartmon and have had two drive problems in
> one or two weeks makes it a definite question.

Speaking as a scientist I would ask about the causal relation here. Is it
really the case that smartmontools is the cause of your drive problems? 
Or is it the drive problems that have piqued your interest in
smartmontools?

I would still advise you to run a long self-test.  Might this cause
problems?  Yes, if something is wrong with the disk.  Would you see the
problems anyway?  Yes, if something is wrong with the disk.

[In any case, and at the risk of sounding like your mother, if this is a
mission critical server you should have a spare disk (or spare system) on
hand and a reliable and tested backup system.]

Cheers,
        Bruce

> Don,
>
> What version of smartmontools and OS are you using?  I'll let Doug answer
> your question.
>
> Cheers,
>         Bruce
>
> On Mon, 13 Sep 2004, Don Shesnicky wrote:
>
> >
> > Bruce,
> > Just to clarify the original question - is it at all possible that
> > smartmon could poke a wrong value into a drive that might cause it to
> > fail or otherwise behave strangely? Don't mind me for asking but after
> > this server has gone down twice while we head for a deadline has me in a
> > very twitchy state.
> >
> > Don
> >
> > -----Original Message-----
> > From: Bruce Allen [mailto:ballen@gravity.phys.uwm.edu]
> > Sent: Monday, September 13, 2004 9:11 PM
> > To: Don Shesnicky
> > Cc: Douglas Gilbert; Smartmontools Mailing List
> > Subject: RE: [smartmontools-support]minor/major errors (fwd)
> >
> > Don,
> >
> > This doesn't answer your question directly -- but have you tried running
> > a self-test on the disk (smartctl -t long)?  If the disk supports
> > self-testing, this is the first thing to try.  And note: if the disk is
> > 'on its last legs' then the added load of a self-test *could* be the
> > final straw.  (But if this is the case then it's doomed anyway -- it's
> > just a matter of time.)
> >
> > Cheers,
> >       Bruce
> >
> > On Mon, 13 Sep 2004, Don Shesnicky wrote:
> >
> > >
> > > Anyone,
> > > The reason I got interested in this utility is that I had a 36 gig
> > > SCSI Seagate drive go bad in a login server. Just before it went down
> > > it was showing one major error. Today we had the rebuilt server just
> > > about do the exact same scenerio. The /var partition is trashed and I
> > > had to create a temporary one on the root partition to get us up
> > running again.
> > > Yesterday the output of a smartctl script run started showing 1 major
> > > error... just like before.
> > >
> > > Let me ask everyone/anyone this - is there ANYTHING in the smartmon
> > > tools itself that could cause this?
> > >
> > > Don
> > >
> > > -----Original Message-----
> > > From: Douglas Gilbert [mailto:dougg@torque.net]
> > > Sent: Monday, September 13, 2004 5:52 PM
> > > To: Don Shesnicky
> > > Subject: Re: [smartmontools-support]minor/major errors (fwd)
> > >
> > > Don Shesnicky wrote:
> > > > Doug,
> > > > Can you tell me what bit/flag/variable the tool is looking at so I
> > > > can
> > >
> > > > ask Seagate for the meaning and whether it is a problem? Here are
> > > > the numbers for another disk which is heavily used but I have none
> > > > that don't show something in these columns.
> > > >
> > > > Don
> > > >
> > > > Error counter log:
> > > >           Errors Corrected    Total      Total   Correction
> > > > Gigabytes    Total
> > > >               delay:       [rereads/    errors   algorithm
> > > > processed    uncorrected
> > > >             minor | major  rewrites]  corrected  invocations   [10^9
> > > > bytes]  errors
> > > > read:   182805994        0         0  182805994   182805994
> > > > 22277.952           0
> > > > write:         0         0         0         0            0
> > > > 2733.390           0
> > >
> > > Don,
> > > It is the "Read Error Counter" log page (i.e. log page code 3) and the
> >
> > > parameter is called "Errors corrected without substantial delay" (i.e.
> > > parameter code 0). That log page is defined in SPC (a.k.a. SCSI-3),
> > > SPC-2 and SPC-3 (i.e. it has been there since at least 1997).
> > >
> > > Doug Gilbert
> > >
> > >
> > > -------------------------------------------------------
> > > This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
> > > Project Admins to receive an Apple iPod Mini FREE for your judgement
> > > on who ports your project to Linux PPC the best. Sponsored by IBM.
> > > Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php
> > > _______________________________________________
> > > Smartmontools-support mailing list
> > > Smartmontools-support@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/smartmontools-support
> > >
> > >
> >
> >
> >
>
>
>
>