Thread: Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (

Disk Inspection and Monitoring

Brought to you by: ballen4705, chrfranke, dipohl

smartmontools-support

Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

From: Linda W. <sma...@tl...> - 2008-09-30 18:28:16

Tejun Heo wrote:
>> Both drives that are doing this are "backup" drives.  I.e. all they
>> store are daily system backups -- so they should only turn on in
>> early "AM" to receive the backups, but then should time-out.  They
>> both *did* timeout (goto sleep) regularly, when both drives were
>> PATA Seagates.  But now, I can't keep them asleep.
>>
>> It's only my SATA drives that 'should' be going to sleep now.  I
>> have the two SATA drives on a Promise SATA-300 TX4 (4 internal Sata
>> ports).  My "active" (only 2 other) drives are on different
>> controllers a SCSI and a PATA port.  They don't sleep or spin-down.
>>
>> I noticed the following when I tried to use the sleep command on the
>> system console:
>>
>> 17:23:26 Ish kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0
>> action 0x2
>> 17:23:26 Ish kernel: ata2.00: waking up from sleep
>>
>> I don't see the console messages when trying to wake it up from standby.
>> But should I be getting kernel error messages on a wakeup from drive sleep?
>> (kern=2.6.25.12, vanilla)
> 
> Any command issued to a sleeping drive triggers wake up action as
> otherwise it will just gonna timeout, so that's libata telling you
> that it's waking up the drive to process whatever pending command.
> Hmmm... it seems there needs to be a way to export that the drive is
> sleeping to userland.
========

I'm using the "-n standby" option to smartctl.  Shouldn't that prevent the drive
from waking if it is in standby or asleep (that's what the man page claims).

But a __related__ but *OPPOSITE* problem -- is the drive *NOT* waking up
in time before being timed out as an I/O device in the kernel.

By stubbornly pushing it into standby (or sleep), I got it to sleep -- but
when it was suppose to perform a daily short-test, smartd coudln't wake up the
drive -- and when the kernel tried to access the drives, they coudln't be brought
back online -- got I/O errors in the kernel and the disks file systems were closed
and the devices were unmapped.

Only way to recover when that happens is to reboot.

I don't know of a way to reset the drives other than power-cycle.

Especially since the kernel removes the disks from "/dev/".

When I had PATA drives in place of the SATA drives, they whole process worked
seemlessly.  They'd spin down after 30 minutes, stay in standby until needed --
any access would be delayed by a few seconds until they spun backup -- but now
its either they don't go into standby OR, they won't come online.

Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

From: Linda W. <sma...@tl...> - 2008-10-01 10:07:45

Tejun Heo wrote:
> Linda Walsh wrote:
>>> Any command issued to a sleeping drive triggers wake up action as
>>> otherwise it will just gonna timeout, so that's libata telling you
>>> that it's waking up the drive to process whatever pending command.
>>> Hmmm... it seems there needs to be a way to export that the drive is
>>> sleeping to userland.
>> ========
>> I'm using the "-n standby" option to smartctl.  Shouldn't that
>> prevent the drive from waking if it is in standby or asleep (that's
>> what the man page claims).
> 
> AFAIK, -n standby uses CHECK POWER MODE command to check power state
> and unfortunately ATA drive isn't required to process any command
> other than DEVICE RESET while it's sleeping.  That's why libata keeps
> track of sleep state and tries to wake it up when a command needs to
> be delivered to it.
---
	Wouldn't the behavior be the same with PATA?  Both sets of drives were
Seagate Barracuda (though the older PATA drives were a generation (size-wise)
older)  I switched over to SATA as the PATA failed.  Sata controller is a
Promise TX4-300



>> But a __related__ but *OPPOSITE* problem -- is the drive *NOT*
>> waking up in time before being timed out as an I/O device in the
>> kernel.
> 
> A sleeping drive is not supposed to wake up when receiving a command.
> A drive in standby mode should.
---
	I wondered about that.  I tried sleep after I couldn't get standby to
stay put.  But at least twice since I just decided too temporarily ignore the 
problem,
the system would stay up for 5-10 days, then one of the drives would hang with
similar symptoms as when I manually told it to sleep -- 'cept that they drives had
only been programmed to go into standby after about 30 minutes.  But both times
it failed, it stayed up for days -- about 5 or 7 the 1st time, then 7-10 the 
second --
i.e. things mostly worked "normally" except that the drives didn't spin down when
not in use (normally only in wee-hours during nightly network backups).


>> By stubbornly pushing it into standby (or sleep), I got it to sleep
>> -- but when it was suppose to perform a daily short-test, smartd
>> couldn't wake up the drive -- and when the kernel tried to access
>> the drives, they couldn't be brought back online -- got I/O errors
>> in the kernel and the disks file systems were closed and the devices
>> were unmapped.
> 
> This means the state machine in either the drive or machine went
> astray and couldn't respond to commands anymore.  Does unplugging
> power from the harddrive and replugging it in revives the drive?  And
> which controller do you have (lspci -nn)?
---
	Unplugging power...not so easy, they are internal drives.
> 
>> Only way to recover when that happens is to reboot.
>> I don't know of a way to reset the drives other than power-cycle.
>> Especially since the kernel removes the disks from "/dev/".
>>
>> When I had PATA drives in place of the SATA drives, they whole
>> process worked seamlessly.  They'd spin down after 30 minutes, stay
>> in standby until needed -- any access would be delayed by a few
>> seconds until they spun backup -- but now its either they don't go
>> into standby OR, they won't come online.
> 
> First of all, we need to find out why '-n standby' doesn't work when
> the drive actually is in standby mode.
---
	Wellllll...it does...just not on the 2nd or 3rd time.  While trying to setup
the drives, I had a script that looped through and displayed the Temps of all drives
every "X" seconds (every 10-60 seconds when I was testing) on the console.

Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

From: Tejun H. <ht...@gm...> - 2008-09-30 05:30:21

Hello,

Sorry about late response.  I've been traveling for more than a month.

Bruce Allen wrote:
> Hi Tejun,
> 
> FYI.  Feel free to ignore this, or you can respond directly to user/list
> if desired.

cc'ing smartmontools-support and the original reporter.

> ---------- Forwarded message ----------
> Date: Sat, 13 Sep 2008 17:42:29 -0700
> From: Linda Walsh <sma...@tl...>
> To: sma...@li...
> Subject: [smartmontools-support] inactive SATA drives won't stay in
> standby or
>     sleep, PATA models did.
> 
> I'm having problems with disks staying "asleep" or "suspended" (spun-down).
> 
> I've been trying to monitor the temperatures on the disks to help
> note cooling problems.  I use the "smartctl -n standby -A <device>"
> command to spew out the attributes and look for "Current Drive
> Temperature" or Attribute#194 (or I look for and print STANDBY|SLEEP
> if that is found).
>
> I can force the drive to standby or sleep using the -y or -Y command
> work).  I've also made it go to sleep by setting the drive timeout
> to 5 seconds (-S 1).  But if I run my "drive_temp" command a few
> times the drive will go from 'STANDBY' back to running in "fairly
> short" order: usually about 30 seconds.
> 
> When I'm running the monitoring script to poll every 10 seconds, I
> can see when the script is going to return the temp -- because if it
> is in STANDBY, I get maybe 2 reads, then on the 3rd, it pauses when
> I issue the smartctl command and waits for it to spin-up and then
> gives me the temperature.

Bruce, is there a smartctl option to tell us what's going on?

> It's on two different drives that I have observed this -- both
> Seagate's, one a 750G, the other a 1000G.
> 
> I tried running the 'short' tests on each (as someone else had a
> similar problem that seemed to be fixed after running the short
> drive health tests).  The drives do claim to be in "standby", but
> keep spinning back-up.

Hmm... sounds like coincidence to me.

> Both drives that are doing this are "backup" drives.  I.e. all they
> store are daily system backups -- so they should only turn on in
> early "AM" to receive the backups, but then should time-out.  They
> both *did* timeout (goto sleep) regularly, when both drives were
> PATA Seagates.  But now, I can't keep them asleep.
> 
> It's only my SATA drives that 'should' be going to sleep now.  I
> have the two SATA drives on a Promise SATA-300 TX4 (4 internal Sata
> ports).  My "active" (only 2 other) drives are on different
> controllers a SCSI and a PATA port.  They don't sleep or spin-down.
> 
> I noticed the following when I tried to use the sleep command on the
> system console:
> 
> 17:23:26 Ish kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0
> action 0x2
> 17:23:26 Ish kernel: ata2.00: waking up from sleep
> 
> I don't see the console messages when trying to wake it up from standby.
> But should I be getting kernel error messages on a wakeup from drive sleep?
> (kern=2.6.25.12, vanilla)

Any command issued to a sleeping drive triggers wake up action as
otherwise it will just gonna timeout, so that's libata telling you
that it's waking up the drive to process whatever pending command.
Hmmm... it seems there needs to be a way to export that the drive is
sleeping to userland.

Thanks.

-- 
tejun

Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

From: Tejun H. <ht...@gm...> - 2008-10-04 03:09:51

Hello,

Linda Walsh wrote:
> Tejun Heo wrote:
>> Linda Walsh wrote:
>>>> Any command issued to a sleeping drive triggers wake up action as
>>>> otherwise it will just gonna timeout, so that's libata telling you
>>>> that it's waking up the drive to process whatever pending command.
>>>> Hmmm... it seems there needs to be a way to export that the drive is
>>>> sleeping to userland.
>>> ========
>>> I'm using the "-n standby" option to smartctl.  Shouldn't that
>>> prevent the drive from waking if it is in standby or asleep (that's
>>> what the man page claims).
>>
>> AFAIK, -n standby uses CHECK POWER MODE command to check power state
>> and unfortunately ATA drive isn't required to process any command
>> other than DEVICE RESET while it's sleeping.  That's why libata keeps
>> track of sleep state and tries to wake it up when a command needs to
>> be delivered to it.
> ---
>     Wouldn't the behavior be the same with PATA?  Both sets of drives were
> Seagate Barracuda (though the older PATA drives were a generation
> (size-wise)
> older)  I switched over to SATA as the PATA failed.  Sata controller is a
> Promise TX4-300

They use completely different command transport and new firmware.  It
would be strange if their behaviors don't differ on corner cases.  :-P

>>> But a __related__ but *OPPOSITE* problem -- is the drive *NOT*
>>> waking up in time before being timed out as an I/O device in the
>>> kernel.
>>
>> A sleeping drive is not supposed to wake up when receiving a command.
>> A drive in standby mode should.
> ---
> I wondered about that.  I tried sleep after I couldn't get standby
> to stay put.  But at least twice since I just decided too
> temporarily ignore the problem, the system would stay up for 5-10
> days, then one of the drives would hang with similar symptoms as
> when I manually told it to sleep -- 'cept that they drives had only
> been programmed to go into standby after about 30 minutes.  But both
> times it failed, it stayed up for days -- about 5 or 7 the 1st time,
> then 7-10 the second -- i.e. things mostly worked "normally" except
> that the drives didn't spin down when not in use (normally only in
> wee-hours during nightly network backups).

I'm having a bit of problem understanding what actually happened.  Can
you explain it in easier way?

>> First of all, we need to find out why '-n standby' doesn't work when
>> the drive actually is in standby mode.
> ---
> Wellllll...it does...just not on the 2nd or 3rd time.  While trying
> to setup the drives, I had a script that looped through and
> displayed the Temps of all drives every "X" seconds (every 10-60
> seconds when I was testing) on the console.

Bruce, is there any way to debug this?  (are you on vacation?)

Thanks.

-- 
tejun

Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

From: Bruce A. <ba...@gr...> - 2008-10-27 12:21:06

>> Wellllll...it does...just not on the 2nd or 3rd time.  While trying
>> to setup the drives, I had a script that looped through and
>> displayed the Temps of all drives every "X" seconds (every 10-60
>> seconds when I was testing) on the console.
>
> Bruce, is there any way to debug this?  (are you on vacation?)

Yes, I was on vacation, two wonderful weeks with no network connection!

Try using the '-r ioctl,3' on the command line that starts smartd, to get 
some debugging info.

Cheers,
     Bruce

Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

From: Linda W. <sma...@tl...> - 2008-10-07 02:16:05

Controller is a Promise TX4/300
Is this what you were looking for?:

Oct  6 16:59:14 ish kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x6 frozen
Oct  6 16:59:14 ish kernel: ata2.00: cmd 
b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0
Oct  6 16:59:14 ish kernel:          res 
40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct  6 16:59:14 ish kernel: ata2.00: status: { DRDY }
Oct  6 16:59:20 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 16:59:24 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 16:59:30 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 16:59:34 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 16:59:40 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:00:09 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:00:09 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps
Oct  6 17:00:14 ish dhcpd: Forward map from ns1.sc.tlinx.org to 
192.168.3.242 FAILED: Has an A record but no DHCID, not mine.
Oct  6 17:00:15 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:00:15 ish kernel: ata2: reset failed, giving up
Oct  6 17:00:15 ish kernel: ata2.00: disabled
Oct  6 17:00:15 ish kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 
action 0xe frozen t4
Oct  6 17:00:15 ish kernel: ata2: hotplug_status 0x22
Oct  6 17:00:20 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:00:25 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:00:30 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:00:35 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:00:40 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:01:10 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:01:10 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps
Oct  6 17:01:15 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:01:15 ish kernel: ata2: reset failed, giving up
Oct  6 17:01:15 ish kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 
action 0xe frozen t3
Oct  6 17:01:15 ish kernel: ata2: hotplug_status 0x22
Oct  6 17:01:20 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:01:25 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:01:30 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:01:35 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:01:40 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:02:10 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:02:10 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps
Oct  6 17:02:15 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:02:15 ish kernel: ata2: reset failed, giving up
Oct  6 17:02:15 ish kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 
action 0xe frozen t2
Oct  6 17:02:15 ish kernel: ata2: hotplug_status 0x22
Oct  6 17:02:21 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:02:25 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:02:31 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:02:35 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:02:41 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:03:01 ish sshd[4020]: error: channel 0: chan_read_failed for 
istate 3
Oct  6 17:03:10 ish syslog-ng[13177]: last message repeated 2 times
Oct  6 17:03:10 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:03:10 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps
Oct  6 17:03:15 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:03:15 ish kernel: ata2: reset failed, giving up
Oct  6 17:03:15 ish kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 
action 0xe frozen t1
Oct  6 17:03:15 ish kernel: ata2: hotplug_status 0x22
Oct  6 17:03:21 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:03:25 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:03:31 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:03:35 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:03:41 ish kernel: ata2: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:04:10 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:04:10 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps
Oct  6 17:04:15 ish kernel: ata2: COMRESET failed (errno=-16)
Oct  6 17:04:15 ish kernel: ata2: reset failed, giving up
Oct  6 17:04:15 ish kernel: ata2: EH pending after 5 tries, giving up
Oct  6 17:04:15 ish kernel: sd 2:0:0:0: rejecting I/O to offline device
Oct  6 17:04:15 ish kernel: program smartctl is using a deprecated SCSI 
ioctl, please convert it to SG_IO
Oct  6 17:04:15 ish kernel: sd 2:0:0:0: [sdc] START_STOP FAILED
Oct  6 17:04:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 
returned.
Oct  6 17:04:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 
returned.
Oct  6 17:05:45 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 
returned.
Oct  6 17:05:45 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 
returned.
Oct  6 17:06:31 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 
returned.
Oct  6 17:07:30 ish syslog-ng[13177]: last message repeated 2 times
Oct  6 17:07:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 
returned.
Oct  6 17:07:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 
returned.
Oct  6 17:08:32 ish kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 
0x0 action 0x6 frozen
Oct  6 17:08:32 ish kernel: ata1.00: cmd 
b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0
Oct  6 17:08:32 ish kernel:          res 
40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct  6 17:08:32 ish kernel: ata1.00: status: { DRDY }
Oct  6 17:08:38 ish kernel: ata1: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:08:42 ish kernel: ata1: COMRESET failed (errno=-16)
Oct  6 17:08:45 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 
returned.
Oct  6 17:08:48 ish kernel: ata1: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:08:52 ish kernel: ata1: COMRESET failed (errno=-16)
Oct  6 17:08:58 ish kernel: ata1: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:09:21 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 
returned.
Oct  6 17:09:27 ish kernel: ata1: COMRESET failed (errno=-16)
Oct  6 17:09:27 ish kernel: ata1: limiting SATA link speed to 1.5 Gbps
Oct  6 17:09:32 ish kernel: ata1: COMRESET failed (errno=-16)
Oct  6 17:09:32 ish kernel: ata1: reset failed, giving up
Oct  6 17:09:32 ish kernel: ata1.00: disabled
Oct  6 17:09:32 ish kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 
action 0xe frozen t4
Oct  6 17:09:32 ish kernel: ata1: hotplug_status 0x88
Oct  6 17:09:38 ish kernel: ata1: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:09:42 ish kernel: ata1: COMRESET failed (errno=-16)
Oct  6 17:09:48 ish kernel: ata1: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:09:52 ish kernel: ata1: COMRESET failed (errno=-16)
Oct  6 17:09:57 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 
returned.
Oct  6 17:09:58 ish kernel: ata1: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:10:27 ish kernel: ata1: COMRESET failed (errno=-16)
Oct  6 17:10:27 ish kernel: ata1: limiting SATA link speed to 1.5 Gbps
Oct  6 17:10:32 ish kernel: ata1: COMRESET failed (errno=-16)
Oct  6 17:10:32 ish kernel: ata1: reset failed, giving up
Oct  6 17:10:32 ish kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 
action 0xe frozen t3
Oct  6 17:10:32 ish kernel: ata1: hotplug_status 0x88
Oct  6 17:10:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 
returned.
Oct  6 17:10:38 ish kernel: ata1: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:10:42 ish kernel: ata1: COMRESET failed (errno=-16)
Oct  6 17:10:48 ish kernel: ata1: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:10:52 ish kernel: ata1: COMRESET failed (errno=-16)
Oct  6 17:10:58 ish kernel: ata1: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:11:09 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 
returned.
Oct  6 17:11:27 ish kernel: ata1: COMRESET failed (errno=-16)
Oct  6 17:11:27 ish kernel: ata1: limiting SATA link speed to 1.5 Gbps
Oct  6 17:11:33 ish kernel: ata1: COMRESET failed (errno=-16)
Oct  6 17:11:33 ish kernel: ata1: reset failed, giving up
Oct  6 17:11:33 ish kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 
action 0xe frozen t2
Oct  6 17:11:33 ish kernel: ata1: hotplug_status 0x88
Oct  6 17:11:38 ish kernel: ata1: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:11:43 ish kernel: ata1: COMRESET failed (errno=-16)
Oct  6 17:11:45 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 
returned.
Oct  6 17:11:48 ish kernel: ata1: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:11:53 ish kernel: ata1: COMRESET failed (errno=-16)
Oct  6 17:11:58 ish kernel: ata1: link is slow to respond, please be 
patient (ready=-19)
Oct  6 17:12:17 ish xinetd[2021]: Exiting...
Oct  6 17:12:17 ish kernel: nfsd: last server has exited
Oct  6 17:12:17 ish kernel: nfsd: unexporting all filesystems
Oct  6 17:12:17 ish apcupsd[1989]: apcupsd exiting, signal 15
Oct  6 17:12:17 ish apcupsd[1989]: apcupsd shutdown succeeded
Oct  6 17:12:17 ish rpc.statd[2074]: Caught signal 15, un-registering 
and exiting.
Oct  6 17:12:17 ish mountd[2075]: Caught signal 15, un-registering and 
exiting.
Oct  6 17:12:21 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 
returned.
Tejun Heo wrote:
> Linda Walsh wrote:
>   
>> So the real problem is why issuing a smart command isn't re-starting
>> the drive -- or bringing it back from standby.  Whereas a "normal" disk
>> read seems to bring it back to normal functioning just fine (and can
>> then do the smart-test).
>>
>> Does this give anyone ideas about where the problem might be?  Also
>> sorta explains why my hangs have been infrequent, because I've been
>> periodically polling the temps of all the drives -- and only when I stop
>> the polling would the drive timeout, then die the next morning when
>> smartd tried to run a short test between 1 and 2 am.
>>     
>
> Sounds like a firmware problem to me.  Issuing ATA_CMD_VERIFY on block
> 0 before issuing test commands should work around the problem.  Also,
> which controller are you using?  Can you post the failing kernel log?
>
>

Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

From: Linda W. <lk...@tl...> - 2008-10-07 00:39:01

Ok, this is my "latest" theory about why my SATA disks have been acting
strange.

Normally I have the drives set to go into standby after 30 minutes of
inactivity. This "can" work -- unless (and this may be obvious to some
people, but it's not entirely intuitive) ...unless you query the drive's
temperature with smartctl periodically.

So..._using_ the "-n standby" on  smartctl  doesn't have an effect unless
the drive is already on standby -- but if it is *not* on standby, then
it counts as drive activity and resets the "goto sleep timer".  This
isn't  the worst problem -- more of an annoyance.  I didn't try to keep
track of all the drives' temperatures until I started having the 2nd
problem which is decidedly "nastier"...

Second problem -- if a drive is in standby, then if  smartctl  or
smartd  try to run the short or long self-tests, the kernel starts
issuing time-out errors, and the drive is eventually, _logically_
removed from the system.  It never comes back from standby.

If I *access* the drive (do an 'ls' of a directory on the drive that
isn't in the cache buffers), then after a ~20 second pause, the drive
has spun up and all is good.  But, for some reason, the "smart" test
functionality isn't causing the drive to wake up.  Instead the kernel
views the drive as OTL (OutToLunch) and removes it from the device
table.  This is, IMO, the more serious problem and is a regression
compared to PATA disk functionality.

The bit of periodically checking temps resetting the activity timer --
that isn't something I normally was trying to do -- I only started that
to try to debug why the drives were going offline (didn't know if temps
were related, among other reasons).  But in the process of checking the
temps, I was also (I am guessing about the functionality based on
observation) resetting the inactivity timer.

So the real problem is why issuing a smart command isn't re-starting
the drive -- or bringing it back from standby.  Whereas a "normal" disk
read seems to bring it back to normal functioning just fine (and can
then do the smart-test).

Does this give anyone ideas about where the problem might be?  Also
sorta explains why my hangs have been infrequent, because I've been
periodically polling the temps of all the drives -- and only when I stop
the polling would the drive timeout, then die the next morning when
smartd tried to run a short test between 1 and 2 am.

Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

From: Tejun H. <ht...@gm...> - 2008-10-07 01:10:37

Linda Walsh wrote:
> So the real problem is why issuing a smart command isn't re-starting
> the drive -- or bringing it back from standby.  Whereas a "normal" disk
> read seems to bring it back to normal functioning just fine (and can
> then do the smart-test).
> 
> Does this give anyone ideas about where the problem might be?  Also
> sorta explains why my hangs have been infrequent, because I've been
> periodically polling the temps of all the drives -- and only when I stop
> the polling would the drive timeout, then die the next morning when
> smartd tried to run a short test between 1 and 2 am.

Sounds like a firmware problem to me.  Issuing ATA_CMD_VERIFY on block
0 before issuing test commands should work around the problem.  Also,
which controller are you using?  Can you post the failing kernel log?

-- 
tejun

Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

From: Tejun H. <ht...@gm...> - 2008-10-07 02:14:38

Linda Walsh wrote:
> Controller is a Promise TX4/300

Yeap.  After the drive goes offline, does unplugging and replugging
the power cable to the harddrive makes it come back?

-- 
tejun

Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

From: Linda W. <sma...@tl...> - 2008-10-07 10:16:18

Tejun Heo wrote:
> Linda Walsh wrote:
>   
>> Controller is a Promise TX4/300
>>     
>
> Yeap.  After the drive goes offline, does unplugging and replugging
> the power cable to the harddrive makes it come back?
>
>   
That's not easy to do.  It's an internal drive ...  will have to find 
some time
to take the system down and apart for that type of testing..

If I powercycle the whole machine it comes back up ...but that's 
probably not what you mean...:-/

Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

From: Linda W. <sma...@tl...> - 2008-10-07 22:28:25

Tejun Heo wrote:
> Linda Walsh wrote:
>   
>> Controller is a Promise TX4/300
>> Yeap.  After the drive goes offline, does unplugging and replugging
>> the power cable to the harddrive makes it come back?
>>     
----
    No.  It hangs the computer. about 2-3 seconds after plugging the
drives back in.  Did it twice to verify it wasn't a fluke.  Verified
drives removed from /dev, then
plugged them back in -- was able to do about 1-2 ls commands on /dev, then
keyboard goes dead.

    First time I tried unplugging the power cables and replugging --
that hung...
2nd time tried unplugging a sata cable and replugging -- that hung too.

    Hopefully you won't need any more tests of this exact nature...? :-)

Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

From: Tejun H. <ht...@gm...> - 2008-10-08 00:01:49

Linda Walsh wrote:
> Tejun Heo wrote:
>> Linda Walsh wrote:
>>  
>>> Controller is a Promise TX4/300
>>> Yeap.  After the drive goes offline, does unplugging and replugging
>>> the power cable to the harddrive makes it come back?
>>>     
> ----
>    No.  It hangs the computer. about 2-3 seconds after plugging the
> drives back in.  Did it twice to verify it wasn't a fluke.  Verified
> drives removed from /dev, then
> plugged them back in -- was able to do about 1-2 ls commands on /dev, then
> keyboard goes dead.
> 
>    First time I tried unplugging the power cables and replugging --
> that hung...
> 2nd time tried unplugging a sata cable and replugging -- that hung too.

Ah.. okay, so the controller went bonkers then.  Any chance you can
shell out ~15 bucks and try a sil SATA controller?

>    Hopefully you won't need any more tests of this exact nature...? :-)

Wasn't it fun and empowering?  :-P

-- 
tejun

Re: [smartmontools-support] SATA drives in standby can reliably hang kernel on wakup

From: Linda W. <sma...@tl...> - 2008-10-15 22:38:23

Grrr...gruimble...

I went ahead and tried to order a Silisata, but not knowning the 
sili-landscape,
I ended up with an eSATA-II (just came today) instead of a SATA-II that
the card is called -- I thought SATA meant internal connectors and eSATA 
meant
external connectors, but there must be some overlap -- I ended up with an
ADD 4 port SATA II Raid controller  that could handle single as well as 
multiple
drives for about $98 (including shipping).  Of course I 
intelligently[sic] ordered
from a 2nd tier vendor that says it doesn't accept returns of non-defective
merchandise (vendor CWOL.COM), so I have to start over again to try ordering
a 66MHz PCI-compatible (going on an Intel 440-BX MB).  I figure/am 
hoping that if I can stabilize my server, I can still use this card, 
since I hope to start using external hard disks for further expansion in 
the future.  But for now, i'd like to get an
internal card -- sure didn't find any cards taht were close to ~15, but 
maybe
that's because I opted for SATA-II and the optional RAID feautures I 
thought I
might use.

The chip is an
Silicon Image
Steelvine
Sil3124ACBHU
QS105.1-9
0808
ADO3AX2

Card is labeled SATA2-PCIX01 and said it was (and appears to be)
PCI compatible.

But dang-it! it's eSATA...sooooo....

Do you have any particular card+vendor that has the NCQ,
(what is PMP?), SATA-II and maybe the RAID support (not really needed, but
always thinking about futures...so if RAID adds more than $30 to price,
its not worth it) -- that might be "known"-stone-cold reliable?

Slight, tangent
My SCSI disk supports DPO and FUA, but my SATA's have messages
that they don't support DPO or FUA - is that a property of the disks or
would a different controller affect that as well?  What ARE DPO/FUE -- seems
to have something to do with WriteCache -- which I usually turn on given
I have the system on a UPS, but the "reliable uptime" on the machine has
fallen -- measured in ~5-7ish days now between hangs,
whereas before with Promise PATA controller and disks), it was ~infinite
(only planned downtimes).  Also, with PATA, I was able to use full ACPI
support, but I need to turn ACPI=noirq (or off), now to get more than
a day of stable uptime).

Anyway -- am trying to move to a different controller to see not only if the
SMART-kernel-timeout probs are controller related, but also if the hangs.
go away. 

Suggestions?  Sources? 
Thanks,  sorry for the delay in getting HW, but such is internet & mail 
delivery
of products...

Thanks!
-linda

Tejun Heo wrote:
> Linda Walsh wrote:
>   
>> A Sil Sata controller?
>>
>> silicon...? something?  any particular model?
>>
>> Will any of them give me working NCQ or such?
>>     
>
> If you want NCQ and PMP support, get something w/ Silicon Image 3124
> (pci) or 3132 (pci-e).  Otherwise, you can get one of sil3112/3512/3114.
>  They all are pretty cheap these days.
>
>   
>>> Ah.. okay, so the controller went bonkers then. Any chance you can
>>> shell out ~15 bucks and try a sil SATA controller?
>>>
>>>   
>>>       
>>>>    Hopefully you won't need any more tests of this exact nature...? :-)
>>>>     
>>>>         
>>> Wasn't it fun and empowering?  :-P
>>>   
>>>       
>> ---
>>     Very thrilling... though not quite as pretty as Win's blue-screen
>> blue....  ;^/
>>     
>
> :-)
>
>

Re: [smartmontools-support] SATA drives in standby can reliably hang kernel on wakup

From: Tejun H. <ht...@gm...> - 2008-10-16 02:41:42

Linda Walsh wrote:
> Do you have any particular card+vendor that has the NCQ,
> (what is PMP?), SATA-II and maybe the RAID support (not really needed, but
> always thinking about futures...so if RAID adds more than $30 to price,
> its not worth it) -- that might be "known"-stone-cold reliable?

I don't know.  I have a couple of them but they are all from local
manufacturers (South Korea) so I don't think they'll be available over
there.  As long as the chip is 3124, it should be okay.

-- 
tejun

Re: [smartmontools-support] Promise SATA-standby +selftest=hungdrive; Sil works...

From: Linda W. <sma...@tl...> - 2008-10-22 03:41:29

This is with 2.6.26.5  (there are multiple other problems with 2.6.27[.0]).

The problem with the drive going "offline" doesn't happen with a
sil_sata(3124) controller -- so no need to unplug and replug...

I.e. when the drives are in standby, if smartd or a smartctl command
attempts to run a drive self-test (short), I get timeout errors from the
Promise controller (which hangs the sys if I try unplugging/replugging
the cable to the hung drive). 

The drives correctly spin up to speed and perform the short-test with
the sil controller.

It would seem there is a problem with the Promise controller or driver?

Tejun Heo wrote:
> Linda Walsh wrote:
>   
>> Tejun Heo wrote:
>>     
>>> Linda Walsh wrote:
>>>  
>>>       
>>>> Controller is a Promise TX4/300
>>>> Yeap.  After the drive goes offline, does unplugging and replugging
>>>> the power cable to the harddrive makes it come back?
>>>>     
>>>>         
>> ----
>>    No.  It hangs the computer. about 2-3 seconds after plugging the
>> drives back in.  Did it twice to verify it wasn't a fluke.  Verified
>> drives removed from /dev, then
>> plugged them back in -- was able to do about 1-2 ls commands on /dev, then
>> keyboard goes dead.
>>
>>    First time I tried unplugging the power cables and replugging --
>> that hung...
>> 2nd time tried unplugging a sata cable and replugging -- that hung too.
>>     
>
> Ah.. okay, so the controller went bonkers then.  Any chance you can
> shell out ~15 bucks and try a sil SATA controller?
>
>   
>>    Hopefully you won't need any more tests of this exact nature...? :-)
>>     
>
> Wasn't it fun and empowering?  :-P
>
>

Re: [smartmontools-support] Promise SATA-standby +selftest=hungdrive; Sil works...

From: Tejun H. <ht...@gm...> - 2008-10-22 04:13:46

Linda Walsh wrote:
> This is with 2.6.26.5  (there are multiple other problems with 2.6.27[.0]).
> 
> The problem with the drive going "offline" doesn't happen with a
> sil_sata(3124) controller -- so no need to unplug and replug...
> 
> 
> I.e. when the drives are in standby, if smartd or a smartctl command
> attempts to run a drive self-test (short), I get timeout errors from the
> Promise controller (which hangs the sys if I try unplugging/replugging
> the cable to the hung drive).
> The drives correctly spin up to speed and perform the short-test with
> the sil controller.
> 
> It would seem there is a problem with the Promise controller or driver?

Yeah, Mikael found out that hardreset requires controller reset before
it.  Hopefully, it will be fixed soon.

Thanks.

-- 
tejun

Re: [smartmontools-support] inactive SATA drives won't stay in standby or sleep, PATA models did. (fwd)

From: Tejun H. <ht...@gm...> - 2008-09-30 18:23:55

Linda Walsh wrote:
>> Any command issued to a sleeping drive triggers wake up action as
>> otherwise it will just gonna timeout, so that's libata telling you
>> that it's waking up the drive to process whatever pending command.
>> Hmmm... it seems there needs to be a way to export that the drive is
>> sleeping to userland.
> ========
> 
> I'm using the "-n standby" option to smartctl.  Shouldn't that
> prevent the drive from waking if it is in standby or asleep (that's
> what the man page claims).

AFAIK, -n standby uses CHECK POWER MODE command to check power state
and unfortunately ATA drive isn't required to process any command
other than DEVICE RESET while it's sleeping.  That's why libata keeps
track of sleep state and tries to wake it up when a command needs to
be delivered to it.

> But a __related__ but *OPPOSITE* problem -- is the drive *NOT*
> waking up in time before being timed out as an I/O device in the
> kernel.

A sleeping drive is not supposed to wake up when receiving a command.
A drive in standby mode should.

> By stubbornly pushing it into standby (or sleep), I got it to sleep
> -- but when it was suppose to perform a daily short-test, smartd
> coudln't wake up the drive -- and when the kernel tried to access
> the drives, they coudln't be brought back online -- got I/O errors
> in the kernel and the disks file systems were closed and the devices
> were unmapped.

This means the state machine in either the drive or machine went
astray and couldn't respond to commands anymore.  Does unplugging
power from the harddrive and replugging it in revives the drive?  And
which controller do you have (lspci -nn)?

> Only way to recover when that happens is to reboot.
> 
> I don't know of a way to reset the drives other than power-cycle.
> 
> Especially since the kernel removes the disks from "/dev/".
> 
> When I had PATA drives in place of the SATA drives, they whole
> process worked seemlessly.  They'd spin down after 30 minutes, stay
> in standby until needed -- any access would be delayed by a few
> seconds until they spun backup -- but now its either they don't go
> into standby OR, they won't come online.

First of all, we need to find out why '-n standby' doesn't work when
the drive actually is in standby mode.

-- 
tejun