From: Linda W. <sma...@tl...> - 2008-09-30 18:28:16
|
Tejun Heo wrote: >> Both drives that are doing this are "backup" drives. I.e. all they >> store are daily system backups -- so they should only turn on in >> early "AM" to receive the backups, but then should time-out. They >> both *did* timeout (goto sleep) regularly, when both drives were >> PATA Seagates. But now, I can't keep them asleep. >> >> It's only my SATA drives that 'should' be going to sleep now. I >> have the two SATA drives on a Promise SATA-300 TX4 (4 internal Sata >> ports). My "active" (only 2 other) drives are on different >> controllers a SCSI and a PATA port. They don't sleep or spin-down. >> >> I noticed the following when I tried to use the sleep command on the >> system console: >> >> 17:23:26 Ish kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 >> action 0x2 >> 17:23:26 Ish kernel: ata2.00: waking up from sleep >> >> I don't see the console messages when trying to wake it up from standby. >> But should I be getting kernel error messages on a wakeup from drive sleep? >> (kern=2.6.25.12, vanilla) > > Any command issued to a sleeping drive triggers wake up action as > otherwise it will just gonna timeout, so that's libata telling you > that it's waking up the drive to process whatever pending command. > Hmmm... it seems there needs to be a way to export that the drive is > sleeping to userland. ======== I'm using the "-n standby" option to smartctl. Shouldn't that prevent the drive from waking if it is in standby or asleep (that's what the man page claims). But a __related__ but *OPPOSITE* problem -- is the drive *NOT* waking up in time before being timed out as an I/O device in the kernel. By stubbornly pushing it into standby (or sleep), I got it to sleep -- but when it was suppose to perform a daily short-test, smartd coudln't wake up the drive -- and when the kernel tried to access the drives, they coudln't be brought back online -- got I/O errors in the kernel and the disks file systems were closed and the devices were unmapped. Only way to recover when that happens is to reboot. I don't know of a way to reset the drives other than power-cycle. Especially since the kernel removes the disks from "/dev/". When I had PATA drives in place of the SATA drives, they whole process worked seemlessly. They'd spin down after 30 minutes, stay in standby until needed -- any access would be delayed by a few seconds until they spun backup -- but now its either they don't go into standby OR, they won't come online. |
From: Linda W. <sma...@tl...> - 2008-10-01 10:07:45
|
Tejun Heo wrote: > Linda Walsh wrote: >>> Any command issued to a sleeping drive triggers wake up action as >>> otherwise it will just gonna timeout, so that's libata telling you >>> that it's waking up the drive to process whatever pending command. >>> Hmmm... it seems there needs to be a way to export that the drive is >>> sleeping to userland. >> ======== >> I'm using the "-n standby" option to smartctl. Shouldn't that >> prevent the drive from waking if it is in standby or asleep (that's >> what the man page claims). > > AFAIK, -n standby uses CHECK POWER MODE command to check power state > and unfortunately ATA drive isn't required to process any command > other than DEVICE RESET while it's sleeping. That's why libata keeps > track of sleep state and tries to wake it up when a command needs to > be delivered to it. --- Wouldn't the behavior be the same with PATA? Both sets of drives were Seagate Barracuda (though the older PATA drives were a generation (size-wise) older) I switched over to SATA as the PATA failed. Sata controller is a Promise TX4-300 >> But a __related__ but *OPPOSITE* problem -- is the drive *NOT* >> waking up in time before being timed out as an I/O device in the >> kernel. > > A sleeping drive is not supposed to wake up when receiving a command. > A drive in standby mode should. --- I wondered about that. I tried sleep after I couldn't get standby to stay put. But at least twice since I just decided too temporarily ignore the problem, the system would stay up for 5-10 days, then one of the drives would hang with similar symptoms as when I manually told it to sleep -- 'cept that they drives had only been programmed to go into standby after about 30 minutes. But both times it failed, it stayed up for days -- about 5 or 7 the 1st time, then 7-10 the second -- i.e. things mostly worked "normally" except that the drives didn't spin down when not in use (normally only in wee-hours during nightly network backups). >> By stubbornly pushing it into standby (or sleep), I got it to sleep >> -- but when it was suppose to perform a daily short-test, smartd >> couldn't wake up the drive -- and when the kernel tried to access >> the drives, they couldn't be brought back online -- got I/O errors >> in the kernel and the disks file systems were closed and the devices >> were unmapped. > > This means the state machine in either the drive or machine went > astray and couldn't respond to commands anymore. Does unplugging > power from the harddrive and replugging it in revives the drive? And > which controller do you have (lspci -nn)? --- Unplugging power...not so easy, they are internal drives. > >> Only way to recover when that happens is to reboot. >> I don't know of a way to reset the drives other than power-cycle. >> Especially since the kernel removes the disks from "/dev/". >> >> When I had PATA drives in place of the SATA drives, they whole >> process worked seamlessly. They'd spin down after 30 minutes, stay >> in standby until needed -- any access would be delayed by a few >> seconds until they spun backup -- but now its either they don't go >> into standby OR, they won't come online. > > First of all, we need to find out why '-n standby' doesn't work when > the drive actually is in standby mode. --- Wellllll...it does...just not on the 2nd or 3rd time. While trying to setup the drives, I had a script that looped through and displayed the Temps of all drives every "X" seconds (every 10-60 seconds when I was testing) on the console. |
From: Tejun H. <ht...@gm...> - 2008-09-30 05:30:21
|
Hello, Sorry about late response. I've been traveling for more than a month. Bruce Allen wrote: > Hi Tejun, > > FYI. Feel free to ignore this, or you can respond directly to user/list > if desired. cc'ing smartmontools-support and the original reporter. > ---------- Forwarded message ---------- > Date: Sat, 13 Sep 2008 17:42:29 -0700 > From: Linda Walsh <sma...@tl...> > To: sma...@li... > Subject: [smartmontools-support] inactive SATA drives won't stay in > standby or > sleep, PATA models did. > > I'm having problems with disks staying "asleep" or "suspended" (spun-down). > > I've been trying to monitor the temperatures on the disks to help > note cooling problems. I use the "smartctl -n standby -A <device>" > command to spew out the attributes and look for "Current Drive > Temperature" or Attribute#194 (or I look for and print STANDBY|SLEEP > if that is found). > > I can force the drive to standby or sleep using the -y or -Y command > work). I've also made it go to sleep by setting the drive timeout > to 5 seconds (-S 1). But if I run my "drive_temp" command a few > times the drive will go from 'STANDBY' back to running in "fairly > short" order: usually about 30 seconds. > > When I'm running the monitoring script to poll every 10 seconds, I > can see when the script is going to return the temp -- because if it > is in STANDBY, I get maybe 2 reads, then on the 3rd, it pauses when > I issue the smartctl command and waits for it to spin-up and then > gives me the temperature. Bruce, is there a smartctl option to tell us what's going on? > It's on two different drives that I have observed this -- both > Seagate's, one a 750G, the other a 1000G. > > I tried running the 'short' tests on each (as someone else had a > similar problem that seemed to be fixed after running the short > drive health tests). The drives do claim to be in "standby", but > keep spinning back-up. Hmm... sounds like coincidence to me. > Both drives that are doing this are "backup" drives. I.e. all they > store are daily system backups -- so they should only turn on in > early "AM" to receive the backups, but then should time-out. They > both *did* timeout (goto sleep) regularly, when both drives were > PATA Seagates. But now, I can't keep them asleep. > > It's only my SATA drives that 'should' be going to sleep now. I > have the two SATA drives on a Promise SATA-300 TX4 (4 internal Sata > ports). My "active" (only 2 other) drives are on different > controllers a SCSI and a PATA port. They don't sleep or spin-down. > > I noticed the following when I tried to use the sleep command on the > system console: > > 17:23:26 Ish kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 > action 0x2 > 17:23:26 Ish kernel: ata2.00: waking up from sleep > > I don't see the console messages when trying to wake it up from standby. > But should I be getting kernel error messages on a wakeup from drive sleep? > (kern=2.6.25.12, vanilla) Any command issued to a sleeping drive triggers wake up action as otherwise it will just gonna timeout, so that's libata telling you that it's waking up the drive to process whatever pending command. Hmmm... it seems there needs to be a way to export that the drive is sleeping to userland. Thanks. -- tejun |
From: Tejun H. <ht...@gm...> - 2008-10-04 03:09:51
|
Hello, Linda Walsh wrote: > Tejun Heo wrote: >> Linda Walsh wrote: >>>> Any command issued to a sleeping drive triggers wake up action as >>>> otherwise it will just gonna timeout, so that's libata telling you >>>> that it's waking up the drive to process whatever pending command. >>>> Hmmm... it seems there needs to be a way to export that the drive is >>>> sleeping to userland. >>> ======== >>> I'm using the "-n standby" option to smartctl. Shouldn't that >>> prevent the drive from waking if it is in standby or asleep (that's >>> what the man page claims). >> >> AFAIK, -n standby uses CHECK POWER MODE command to check power state >> and unfortunately ATA drive isn't required to process any command >> other than DEVICE RESET while it's sleeping. That's why libata keeps >> track of sleep state and tries to wake it up when a command needs to >> be delivered to it. > --- > Wouldn't the behavior be the same with PATA? Both sets of drives were > Seagate Barracuda (though the older PATA drives were a generation > (size-wise) > older) I switched over to SATA as the PATA failed. Sata controller is a > Promise TX4-300 They use completely different command transport and new firmware. It would be strange if their behaviors don't differ on corner cases. :-P >>> But a __related__ but *OPPOSITE* problem -- is the drive *NOT* >>> waking up in time before being timed out as an I/O device in the >>> kernel. >> >> A sleeping drive is not supposed to wake up when receiving a command. >> A drive in standby mode should. > --- > I wondered about that. I tried sleep after I couldn't get standby > to stay put. But at least twice since I just decided too > temporarily ignore the problem, the system would stay up for 5-10 > days, then one of the drives would hang with similar symptoms as > when I manually told it to sleep -- 'cept that they drives had only > been programmed to go into standby after about 30 minutes. But both > times it failed, it stayed up for days -- about 5 or 7 the 1st time, > then 7-10 the second -- i.e. things mostly worked "normally" except > that the drives didn't spin down when not in use (normally only in > wee-hours during nightly network backups). I'm having a bit of problem understanding what actually happened. Can you explain it in easier way? >> First of all, we need to find out why '-n standby' doesn't work when >> the drive actually is in standby mode. > --- > Wellllll...it does...just not on the 2nd or 3rd time. While trying > to setup the drives, I had a script that looped through and > displayed the Temps of all drives every "X" seconds (every 10-60 > seconds when I was testing) on the console. Bruce, is there any way to debug this? (are you on vacation?) Thanks. -- tejun |
From: Bruce A. <ba...@gr...> - 2008-10-27 12:21:06
|
>> Wellllll...it does...just not on the 2nd or 3rd time. While trying >> to setup the drives, I had a script that looped through and >> displayed the Temps of all drives every "X" seconds (every 10-60 >> seconds when I was testing) on the console. > > Bruce, is there any way to debug this? (are you on vacation?) Yes, I was on vacation, two wonderful weeks with no network connection! Try using the '-r ioctl,3' on the command line that starts smartd, to get some debugging info. Cheers, Bruce |
From: Linda W. <sma...@tl...> - 2008-10-07 02:16:05
|
Controller is a Promise TX4/300 Is this what you were looking for?: Oct 6 16:59:14 ish kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Oct 6 16:59:14 ish kernel: ata2.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0 Oct 6 16:59:14 ish kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 6 16:59:14 ish kernel: ata2.00: status: { DRDY } Oct 6 16:59:20 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 16:59:24 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 16:59:30 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 16:59:34 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 16:59:40 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 17:00:09 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:00:09 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps Oct 6 17:00:14 ish dhcpd: Forward map from ns1.sc.tlinx.org to 192.168.3.242 FAILED: Has an A record but no DHCID, not mine. Oct 6 17:00:15 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:00:15 ish kernel: ata2: reset failed, giving up Oct 6 17:00:15 ish kernel: ata2.00: disabled Oct 6 17:00:15 ish kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen t4 Oct 6 17:00:15 ish kernel: ata2: hotplug_status 0x22 Oct 6 17:00:20 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 17:00:25 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:00:30 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 17:00:35 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:00:40 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 17:01:10 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:01:10 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps Oct 6 17:01:15 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:01:15 ish kernel: ata2: reset failed, giving up Oct 6 17:01:15 ish kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen t3 Oct 6 17:01:15 ish kernel: ata2: hotplug_status 0x22 Oct 6 17:01:20 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 17:01:25 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:01:30 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 17:01:35 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:01:40 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 17:02:10 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:02:10 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps Oct 6 17:02:15 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:02:15 ish kernel: ata2: reset failed, giving up Oct 6 17:02:15 ish kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen t2 Oct 6 17:02:15 ish kernel: ata2: hotplug_status 0x22 Oct 6 17:02:21 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 17:02:25 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:02:31 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 17:02:35 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:02:41 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 17:03:01 ish sshd[4020]: error: channel 0: chan_read_failed for istate 3 Oct 6 17:03:10 ish syslog-ng[13177]: last message repeated 2 times Oct 6 17:03:10 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:03:10 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps Oct 6 17:03:15 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:03:15 ish kernel: ata2: reset failed, giving up Oct 6 17:03:15 ish kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen t1 Oct 6 17:03:15 ish kernel: ata2: hotplug_status 0x22 Oct 6 17:03:21 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 17:03:25 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:03:31 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 17:03:35 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:03:41 ish kernel: ata2: link is slow to respond, please be patient (ready=-19) Oct 6 17:04:10 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:04:10 ish kernel: ata2: limiting SATA link speed to 1.5 Gbps Oct 6 17:04:15 ish kernel: ata2: COMRESET failed (errno=-16) Oct 6 17:04:15 ish kernel: ata2: reset failed, giving up Oct 6 17:04:15 ish kernel: ata2: EH pending after 5 tries, giving up Oct 6 17:04:15 ish kernel: sd 2:0:0:0: rejecting I/O to offline device Oct 6 17:04:15 ish kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Oct 6 17:04:15 ish kernel: sd 2:0:0:0: [sdc] START_STOP FAILED Oct 6 17:04:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 returned. Oct 6 17:04:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 returned. Oct 6 17:05:45 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 returned. Oct 6 17:05:45 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 returned. Oct 6 17:06:31 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 returned. Oct 6 17:07:30 ish syslog-ng[13177]: last message repeated 2 times Oct 6 17:07:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 returned. Oct 6 17:07:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 returned. Oct 6 17:08:32 ish kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Oct 6 17:08:32 ish kernel: ata1.00: cmd b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 0 Oct 6 17:08:32 ish kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 6 17:08:32 ish kernel: ata1.00: status: { DRDY } Oct 6 17:08:38 ish kernel: ata1: link is slow to respond, please be patient (ready=-19) Oct 6 17:08:42 ish kernel: ata1: COMRESET failed (errno=-16) Oct 6 17:08:45 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 returned. Oct 6 17:08:48 ish kernel: ata1: link is slow to respond, please be patient (ready=-19) Oct 6 17:08:52 ish kernel: ata1: COMRESET failed (errno=-16) Oct 6 17:08:58 ish kernel: ata1: link is slow to respond, please be patient (ready=-19) Oct 6 17:09:21 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 returned. Oct 6 17:09:27 ish kernel: ata1: COMRESET failed (errno=-16) Oct 6 17:09:27 ish kernel: ata1: limiting SATA link speed to 1.5 Gbps Oct 6 17:09:32 ish kernel: ata1: COMRESET failed (errno=-16) Oct 6 17:09:32 ish kernel: ata1: reset failed, giving up Oct 6 17:09:32 ish kernel: ata1.00: disabled Oct 6 17:09:32 ish kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen t4 Oct 6 17:09:32 ish kernel: ata1: hotplug_status 0x88 Oct 6 17:09:38 ish kernel: ata1: link is slow to respond, please be patient (ready=-19) Oct 6 17:09:42 ish kernel: ata1: COMRESET failed (errno=-16) Oct 6 17:09:48 ish kernel: ata1: link is slow to respond, please be patient (ready=-19) Oct 6 17:09:52 ish kernel: ata1: COMRESET failed (errno=-16) Oct 6 17:09:57 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 returned. Oct 6 17:09:58 ish kernel: ata1: link is slow to respond, please be patient (ready=-19) Oct 6 17:10:27 ish kernel: ata1: COMRESET failed (errno=-16) Oct 6 17:10:27 ish kernel: ata1: limiting SATA link speed to 1.5 Gbps Oct 6 17:10:32 ish kernel: ata1: COMRESET failed (errno=-16) Oct 6 17:10:32 ish kernel: ata1: reset failed, giving up Oct 6 17:10:32 ish kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen t3 Oct 6 17:10:32 ish kernel: ata1: hotplug_status 0x88 Oct 6 17:10:33 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 returned. Oct 6 17:10:38 ish kernel: ata1: link is slow to respond, please be patient (ready=-19) Oct 6 17:10:42 ish kernel: ata1: COMRESET failed (errno=-16) Oct 6 17:10:48 ish kernel: ata1: link is slow to respond, please be patient (ready=-19) Oct 6 17:10:52 ish kernel: ata1: COMRESET failed (errno=-16) Oct 6 17:10:58 ish kernel: ata1: link is slow to respond, please be patient (ready=-19) Oct 6 17:11:09 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 returned. Oct 6 17:11:27 ish kernel: ata1: COMRESET failed (errno=-16) Oct 6 17:11:27 ish kernel: ata1: limiting SATA link speed to 1.5 Gbps Oct 6 17:11:33 ish kernel: ata1: COMRESET failed (errno=-16) Oct 6 17:11:33 ish kernel: ata1: reset failed, giving up Oct 6 17:11:33 ish kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0xe frozen t2 Oct 6 17:11:33 ish kernel: ata1: hotplug_status 0x88 Oct 6 17:11:38 ish kernel: ata1: link is slow to respond, please be patient (ready=-19) Oct 6 17:11:43 ish kernel: ata1: COMRESET failed (errno=-16) Oct 6 17:11:45 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 returned. Oct 6 17:11:48 ish kernel: ata1: link is slow to respond, please be patient (ready=-19) Oct 6 17:11:53 ish kernel: ata1: COMRESET failed (errno=-16) Oct 6 17:11:58 ish kernel: ata1: link is slow to respond, please be patient (ready=-19) Oct 6 17:12:17 ish xinetd[2021]: Exiting... Oct 6 17:12:17 ish kernel: nfsd: last server has exited Oct 6 17:12:17 ish kernel: nfsd: unexporting all filesystems Oct 6 17:12:17 ish apcupsd[1989]: apcupsd exiting, signal 15 Oct 6 17:12:17 ish apcupsd[1989]: apcupsd shutdown succeeded Oct 6 17:12:17 ish rpc.statd[2074]: Caught signal 15, un-registering and exiting. Oct 6 17:12:17 ish mountd[2075]: Caught signal 15, un-registering and exiting. Oct 6 17:12:21 ish kernel: Filesystem "sdc1": xfs_log_force: error 5 returned. Tejun Heo wrote: > Linda Walsh wrote: > >> So the real problem is why issuing a smart command isn't re-starting >> the drive -- or bringing it back from standby. Whereas a "normal" disk >> read seems to bring it back to normal functioning just fine (and can >> then do the smart-test). >> >> Does this give anyone ideas about where the problem might be? Also >> sorta explains why my hangs have been infrequent, because I've been >> periodically polling the temps of all the drives -- and only when I stop >> the polling would the drive timeout, then die the next morning when >> smartd tried to run a short test between 1 and 2 am. >> > > Sounds like a firmware problem to me. Issuing ATA_CMD_VERIFY on block > 0 before issuing test commands should work around the problem. Also, > which controller are you using? Can you post the failing kernel log? > > |
From: Linda W. <lk...@tl...> - 2008-10-07 00:39:01
|
Ok, this is my "latest" theory about why my SATA disks have been acting strange. Normally I have the drives set to go into standby after 30 minutes of inactivity. This "can" work -- unless (and this may be obvious to some people, but it's not entirely intuitive) ...unless you query the drive's temperature with smartctl periodically. So..._using_ the "-n standby" on smartctl doesn't have an effect unless the drive is already on standby -- but if it is *not* on standby, then it counts as drive activity and resets the "goto sleep timer". This isn't the worst problem -- more of an annoyance. I didn't try to keep track of all the drives' temperatures until I started having the 2nd problem which is decidedly "nastier"... Second problem -- if a drive is in standby, then if smartctl or smartd try to run the short or long self-tests, the kernel starts issuing time-out errors, and the drive is eventually, _logically_ removed from the system. It never comes back from standby. If I *access* the drive (do an 'ls' of a directory on the drive that isn't in the cache buffers), then after a ~20 second pause, the drive has spun up and all is good. But, for some reason, the "smart" test functionality isn't causing the drive to wake up. Instead the kernel views the drive as OTL (OutToLunch) and removes it from the device table. This is, IMO, the more serious problem and is a regression compared to PATA disk functionality. The bit of periodically checking temps resetting the activity timer -- that isn't something I normally was trying to do -- I only started that to try to debug why the drives were going offline (didn't know if temps were related, among other reasons). But in the process of checking the temps, I was also (I am guessing about the functionality based on observation) resetting the inactivity timer. So the real problem is why issuing a smart command isn't re-starting the drive -- or bringing it back from standby. Whereas a "normal" disk read seems to bring it back to normal functioning just fine (and can then do the smart-test). Does this give anyone ideas about where the problem might be? Also sorta explains why my hangs have been infrequent, because I've been periodically polling the temps of all the drives -- and only when I stop the polling would the drive timeout, then die the next morning when smartd tried to run a short test between 1 and 2 am. |
From: Tejun H. <ht...@gm...> - 2008-10-07 01:10:37
|
Linda Walsh wrote: > So the real problem is why issuing a smart command isn't re-starting > the drive -- or bringing it back from standby. Whereas a "normal" disk > read seems to bring it back to normal functioning just fine (and can > then do the smart-test). > > Does this give anyone ideas about where the problem might be? Also > sorta explains why my hangs have been infrequent, because I've been > periodically polling the temps of all the drives -- and only when I stop > the polling would the drive timeout, then die the next morning when > smartd tried to run a short test between 1 and 2 am. Sounds like a firmware problem to me. Issuing ATA_CMD_VERIFY on block 0 before issuing test commands should work around the problem. Also, which controller are you using? Can you post the failing kernel log? -- tejun |
From: Tejun H. <ht...@gm...> - 2008-10-07 02:14:38
|
Linda Walsh wrote: > Controller is a Promise TX4/300 Yeap. After the drive goes offline, does unplugging and replugging the power cable to the harddrive makes it come back? -- tejun |
From: Linda W. <sma...@tl...> - 2008-10-07 10:16:18
|
Tejun Heo wrote: > Linda Walsh wrote: > >> Controller is a Promise TX4/300 >> > > Yeap. After the drive goes offline, does unplugging and replugging > the power cable to the harddrive makes it come back? > > That's not easy to do. It's an internal drive ... will have to find some time to take the system down and apart for that type of testing.. If I powercycle the whole machine it comes back up ...but that's probably not what you mean...:-/ |
From: Linda W. <sma...@tl...> - 2008-10-07 22:28:25
|
Tejun Heo wrote: > Linda Walsh wrote: > >> Controller is a Promise TX4/300 >> Yeap. After the drive goes offline, does unplugging and replugging >> the power cable to the harddrive makes it come back? >> ---- No. It hangs the computer. about 2-3 seconds after plugging the drives back in. Did it twice to verify it wasn't a fluke. Verified drives removed from /dev, then plugged them back in -- was able to do about 1-2 ls commands on /dev, then keyboard goes dead. First time I tried unplugging the power cables and replugging -- that hung... 2nd time tried unplugging a sata cable and replugging -- that hung too. Hopefully you won't need any more tests of this exact nature...? :-) |
From: Tejun H. <ht...@gm...> - 2008-10-08 00:01:49
|
Linda Walsh wrote: > Tejun Heo wrote: >> Linda Walsh wrote: >> >>> Controller is a Promise TX4/300 >>> Yeap. After the drive goes offline, does unplugging and replugging >>> the power cable to the harddrive makes it come back? >>> > ---- > No. It hangs the computer. about 2-3 seconds after plugging the > drives back in. Did it twice to verify it wasn't a fluke. Verified > drives removed from /dev, then > plugged them back in -- was able to do about 1-2 ls commands on /dev, then > keyboard goes dead. > > First time I tried unplugging the power cables and replugging -- > that hung... > 2nd time tried unplugging a sata cable and replugging -- that hung too. Ah.. okay, so the controller went bonkers then. Any chance you can shell out ~15 bucks and try a sil SATA controller? > Hopefully you won't need any more tests of this exact nature...? :-) Wasn't it fun and empowering? :-P -- tejun |
From: Linda W. <sma...@tl...> - 2008-10-15 22:38:23
|
Grrr...gruimble... I went ahead and tried to order a Silisata, but not knowning the sili-landscape, I ended up with an eSATA-II (just came today) instead of a SATA-II that the card is called -- I thought SATA meant internal connectors and eSATA meant external connectors, but there must be some overlap -- I ended up with an ADD 4 port SATA II Raid controller that could handle single as well as multiple drives for about $98 (including shipping). Of course I intelligently[sic] ordered from a 2nd tier vendor that says it doesn't accept returns of non-defective merchandise (vendor CWOL.COM), so I have to start over again to try ordering a 66MHz PCI-compatible (going on an Intel 440-BX MB). I figure/am hoping that if I can stabilize my server, I can still use this card, since I hope to start using external hard disks for further expansion in the future. But for now, i'd like to get an internal card -- sure didn't find any cards taht were close to ~15, but maybe that's because I opted for SATA-II and the optional RAID feautures I thought I might use. The chip is an Silicon Image Steelvine Sil3124ACBHU QS105.1-9 0808 ADO3AX2 Card is labeled SATA2-PCIX01 and said it was (and appears to be) PCI compatible. But dang-it! it's eSATA...sooooo.... Do you have any particular card+vendor that has the NCQ, (what is PMP?), SATA-II and maybe the RAID support (not really needed, but always thinking about futures...so if RAID adds more than $30 to price, its not worth it) -- that might be "known"-stone-cold reliable? Slight, tangent My SCSI disk supports DPO and FUA, but my SATA's have messages that they don't support DPO or FUA - is that a property of the disks or would a different controller affect that as well? What ARE DPO/FUE -- seems to have something to do with WriteCache -- which I usually turn on given I have the system on a UPS, but the "reliable uptime" on the machine has fallen -- measured in ~5-7ish days now between hangs, whereas before with Promise PATA controller and disks), it was ~infinite (only planned downtimes). Also, with PATA, I was able to use full ACPI support, but I need to turn ACPI=noirq (or off), now to get more than a day of stable uptime). Anyway -- am trying to move to a different controller to see not only if the SMART-kernel-timeout probs are controller related, but also if the hangs. go away. Suggestions? Sources? Thanks, sorry for the delay in getting HW, but such is internet & mail delivery of products... Thanks! -linda Tejun Heo wrote: > Linda Walsh wrote: > >> A Sil Sata controller? >> >> silicon...? something? any particular model? >> >> Will any of them give me working NCQ or such? >> > > If you want NCQ and PMP support, get something w/ Silicon Image 3124 > (pci) or 3132 (pci-e). Otherwise, you can get one of sil3112/3512/3114. > They all are pretty cheap these days. > > >>> Ah.. okay, so the controller went bonkers then. Any chance you can >>> shell out ~15 bucks and try a sil SATA controller? >>> >>> >>> >>>> Hopefully you won't need any more tests of this exact nature...? :-) >>>> >>>> >>> Wasn't it fun and empowering? :-P >>> >>> >> --- >> Very thrilling... though not quite as pretty as Win's blue-screen >> blue.... ;^/ >> > > :-) > > |
From: Tejun H. <ht...@gm...> - 2008-10-16 02:41:42
|
Linda Walsh wrote: > Do you have any particular card+vendor that has the NCQ, > (what is PMP?), SATA-II and maybe the RAID support (not really needed, but > always thinking about futures...so if RAID adds more than $30 to price, > its not worth it) -- that might be "known"-stone-cold reliable? I don't know. I have a couple of them but they are all from local manufacturers (South Korea) so I don't think they'll be available over there. As long as the chip is 3124, it should be okay. -- tejun |
From: Linda W. <sma...@tl...> - 2008-10-22 03:41:29
|
This is with 2.6.26.5 (there are multiple other problems with 2.6.27[.0]). The problem with the drive going "offline" doesn't happen with a sil_sata(3124) controller -- so no need to unplug and replug... I.e. when the drives are in standby, if smartd or a smartctl command attempts to run a drive self-test (short), I get timeout errors from the Promise controller (which hangs the sys if I try unplugging/replugging the cable to the hung drive). The drives correctly spin up to speed and perform the short-test with the sil controller. It would seem there is a problem with the Promise controller or driver? Tejun Heo wrote: > Linda Walsh wrote: > >> Tejun Heo wrote: >> >>> Linda Walsh wrote: >>> >>> >>>> Controller is a Promise TX4/300 >>>> Yeap. After the drive goes offline, does unplugging and replugging >>>> the power cable to the harddrive makes it come back? >>>> >>>> >> ---- >> No. It hangs the computer. about 2-3 seconds after plugging the >> drives back in. Did it twice to verify it wasn't a fluke. Verified >> drives removed from /dev, then >> plugged them back in -- was able to do about 1-2 ls commands on /dev, then >> keyboard goes dead. >> >> First time I tried unplugging the power cables and replugging -- >> that hung... >> 2nd time tried unplugging a sata cable and replugging -- that hung too. >> > > Ah.. okay, so the controller went bonkers then. Any chance you can > shell out ~15 bucks and try a sil SATA controller? > > >> Hopefully you won't need any more tests of this exact nature...? :-) >> > > Wasn't it fun and empowering? :-P > > |
From: Tejun H. <ht...@gm...> - 2008-10-22 04:13:46
|
Linda Walsh wrote: > This is with 2.6.26.5 (there are multiple other problems with 2.6.27[.0]). > > The problem with the drive going "offline" doesn't happen with a > sil_sata(3124) controller -- so no need to unplug and replug... > > > I.e. when the drives are in standby, if smartd or a smartctl command > attempts to run a drive self-test (short), I get timeout errors from the > Promise controller (which hangs the sys if I try unplugging/replugging > the cable to the hung drive). > The drives correctly spin up to speed and perform the short-test with > the sil controller. > > It would seem there is a problem with the Promise controller or driver? Yeah, Mikael found out that hardreset requires controller reset before it. Hopefully, it will be fixed soon. Thanks. -- tejun |
From: Tejun H. <ht...@gm...> - 2008-09-30 18:23:55
|
Linda Walsh wrote: >> Any command issued to a sleeping drive triggers wake up action as >> otherwise it will just gonna timeout, so that's libata telling you >> that it's waking up the drive to process whatever pending command. >> Hmmm... it seems there needs to be a way to export that the drive is >> sleeping to userland. > ======== > > I'm using the "-n standby" option to smartctl. Shouldn't that > prevent the drive from waking if it is in standby or asleep (that's > what the man page claims). AFAIK, -n standby uses CHECK POWER MODE command to check power state and unfortunately ATA drive isn't required to process any command other than DEVICE RESET while it's sleeping. That's why libata keeps track of sleep state and tries to wake it up when a command needs to be delivered to it. > But a __related__ but *OPPOSITE* problem -- is the drive *NOT* > waking up in time before being timed out as an I/O device in the > kernel. A sleeping drive is not supposed to wake up when receiving a command. A drive in standby mode should. > By stubbornly pushing it into standby (or sleep), I got it to sleep > -- but when it was suppose to perform a daily short-test, smartd > coudln't wake up the drive -- and when the kernel tried to access > the drives, they coudln't be brought back online -- got I/O errors > in the kernel and the disks file systems were closed and the devices > were unmapped. This means the state machine in either the drive or machine went astray and couldn't respond to commands anymore. Does unplugging power from the harddrive and replugging it in revives the drive? And which controller do you have (lspci -nn)? > Only way to recover when that happens is to reboot. > > I don't know of a way to reset the drives other than power-cycle. > > Especially since the kernel removes the disks from "/dev/". > > When I had PATA drives in place of the SATA drives, they whole > process worked seemlessly. They'd spin down after 30 minutes, stay > in standby until needed -- any access would be delayed by a few > seconds until they spun backup -- but now its either they don't go > into standby OR, they won't come online. First of all, we need to find out why '-n standby' doesn't work when the drive actually is in standby mode. -- tejun |