I'm fairly new to this list. However, I'm a EE with 20+ years
experience in embedded design. One thing I've learned over time is
tht where there is smoke, there is often fire.
My purpose in this post if to detail some bad experiences in running
disk diagnostics on mounted systems, with one disk in particular, to
both share this information and hopefully get a dialog going with
people more knowledgeable than me. I am currently of the opinion that
there is a causal relationship between running "short" diagnostics
and the health of Maxtor MAXLINE II 250G drives [possibly just under
** Long story to follow **
I had posted a bit about this before, but here is more information.
My daughter just started at NYU, and I got her a new iMac (Macintosh)
in September. In November, there was a power outage, and the system
would not boot one morning. I went in, picked up the computer, and
found that a few sectors were corrupted (UNC) errors. The partition
map was corrupted (unreadable) but the rest of the disk was fine, and
I recovered every file with a data recovery program called "Data
Rescue X" (Mac and Windows version).
After this happened, I found out about smartctl, and used it to test
new drive (and every other drive I had!).
FIRST ODD ISSUE
In February, the computer would again not boot. Now, I also added a
UPS (and tested it - it shuts the computer down after the battery
drains - and it worked fine). I went to the dorm room with a Firewire
drive, and after I booted the computer, I was able to see the drive
in the Mac disk check utility (ie partition was viewable). I did a
file system check and it was corrupt, and the vanilla Mac program
(Disk Utility) could not fix it. But it mounted, and I could see
OK, I think, let me start up smartctl, look at the disk (-a). No
errors reported - SMART is OK. So, I run a "-t short". Now, I try to
use the recovery program again (Data Rescue). It start complaining
like mad. So, shut the system down and take it home (I was hoping for
a quick fix).
When I get the system home, in a nice cool basement, and start
looking at it, I find THOUSANDS of corrupt sectors. I end up spending
the next few days using "dd" to copy chunks of good blocks to a
second 200G drive (remember, this is a 250G drive). The partition map
is gone, tons of sectors in the first part of the disk are UNC.
However, the user files are mostly good - I end up recovering most of
her personal files. But OS files and OS directories are all UNC.
I wonder at the time if running the "-t short" diagnostic worsened
the situation - I just don't know why I could mount the drive for a
while, then later on it was much sicker. And surprisingly, after I
start the recovery, it does not get worse. That is, the bad sectors
stay bad, and the good ones stay good.
SECOND ODD ISSUE
So, Apple is great and sends me a replacement. I reformat it with
"Zero all sectors" after the disk gets to operating temperature - I'm
thinking at the time that the original drive maybe got overheated
(Apple runs the ambient at 53.0 - 53.5C with a drive spec of 55C max
I do a bunch of short tests, and all is well. I do an extended test,
all is well.
I get the system back together, take it into NYC, and daughter is happy.
Apple uses variable speed fans to cool the iMac, and people have
figured out how to tell the RPM setting and the ambient temps from
various zones in the system.
I start watching this like a hawk, every day or so checking it, and
all appears well.
One other point - since this happened I put a backup program on her
machine that automatically backs up files at 11:30 every night to a
online service (.MAC).
March 3rd, at about 7:45PM, I connect and check the fans, and do a
smartctl "-a" to see how the disk looks. It looks fine. So, [stupid
me], I think "Hey, let me try another short diagnostic". So I run "-t
short", wait the recommended two minutes, and then type "-a" [I have
this record]. It looks fine.
My daughter says she was working on homework that night (in a folder
that gets backed up), but she is not sure how late. However, the
online backup folder has no files from March 3rd at 11PM (or any that
day). Therefore, I do not believe the computer was running at this
time - which is 4 hours after I ran the short test.
So, March 4th, I get the second call of "My system won't boot". I go
in with the firewire drive, but now I experience something totally
foreign - the system chiprs and clicks during the boot, and when it
finally comes up, the system reporting tools say there is a MAXTOR
CALYPSO device on the SATA bus, and it has 0 bytes of storage
I take the system home, and sure enough, no matter what you do, the
disk drive makes chirps and clicks during boot. I tried it hot, I
cooled the disk off outside (like 30F in the garage) - no change -
the drive always does the same thing. It is totally absolutely gone -
with only 270 hours of life on the drive [I have the smartctl -a
report of this session].
I have pondered this for a while, and built a theory. Suppose that
the concurrent diagnostic procedure has a bug - I've been in the
embedded business for decades - and this wouldn't be the first.
Suppose that this bug is something that leaves an unitialized
register or pointer. And suppose that the end result is that
sometimes the chip that does the ECC for WRITES gets reprogrammed
In this case, all writes would create sectors that later on would
appear corrupt. This would fit the model of my first crash perfectly.
Sectors that the OS was using (for programs and OS writeable files)
were hosed (partition map, sectors 1-8, etc). But files that were
probably not in the cache were fine.
Now, on this second crash - does the disk update is private sectors
after an offline test? Could it have corrpted the sectors where it
keeps the information on the drive type and number of sectors etc?
It has to be something like this. The drive could not just wear out
in 270 hours. It was working PERFECTLY until I did the short
diagnostic test - and within 4 hours it was totally catatonic. This
just cannot be a coincident.
I guess I just did my last short/extended test on a desktop machine.
I'm sure the manufacturers do some testing. But, does Darwin/BSD do a
different sequence of disk accesses than Windows? Than LINUX? Than
Is there timing involved?
I'm sure that many of you have good track histories with brand X in
your data centers, with OS Y. Maybe most of you have RAID so you are
not too concerned with a single disk failure.
For me, I thik I got too close to the bleeding edge. I tried to
become too proactive in looking for a problem that probably would
never happen (ie my daughter's imac will probably be tossed away
before the drive **really** fails.)
Who can I press this issue with? No one - really - no one is going to
take it seriously. Unless I can build a test harness and show that an
error happens with a certain procedure, I could never get anyone
attention. Even with that it would be an uphill battle.
Just ask yourself what the probability is of a new drive (that just
passed an offline test) experiencing a catistrophic failure within 4
PS: the drive is a Maxtor Maxline II 250G drive (and one that is
targeted to the consumer market and not the IT market).