From: Delian K. <sma...@kr...> - 2005-02-03 14:23:36
|
Please do not read the whole message if You don't have the time. Just the sections between the stars are important. Hello, I have a failing 60Gigs Maxtor hard drive. When I've tried to diagnose what was wrong with it, I've found it had 266 already reallocated blocks, and 260 more pending for reallocation. I've found the howto on the smartmontools site, and tried to write some data on the failing sectors. This was no success. I've actually tried zeroing the complete drive previously, with the same result. I've decided the drive has phisical problems at this points, and that it'= s "spare sectors pool" has been exausted. At this point comes my first majo= r question: ****** * Is there a way to determine how many "spare sectors" a drive offers ? * If not at run time, shouldn't it be specified by the manufacturer * somewhere ? * If not, are You aware of such a data for some of the modern IDE drives = ? ****** I've googled around to see how to force the reallocation, since I just had the feeling this drive was not yet gone. I wasn't quite successfull though, and found nothing, but the method your HOWTO offers - writing som= e data on the failing sector. AFAIKnew, the "HDD low level format", was doing nothing but writing the complete drive with some pattern, usually zeroes. I've read about it on the maxtor site, and their "powermax" tool was really documented to do it this way. I didn't have too much hope in it, but I've decided to give it a try. I've also hoped there might be some feature, not yet known by me, which this software might offer. Maxtor provides nothing but a windows executable to create the boot disquette with powermax. I've tried "apt-get install wine", and was quite dissapointed when the "./wine ./powermax.exe" didn't found the floppy drive. A quick look at the wine's config options showed me that everything "should" be ok. Anyway, it obviously wasn't. So I've left this tool until several days later when I was in the office and onother HDD with the same simptoms was given to me in the office. Since I've got a win machine there I was able to create this stupid disquette. I've run powermax, and a quick look at it's features showed me It had nothing to offer that I haven't tried yet. It just runs the smart tests, and fills the drive with zeroes. Anyway, that feeling of mine made me run the low level format. I was quite amazed when it finished successfully, more amazed when the smart test were successfull later, and even more amazed when I've seen this low lever format was able to force the reallocation event, and the number of reallocated sectors had lowered to just 6, and the pending ones to 0. But hey, I've already tried filling the drive with zeroes. Why did I fail previously ? I've remembered I've read s.t. about the bufferening, raw devices, etc .. in the smartmoontoos ML archives, and concluded this should be the reason why my previous tries weren't successfull. So ******* * Is there a way to force the bad block reallocation event under linux ? * Does the success of writing to a hdd dependend on the driver's * buffering settings ? ******* I just have the feeling thish should work transparently, not to be forced by "unbuffered low level format". The reason I'm asking this is because I have another drive, which did not showed any such simptoms, until I've run the long smart test on it. The test failed, and I've mentioned there is one pending sector for reallocation. The only way I see it could be done currently, is to: - migrate all the readable data to another drive - low level format the complete drive - put the data back. Also note that this drive has just one pending sector, and it's allready reallocad sectors are 0. Additionally, it's quite strange powermax says that the drive should be returned if these problems are correctable by a simple zeroing .. Thanks for your attention. Cheers, Delian |
From: Delian K. <sma...@kr...> - 2005-02-03 15:54:28
|
I hope You do not mind bringing your message to the public list. Please reply to the list in the future. I find the opened discussion more convinient. ---------- Forwarded Message ---------- Subject: Re: [smartmontools-support]Bad block reallocation not triggered = automatically ? Date: Thu, 3 Feb 2005 10:22:09 -0500 (EST) From: Eric Praetzel <XXX@XXX> To: sma...@kr... (Delian Krustev) > I have a failing 60Gigs Maxtor hard drive. When I've tried to diagnose Send it back for warranty repair or throw it out! I've had to deal with a good 20 failing drives in the past few months. Maxtor/WD's suffer from early failures. IBMs are reliable for a good 2.5 years and then they earn the name DeathS= tar! > what was wrong with it, I've found it had 266 already reallocated block= s, > and 260 more pending for reallocation. I've found the howto on the Heave that thing in the garbage or get a warranty replacement! Whenever I've seen this the drive is well on it's way to failure. Running PowerMax should confirm that. > I've decided the drive has phisical problems at this points, and that i= t's > "spare sectors pool" has been exausted. At this point comes my first ma= jor There is no "spare sectors pool". All sectors are available for use. As they fail they get marked bad and avoided. You start with the maximum number of sectors and go down from there. Any drive which is loosing sectors over time is on a slope to failure. I rarely see a drive with 1 "pending" bad sector that continues to work and doesn't fail the mfg's tests. Most fail quickly - an exception being some older 15G Maxtors/Quantums I have in service. > ******* > * Is there a way to force the bad block reallocation event under linux = ? shutdown -F -r now will force a "fsck" or file system check at boot time on RedHat/Fedora. > * Does the success of writing to a hdd dependend on the driver's > * buffering settings ? Yes and no. Generally no - but if you're misconfigured the settings then lots can go wrong - usually hanging the machine quickly. > is because I have another drive, which did not showed any such simptoms= , > until I've run the long smart test on it. The test failed, and I've I don't put any drive into a server unless it's had a 3 day burn-in of read-writes via the Powermax software - I've just had tooo many WD / Maxtor drives keel over on me within a month of installs. > - migrate all the readable data to another drive > - low level format the complete drive > - put the data back. Don't trust the drive - don't trust the drive. Modern drives are dirt cheap for their performance. What's been sacrified is reliability and a "soft" failure. Gone are the days when I'd watch a drive slowly head towards failure over the space of a year. Now they drop dead really really fast. I've seen 2 drop half way thru the MaxBlast test - totally, electically gone. > Also note that this drive has just one pending sector, and it's > allready reallocad sectors are 0. I have 2 drives like that (out of 20 bad ones) which continue to work. I get no messages about data corruption; fsck or Windoze never marks the sector as bad and so I use the drives as backups - non critical use with a very low usage. Your time is worth more than dealing with a suspect drive that is likely on it's way to failure. > Additionally, it's quite strange powermax says that the drive should > be returned if these problems are correctable by a simple zeroing .. If PowerMax says that the drive has failed - it's failed. It may not be in the grave yet - but it's on it's way. - Eric ------------------------------------------------------- |
From: Delian K. <sma...@kr...> - 2005-02-03 16:02:23
|
On Thursday 03 February 2005 17:22, you wrote: > Send it back for warranty repair or throw it out! Actually the technicians will do what I did, and return the drive to me as repaired. After the steps I've described the drive is working perfectl= y fine. I've run several troughout tests on it. > Heave that thing in the garbage or get a warranty replacement! > Whenever I've seen this the drive is well on it's way to failure. > Running PowerMax should confirm that. I plan to test this out. I've added the drive to a LVM LV containing a filesystem with unimportant data - mostly large media files which I could afford to loose. > There is no "spare sectors pool". All sectors are available for use. = As > they fail they get marked bad and avoided. You start with the maximum > number of sectors and go down from there. > > Any drive which is loosing sectors over time is on a slope to failure. I think You're wrong here. There is such a pool. This is where this spare sectors come from. > shutdown -F -r now > will force a "fsck" or file system check at boot time on RedHat/Fedora. This error was undetectable by fsck w/o badblocks test. The problematic sectors lie s.w. in within the data. By default, fsck does a filesystem integrity check, not a data integrity check. > I have 2 drives like that (out of 20 bad ones) which continue to work. > I get no messages about data corruption; fsck or Windoze never marks > the sector as bad and so I use the drives as backups - non critical use > with a very low usage. I guess You've never tried accessing the data on the problematic sectors, after the problem appeared. Try: dd if=3D/dev/hdX of=3D/dev/null Thish should either trigger an error or update the number of sectors pend= ing for reallocation. > Your time is worth more than dealing with a suspect drive that is likel= y > on it's way to failure. Unforturanately no. The salaries vary around the globe. > If PowerMax says that the drive has failed - it's failed. It may not > be in the grave yet - but it's on it's way. Powermax is quite dubious about's it's statements. That's why I do not ultimately trust it. I prefer to understand what's going on beneath .. What I did not mention in my previous post was that I did not try the raw device, since at first glance it seems unsupported under my current kernel(2.6.10 with debian patches). Cheers, Delian |
From: Volker K. <lis...@pa...> - 2005-02-03 18:14:31
|
> Actually the technicians will do what I did, and return the drive to me > as repaired. You can't be sure of that, but it's possible. I've had 2 80GB Samsung drives fail within <1 year, the second one had one bad block used by a file, with smartmontools finding another 1 or 2 unused bad blacks. It was replaced under warranty overnight, and I did get a new drive (new sticker with different numbers, manufacturing date, and firmware date). > Powermax is quite dubious about's it's statements. That's why I do not > ultimately trust it. I prefer to understand what's going on beneath .. Powermax is a tool supplied by the manufacturer. If this tool says the drive has had it, you can safely trust this statement, and the manufacturer will be forced to replace under warranty - afterall, their own tool says so. If Powermax says the drive's ok, or worse, isn't totally sure about it, you have a bigger problem. > What I did not mention in my previous post was that I did not try the > raw device, since at first glance it seems unsupported under > my current kernel(2.6.10 with debian patches). This wouldn't really give you any more useful information anyway. dd is good enough. Even better is smartctl -t short/long, in my limited experience. Volker -- Volker Kuhlmann is possibly list0570 with the domain in header http://volker.dnsalias.net/ Please do not CC list postings to me. |
From: Delian K. <sma...@kr...> - 2005-02-03 18:46:17
|
On Thursday 03 February 2005 20:14, Volker Kuhlmann wrote: > You can't be sure of that, but it's possible. I've had 2 80GB Samsung > drives fail within <1 year, the second one had one bad block used by a > file, with smartmontools finding another 1 or 2 unused bad blacks. It > was replaced under warranty overnight, and I did get a new drive (new > sticker with different numbers, manufacturing date, and firmware date). Indeed, the warranty period of my drive has expired. > This wouldn't really give you any more useful information anyway. dd is > good enough.=20 It could. I might be able to force the reallocation w/o the data migratio= n workaround if it works. Cheers, Delian |
From: Volker K. <lis...@pa...> - 2005-02-03 23:15:03
|
> Indeed, the warranty period of my drive has expired. I have one of those too, 15GB with 318 reallocated sectors, not increasing for the past 4 months regardless what I do to it. It's in a raid 1 and the box will be replaced this year, so I'm not too worried. So far I couldn't prove any bad block I/O error or data loss. Interestingly, both -t short and -t long never terminate onthis disk, once I waited 6 hours before cancelling the test. The disk gets quite slow and noisy too, which doesn't happen on an ok disk. > > This wouldn't really give you any more useful information anyway. dd is > > good enough. > > It could. I might be able to force the reallocation w/o the data migration > workaround if it works. The raw device bypasses the kernel's block buffer, i.e. you'd make sure that dd actually reads the block from disk rather than from the block buffer. As I understand, you can only force reallocation if the disk is aware the block is bad and would like to move it but can't because it doesn't have the data, and writing to the block, thus making it possible for the disk to reallocate. Sometimes it seems a block may have trouble but the disk still doesn't want to reallocate even if you write it. There's nothing you can do about it, and writing the block makes it far less likely that the disk will reallocate as the surface is then newly magnetised at that location... Volker -- Volker Kuhlmann is possibly list0570 with the domain in header http://volker.dnsalias.net/ Please do not CC list postings to me. |
From: Delian K. <sma...@kr...> - 2005-02-04 12:56:27
|
On Friday 04 February 2005 01:14, Volker Kuhlmann wrote: > I have one of those too, 15GB with 318 reallocated sectors, not > increasing for the past 4 months regardless what I do to it. It's in a > raid 1 and the box will be replaced this year, so I'm not too worried. > So far I couldn't prove any bad block I/O error or data loss. I think You should be worried. You're risking the entire array to fail because of this drive. > Interestingly, both -t short and -t long never terminate onthis disk, > once I waited 6 hours before cancelling the test. The disk gets quite > slow and noisy too, which doesn't happen on an ok disk. Strange .. > The raw device bypasses the kernel's block buffer, i.e. you'd make sure > that dd actually reads the block from disk rather than from the block > buffer. I'm actually trying to make it write, not read. It has proven it's unable to read this sector. Have You tested the raw device ? > As I understand, you can only force reallocation if the disk is > aware the block is bad and would like to move it It is aware. The sector is marked for reallocation. > but can't because it > doesn't have the data, and writing to the block, thus making it possibl= e > for the disk to reallocate. That's why the write is necessary. I'm providing the data for the write, = so it has it. It checks where it has to write the data. Sees this sector is marked as suspicious, and performs reallocation. > Sometimes it seems a block may have trouble > but the disk still doesn't want to reallocate even if you write it. This should happen only if there are no spare sectors. > There's nothing you can do about it, and writing the block makes it far > less likely that the disk will reallocate as the surface is then newly > magnetised at that location... Incorrect. The manufacturers claim otherwise. Cheers, Delian |
From: Volker K. <lis...@pa...> - 2005-02-04 21:30:36
|
> I think You should be worried. You're risking the entire array to fail > because of this drive. Why? If one disk of a raid 1 fails, the other one will keep things running. That's the idea. It's an old machine due to be replaced, I don't feel like sinking money into a new disk unless I really have to. So far I only had to disable the scheduled explicit -t short|long as they hang. For the last months, the only thing which hasn't been working is the builtin selftests. I/O has been fine. I used to be worried. And I won't be surprised if this disk self-destructs tomorrow. > I'm actually trying to make it write, not read. It has proven it's unable > to read this sector. Then you should overwrite the sector so it can be reallocated. > Have You tested the raw device ? Which device is that exactly? What for? > > Sometimes it seems a block may have trouble > > but the disk still doesn't want to reallocate even if you write it. > > This should happen only if there are no spare sectors. You're assuming firmwares in disks are bug-free, complete with respect to the functionality of the SMART specs, and actually implements SMART according to the specs. You must be dreaming. Volker -- Volker Kuhlmann is possibly list0570 with the domain in header http://volker.dnsalias.net/ Please do not CC list postings to me. |
From: Delian K. <sma...@kr...> - 2005-02-07 14:20:55
|
On Friday 04 February 2005 23:29, Volker Kuhlmann wrote: > Why? If one disk of a raid 1 fails, the other one will keep things > running. That's the idea. Because one of the drives is already failing. > Then you should overwrite the sector so it can be reallocated. This is what I've tried. And this is what my questions were about. > Which device is that exactly? What for? /dev/raw .. Anyway, You obviously haven't tested it .. > You're assuming firmwares in disks are bug-free, complete with respect = to > the functionality of the SMART specs, and actually implements SMART > according to the specs. You must be dreaming. Yes, I'm assuming this. What makes You think otherwise ? Cheers, Delian |
From: Volker K. <lis...@pa...> - 2005-02-07 21:04:50
|
> > Why? If one disk of a raid 1 fails, the other one will keep things > > running. That's the idea. > > Because one of the drives is already failing. The disk says it has a few problems, but no matter what I tried, for the past six months I've been unable to find any sector which gives trouble. The fact that -t short/long no longer work could have something to do with lousy firmware and a large (318, 15GB disk, <2y service) number of reallocated blocks. Sure, with an unlimited budget of course I'd replace it... > > You're assuming firmwares in disks are bug-free, complete with respect to > > the functionality of the SMART specs, and actually implements SMART > > according to the specs. You must be dreaming. > > Yes, I'm assuming this. What makes You think otherwise ? * It's an invisible (to Joe Bloggs) feature. * It's a consumer grade product. * My own experience o The Maxtor disk I have exhibits the Maxtor powerup time problem mentioned on the smartmontools page. o Having had 3 different firmwares of the same model disk, I can say that at least 2 behave differently. * The large number of firmware bug workarounds implemented in smartmontools and which can be selected with -F. This includes byte swapping, using different smart registers/variables for the same thing, etc. Peruse the source ;) Volker -- Volker Kuhlmann is possibly list0570 with the domain in header http://volker.dnsalias.net/ Please do not CC list postings to me. |
From: Per J. <pe...@co...> - 2005-02-08 11:19:58
|
Volker Kuhlmann wrote: > You're assuming firmwares in disks are bug-free, complete with respect = to > the functionality of the SMART specs, and actually implements SMART > according to the specs. You must be dreaming. Maxtor accepts a SMART-log as sufficient documentation when claiming warr= anty.=20 The drive firmware is probably not bug-free, often not complete, but I wo= uld have thought what *is* implemented would be according to the specs. Otherwise how can they= rely on just SMART when doing warranty returns?=20 (their Powermax (or whatever it is called) tool didn't run on Linux last = I checked, which means we can't use it to document anything. ) /Per Jessen, Z=FCrich --=20 http://www.spamchek.ch/freetrial - jetzt f=FCr 30 Tage ausprobieren - kos= tenlos und unverbindlich! |
From: Delian K. <sma...@kr...> - 2005-02-08 13:42:23
|
On Tuesday 08 February 2005 13:19, Per Jessen wrote: > (their Powermax (or whatever it is called) tool didn't run on Linux las= t I > checked, which means we can't use it to document anything. ) It does not run on windows or any other OS either. It's a standalonone bo= ot floppy disk with a DOS flavour. Powermax diplays an error code which You should provide when requesting warranty return. However, You're right if = You mean the floppy disk creation program supplied by maxtor is for win only. P.S. I've had some progress on the subject of this thread. However I was not unable to post w/o breaking the threading. I was at home and the previous messages were not in my mailbox. So: What is the mailman equivalent of ezmlm's=20 "smartmontools-support-get.123@..." . In other words how could I fetch a particular message, or message range ? If this is not possible, how could one post in my situation w/o breaking the threading ? Cheers, Delian |
From: Volker K. <lis...@pa...> - 2005-02-09 00:00:45
|
> Maxtor accepts a SMART-log as sufficient documentation when claiming > warranty. The drive firmware is probably not bug-free, often not > complete, but I would have thought what *is* implemented would be > according to the specs. Otherwise how can they rely on just SMART when > doing warranty returns? They have a proprietory tool to query what the drive thinks of itself. This tool and the drive's firmware are developed together and tested to work together. There is no need to follow any specs to achieve this. Volker -- Volker Kuhlmann is possibly list0570 with the domain in header http://volker.dnsalias.net/ Please do not CC list postings to me. |
From: Delian K. <gm...@kr...> - 2005-02-08 22:42:42
|
Fortunately or not the pending sector of my hard drive has been reallocated. I don't know what triggered the reallocation. I found this thread extremely interesting: http://sourceforge.net/mailarchive/message.php?msg_id=7305120 With some further investigation I found that when doing: dd if=/dev/zero of=/dev/hdb some reading was also ongoing on the disk(probably causing the write to fail), as shown by: sar -d 1 0 I'm not sure what dd tries to read, and an strace does not show read attempts. Anyway, this simple program: ------------- #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <stdio.h> #include <unistd.h> int main(void) { int fd, i; char buf[4096]; for(i=0; i<4096; i++) buf[i] = 0; fd = open("/dev/hdb", O_WRONLY); if(fd == -1) { perror("open"); return 1; } i=4096; while(i==4096) { i = write(fd, buf, 4096); } if(i == -1) { perror("write"); return 2;} printf("i=%d\n", i); close(fd); return 0; } --------------- is writing to the disk w/o any reading(at least sar(/proc/diskstats) says so). Although O_DIRECT is documented in open(2), it doesn't appear to be present in the requested headers on my system(I've found it in asm/fcntl.h). I just regret I didn't do these tests earlier, while I still got a pending sector. Anyway. Hope this helps s.o. in the future. Any success/failure reports are welcome. Cheers, Delian |