From: Bruce A. <ba...@gr...> - 2008-03-21 16:44:46
|
Hi Clem, >> Out of curiousity, what are the 40 devices and how are they >> structured/accesssed? > > They are (mostly) 750GB SATA disks in external 12 slot SAS chassis > connected via LSI Logic SAS host adaptors. Each set of 12 is arranged as > a software raid5 array using md driver. Right now we have 36, soon to > increase to 48 and in the future possibly 100's... > > With so many disks and only raid5 protection I really want to get warned > that a disk is getting ill before it fails. I would be very cautious about running Raid-5. See for example this thread http://www.beowulf.org/archive/2007-August/019062.html and references therin. You will need to do a regular scrubbing of the disks to find uncorrectable blocks and to reconstruct the correct data for them. I suggest you check with LSI Logic, and make sure that their controller has a 'scrub and verify' function that reads the entire disk surface. You should run this at least two times per week. You can also test this using a new feature of hdparm which creates uncorrectable sectors: http://lwn.net/Articles/269721/ Here's the procedure that we use to test our RAID-6 controllers: a) Power down storage server b) Remove disk from storage server, put into 2nd SATA slot on compute node c) Corrupt sector 1234567 of the disk with hdparm to make it uncorrectable d) Put disk back into storage server and power up storage server e) Run the verify/scrub command. See if it correctly rewrites the UNC sector on the disk drive with the correct data. f) Repeat steps a-d, then simply read the entire device. See if the controller correctly identifies and corrects the UNC sector on the disk drive. I suggest that you run some similar tests, if you have not already done so. Cheers, Bruce |