From: R.C. <mil...@gm...> - 2017-06-17 06:35:19
|
Hi all I want to share with you my recent experience. Two days ago or so we experienced an unexpected power-loss. The master and 1 chunk-server shutdown properly but the remaining 2 chunk-servers had a degraded UPS battery that left them too early... After power-back the metadata was healthy (as expected) but 1 data chunk was lost. 1 file was affected and, even with mfsfilerepair, I had no chance of restoring it. At the beginning I thought it was due to write-cache on chunk-disks: in the case of that specific chunk was about to be written on those servers that had faulty UPS, most likely the write cache was still holding the data and it was then lost forever. Unfortunately, a rapid check of disks configuration negated that, being write cache disabled on those two servers. Moreover, no battery-backed RAID controllers are used for chunk disks. I'm quite sure write-barriers are enabled by default on CENTOS (the only distro we use here) How can I mitigate the possibility of experiencing such a problem again? (apart from changin' UPS batteries... :-) The goal is now set to 2. Should I increase to 3? Our system: Server1: master+chunk (2 dedicated HDs - XFS filesystem - write cache enabled) Server2: metalog+chunk (2 dedicated HDs - XFS filesystem - write cache disabled) Server3: metalog+chunk (2 dedicated HDs - XFS filesystem - write cache disabled) Server4: metalog+chunk (2 dedicated HDs - XFS filesystem - write cache disabled) Thanks for reading Bye Raffaello |