From: Marin B. <li...@ol...> - 2018-05-22 19:56:34
|
> On 05/22/2018 02:36 PM, Gandalf Corvotempesta wrote: > > Il giorno mar 22 mag 2018 alle ore 19:28 Marin Bernard <lists@oliva > > rim.com> > > ha scritto: > > > So does Proxmox VE. > > > > Not all server are using proxmox. > > Proxmox repackage ZFS on every release, because they support it. > > If you have to mantain multiple different system, using DKMS is > > more prone > > to error > > than without. A small kernel upgrade could break everything. > > Yes. You may use Proxmox, Ubuntu, FreeBSD or even build your own kernel. > > > That's a myth. ZFS never required ECC RAM, and I run it on boxes > > > with > > > as little as 1GB RAM. Every bit of it can be tuned, including the > > > size > > > of the ARC. > > > > Is not a myth. Is the truth. ECC RAM is not required to run ZFS, > > but you > > won't be sure > > that what are you writing to disks (and checksumming) is exactly > > the same > > you received. > > > > In other words, without ECC RAM you could experience in memory data > > corruption and then > > you will write corrupted data (with a proper checksum), so that ZFS > > will > > reply with corrupted data. > > > > ECC is not mandatory, but highly suggested. > > Without ECC you'll fix the bit-rot, but you are still subject to > > in-memory > > corruption, > > so, the original issue (data corruption) is still unfixed and ZFS > > can't do > > nothing if data is > > corrupted before ZFS. Yes, I know that. However, you seemed to imply that ECC was a requirement. I'm sorry if I misunderstood. Of course, ECC memory is a must-have; I see no reason for not using it. > > > Checksumming and duplication (ditto blocks) of pool metadata are > > > NOT > > > provided by the master. This is a much appreciated feature when > > > you > > > come from an XFS background where a single urecoverable read can > > > crash > > > an entire filesystem. I've been there before; never ever! > > > > At which pool metadata are you referring to ? All. ZFS stores double or triple copies of each metadata block (it depends on the type of the metadata). Corrupted metadata blocks *will* be corrected, even in single-disk setups. > > Anyway, I hate XFS :-) I had multiple failures...... > > > > > MooseFS background verification may take months to check the > > > whole > > > dataset > > > > True. > > > > > ZFS does scrub a whole chunkserver within a few hours, with > > > adaptive, tunable throughput to minimize the impact on the > > > cluster. > > > > Is not the same. > > When ZFS detect a corruption, it does nothing without a RAID. it > > simply > > discard data > > during a read. But if you are reading a file, MooseFS will check > > the > > checksum automatically > > and does the same. Actually, ZFS keeps a list of damaged files. So in case of damaged blocks, you may: * Stop the chunkserver * List and remove damaged chunk files * Restart the chunkserver The mfschunkserver daemon will rescan chunk files and the master will soon be aware that a chunk is missing, and trigger a replication. This is easy to automate with a simple script. > Assuming that you have minimum of 2 copies in MooseFS, it will read, > detect > and read from second copy and will heal the first copy. > So, I don't know what you mean exactly by "does the same" but > it is not the *same* > > > > > > Anyway, even if you scrub the whole ZFS pool, you won't get any > > advantage, > > ZFS is unable > > to recover by itself (without raid) and MooseFS is still unaware of > > corruption. > > MooseFS will be *aware* of the corruption during the read and will > self heal > as I explained above. (Or during the checksum checking (native scrub) > loop, > whichever comes first.) > > > > > Ok, chunk1 is corrupted, ZFS detected it during a scrub. And now ? > > ZFS doesn't have any replica to rebuild from. > > MooseFS is unaware of this because their native scrub takes months > > and > > no one is reading that file from a client (forcing the checksum > > verification). > > You seem to be making these constant claims about "native scrub > taking months", > but I believe it was explained in earlier emails that this will > depend on your > hardware configuration. AFAIK, you can't scrub faster than 1 chunk/sec. per chunkserver. I you own 12 servers, they'll 12 chunks/sec. = 720 chunks/min. = 43,200 chunks/hour = 1,036,800 chunks/day. If you have 50,000,000 chunks, it would take roughly 50 days to have them checked at this rate, which would probably put the cluster on its knees. If you scan at a more reasonable rate of 3 chunks/sec, it rises to 150 days. So that's not a claim; that's a fact. > I believe there was another email which basically said > this "native scrub speed" was much improved in version 4. > So I think it is fair to say that you should stop repeating this > "native scrub takes months" claim, > or if you are not going to stop repeating it, at least put some > qualifiers around it. > Or download v4, and see if the speed improved... > I do know that v4 improves on this point, but it not yet production ready. I won't be mentioning it until it is released. |