You can subscribe to this list here.
2009 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(4) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 |
Jan
(20) |
Feb
(11) |
Mar
(11) |
Apr
(9) |
May
(22) |
Jun
(85) |
Jul
(94) |
Aug
(80) |
Sep
(72) |
Oct
(64) |
Nov
(69) |
Dec
(89) |
2011 |
Jan
(72) |
Feb
(109) |
Mar
(116) |
Apr
(117) |
May
(117) |
Jun
(102) |
Jul
(91) |
Aug
(72) |
Sep
(51) |
Oct
(41) |
Nov
(55) |
Dec
(74) |
2012 |
Jan
(45) |
Feb
(77) |
Mar
(99) |
Apr
(113) |
May
(132) |
Jun
(75) |
Jul
(70) |
Aug
(58) |
Sep
(58) |
Oct
(37) |
Nov
(51) |
Dec
(15) |
2013 |
Jan
(28) |
Feb
(16) |
Mar
(25) |
Apr
(38) |
May
(23) |
Jun
(39) |
Jul
(42) |
Aug
(19) |
Sep
(41) |
Oct
(31) |
Nov
(18) |
Dec
(18) |
2014 |
Jan
(17) |
Feb
(19) |
Mar
(39) |
Apr
(16) |
May
(10) |
Jun
(13) |
Jul
(17) |
Aug
(13) |
Sep
(8) |
Oct
(53) |
Nov
(23) |
Dec
(7) |
2015 |
Jan
(35) |
Feb
(13) |
Mar
(14) |
Apr
(56) |
May
(8) |
Jun
(18) |
Jul
(26) |
Aug
(33) |
Sep
(40) |
Oct
(37) |
Nov
(24) |
Dec
(20) |
2016 |
Jan
(38) |
Feb
(20) |
Mar
(25) |
Apr
(14) |
May
(6) |
Jun
(36) |
Jul
(27) |
Aug
(19) |
Sep
(36) |
Oct
(24) |
Nov
(15) |
Dec
(16) |
2017 |
Jan
(8) |
Feb
(13) |
Mar
(17) |
Apr
(20) |
May
(28) |
Jun
(10) |
Jul
(20) |
Aug
(3) |
Sep
(18) |
Oct
(8) |
Nov
|
Dec
(5) |
2018 |
Jan
(15) |
Feb
(9) |
Mar
(12) |
Apr
(7) |
May
(123) |
Jun
(41) |
Jul
|
Aug
(14) |
Sep
|
Oct
(15) |
Nov
|
Dec
(7) |
2019 |
Jan
(2) |
Feb
(9) |
Mar
(2) |
Apr
(9) |
May
|
Jun
|
Jul
(2) |
Aug
|
Sep
(6) |
Oct
(1) |
Nov
(12) |
Dec
(2) |
2020 |
Jan
(2) |
Feb
|
Mar
|
Apr
(3) |
May
|
Jun
(4) |
Jul
(4) |
Aug
(1) |
Sep
(18) |
Oct
(2) |
Nov
|
Dec
|
2021 |
Jan
|
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
|
Jul
(6) |
Aug
|
Sep
(5) |
Oct
(5) |
Nov
(3) |
Dec
|
2022 |
Jan
|
Feb
|
Mar
(3) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Marin B. <li...@ol...> - 2018-05-21 07:54:44
|
> > We have experienced all sorts of disasters, crashes, bad drives, > > etc and > > were always able to recover using a metalogger or other backups > > with no > > data loss (expect on the fly data). > > I'm really interested in this. > How MFS react to disk failures (or disk still working but with some > URE) ? > Is it safe to use MFS without any RAID, as suggested in the official > site ? Disks with I/O errors are automatically marked as damaged. You may configure the threshold in mfschunkserver.cfg. It is perfectly safe (and recommended) to use MFS on JBOD. A chunkserver will *never* send bad data, because it always checks the checksum of a chunk before returning it. Bad chunks would simply be dropped. > > Note: The imap email storage is a funny use case. It works really > > well, > > but it really balloons storage space because of the small files. > > Plan > > for as much a 5x-7x needed capacity. > > Why? 64MB chunks should be useless in email hosting. > If a file is smaller than 64MB, chunk will get the real file size. > Why we should plan for 5x-75x needed capacity ? > Yes, but the minimal sector size in MooseFS is 64k, and a chunk cannot be smaller than that. |
From: Gandalf C. <gan...@gm...> - 2018-05-21 07:46:06
|
Il giorno dom 20 mag 2018 alle ore 23:58 WK <wk...@bn...> ha scritto: > Early on in our MFS history, we *did* have issues with VMs when they > were under heavy i/o load AND the MFS cluster was busy doing > rebalancing. They would go read-only and/or lose a chunk, requiring an > fsck to recover. Did you have time to figure this out? Why this happened? It was due to a bug in the older version of MooseFS or something else? > At the time we stopped using MFS for VM images and purposed MFS solely > for Email, NAS type File Server loads and Archive backups. Email will be one of our primary use case. > Since 3.x we have begun to resume using MFS for "some" VM images with no > failures, but we are still a little skittish and reserve that for > 'Cloud-Native' installs where there are other VM copies on other > hosts/storage, just in case something bad happens. Why ? Did you have any more issues with 3.x ? > We have experienced all sorts of disasters, crashes, bad drives, etc and > were always able to recover using a metalogger or other backups with no > data loss (expect on the fly data). I'm really interested in this. How MFS react to disk failures (or disk still working but with some URE) ? Is it safe to use MFS without any RAID, as suggested in the official site ? > a) a Tech brought up a chunkserver with the same IP as another > chunkserver. Not a good result, as it swiss cheesed the chunkserver data > on any file that was active during the period. What happens in this case ? MFS will start to send bad data coming from the chunkserver with the bad IP ? > Note: The imap email storage is a funny use case. It works really well, > but it really balloons storage space because of the small files. Plan > for as much a 5x-7x needed capacity. Why? 64MB chunks should be useless in email hosting. If a file is smaller than 64MB, chunk will get the real file size. Why we should plan for 5x-75x needed capacity ? In example, a 2MB file, is saved as 2MB file (plus a little bit overhead due to the MFS header) |
From: WK <wk...@bn...> - 2018-05-20 21:58:27
|
We have used MooseFS since the 1.6.x days. So 5+ years. For archive and email workloads MFS worked great and was reliable. We tend to build Multiple Small MFS clusters of 4-6 chunkservers rather than one huge "everything in one basket solution". Early on in our MFS history, we *did* have issues with VMs when they were under heavy i/o load AND the MFS cluster was busy doing rebalancing. They would go read-only and/or lose a chunk, requiring an fsck to recover. That was especially true with sparse VM images that were being expanding due to a large amount of inbound data. At the time we stopped using MFS for VM images and purposed MFS solely for Email, NAS type File Server loads and Archive backups. Since 3.x we have begun to resume using MFS for "some" VM images with no failures, but we are still a little skittish and reserve that for 'Cloud-Native' installs where there are other VM copies on other hosts/storage, just in case something bad happens. We have experienced all sorts of disasters, crashes, bad drives, etc and were always able to recover using a metalogger or other backups with no data loss (expect on the fly data). The sole exception were two incidients which were entirely our fault and on the older 1.6.x code. a) a Tech brought up a chunkserver with the same IP as another chunkserver. Not a good result, as it swiss cheesed the chunkserver data on any file that was active during the period. b) A mfsmaster was using a small SSD which filled up and the 'warning checks' weren't in fully in place. It remained that way for weeks, because MFS is never anything you ever worry about. Because the master had been unable to write out the changelogs, the metadata built up in memory and eventually crashed the box. The metalogger data on the other boxes was corrupt. So we pretty much lost all the data that had been written after the drive filled up. So we are very, very happy with MFS, but don't regard it as the only Storage Solution in our setup. Note: The imap email storage is a funny use case. It works really well, but it really balloons storage space because of the small files. Plan for as much a 5x-7x needed capacity. On 5/20/2018 9:42 AM, Gandalf Corvotempesta wrote: > Hi to all > I hope that responses to this post will be honest even if we are on an > official ML. > > Anyone using MooseFS in production (from many years) and experienced > data-loss > or data corruption of any kind, event after hard crashes, disk failures, > node failures and so on? > > How Moose react to failures ? > Is that stable enough to be used in production with critical data and VM > images ? > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _________________________________________ > moosefs-users mailing list > moo...@li... > https://lists.sourceforge.net/lists/listinfo/moosefs-users |
From: Marin B. <li...@ol...> - 2018-05-20 21:11:15
|
> Il giorno dom 20 mag 2018 alle ore 22:09 Marin Bernard > <li...@ol...> > ha scritto: > > The chunkserver acknowledges the write while the data are still > > pending > > commit to disk. If the server dies meanwhile, the data are lost. > > Even if client asked for O_DIRECT or fsync explicitely ? > If yes, this would break the POSIX compatibility, I think. No, not necessarily. It depends on what you mean by 'the client'. The POSIX interface is implemented by mfsmount, not the chunkserver. The client is just another process on the same machine. mfsmount may comply with POSIX and let chunkservers deal with the data in their own way in the background. The client process would know nothing about it, as it only speaks to mfsmount. From a POSIX perspective, the chunkserver is like a physical disk. POSIX does not specify how physical disks should work internally. If you O_DIRECT a file stored on a consumer hard drive and fill it with data, chances are that you'll experiment some kind of data loss if you unplug the box in the middle of a write: the content of the write cache, which was acknowledged but not committed, would have vanished. POSIX can't do anything about it. Since MooseFS extends over both the kernel and device layers, it has the opportunity to do better than POSIX, and break the tiering by leaking useful data from the mount processes to the chunkservers. I suppose this is why fsync operations are cascaded from mfsmount to the chunkservers. I do not know if this is the same with O_DIRECT, though. > > However, if goal is >= 2 (as it should always be), at least one > > more > > copy of the data must already be present on another chunkserver > > before > > the acknowledgment is sent. > > Not really. if goal >= 2, the ack is sent if another chunkserver has > commited to the cache, > so you are acknowledging when all goal copies are wrote to the cache, > not > to the disk. Absolutely. > This woud be ok in normal condition (like any writes made by any > software), > but if > client is asking for O_DIRECT, the acknoledge *must* be sent *after* > data > is stored on disk. This is what you would get with the ``mfscachemode=DIRECT`` mount option, which bypasses the cache completely, at least on client side. Yet, I don't know whether mfsmount is able to enforce O_DIRECT on a per-file basis or if the settings must apply to the whole mountpoint. |
From: Marin B. <li...@ol...> - 2018-05-20 20:23:40
|
> Based on official site, the suggested configuration is: > > - JBOD (no raid) for each chunkservers. > - XFS for each chunkserver > > Is ZFS not suggested? Any drawbacks ? > Anyone using ZFS in production ? I do use ZFS in production, and I know from reading this mailing list that I am not the only one doing so. My chunkservers have been running FreeBSD + ZFS for several years. I migrated to Debian + ZoL a few months ago. |
From: Gandalf C. <gan...@gm...> - 2018-05-20 20:22:05
|
Il giorno dom 20 mag 2018 alle ore 22:09 Marin Bernard <li...@ol...> ha scritto: > The chunkserver acknowledges the write while the data are still pending > commit to disk. If the server dies meanwhile, the data are lost. Even if client asked for O_DIRECT or fsync explicitely ? If yes, this would break the POSIX compatibility, I think. > However, if goal is >= 2 (as it should always be), at least one more > copy of the data must already be present on another chunkserver before > the acknowledgment is sent. Not really. if goal >= 2, the ack is sent if another chunkserver has commited to the cache, so you are acknowledging when all goal copies are wrote to the cache, not to the disk. This woud be ok in normal condition (like any writes made by any software), but if client is asking for O_DIRECT, the acknoledge *must* be sent *after* data is stored on disk. |
From: Marin B. <li...@ol...> - 2018-05-20 20:09:34
|
> > What is certain, however, is that the performance cost of fsync > > with > > ZFS is far higher than with another FS. In standard async mode, ZFS > > delays and batches disk writes to minimize IO latency. fsync > > predates > > this optimization without disabling it, and as a result performance > > drops quickly > > This is obvious but the question is not if enable or not fsync. > The real question is: what happens when HDD_FSYNC_BEFORE_CLOSE is set > to 0 ? The chunkserver acknowledges the write while the data are still pending commit to disk. If the server dies meanwhile, the data are lost. However, if goal is >= 2 (as it should always be), at least one more copy of the data must already be present on another chunkserver before the acknowledgment is sent. > What if client open a file with FSYNC set? Will Moose honor this and > then > send that to the underlying storage > regardless the value of HDD_FSYNC_BEFORE_CLOSE ? > In other words: is HDD_FSYNC_BEFORE_CLOSE an additional measure to > force > FSYNC (on file close) even if client > didn't ask for it ? In my understanding yes, it is a way to enforce fsync at the chunkserver level, regardless of what the client asked. This would apply to all writes on a specific chunkserver. However, this does not enforce fsync on the client side. > or is HDD_FSYNC_BEFORE_CLOSE the only way to issue > FSYNC in MooseFS? I don't think so. It seems that mfsmount has a undocumented option ``mfsfsyncbeforeclose``, which controls fsync before close for a given mountpoint. See: https://github.com/moosefs/moosefs/blob/138e149431b47b363de0d32e39 629e4036d1cb00/mfsmount/main.c#L295 This option appeared in MooseFS 3.0.5-1 to allow a user to disable fsync before close. According to the commit message, fsync was enabled and mandatory before that change. I suppose the commit also disabled fsync by default in mfsmount, because a few months later, MooseFS 3.0.51-1 reverted it to enabled by default. The ``mfsfsyncmintime`` option, which is documented, seems to be the official way to deal with this setting. The man page states that ``mfsfsyncmintime=0`` has the same effect as ``mfsfsyncbeforeclose=1``. So setting ``mfsfsyncmintime`` to 0 would force fsync before close regardless of what the client asked, and of the age of the file descriptor. MooseFS being distributed, a single fsync call on a client-side fd would ideally result in cache flushes both on the client and on the chunkservers. These operations must be orchestrated in a way that makes them both reliable and lightweight; so I'm sure the client was designed to be adaptive enough to find the best compromise in various scenarios. For instance, I would not be surprised to learn that the way fsync is managed depends on write cache or file locking settings. This is a highly technical discussion and I'm mostly making blind suppositions here. The final word may only come from a member of the dev team. Marin. |
From: Gandalf C. <gan...@gm...> - 2018-05-20 19:59:47
|
Based on official site, the suggested configuration is: - JBOD (no raid) for each chunkservers. - XFS for each chunkserver Is ZFS not suggested? Any drawbacks ? Anyone using ZFS in production ? |
From: Gandalf C. <gan...@gm...> - 2018-05-20 17:59:38
|
Il giorno dom 20 mag 2018 alle ore 11:33 Jakub Kruszona-Zawadzki < ac...@mo...> ha scritto: > HA in MFS 4.x works fully automatic. You just need to define group of IP numbers in your DNS for your “mfsmaster” name. Then install master servers and run them on machines with those IP numbers and thats it. One of them will became “LEADER” of the group and other “FOLLOWER”-s of the group. If for any reason “LEADER” goes down then one of the FOLLOWER’s is chosen as an “ELECT” and when more than half of chunkservers connects to it then it automatically switches to “LEADER” state and starts working as main master server. When your previous leader becomes available again then it usually joins the group as a “FOLLOWER”. This is rather “fire and forget” solution. They will synchronise metadata between them automatically, chunkservers and clients will reconnect to current leader also automatically etc. Just thinking about this. So, in a MooseFS 4.0, if RAM is enough on all servers, placing a mfsmaster on each chunkserver, should increase reliability (every server can be a leader in case of failure) and metalogger would be useless. Which protocol is used for this ? RAFT ? |
From: Gandalf C. <gan...@gm...> - 2018-05-20 16:48:37
|
Il giorno dom 20 mag 2018 alle ore 17:13 Marin Bernard <li...@ol...> ha scritto: > What is certain, however, is that the performance cost of fsync with > ZFS is far higher than with another FS. In standard async mode, ZFS > delays and batches disk writes to minimize IO latency. fsync predates > this optimization without disabling it, and as a result performance > drops quickly This is obvious but the question is not if enable or not fsync. The real question is: what happens when HDD_FSYNC_BEFORE_CLOSE is set to 0 ? What if client open a file with FSYNC set? Will Moose honor this and then send that to the underlying storage regardless the value of HDD_FSYNC_BEFORE_CLOSE ? In other words: is HDD_FSYNC_BEFORE_CLOSE an additional measure to force FSYNC (on file close) even if client didn't ask for it ? or is HDD_FSYNC_BEFORE_CLOSE the only way to issue FSYNC in MooseFS? |
From: Gandalf C. <gan...@gm...> - 2018-05-20 16:42:54
|
Hi to all I hope that responses to this post will be honest even if we are on an official ML. Anyone using MooseFS in production (from many years) and experienced data-loss or data corruption of any kind, event after hard crashes, disk failures, node failures and so on? How Moose react to failures ? Is that stable enough to be used in production with critical data and VM images ? |
From: Marin B. <li...@ol...> - 2018-05-20 15:13:20
|
> On 20.05.2018 15:38, Marin Bernard wrote: > > I may have some answers. > > > > > There where some posts saying that MooseFS doesn't support fsync > > > at > > > all, is this fixed? Can we force the use of fsync to be sure that > > > an > > > ACK is sent if and only if all data is really wrote on disks? > > > > MooseFS chunkservers have the HDD_FSYNC_BEFORE_CLOSE config setting > > (default off) which does just that. The equivalent setting on > > LizardFS > > is PERFORM_FSYNC (default on). The performance penalty with > > PERFORM_FSYNC=1 is very high, though. > > > > > > Funny thing, just this morning I did some testing with this flag on > and off. > > Basically, I was running with fsync on ever since, because it just > felt as the right thing to do. And it's the only change I ever made > to mfschunkserver.cfg, all other settings being at their default > values. But, just yesterday, suddenly I wondered, if it would make > any difference for my case, where chunkservers are rather far away > from the mount point (about 10 - 15 ms away). > > So, I did some basic tests, writing a large file, and untarring linux > kernel source (lots of small files) with fsync setting on/off. > > Much as expected, there was no difference at all for large files > (network interface saturated, anyway). In case of very small files > (linux kernel source), performance per chunkserver improved from > about 350 writes/sec to about 420 writes/sec. IOW, depending on your > workload, it seems that turning HDD_FSYNC_BEFORE_CLOSE on could > result in 0-15% performance degradation. Of course, depending on your > infrastructure (network, disks...), the results may vary. > > I eventually decided, it's not a big price to pay, and I'm dealing > mostly with bigger files anyway, so I turned the setting back on and > don't intend to do any more testing. I saw a much important difference with LizardFS on a ZFS backend, even with large files. Roughly 50-60% faster without fsync. I never ran the test on MooseFS, so it's hard to tell the difference. You gave me the idea to try it though! It is possible that LizardFS performs slower than MooseFS when fsync is enabled. Afterall, their implementations have had enough time to drift. However, such a perf gap between both solutions seem unlikely to me. What is certain, however, is that the performance cost of fsync with ZFS is far higher than with another FS. In standard async mode, ZFS delays and batches disk writes to minimize IO latency. fsync predates this optimization without disabling it, and as a result performance drops quickly. I could try to set sync=always on the dataset, leave fsync off at chunkserver level, and see what happens. This would offer the same warranty than fsync=on while letting the filesystem handle synchronous writes by itself (and trigger its own optimization strategies, maybe not using batched transactions at all with synchronous writes). Thanks for sharing your figures with us! Marin. |
From: Marin B. <li...@ol...> - 2018-05-20 14:36:28
|
> Could you please make a real example? > Let's assume a replica 4, 2 SSD chunkservers, 2 hdd chunkservers > I would like that ACK is returned to the client after the first 2 SSD > server has wrote the data while the other 2 (up to a goal of 4) are > still writing The MooseFS team detailed this extensively in the storage class manual (see https://moosefs.com/Content/Downloads/moosefs-storage-classes-manu al.pdf, especially chapter 4 with common use scenarios). Let: - SSD chunkservers be labeled 'A' - HDD chunkservers be labeled 'B'. You may create the following storage class: mfsscadmin create -C2A -K2A,2B tiered4 The class is named 'tiered4'. It stores 2 copies on SSD at chunk creation (-C2A), and adds 2 more on HDD asynchronously (-K2A,2B). Then, you need to assign this storage class to a directory with: mfssetsclass tiered4 <directory> Furthermore, you may also configure your clients to prefer SSD chunkservers for R/W operations by adding the following line to mfsmount.cfg or passing it as a mount option: mfspreflabels=A This would make you clients prefer SSD copies over HDD ones to retrieve or modify chunks. Hope it helps you, Marin. |
From: Zlatko Č. <zca...@bi...> - 2018-05-20 14:10:44
|
On 20.05.2018 15:38, Marin Bernard wrote: > I may have some answers. > >> There where some posts saying that MooseFS doesn't support fsync at >> all, is this fixed? Can we force the use of fsync to be sure that an >> ACK is sent if and only if all data is really wrote on disks? > MooseFS chunkservers have the HDD_FSYNC_BEFORE_CLOSE config setting > (default off) which does just that. The equivalent setting on LizardFS > is PERFORM_FSYNC (default on). The performance penalty with > PERFORM_FSYNC=1 is very high, though. > > Funny thing, just this morning I did some testing with this flag on and off. Basically, I was running with fsync *on* ever since, because it just felt as the right thing to do. And it's the only change I ever made to mfschunkserver.cfg, all other settings being at their default values. But, just yesterday, suddenly I wondered, if it would make any difference for my case, where chunkservers are rather far away from the mount point (about 10 - 15 ms away). So, I did some basic tests, writing a large file, and untarring linux kernel source (lots of small files) with fsync setting on/off. Much as expected, there was no difference at all for large files (network interface saturated, anyway). In case of very small files (linux kernel source), performance per chunkserver improved from about 350 writes/sec to about 420 writes/sec. IOW, depending on your workload, it seems that turning HDD_FSYNC_BEFORE_CLOSE on could result in 0-15% performance degradation. Of course, depending on your infrastructure (network, disks...), the results may vary. I eventually decided, it's not a big price to pay, and I'm dealing mostly with bigger files anyway, so I turned the setting back on and don't intend to do any more testing. -- Zlatko |
From: Gandalf C. <gan...@gm...> - 2018-05-20 14:00:50
|
Il dom 20 mag 2018, 15:56 Zlatko Čalušić <zca...@bi...> ha scritto: > Even if LizardFS currently has a feature or two missing in current > MooseFS 3.0, it seems that MooseFS 4.0 will be a clear winner, and > definitely worth a wait! > Totally agree Probably the real missing thing, IMHO, is the qemu driver > |
From: Zlatko Č. <zca...@bi...> - 2018-05-20 13:56:17
|
On 20.05.2018 14:53, Marin Bernard wrote: >> This is very interesting >> Any official and detailed docs about the HA feature? >> >> Other than this, without making any flame, which are the differences >> between MFS4 and LizardFS? >> > Hi again, > > I've been testing both MooseFS 3.0.x and LizardFS 3.1x in parallel for > a few weeks now. Here are the main differences I found while using > them. I think most of them will still be relevant with MooseFS 4.0. > > [snip] Hey Marin! Thank you for sharing your experience with competing product with us, much appreciated. Being very happy with MooseFS so far, never had enough motivation to test its fork. So your detailed explanation of differences perfectly satisfies my curiosity. ;) Even if LizardFS currently has a feature or two missing in current MooseFS 3.0, it seems that MooseFS 4.0 will be a clear winner, and definitely worth a wait! Regards, -- Zlatko |
From: Zlatko Č. <zca...@bi...> - 2018-05-20 13:56:17
|
On 20.05.2018 11:33, Jakub Kruszona-Zawadzki wrote: > >> On 19 May 2018, at 18:42, Gandalf Corvotempesta >> <gan...@gm... >> <mailto:gan...@gm...>> wrote: >> >> il sab 19 mag 2018, 17:48 Marco Milano <mar...@gm... >> <mailto:mar...@gm...>> ha scritto: >> >> Tea Leaves :-) >> >> >> Seriously, where did you get this info? >> How HA works in moosefs? Is failover automatic? >> > > MooseFS 4.x now is in “close beta” (or even more “close > release-candidate”) stage. Currently we started to use it by > ourselves. When we will see that there are no obvious bugs then we > will release it as open-source product under GPL-v2 (or even GPL-v2+) > licence. > In the meantime if you want to participate in tests of MFS 4.x - > please let us know, we will send you packages of MFS 4.x for you OS. > > Marco Milano is one of our “testers” and this is why he knows more > about MFS 4.x > > HA in MFS 4.x works fully automatic. You just need to define group of > IP numbers in your DNS for your “mfsmaster” name. Then install master > servers and run them on machines with those IP numbers and thats it. > One of them will became “LEADER” of the group and other “FOLLOWER”-s > of the group. If for any reason “LEADER” goes down then one of the > FOLLOWER’s is chosen as an “ELECT” and when more than half of > chunkservers connects to it then it automatically switches to “LEADER” > state and starts working as main master server. When your previous > leader becomes available again then it usually joins the group as a > “FOLLOWER”. This is rather “fire and forget” solution. They will > synchronise metadata between them automatically, chunkservers and > clients will reconnect to current leader also automatically etc. > > Regards, > Jakub Kruszona-Zawadzki > > Hey Jakub! Such an excellent work on MooseFS, as always! Thank you for letting us know all these great news upfront! With new features announces, I really don't see how any other SDS competitor will be able to compete. Of course, I may be biased, running MooseFS for several years, but also being very happy with it all that time. Occassionaly, I tested other SDS solutions, but none were match for MooseFS. Last half a year, I've been successfully running MooseFS 3.0 as a Kubernetes deployment, and honestly I can't wait to see how new MooseFS 4.0 HA will fit that scheme. I expect lots of fun, in any case! Keep up the good work! -- Zlatko |
From: Gandalf C. <gan...@gm...> - 2018-05-20 13:43:32
|
Il dom 20 mag 2018, 15:38 Marin Bernard <li...@ol...> ha scritto: > MooseFS chunkservers have the HDD_FSYNC_BEFORE_CLOSE config setting > (default off) which does just that. The equivalent setting on LizardFS > is PERFORM_FSYNC (default on). The performance penalty with > PERFORM_FSYNC=1 is very high, though. > > If you use ZFS as a backend (which I do), fsync may be enforced at the > file system level, which is probably more efficient as it bypasses the > kernel buffer cache (ZFS uses its own). Performance penalty is higher > than on other file systems because in async mode, ZFS batches disk > transactions to minimize latency -- a performance boost which is lost > when fsync is enabled. If performance is critical, you may improve it > with zlogs. > No, data reliability and consistancy is more important for us > > I think you can do the same in MooseFS with storage classes: just > specify a different label expression for chunk Creation and Keep steps. > This way, you may even decide to assign newly created chunks to > specific chunk servers. With LizardFS, there is no way to limit chunk > creation to a subset of chunkservers: they are always distributed > randomly. That's a problem when half of your servers are part of > another site. > Could you please make a real example? Let's assume a replica 4, 2 SSD chunkservers, 2 hdd chunkservers I would like that ACK is returned to the client after the first 2 SSD server has wrote the data while the other 2 (up to a goal of 4) are still writing > |
From: Marin B. <li...@ol...> - 2018-05-20 13:38:43
|
I may have some answers. > There where some posts saying that MooseFS doesn't support fsync at > all, is this fixed? Can we force the use of fsync to be sure that an > ACK is sent if and only if all data is really wrote on disks? MooseFS chunkservers have the HDD_FSYNC_BEFORE_CLOSE config setting (default off) which does just that. The equivalent setting on LizardFS is PERFORM_FSYNC (default on). The performance penalty with PERFORM_FSYNC=1 is very high, though. If you use ZFS as a backend (which I do), fsync may be enforced at the file system level, which is probably more efficient as it bypasses the kernel buffer cache (ZFS uses its own). Performance penalty is higher than on other file systems because in async mode, ZFS batches disk transactions to minimize latency -- a performance boost which is lost when fsync is enabled. If performance is critical, you may improve it with zlogs. > LizardFS has something interesting: an ACK could be sent after a > defined amount of copies are done, even if if less than goal level > In example, goal 4, redundancy level set to 2: after 2 succsefully > written chunkservers, the ACK is returned, the missing 2 copies are > made in writeback I think you can do the same in MooseFS with storage classes: just specify a different label expression for chunk Creation and Keep steps. This way, you may even decide to assign newly created chunks to specific chunk servers. With LizardFS, there is no way to limit chunk creation to a subset of chunkservers: they are always distributed randomly. That's a problem when half of your servers are part of another site. Marin. |
From: Gandalf C. <gan...@gm...> - 2018-05-20 13:19:50
|
Il dom 20 mag 2018, 11:33 Jakub Kruszona-Zawadzki <ac...@mo...> ha scritto: > In the meantime if you want to participate in tests of MFS 4.x - please > let us know, we will send you packages of MFS 4.x for you OS. > I have a test cluster to revive. I've planned to use Lizard, but if i understood properly, MFS will have HA even in FOSS version thus i would be glad to test, if possibile. |
From: Gandalf C. <gan...@gm...> - 2018-05-20 13:15:49
|
This is exactly what i was looking for. A well done comparison. Is MFS4 Will be released with native HA for everyone and not only for paying users, MooseFS could be easily considered one of the best SDS out there. There are some missing thing that Lizard won't implement due to the lack of man power in their company, i'll open the same request on MFS github repository, let's see if MooseFS is more open... There where some posts saying that MooseFS doesn't support fsync at all, is this fixed? Can we force the use of fsync to be sure that an ACK is sent if and only if all data is really wrote on disks? LizardFS has something interesting: an ACK could be sent after a defined amount of copies are done, even if if less than goal level In example, goal 4, redundancy level set to 2: after 2 succsefully written chunkservers, the ACK is returned, the missing 2 copies are made in writeback Any plan to add support for a quemu driver totally skipping the fuse stack? This is a much waited feature from Lizard and would bring MooseFS to the cloud world, where many hypervisors are qemu/kvm Il dom 20 mag 2018, 14:53 Marin Bernard <li...@ol...> ha scritto: > > This is very interesting > > Any official and detailed docs about the HA feature? > > > > Other than this, without making any flame, which are the differences > > between MFS4 and LizardFS? > > > > Hi again, > > I've been testing both MooseFS 3.0.x and LizardFS 3.1x in parallel for > a few weeks now. Here are the main differences I found while using > them. I think most of them will still be relevant with MooseFS 4.0. > > * High availability > In theory, LizardFS provides master high-availability with > _shadow_ instances. The reality is less glorious, as the piece of > software actually implementing master autopromotion (based on uraft) is > still proprietary. It is expected to be GPL'd, yet nobody knows when. > So as of now, if you need HA with LizardFS, you have to write your own > set of scripts and use a 3rd party cluster manager such as corosync. > > * POSIX ACLs > Using POSIX ACLs with LizardFS requires a recent Linux Kernel (4.9+), > because a version of FUSE with ACL support is needed. This means ACLs > are unusable with most LTS distros, whose kernels are too old. > > With MooseFS, ACLs do work even with older kernels; maybe because they > are implemented at the master level and the client does not even try to > enforce them? > > * FreeBSD support > According to the LizardFS team, all components do compile on FreeBSD. > They do not provide a package repository, though, nor did they succeed > in submitting LizardFS to the FreeBSD ports tree (bug #225489 is still > open on phabricator). > > * Storage classes > Erasure coding is supported in LizardFS, and I had no special issue > with it. So far, it works as expected. > > The equivalent of MooseFS storage classes in LizardFS are _custom > goals_. While MooseFS storage classes may be dealt with interactively, > LizardFS goals are statically defined in a dedicated config file. > MooseFS storage classes allow the use of different label expressions at > each step of a chunk lifecycle (different labels for new, kept and > archived chunks). LizardFS has no equivalent. > > One application of MooseFS storage classes is to transparently delay > the geo-replication of a chunk for a given amount of time, to lower the > latency of client I/O operations. As far as I know, it is not possible > to do the same with LizardFS. > > * NFS support > LizardFS supports NFSv4 ACL. It may also be used with the NFS Ganesha > server to export directories directly through user-space NFS. I did not > test this feature myself. According to several people, the feature, > which is rather young, does work but performs poorly. Ganesha on top of > LizardFS is a multi-tier setup with a lot of moving parts. I think it > will take some time for it to reach production quality, if ever. > > In theory, Ganesha is compatible with kerberized NFS, which would be > far more secure a solution than the current mfsmount client, enabling > its use in public/hostile environments. I don't know if MooseFS 4.0 has > improved on this matter. > > * Tape server > LizardFS includes a tape server daemon for tape archiving. That's > another way to implement some kind of chunk lifecycle without storage > classes. > > * IO limits > Lizardfs includes a new config file dedicated to IO limits. It allows > to assign IO limits to cgroups. The LFS client negotiates its bandwidth > limit with the master is leased a reserved bandwidth for a given amount > of time. The big limitation of this feature is that the reserved > bandwidth may not be shared with another client while the original one > is not using it. In that case, the reserved bandwidth is simply lost. > > * Windows client > The paid version of LizardFS includes a native Windows client. I think > it is built upon some kind of fsal à la Dokan. The client allows to map > a LizardFS export to a drive letter. The client supports Windows ACL > (probably stored as NFSv4 ACL). > > * Removed features > LizardFS removed chunkserver maintenance mode and authentication code > (AUTH_CODE). Several tabs from the Web UI are also gone, including the > one showing quotas. The original CLI tools were replaced by their own > versions, which I find harder to use (no more tables, and very verbose > output). > > > > I've been using MooseFS for several years and never had any problem > with it, even in very awkward situations. My feeling is that it is > really a rock-solid, battle-tested product. > > I gave LizardFS a try, mainly for erasure coding and high-availability. > While the former worked as expected, the latter turned out to be a > myth: the free version of LizardFS does not provide more HA than > MooseFS CE: in both cases, building a HA solution requires writing > custom scripts and relying on a cluster managed such as corosync. I see > no added value in using LizardFS for HA. > > On all other aspects, LizardFS does the same or worse than MooseFS. I > found performance to be roughly equivalent between the two (provided > you disable fsync on LizardFS chunkservers, where it is enabled by > default). Both solutions are still similar in many aspects, yet > LizardFS is clouded by a few negative points: ACLs are hardly usable, > custom goals are less powerful than storage classes and less convenient > for geo-replication, FreeBSD support is inexistent, CLI tools are less > efficient, and native NFS support is too young to be really usable. > > After a few months, I came to the conclusion than migrating to LizardFS > was not worth the single erasure coding feature, especially now that > MooseFS 4.0 CE with EC is officially announced. I'd rather buy a few > more drives and cope with standard copies for a while than ditching > MooseFS reliability for LizardFS. > > Hope it helps, > > Marin |
From: Marin B. <li...@ol...> - 2018-05-20 13:08:31
|
> > This is very interesting > > Any official and detailed docs about the HA feature? > > > > Other than this, without making any flame, which are the > > differences > > between MFS4 and LizardFS? > > > > Hi again, > > I've been testing both MooseFS 3.0.x and LizardFS 3.1x in parallel > for > a few weeks now. Here are the main differences I found while using > them. I think most of them will still be relevant with MooseFS 4.0. > > * High availability > In theory, LizardFS provides master high-availability with > _shadow_ instances. The reality is less glorious, as the piece of > software actually implementing master autopromotion (based on uraft) > is > still proprietary. It is expected to be GPL'd, yet nobody knows when. > So as of now, if you need HA with LizardFS, you have to write your > own > set of scripts and use a 3rd party cluster manager such as corosync. > > * POSIX ACLs > Using POSIX ACLs with LizardFS requires a recent Linux Kernel (4.9+), > because a version of FUSE with ACL support is needed. This means ACLs > are unusable with most LTS distros, whose kernels are too old. > > With MooseFS, ACLs do work even with older kernels; maybe because > they > are implemented at the master level and the client does not even try > to > enforce them? > > * FreeBSD support > According to the LizardFS team, all components do compile on FreeBSD. > They do not provide a package repository, though, nor did they > succeed > in submitting LizardFS to the FreeBSD ports tree (bug #225489 is > still > open on phabricator). > > * Storage classes > Erasure coding is supported in LizardFS, and I had no special issue > with it. So far, it works as expected. > > The equivalent of MooseFS storage classes in LizardFS are _custom > goals_. While MooseFS storage classes may be dealt with > interactively, > LizardFS goals are statically defined in a dedicated config file. > MooseFS storage classes allow the use of different label expressions > at > each step of a chunk lifecycle (different labels for new, kept and > archived chunks). LizardFS has no equivalent. > > One application of MooseFS storage classes is to transparently delay > the geo-replication of a chunk for a given amount of time, to lower > the > latency of client I/O operations. As far as I know, it is not > possible > to do the same with LizardFS. > > * NFS support > LizardFS supports NFSv4 ACL. It may also be used with the NFS Ganesha > server to export directories directly through user-space NFS. I did > not > test this feature myself. According to several people, the feature, > which is rather young, does work but performs poorly. Ganesha on top > of > LizardFS is a multi-tier setup with a lot of moving parts. I think it > will take some time for it to reach production quality, if ever. > > In theory, Ganesha is compatible with kerberized NFS, which would be > far more secure a solution than the current mfsmount client, enabling > its use in public/hostile environments. I don't know if MooseFS 4.0 > has > improved on this matter. > > * Tape server > LizardFS includes a tape server daemon for tape archiving. That's > another way to implement some kind of chunk lifecycle without storage > classes. > > * IO limits > Lizardfs includes a new config file dedicated to IO limits. It allows > to assign IO limits to cgroups. The LFS client negotiates its > bandwidth > limit with the master is leased a reserved bandwidth for a given > amount > of time. The big limitation of this feature is that the reserved > bandwidth may not be shared with another client while the original > one > is not using it. In that case, the reserved bandwidth is simply lost. > > * Windows client > The paid version of LizardFS includes a native Windows client. I > think > it is built upon some kind of fsal à la Dokan. The client allows to > map > a LizardFS export to a drive letter. The client supports Windows ACL > (probably stored as NFSv4 ACL). > > * Removed features > LizardFS removed chunkserver maintenance mode and authentication code > (AUTH_CODE). Several tabs from the Web UI are also gone, including > the > one showing quotas. The original CLI tools were replaced by their own > versions, which I find harder to use (no more tables, and very > verbose > output). > > > > I've been using MooseFS for several years and never had any problem > with it, even in very awkward situations. My feeling is that it is > really a rock-solid, battle-tested product. > > I gave LizardFS a try, mainly for erasure coding and high- > availability. > While the former worked as expected, the latter turned out to be a > myth: the free version of LizardFS does not provide more HA than > MooseFS CE: in both cases, building a HA solution requires writing > custom scripts and relying on a cluster managed such as corosync. I > see > no added value in using LizardFS for HA. > > On all other aspects, LizardFS does the same or worse than MooseFS. I > found performance to be roughly equivalent between the two (provided > you disable fsync on LizardFS chunkservers, where it is enabled by > default). Both solutions are still similar in many aspects, yet > LizardFS is clouded by a few negative points: ACLs are hardly usable, > custom goals are less powerful than storage classes and less > convenient > for geo-replication, FreeBSD support is inexistent, CLI tools are > less > efficient, and native NFS support is too young to be really usable. > > After a few months, I came to the conclusion than migrating to > LizardFS > was not worth the single erasure coding feature, especially now that > MooseFS 4.0 CE with EC is officially announced. I'd rather buy a few > more drives and cope with standard copies for a while than ditching > MooseFS reliability for LizardFS. > > Hope it helps, > > Marin A few corrections: 1. MooseFS Pro also includes a Windows client. 2. LizardFS did not "remove" tabs from the web UI: these tabs were added by MooseFS after LizardFS had forked the code base. |
From: Marin B. <li...@ol...> - 2018-05-20 12:53:50
|
> This is very interesting > Any official and detailed docs about the HA feature? > > Other than this, without making any flame, which are the differences > between MFS4 and LizardFS? > Hi again, I've been testing both MooseFS 3.0.x and LizardFS 3.1x in parallel for a few weeks now. Here are the main differences I found while using them. I think most of them will still be relevant with MooseFS 4.0. * High availability In theory, LizardFS provides master high-availability with _shadow_ instances. The reality is less glorious, as the piece of software actually implementing master autopromotion (based on uraft) is still proprietary. It is expected to be GPL'd, yet nobody knows when. So as of now, if you need HA with LizardFS, you have to write your own set of scripts and use a 3rd party cluster manager such as corosync. * POSIX ACLs Using POSIX ACLs with LizardFS requires a recent Linux Kernel (4.9+), because a version of FUSE with ACL support is needed. This means ACLs are unusable with most LTS distros, whose kernels are too old. With MooseFS, ACLs do work even with older kernels; maybe because they are implemented at the master level and the client does not even try to enforce them? * FreeBSD support According to the LizardFS team, all components do compile on FreeBSD. They do not provide a package repository, though, nor did they succeed in submitting LizardFS to the FreeBSD ports tree (bug #225489 is still open on phabricator). * Storage classes Erasure coding is supported in LizardFS, and I had no special issue with it. So far, it works as expected. The equivalent of MooseFS storage classes in LizardFS are _custom goals_. While MooseFS storage classes may be dealt with interactively, LizardFS goals are statically defined in a dedicated config file. MooseFS storage classes allow the use of different label expressions at each step of a chunk lifecycle (different labels for new, kept and archived chunks). LizardFS has no equivalent. One application of MooseFS storage classes is to transparently delay the geo-replication of a chunk for a given amount of time, to lower the latency of client I/O operations. As far as I know, it is not possible to do the same with LizardFS. * NFS support LizardFS supports NFSv4 ACL. It may also be used with the NFS Ganesha server to export directories directly through user-space NFS. I did not test this feature myself. According to several people, the feature, which is rather young, does work but performs poorly. Ganesha on top of LizardFS is a multi-tier setup with a lot of moving parts. I think it will take some time for it to reach production quality, if ever. In theory, Ganesha is compatible with kerberized NFS, which would be far more secure a solution than the current mfsmount client, enabling its use in public/hostile environments. I don't know if MooseFS 4.0 has improved on this matter. * Tape server LizardFS includes a tape server daemon for tape archiving. That's another way to implement some kind of chunk lifecycle without storage classes. * IO limits Lizardfs includes a new config file dedicated to IO limits. It allows to assign IO limits to cgroups. The LFS client negotiates its bandwidth limit with the master is leased a reserved bandwidth for a given amount of time. The big limitation of this feature is that the reserved bandwidth may not be shared with another client while the original one is not using it. In that case, the reserved bandwidth is simply lost. * Windows client The paid version of LizardFS includes a native Windows client. I think it is built upon some kind of fsal à la Dokan. The client allows to map a LizardFS export to a drive letter. The client supports Windows ACL (probably stored as NFSv4 ACL). * Removed features LizardFS removed chunkserver maintenance mode and authentication code (AUTH_CODE). Several tabs from the Web UI are also gone, including the one showing quotas. The original CLI tools were replaced by their own versions, which I find harder to use (no more tables, and very verbose output). I've been using MooseFS for several years and never had any problem with it, even in very awkward situations. My feeling is that it is really a rock-solid, battle-tested product. I gave LizardFS a try, mainly for erasure coding and high-availability. While the former worked as expected, the latter turned out to be a myth: the free version of LizardFS does not provide more HA than MooseFS CE: in both cases, building a HA solution requires writing custom scripts and relying on a cluster managed such as corosync. I see no added value in using LizardFS for HA. On all other aspects, LizardFS does the same or worse than MooseFS. I found performance to be roughly equivalent between the two (provided you disable fsync on LizardFS chunkservers, where it is enabled by default). Both solutions are still similar in many aspects, yet LizardFS is clouded by a few negative points: ACLs are hardly usable, custom goals are less powerful than storage classes and less convenient for geo-replication, FreeBSD support is inexistent, CLI tools are less efficient, and native NFS support is too young to be really usable. After a few months, I came to the conclusion than migrating to LizardFS was not worth the single erasure coding feature, especially now that MooseFS 4.0 CE with EC is officially announced. I'd rather buy a few more drives and cope with standard copies for a while than ditching MooseFS reliability for LizardFS. Hope it helps, Marin |
From: Gandalf C. <gan...@gm...> - 2018-05-20 10:07:13
|
This is very interesting Any official and detailed docs about the HA feature? Other than this, without making any flame, which are the differences between MFS4 and LizardFS? Il dom 20 mag 2018, 11:33 Jakub Kruszona-Zawadzki <ac...@mo...> ha scritto: > > On 19 May 2018, at 18:42, Gandalf Corvotempesta < > gan...@gm...> wrote: > > il sab 19 mag 2018, 17:48 Marco Milano <mar...@gm...> ha scritto: > >> Tea Leaves :-) >> > > Seriously, where did you get this info? > How HA works in moosefs? Is failover automatic? > >> > MooseFS 4.x now is in “close beta” (or even more “close > release-candidate”) stage. Currently we started to use it by ourselves. > When we will see that there are no obvious bugs then we will release it as > open-source product under GPL-v2 (or even GPL-v2+) licence. > In the meantime if you want to participate in tests of MFS 4.x - please > let us know, we will send you packages of MFS 4.x for you OS. > > Marco Milano is one of our “testers” and this is why he knows more about > MFS 4.x > > HA in MFS 4.x works fully automatic. You just need to define group of IP > numbers in your DNS for your “mfsmaster” name. Then install master servers > and run them on machines with those IP numbers and thats it. One of them > will became “LEADER” of the group and other “FOLLOWER”-s of the group. If > for any reason “LEADER” goes down then one of the FOLLOWER’s is chosen as > an “ELECT” and when more than half of chunkservers connects to it then it > automatically switches to “LEADER” state and starts working as main master > server. When your previous leader becomes available again then it usually > joins the group as a “FOLLOWER”. This is rather “fire and forget” solution. > They will synchronise metadata between them automatically, chunkservers and > clients will reconnect to current leader also automatically etc. > > Regards, > Jakub Kruszona-Zawadzki > > > > > |
From: Marin B. <li...@ol...> - 2018-05-20 10:01:40
|
> MooseFS 4.x now is in “close beta” (or even more “close release- > candidate”) stage. Currently we started to use it by ourselves. When > we will see that there are no obvious bugs then we will release it as > open-source product under GPL-v2 (or even GPL-v2+) licence. > In the meantime if you want to participate in tests of MFS 4.x - > please let us know, we will send you packages of MFS 4.x for you OS. > > Marco Milano is one of our “testers” and this is why he knows more > about MFS 4.x > > HA in MFS 4.x works fully automatic. You just need to define group of > IP numbers in your DNS for your “mfsmaster” name. Then install master > servers and run them on machines with those IP numbers and thats it. > One of them will became “LEADER” of the group and other “FOLLOWER”-s > of the group. If for any reason “LEADER” goes down then one of the > FOLLOWER’s is chosen as an “ELECT” and when more than half of > chunkservers connects to it then it automatically switches to > “LEADER” state and starts working as main master server. When your > previous leader becomes available again then it usually joins the > group as a “FOLLOWER”. This is rather “fire and forget” solution. > They will synchronise metadata between them automatically, > chunkservers and clients will reconnect to current leader also > automatically etc. > > Regards, > Jakub Kruszona-Zawadzki Hi Jakub, Thank you for taking the time providing these answers! Marin |