Thread: [f2fs-dev] general stability of f2fs?
Brought to you by:
kjgkr
|
From: Marc L. <sc...@sc...> - 2015-08-08 20:50:12
|
Hi! I did some more experiments, and wonder about the general stabiulity of f2fs. I have not managed to keep an f2fs filesystem that worked for longer than a few days. For example, a few days ago I created an 8TB volume and copied 2TB of data to it, which worked until I hot the (very low...) 32k limit on the number of subdirectories. I moved some directoriesd into a single subdirectory, and continued. Everything seemed fine. Today I ran fsck.f2fs on the fs, which found 4 inodes with wrong link counts (generally higher than fsck counted). It asked me whether to fix this, which I did. I then did another fsck run, and was greeted with tens of thousands of errors: http://ue.tst.eu/f692bac9abbe4e910787adee18ec52be.txt Mounting made the box unusable for multiple minutes, probably due to the amount of backtraces: http://ue.tst.eu/6243cc344a943d95a20907ecbc37061f.txt The data is toast (which is fine, I am still experimenting only), but this, the weird write behaviour, the fact that you don#t get signalled on ENOSPC make me wonder what the general status of f2fs is. It *seems* to be in actual use for a number of years now, and I would expect small hiccups and problems, so backups would be advised, but this level of brokenness (I only tested linux 3.18.14 and 4.1.4) is not something I didn#t expect from a fs that is in development for so long. So I wonder what the general stability epxectation for f2fs is - is it just meant to be an experimental fs not used for any data, or am I just unlucky and hit so many disastrous bugs by chance? (It's really too bad, it's the only fs in linux that has stable write performance on SMR drives at this time). -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / sc...@sc... -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Jaegeuk K. <ja...@ke...> - 2015-08-10 20:34:20
|
Hi Marc, I'm very interested in trying f2fs on SMR drives too. I also think that several characteristics of SMR drives are very similar with flash drives. So far, the f2fs has been well performed on embedded systems like smart phones. For server environement, however, I couldn't actually test f2fs pretty much intensively. The major uncovered code areas would be: - over 4TB storage space case - inline_dentry mount option; I'm still working on extent_cache for v4.3 too - various sizes of section and zone - tmpfile, and rename2 interfaces In your logs, I suspect some fsck.f2fs bugs in a large storage case. In order to confirm that, could you use the latest f2fs-tools from: http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs-tools.git And, if possible, could you share some experiences when you didn't fill up the partition to 100%? If there is no problem, we can nicely focus on ENOSPC only. Thanks, On Sat, Aug 08, 2015 at 10:50:03PM +0200, Marc Lehmann wrote: > Hi! > > I did some more experiments, and wonder about the general stabiulity of f2fs. > I have not managed to keep an f2fs filesystem that worked for longer than a > few days. > > For example, a few days ago I created an 8TB volume and copied 2TB of data to > it, which worked until I hot the (very low...) 32k limit on the number of > subdirectories. > > I moved some directoriesd into a single subdirectory, and continued. > Everything seemed fine. > > Today I ran fsck.f2fs on the fs, which found 4 inodes with wrong link counts > (generally higher than fsck counted). It asked me whether to fix this, which > I did. > > I then did another fsck run, and was greeted with tens of thousands of > errors: > > http://ue.tst.eu/f692bac9abbe4e910787adee18ec52be.txt > > Mounting made the box unusable for multiple minutes, probably due to the > amount of backtraces: > > http://ue.tst.eu/6243cc344a943d95a20907ecbc37061f.txt > > The data is toast (which is fine, I am still experimenting only), but this, > the weird write behaviour, the fact that you don#t get signalled on ENOSPC > make me wonder what the general status of f2fs is. > > It *seems* to be in actual use for a number of years now, and I would expect > small hiccups and problems, so backups would be advised, but this level of > brokenness (I only tested linux 3.18.14 and 4.1.4) is not something I didn#t > expect from a fs that is in development for so long. > > So I wonder what the general stability epxectation for f2fs is - is it just > meant to be an experimental fs not used for any data, or am I just unlucky > and hit so many disastrous bugs by chance? > > (It's really too bad, it's the only fs in linux that has stable write > performance on SMR drives at this time). > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ > > ------------------------------------------------------------------------------ > _______________________________________________ > Linux-f2fs-devel mailing list > Lin...@li... > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel |
|
From: Marc L. <sc...@sc...> - 2015-09-23 23:30:30
|
On Wed, Sep 23, 2015 at 04:55:57PM +0800, Chao Yu <cha...@sa...> wrote:
> > echo 1 >gc_idle
> > echo 1000 >gc_max_sleep_time
> > echo 5000 >gc_no_gc_sleep_time
>
> One thing I note is that gc_min_sleep_time is not be set in your script,
> so in some condition gc may still do the sleep with gc_min_sleep_time (30
> seconds by default) instead of gc_max_sleep_time which we expect.
Ah, sorry, I actually set gc_min_sleep_time to 100, but forgot to include
it.
> In 4.3 rc1 kernel, we have add a new ioctl to trigger in batches gc, maybe
> we can use it as one option.
Yes, such an ioctl could be useful to me, although I do not intend to have
background gc off.
I assume that the ioctl will block for the time it runs, and I can ask it
to do up to 16 batches in one go (by default)? That sounds indeed very
useful to have.
What is "one batch" in terms of gc, one section?
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Marc L. <sc...@sc...> - 2015-09-23 23:39:46
|
On Wed, Sep 23, 2015 at 03:08:23PM -0700, Jaegeuk Kim <ja...@ke...> wrote: > > root@shag:/sys/fs/f2fs/dm-1# df -H /mnt > > Filesystem Size Used Avail Use% Mounted on > > /dev/mapper/vg_test-test 138G 137G 803k 100% /mnt > > Could you please share /sys/kernel/debug/f2fs/status? Uh, sorry, I planned to, but forgot, probably because I thought the result was so good it didn't need any checking :) > So, I'm convinced that your inital test set "-o1 -s128", which was an unlucky > trial. :) hmm... since the point is to simulate a full 8TB partition, having large overprovision/reserved space AND large section size might actually have been a good test, as it would simulate the TB case better, which would also have larger overprovisioning and the larger section size. In the end, I might settle with -s64, and currently do tests with -s90. I was just scared that overprovisioning might turn out ot be extremely large with 8TB. I have since then dropped -o from all my mkfs.f2fs invocations, seeing that the resulting filesystem does not actually have 5% overprovisioning. > Subject: [PATCH] mkfs.f2fs: fix wrong ovp space calculation on large section Hmm, the latest change in git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git is from august 10 - do I need to select a branch (I am not good with git)? -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / sc...@sc... -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Marc L. <sc...@sc...> - 2015-08-10 20:53:43
|
On Mon, Aug 10, 2015 at 01:31:06PM -0700, Jaegeuk Kim <ja...@ke...> wrote: > I'm very interested in trying f2fs on SMR drives too. > I also think that several characteristics of SMR drives are very similar with > flash drives. Indeed, but of course there isn't an exact match for any characteristic. Also, in the end, drive-managed SMR drives will suck somewhat with any filesystem (note that nilfs performs very badly, even thought it should be better than anything else till the drive is completely full). Now, looking at the characteristics of f2fs, it could be a good match for any rotational media, too, since it writes linearly and can defragment. At least for desktop or similar loads (where files usually aren't randomly written, but mostly replaced and rarely appended). The only crucial ability it would need to have is to be able to free large chunks for rewriting, which should be in f2fs as well. So at this time, what I apparently need is mkfs.f2fs -s128 instead of -s7. Unfortunately, I probably can't make these tests immediately, and they do take some days to run, but hopefully I cna repeatmy experiments next week. > - over 4TB storage space case fsck limits could well have been the issue for my first big filesystem, but not the second (which was only 128G in size to be able to utilize it within a reasonable time). > - inline_dentry mount option; I'm still working on extent_cache for v4.3 too I only enabled mount options other than noatime for the 128G filesystem, so it might well have cauzsed the trouble with it. Another thing that will seriously hamper adoption of these drives is the 32000 limit on hardlinks - I am hard pressed to find any large file tree here that doesn't have places with of 40000 subdirs somewhere, but I guess on a 32GB phone flash storage, this was less of a concern. In any case, if f2fs turns out to be workable, it will become the fs of choice for me for my archival uses, and maybe even more, and I then have to somehow cope with that limit. > In your logs, I suspect some fsck.f2fs bugs in a large storage case. > In order to confirm that, could you use the latest f2fs-tools from: > http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs-tools.git Will do so. Is there a repository for out-of-tree module builds for f2fs? It seems kernels 3.17.x to 4.1 (at least) have a kernel bug making reads to these SMR drives unstable (https://bugzilla.kernel.org/show_bug.cgi?id=93581), so I will have to test with a relatively old kernel or play too many tricks. And I suspect from glancing over patches (And mount options) that there have been quite some improvements in f2fs since 3.16 days. > And, if possible, could you share some experiences when you didn't fill up the > partition to 100%? If there is no problem, we can nicely focus on ENOSPC only. My experience was that f2fs wrote at nearly maximum I/O speed of the drives. In fact, I couldn't saturate the bandwidth except when writing small files, because the 8 drive source raid using xfs was not able to read files quickly enough. After writing an initial tree of >2TB Directory reading and mass stat seemed to be considerably slower and take more time directly afterwards. I don't know if that is something that balancing can fix (or improve), but I am not overly concerned about that, as the difference to e.g. xfs is not that big (roughly a factor of two), and thes eoperations are too slow for me on any device, so I usually put a dm-cache in front of such storage devices. I don't think that I have more useful data to report - if I used 14MB sections, performance would predictably suck, so the real test is still outstanding. Stay tuned, and thanks for your reply! -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / sc...@sc... -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Jaegeuk K. <ja...@ke...> - 2015-08-10 21:58:17
|
On Mon, Aug 10, 2015 at 10:53:32PM +0200, Marc Lehmann wrote: > On Mon, Aug 10, 2015 at 01:31:06PM -0700, Jaegeuk Kim <ja...@ke...> wrote: > > I'm very interested in trying f2fs on SMR drives too. > > I also think that several characteristics of SMR drives are very similar with > > flash drives. > > Indeed, but of course there isn't an exact match for any characteristic. > Also, in the end, drive-managed SMR drives will suck somewhat with any > filesystem (note that nilfs performs very badly, even thought it should be > better than anything else till the drive is completely full). IMO, it's similar to flash drives too. Indeed, I believe host-managed SMR/flash drives are likely to show much better performance than drive-managed ones. However, I think there are many HW constraints inside the storage not to move forward to it easily. > Now, looking at the characteristics of f2fs, it could be a good match for > any rotational media, too, since it writes linearly and can defragment. At > least for desktop or similar loads (where files usually aren't randomly > written, but mostly replaced and rarely appended). Possible, but not much different from other filesystems. :) > The only crucial ability it would need to have is to be able to free large > chunks for rewriting, which should be in f2fs as well. > > So at this time, what I apparently need is mkfs.f2fs -s128 instead of -s7. I wrote a patch to fix the document. Sorry about that. > Unfortunately, I probably can't make these tests immediately, and they do > take some days to run, but hopefully I cna repeatmy experiments next week. > > > - over 4TB storage space case > > fsck limits could well have been the issue for my first big filesystem, > but not the second (which was only 128G in size to be able to utilize it > within a reasonable time). > > > - inline_dentry mount option; I'm still working on extent_cache for v4.3 too > > I only enabled mount options other than noatime for the 128G filesystem, > so it might well have cauzsed the trouble with it. Okay, so I think it'd be good to start with: - noatime,inline_xattr,inline_data,flush_merge,extent_cache. And you can control defragementation through /sys/fs/f2fs/[DEV]/gc_[min|max|no]_sleep_time > Another thing that will seriously hamper adoption of these drives is the > 32000 limit on hardlinks - I am hard pressed to find any large file tree > here that doesn't have places with of 40000 subdirs somewhere, but I guess > on a 32GB phone flash storage, this was less of a concern. Looking at a glance, it'll be no problme to increase as 64k. Let me check again. > In any case, if f2fs turns out to be workable, it will become the fs of > choice for me for my archival uses, and maybe even more, and I then have > to somehow cope with that limit. > > > In your logs, I suspect some fsck.f2fs bugs in a large storage case. > > In order to confirm that, could you use the latest f2fs-tools from: > > http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs-tools.git > > Will do so. > > Is there a repository for out-of-tree module builds for f2fs? It seems > kernels 3.17.x to 4.1 (at least) have a kernel bug making reads to these SMR > drives unstable (https://bugzilla.kernel.org/show_bug.cgi?id=93581), so I > will have to test with a relatively old kernel or play too many tricks. What kernel version do you prefer? I've been maintaining f2fs for v3.10 mainly. http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs.git/log/?h=linux-3.10 Thanks, > And I suspect from glancing over patches (And mount options) that there > have been quite some improvements in f2fs since 3.16 days. > > > And, if possible, could you share some experiences when you didn't fill up the > > partition to 100%? If there is no problem, we can nicely focus on ENOSPC only. > > My experience was that f2fs wrote at nearly maximum I/O speed of the drives. > In fact, I couldn't saturate the bandwidth except when writing small files, > because the 8 drive source raid using xfs was not able to read files quickly > enough. After writing an initial tree of >2TB > > Directory reading and mass stat seemed to be considerably slower and take > more time directly afterwards. I don't know if that is something that > balancing can fix (or improve), but I am not overly concerned about that, > as the difference to e.g. xfs is not that big (roughly a factor of two), > and thes eoperations are too slow for me on any device, so I usually put a > dm-cache in front of such storage devices. > > I don't think that I have more useful data to report - if I used 14MB > sections, performance would predictably suck, so the real test is still > outstanding. Stay tuned, and thanks for your reply! > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Marc L. <sc...@sc...> - 2015-08-13 00:26:52
|
On Mon, Aug 10, 2015 at 02:58:06PM -0700, Jaegeuk Kim <ja...@ke...> wrote: > IMO, it's similar to flash drives too. Indeed, I believe host-managed SMR/flash > drives are likely to show much better performance than drive-managed ones. If I had one, its performance would be abysmal, as filesystems (and indeed, driver support) for that are far away... :) > However, I think there are many HW constraints inside the storage not to move > forward to it easily. Exactly :) > > Now, looking at the characteristics of f2fs, it could be a good match for > > any rotational media, too, since it writes linearly and can defragment. At > > least for desktop or similar loads (where files usually aren't randomly > > written, but mostly replaced and rarely appended). > > Possible, but not much different from other filesystems. :) Hmm, I would strongly disagree - most other filesystems cannot defragment effectively. For example, xfs_fsr is unstable under load and only defragments files, but greatly increases external fragmentation over time. Similarly for e4defrag. Most other filesystems do not even have a way to defragment. Files that are defragmented never move on other filesystems. This can be true for f2fs as well, but as far as I can see, if formatted with e.g. -s128, the external fragments will be 256mb in size, which is far more acceptable than the millions of 4-100kb size fragments on some of my xfs filesystems. If I wouldn't copy my filesystems every 1.5 years or so, they would be horrible degraded. It's very common to read directories with many medium to small files at 10-20mb/s on an old xfs filesystem, but at 80mb/s on a new one with exactly the same contents. I don't think f2fs will intelligently defragment and relayout directories anytime soon, either, but at least internal and external defragmentation are being managed. > Okay, so I think it'd be good to start with: > - noatime,inline_xattr,inline_data,flush_merge,extent_cache. I still haven't found the right kernel for my main server, but I did some preliminary experiments today, with 3.19.8-ckt5 (an ubuntu kernel). After formatting a 128G partition with "mkfs.f2fs -o1 -s128 -t0", I got this after mounting (kernel complained about missing extent_cache in my kernel version): Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_test-test 128G 53G 75G 42% /mnt which give sme another quetsion - on an 8TB disk, 5% overprovision is 400GB, which sounds a bit wasteful. Even 1% (80GB) sounds a bit much, especially asI am prepared to wait for defragmentation, if defragmentation works well. And lastly, the 53GB used on a 128GB partition looks way too conservative. I immediately configured the fs with these values: echo 500 >gc_max_sleep_time echo 100 >gc_min_sleep_time echo 800 >gc_no_gc_sleep_time Anyways, I write it until disk was 99% utilizied according to /sys/kernel/debug/f2fs/status, at which write speed crawled down to 1-2MB/s. I deleted some "random" files till utilisation was at 38%, then waited until there was no disk I/O (disk went into standby, which indicates that it has flushed its internal transaction log as well). When I then tried to write a file, the writer (rsync) stopped after ~4kb, and the filesystem started reading at <2MB/s and wriitng at <2MB/s for a few minutes. Since I didn't intend this to test very well (I was looking mainly for a kernel that worked well with the hardware and drives), I didn't make detailed notes, but basically, "LFS:" increased exactly with the writing speed. I then stopped writing, after which the fs wrote (but did not read) a bit longer at this speed, then became idle, disk went into standby again. The next day, I mounted it, and now I will take notes. Initial status was: http://ue.tst.eu/e2ea137a6b87fd0e43446b286a3d1b19.txt The disk woke up and started reading and writing at <1MB/s: http://ue.tst.eu/a9dd48428b7b454f52590efeea636a27.txt At some point, you can see that the disk stopped reading, that's when I killed rsync. rsync also transfers over the net, and as you can see, it didn't maange to transfer anything. The read I/O is probably due to rsync reading the filetree info. A status snapshot after killing rsync looks like this: http://ue.tst.eu/211fc87b0b43270e4b2ee0261d251818.txt The disk did no other I/O afterwards and went into standby again. I repeated the experiment a few minutes later with similar results, with these differences: 1. There was absolutely no read I/O (maybe all inodes were still in the cache, but that would be surprising as rsync probably didn't read all of them in the previous run). 2. The disk didn't stay idle this time, but instead kept steadily writing at ~1MB/s. Status output at the end: http://ue.tst.eu/cbb4774b2f8e44ae68e635be5a414d1d.txt Status output a bit later, disk still writing: http://ue.tst.eu/9fbdfe1e9051a65c1417bea7192ea182.txt Much later, disk idle: http://ue.tst.eu/78a1614d867bfbfa115485e5fcf1a1a8.txt At this point, my main problem is that I have no clue what is causing the slow writes. Obviously the garbage collector doesn't think anything needs to be done, it shouldn't be IPU writes either then, and even if they are, I don't know what the ipu_policy's mean. I tried the same with ipu_policy=8 and min_ipu_util=100, also separately also gc_idle=1, with seemingly no difference. Here is what I expect should happen: When I write to a new disk, or append to a still-free-enough disk, writing happens linearly (with that I mean appending to multiple of its logs linearly, which is not optimal, but should be fine). This clearly happens, and near perfectly so. When the disk is near-full, bad things might happen, delays might be there when some small areas are being garbage collected. When I delete files, the disk should start garbage collecting at around 50mb/s read + 50mb/s write. If combined with writing, I should be able to write at roughly 30MB/s while the garbage collector is cleaning up. I would expect the gc to do its work by selecting a 256MB section, reading everything it needs to, write this data linearly to some log poossibly followed by some random update and a flush or somesuch, and thus achieve about 50MB/s cleaning throughput. This clearly doesn't seem to happen, possibly because the gc things nothing needs to be done. I would expect the gc to do its work when the disk is idle, at least if need to, so after coming back after a while, I can write at nearly full speed again. This also dosn't happen - maybe the gc runs, but writing to the disk is impossible even after it qwuited down. > > Another thing that will seriously hamper adoption of these drives is the > > 32000 limit on hardlinks - I am hard pressed to find any large file tree > > here that doesn't have places with of 40000 subdirs somewhere, but I guess > > on a 32GB phone flash storage, this was less of a concern. > > Looking at a glance, it'll be no problme to increase as 64k. > Let me check again. I thought more like 2**31 or so links, but it so happens that all my testcases (by pure chance) have between 57k and 64k links,. so thanks a lot for that. If you are reluctant, look at other filesystems. extX thought 16 bit is enough. btrfs thought 16 bit is enough - even reiserfs thought 16 bit is enough. Lots of filesystems thought 16 bits is enough, but all modern incarnations of them do 31 or 32 bit link counts these days. It's kind of rare to have 8+TB of storage where you are fine with 2**16 subdirectories everywhere. > What kernel version do you prefer? I've been maintaining f2fs for v3.10 mainly. > > http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs.git/log/?h=linux-3.10 I have a hard time finding kernels that work with these SMR drives. So far, only the 3.18.x and the 3.19.x series works for me. the 3.17 and 3.16 kernels fail for various reasons, and the 4.1.x kernels still fail miserably with these drives. So, at this point, it needs to be either 3.18 or 3.19 for me. It seems 3.19 has everything but the extent_cache, which probably shouldn't make such a big difference. Are there any big bugs in 3.8/3.19 which I would have to look out for? Storage size isn't an issue right now, because I can reproduce the performance characteristics just fine on a 128G partition. I mainly asked because I thought newer kernel versions might have important bugfixes. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / sc...@sc... -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Jaegeuk K. <ja...@ke...> - 2015-08-14 23:07:34
|
On Thu, Aug 13, 2015 at 02:26:41AM +0200, Marc Lehmann wrote: Okay, let me jump into the original issues. > I still haven't found the right kernel for my main server, but I did some > preliminary experiments today, with 3.19.8-ckt5 (an ubuntu kernel). I backported the latest f2fs into 3.19 here. http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs.git/log/?h=linux-3.19 You can build f2fs by linking the following f2fs source codes into your base ubuntu 3.19.8-ckt5. - fs/f2fs/* - include/linux/f2fs_fs.h - include/trace/events/f2fs.h > After formatting a 128G partition with "mkfs.f2fs -o1 -s128 -t0", I got > this after mounting (kernel complained about missing extent_cache in my > kernel version): > > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg_test-test 128G 53G 75G 42% /mnt > > which give sme another quetsion - on an 8TB disk, 5% overprovision is > 400GB, which sounds a bit wasteful. Even 1% (80GB) sounds a bit much, > especially asI am prepared to wait for defragmentation, if defragmentation > works well. And lastly, the 53GB used on a 128GB partition looks way too > conservative. Right, so I wrote a patch to resolve this issue. http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs-tools.git You can find this patch which set the best overprovision ratio automatically. mkfs.f2fs: set overprovision size more precisely > > I immediately configured the fs with these values: > > echo 500 >gc_max_sleep_time > echo 100 >gc_min_sleep_time > echo 800 >gc_no_gc_sleep_time > > Anyways, I write it until disk was 99% utilizied according to > /sys/kernel/debug/f2fs/status, at which write speed crawled down to 1-2MB/s. > > I deleted some "random" files till utilisation was at 38%, then waited > until there was no disk I/O (disk went into standby, which indicates that > it has flushed its internal transaction log as well). > > When I then tried to write a file, the writer (rsync) stopped after ~4kb, and > the filesystem started reading at <2MB/s and wriitng at <2MB/s for a few > minutes. Since I didn't intend this to test very well (I was looking mainly > for a kernel that worked well with the hardware and drives), I didn't make > detailed notes, but basically, "LFS:" increased exactly with the writing > speed. > > I then stopped writing, after which the fs wrote (but did not read) a bit > longer at this speed, then became idle, disk went into standby again. > > The next day, I mounted it, and now I will take notes. Initial status was: > > http://ue.tst.eu/e2ea137a6b87fd0e43446b286a3d1b19.txt > > The disk woke up and started reading and writing at <1MB/s: > > http://ue.tst.eu/a9dd48428b7b454f52590efeea636a27.txt > > At some point, you can see that the disk stopped reading, that's when I > killed rsync. rsync also transfers over the net, and as you can see, it > didn't maange to transfer anything. The read I/O is probably due to rsync > reading the filetree info. > > A status snapshot after killing rsync looks like this: > > http://ue.tst.eu/211fc87b0b43270e4b2ee0261d251818.txt Here, the key clue is the number of CP calls, which increased enormously. So, I did some test which filled up with data and take a look at what happened in the last minutes. In my case, I could have seen that a lot of checkpoints were triggered by f2fs_gc even though there was nothing to gather garbages. I suspect that's the exact corner case where the performance goes down dramatically. In order to resolve that issue, I made a patch: f2fs: skip checkpoint if there is no dirty and prefree segments Note that, the backported f2fs should have this patch too. So, at first, could you check this patch in your workloads? > The disk did no other I/O afterwards and went into standby again. > > I repeated the experiment a few minutes later with similar > results, with these differences: > > 1. There was absolutely no read I/O (maybe all inodes were still in the > cache, but that would be surprising as rsync probably didn't read all > of them in the previous run). > > 2. The disk didn't stay idle this time, but instead kept steadily writing > at ~1MB/s. > > Status output at the end: > > http://ue.tst.eu/cbb4774b2f8e44ae68e635be5a414d1d.txt > > Status output a bit later, disk still writing: > > http://ue.tst.eu/9fbdfe1e9051a65c1417bea7192ea182.txt > > Much later, disk idle: > > http://ue.tst.eu/78a1614d867bfbfa115485e5fcf1a1a8.txt > > At this point, my main problem is that I have no clue what is causing the > slow writes. Obviously the garbage collector doesn't think anything needs > to be done, it shouldn't be IPU writes either then, and even if they are, > I don't know what the ipu_policy's mean. > > I tried the same with ipu_policy=8 and min_ipu_util=100, also separately > also gc_idle=1, with seemingly no difference. > > Here is what I expect should happen: > > When I write to a new disk, or append to a still-free-enough disk, writing > happens linearly (with that I mean appending to multiple of its logs > linearly, which is not optimal, but should be fine). This clearly happens, > and near perfectly so. > > When the disk is near-full, bad things might happen, delays might be there > when some small areas are being garbage collected. > > When I delete files, the disk should start garbage collecting at around > 50mb/s read + 50mb/s write. If combined with writing, I should be able to > write at roughly 30MB/s while the garbage collector is cleaning up. At that moment, actually I suspect garbage collector has no sections to clean up. Because, if you set a big section in a small partition, the deleted regions are likely to be laid across the current active sections. In such the case, even if there are many dirty segments, garbage collector can't select them as victims at all. > I would expect the gc to do its work by selecting a 256MB section, reading > everything it needs to, write this data linearly to some log poossibly > followed by some random update and a flush or somesuch, and thus achieve > about 50MB/s cleaning throughput. This clearly doesn't seem to happen, > possibly because the gc things nothing needs to be done. > > I would expect the gc to do its work when the disk is idle, at least if > need to, so after coming back after a while, I can write at nearly full > speed again. This also dosn't happen - maybe the gc runs, but writing to > the disk is impossible even after it qwuited down. > > > > Another thing that will seriously hamper adoption of these drives is the > > > 32000 limit on hardlinks - I am hard pressed to find any large file tree > > > here that doesn't have places with of 40000 subdirs somewhere, but I guess > > > on a 32GB phone flash storage, this was less of a concern. > > > > Looking at a glance, it'll be no problme to increase as 64k. > > Let me check again. > > I thought more like 2**31 or so links, but it so happens that all my > testcases (by pure chance) have between 57k and 64k links,. so thanks a > lot for that. > > If you are reluctant, look at other filesystems. extX thought 16 bit is > enough. btrfs thought 16 bit is enough - even reiserfs thought 16 bit is > enough. Lots of filesystems thought 16 bits is enough, but all modern > incarnations of them do 31 or 32 bit link counts these days. Oh, yes. The f2fs_inode's link_count is the 32 bit structure, so it would be good to set 0xffffffff for F2FS_LINK_MAX. > It's kind of rare to have 8+TB of storage where you are fine with 2**16 > subdirectories everywhere. > > > What kernel version do you prefer? I've been maintaining f2fs for v3.10 mainly. > > > > http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs.git/log/?h=linux-3.10 > > I have a hard time finding kernels that work with these SMR drives. So > far, only the 3.18.x and the 3.19.x series works for me. the 3.17 and > 3.16 kernels fail for various reasons, and the 4.1.x kernels still fail > miserably with these drives. > > So, at this point, it needs to be either 3.18 or 3.19 for me. It seems > 3.19 has everything but the extent_cache, which probably shouldn't make > such a big difference. Are there any big bugs in 3.8/3.19 which I would > have to look out for? Storage size isn't an issue right now, because I can > reproduce the performance characteristics just fine on a 128G partition. > > I mainly asked because I thought newer kernel versions might have > important bugfixes. > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Marc L. <sc...@sc...> - 2015-09-21 00:14:57
|
On Mon, Aug 10, 2015 at 01:31:06PM -0700, Jaegeuk Kim <ja...@ke...> wrote: > I'm very interested in trying f2fs on SMR drives too. Sorry that it took me so long, I am currently conducting initial tests, and will hopefully be able tpo report soon, for real this time. The kernel bug regarding SMR drives has multiplied (the gory details are in https://bugzilla.kernel.org/show_bug.cgi?id=93581, apparently, a silent data corruption error has emerged, although I don't think I was affected by it), in short, I will test with "stock" 3.18.20 and/or 4.2.0 (with max_sectors_kb=512). I also have the current git f2fs tools up and running. I'll do small test with 4.2.0 and hopefully also the big ones (depending on when I can reboot the boxes). In the meantime, can you answer me one question? How can I effectively disable IPU? I currently try this: echo 8 >ipu_policy echo 100 >min_ipu_util Can you verify that this would suppress at least "most" IPU updates? If not, is there a better way to suppress them? Thanks! I really want to see the garbage collector freeing big chunks on its own, and rather wait for it to do its work than to risk IPU writes, as the latter will effectively trigger a similar garbage collect on the drive (less efficient, but with more cache). > - over 4TB storage space case I currently do tests with 512GB, will do full device size later. > - inline_dentry mount option; I'm still working on extent_cache for v4.3 too While inline_dentry will be nice to have, I can live with them being disabled, but will test anyways. Likewise the extent_cache. > - various sizes of section and zone > - tmpfile, and rename2 interfaces I wasn'*t even aware of the renameat2 syscall, thanks for indirectly pointing this out to me :) > http://git.kernel.org/cgit/linux/kernel/git/jaegeuk/f2fs-tools.git Up and running, thanks for pointing the URL out to me, I overlooked it in the manpage :/ -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / sc...@sc... -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Marc L. <sc...@sc...> - 2015-09-23 23:43:32
|
On Thu, Sep 24, 2015 at 01:30:22AM +0200, Marc Lehmann <sc...@sc...> wrote:
> > One thing I note is that gc_min_sleep_time is not be set in your script,
> > so in some condition gc may still do the sleep with gc_min_sleep_time (30
> > seconds by default) instead of gc_max_sleep_time which we expect.
>
> Ah, sorry, I actually set gc_min_sleep_time to 100, but forgot to include
> it.
Sorry, that sounded confusing - I set it to 100 in previous tests, and forgot
to include it, so it was running with 30000. When experimenting, I actually
do get the gc to do more frequent operations now.
Is there any obvious harm setting it to a very low value (such as 100 or 10)?
I assume all it does is have less time buffer between the last operation
and the gc starting. When I write in batches, or when I know the fs will be
idle, there shouldn't be any harm, performance wise, of letting it work all
the time.
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Jaegeuk K. <ja...@ke...> - 2015-09-24 17:21:13
|
On Thu, Sep 24, 2015 at 01:43:24AM +0200, Marc Lehmann wrote: > On Thu, Sep 24, 2015 at 01:30:22AM +0200, Marc Lehmann <sc...@sc...> wrote: > > > One thing I note is that gc_min_sleep_time is not be set in your script, > > > so in some condition gc may still do the sleep with gc_min_sleep_time (30 > > > seconds by default) instead of gc_max_sleep_time which we expect. > > > > Ah, sorry, I actually set gc_min_sleep_time to 100, but forgot to include > > it. > > Sorry, that sounded confusing - I set it to 100 in previous tests, and forgot > to include it, so it was running with 30000. When experimenting, I actually > do get the gc to do more frequent operations now. > > Is there any obvious harm setting it to a very low value (such as 100 or 10)? > > I assume all it does is have less time buffer between the last operation > and the gc starting. When I write in batches, or when I know the fs will be > idle, there shouldn't be any harm, performance wise, of letting it work all > the time. Yeah, I don't think it does matter with very small time periods, since the timer is set after background GC is done. But, we use msecs_to_jiffies(), so hope not to use something like 10 ms, since each backgroudn GC conducts reading victim blocks into page cache and then just sets them as dirty. That indicates, after a while, we hope flusher will write them all to disk and finally we got a free section. So, IMO, we need to give some time slots to flusher as well. For example, if write bandwidth is 30MB/s and section size is 128MB, it needs about 4secs to write one section. So, how about setting - gc_min_time to 1~2 secs, - gc_max_time to 3~4 secs, - gc_idle_time to 10 secs, - reclaim_segments to 64 (sync when 1 section becomes prefree) Thanks, > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Jaegeuk K. <ja...@ke...> - 2015-09-24 17:27:58
|
On Thu, Sep 24, 2015 at 01:39:38AM +0200, Marc Lehmann wrote: > On Wed, Sep 23, 2015 at 03:08:23PM -0700, Jaegeuk Kim <ja...@ke...> wrote: > > > root@shag:/sys/fs/f2fs/dm-1# df -H /mnt > > > Filesystem Size Used Avail Use% Mounted on > > > /dev/mapper/vg_test-test 138G 137G 803k 100% /mnt > > > > Could you please share /sys/kernel/debug/f2fs/status? > > Uh, sorry, I planned to, but forgot, probably because I thought the result > was so good it didn't need any checking :) > > > So, I'm convinced that your inital test set "-o1 -s128", which was an unlucky > > trial. :) > > hmm... since the point is to simulate a full 8TB partition, having large > overprovision/reserved space AND large section size might actually have been > a good test, as it would simulate the TB case better, which would also have > larger overprovisioning and the larger section size. > > In the end, I might settle with -s64, and currently do tests with -s90. Got it. But why -s90? :) > I was just scared that overprovisioning might turn out ot be extremely large > with 8TB. > > I have since then dropped -o from all my mkfs.f2fs invocations, seeing > that the resulting filesystem does not actually have 5% overprovisioning. > > > Subject: [PATCH] mkfs.f2fs: fix wrong ovp space calculation on large section > > Hmm, the latest change in > git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git is from > august 10 - do I need to select a branch (I am not good with git)? I just pushed the patches to master branch in f2fs-tools.git. Could you pull them and check them? I added one more patch to avoid harmless sit_type fixes previously you reported. And, for the 8TB case, let me check again. It seems that we need to handle under 1% overprovision ratio. (e.g., 0.5%) Thanks, > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Chao Yu <cha...@sa...> - 2015-09-25 08:06:39
|
> -----Original Message-----
> From: Marc Lehmann [mailto:sc...@sc...]
> Sent: Thursday, September 24, 2015 7:30 AM
> To: Chao Yu
> Cc: 'Jaegeuk Kim'; lin...@li...
> Subject: Re: [f2fs-dev] SMR drive test 2; 128GB partition; no obvious corruption, much more
> sane behaviour, weird overprovisioning
>
> On Wed, Sep 23, 2015 at 04:55:57PM +0800, Chao Yu <cha...@sa...> wrote:
> > > echo 1 >gc_idle
> > > echo 1000 >gc_max_sleep_time
> > > echo 5000 >gc_no_gc_sleep_time
> >
> > One thing I note is that gc_min_sleep_time is not be set in your script,
> > so in some condition gc may still do the sleep with gc_min_sleep_time (30
> > seconds by default) instead of gc_max_sleep_time which we expect.
>
> Ah, sorry, I actually set gc_min_sleep_time to 100, but forgot to include
> it.
>
> > In 4.3 rc1 kernel, we have add a new ioctl to trigger in batches gc, maybe
> > we can use it as one option.
>
> Yes, such an ioctl could be useful to me, although I do not intend to have
> background gc off.
>
> I assume that the ioctl will block for the time it runs, and I can ask it
> to do up to 16 batches in one go (by default)? That sounds indeed very
Actually, we should set the value of 'count' parameter to indicate how many
times we want to do gc in one batch, at most 16 times in a loop for each
ioctl invoking:
ioctl(fd, F2FS_IOC_GC, &count);
After ioctl retruned successfully, 'count' parameter will contain the count
of gces we did actually.
> useful to have.
>
> What is "one batch" in terms of gc, one section?
One batch means a certain number of gces excuting serially.
We have foreground/background mode in gc procedure:
1) For forground gc mode, it will try to gc several sections until there are
enough free sections;
2) For background gc mode, it will try to gc one section.
So we will not know how many sections will be freed in one batch, because it
depends on a) which mode we will use (gc mode is dynamically depending on current
status of free section/dirty datas) and b) whether a victim exist or not.
Thanks,
>
> --
> The choice of a Deliantra, the free code+content MORPG
> -----==- _GNU_ http://www.deliantra.net
> ----==-- _ generation
> ---==---(_)__ __ ____ __ Marc Lehmann
> --==---/ / _ \/ // /\ \/ / sc...@sc...
> -=====/_/_//_/\_,_/ /_/\_\
|
|
From: Marc L. <sc...@sc...> - 2015-09-26 03:42:27
|
On Fri, Sep 25, 2015 at 04:05:48PM +0800, Chao Yu <cha...@sa...> wrote:
> Actually, we should set the value of 'count' parameter to indicate how many
> times we want to do gc in one batch, at most 16 times in a loop for each
> ioctl invoking:
> ioctl(fd, F2FS_IOC_GC, &count);
> After ioctl retruned successfully, 'count' parameter will contain the count
> of gces we did actually.
Ah, so this way, I could even find out when to stop.
> One batch means a certain number of gces excuting serially.
Thanks for the explanation - well, I guess there is no harm in setting
count to 1 and calling it repeatedly, as GC operations should generally be
slow enough so many repeated calls will be ok.
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Chao Yu <cha...@sa...> - 2015-09-25 08:28:52
|
> -----Original Message----- > From: Jaegeuk Kim [mailto:ja...@ke...] > Sent: Friday, September 25, 2015 1:21 AM > To: Marc Lehmann > Cc: Chao Yu; lin...@li... > Subject: Re: [f2fs-dev] SMR drive test 2; 128GB partition; no obvious corruption, much more > sane behaviour, weird overprovisioning > > On Thu, Sep 24, 2015 at 01:43:24AM +0200, Marc Lehmann wrote: > > On Thu, Sep 24, 2015 at 01:30:22AM +0200, Marc Lehmann <sc...@sc...> wrote: > > > > One thing I note is that gc_min_sleep_time is not be set in your script, > > > > so in some condition gc may still do the sleep with gc_min_sleep_time (30 > > > > seconds by default) instead of gc_max_sleep_time which we expect. > > > > > > Ah, sorry, I actually set gc_min_sleep_time to 100, but forgot to include > > > it. > > > > Sorry, that sounded confusing - I set it to 100 in previous tests, and forgot > > to include it, so it was running with 30000. When experimenting, I actually > > do get the gc to do more frequent operations now. > > > > Is there any obvious harm setting it to a very low value (such as 100 or 10)? > > > > I assume all it does is have less time buffer between the last operation > > and the gc starting. When I write in batches, or when I know the fs will be > > idle, there shouldn't be any harm, performance wise, of letting it work all > > the time. > > Yeah, I don't think it does matter with very small time periods, since the timer > is set after background GC is done. > But, we use msecs_to_jiffies(), so hope not to use something like 10 ms, since > each backgroudn GC conducts reading victim blocks into page cache and then just > sets them as dirty. > That indicates, after a while, we hope flusher will write them all to disk and > finally we got a free section. > So, IMO, we need to give some time slots to flusher as well. > > For example, if write bandwidth is 30MB/s and section size is 128MB, it needs > about 4secs to write one section. It's better for us to consider VM dirty data flush policy, IIRC, Fengguang did the optimization work of writeback, if dirty ratio (dirty bytes?)is not high, VM will flush data slightly slowly, but as dirty ratio increase, VM will flush data aggressively. If we want a large usage of max bandwidth, the value of following interface could be consider when tuned up with gc policy of f2fs. /proc/sys/vm/ dirty_background_bytes dirty_background_ratio dirty_expire_centisecs Thanks, > So, how about setting > - gc_min_time to 1~2 secs, > - gc_max_time to 3~4 secs, > - gc_idle_time to 10 secs, > - reclaim_segments to 64 (sync when 1 section becomes prefree) > > Thanks, > > > > > -- > > The choice of a Deliantra, the free code+content MORPG > > -----==- _GNU_ http://www.deliantra.net > > ----==-- _ generation > > ---==---(_)__ __ ____ __ Marc Lehmann > > --==---/ / _ \/ // /\ \/ / sc...@sc... > > -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Marc L. <sc...@sc...> - 2015-09-21 08:17:57
|
On Mon, Sep 21, 2015 at 01:59:01AM +0200, Marc Lehmann <sc...@sc...> wrote: > Sorry that it took me so long, I am currently conducting initial tests, and > will hopefully be able tpo report soon, for real this time. Ok, here is my first test result. It's primarily concerned with GC and near-full conditions, because that is fastest to test. The test was done on a 4.2.0 kernel and current git f2fs tools. Summary: not good - write performance went down to 20kb/s at the 10GB free mark, sync took hours to complete, the filesystem was corrupt afterwards and fsck failed to repair it. I created a 512GB partition (-s 128 -o 1), mounted it (-onoatime,inline_xattr,inline_data,flush_merge,extent_cache, note: no inline_dentry) and started writing files to it (again, via rsync). Every few minutes, a simple script deleted every 80th file, to create dirty blocks. This test didn't test write performance, but it was adequate (the filesystem kept up with it). I paused rsync multiple times to check delete speed - the find -type f command I used to generate the list was rather slow (it took multiple minutes to list ~50000 files), which is not completely surprising, and still manageable for me. At around 50% utilization I paused the rsync and delete to see if there is any gc or otherwise activity. Indeed, every 30 seconds or so there was a ~100mb read and write, and no other activity. I continuted writing. At the 10GB free mark (df -h), write speed became rather slow (~1MB/s), and a short time later (9.8GB) I paused rsync+delete again. The "Dirty:" value was around 11000 at the time. From then on performance became rather abysmal - the speed went down to a steady 20kb/s (sic!). After a while I started "sync", which hung for almost 2 hours, during which the disk was mostly written at ~20kb/s, with occasional faster writes (~40-100mb/s) for a few seconds. The faster write periods coincided mostly with activity in the "Balancing F2FS Async" section of the status file. Here is the status file from when the write speed became slow: http://ue.tst.eu/12cf94978b9f47013f5f3b5712692ed5.txt And here is the status file maybe half an hour later: http://ue.tst.eu/144d36137371905a43d9a100f2f6b65c.txt I can't really explain the abysmal speed - it doesn't happen with other filesystems, so it's unlikely to be a hardware issue, but the only way I can imagine how this speed could be explained is by f2fs scattering random small writes all over the disk. The disk can do about 5-15 fully random writes per second, but should be able to buffer >20GB of random writes before this would happen. The reason why I am so infatuated with disk full conditions is that it will happen sooner or later, and while a slowdown to a 1MB/s might be ok when the disk is nearly full, the filesystem absolutely must needs recover one there is more free space and it had some time to reorganise. Another issue is that in one of my applications (backup), I reserve 10GB of space for transaction storage used only temporary, and the rest for long term storage. With f2fs, it seems this has to be at least 25GB to avoid the performance drop (which effectively takes down the disk for hours). This is a bit painful for two reasons: 1) f2fs already sets aside a lot of storage. Even with the minimum amount of reserved space (1%), this boils down to 80GB, a lot). In this test, only 5GB were reserved, but performance dropped when df -h still showed 10GB of free space. Now my observations on recovery after this condition: After sync returned, I more or less regained control of the disk, and started thinning out files again. This was rather slow at first (but the disk was reading and writing 1-50mb/s - I assume the GC was at work). After about 20 minutes, the utilization went down from 97% to 96%: http://ue.tst.eu/74dd57f9b0fe2657a1518af71de0ce38.txt At this point I noticed "find" spewing a large number of "No such file or directory" messages for files. The command I used to delete was: find /mnt -type f | awk '0 == NR % 80' | xargs -d\\n rm -v And I don't see how find can ever complain about "No such file or directory", even when there are concurrent deletes, because find should not revisit the same file multiple times, so by the time it gets deleted, find should be done with it. At this point I stopped the find/rm - the disk then only showed large reads and writes with a fgew second pauses between them. I then and ran the find command manually, and fair enough, find gives thousands of "No such file or directory" messages like these: find: `/mnt/ebook-export/eng/Pyrotools.txt': No such file or directory And indeed, the filesysstem is completely corrupted at this point, with lots of directory entries that cannot be stat'ed. root@shag:~# echo /mnt/ebook-export/eng/Pyrotools* /mnt/ebook-export/eng/Pyrotools.txt root@shag:~# ls -ld /mnt/ebook-export/eng/Pyrotools* ls: cannot access /mnt/ebook-export/eng/Pyrotools.txt: No such file or directory Since you warned me about the inline_dentry/extent_cache options, I will re-run this test tomorrow with noinline_dentry,noextent_cache (not documented, if they even exist - but inline_dentry seems to be on by default?). For completeness, I ran fsck.f2fs, which gave me a lot of these: [ASSERT] (fsck_chk_inode_blk: 525) --> ino: 0xc4e9 has i_blocks: 0000009e, but has 1 blocks [ASSERT] (fsck_chk_inode_blk: 391) --> [0xc79a] needs more i_links=0x1 [ASSERT] (fsck_chk_inode_blk: 525) --> ino: 0xc79a has i_blocks: 0000005c, but has 1 blocks [ASSERT] (fsck_chk_inode_blk: 391) --> [0xc845] needs more i_links=0x1 [ASSERT] (fsck_chk_inode_blk: 525) --> ino: 0xc845 has i_blocks: 000002d5, but has 1 blocks [ASSERT] (sanity_check_nid: 261) --> Duplicated node blk. nid[0x34fa5][0x7fe07b3] [ASSERT] (fsck_chk_inode_blk: 391) --> [0xccdc] needs more i_links=0x1 [ASSERT] (fsck_chk_inode_blk: 525) --> ino: 0xccdc has i_blocks: 00000063, but has 1 blocks [ASSERT] (fsck_chk_inode_blk: 391) --> [0xcebc] needs more i_links=0x1 [ASSERT] (fsck_chk_inode_blk: 525) --> ino: 0xcebc has i_blocks: 000000b0, but has 1 blocks [ASSERT] (fsck_chk_inode_blk: 391) --> [0xcf12] needs more i_links=0x1 [ASSERT] (fsck_chk_inode_blk: 525) --> ino: 0xcf12 has i_blocks: 00001b18, but has 1 blocks I then tried fsck.f2fs -a, which completed without much output, almost instantly (what does it do?). I then tried fsck.f2fs -f, which seemed to do something: [ASSERT] (fsck_chk_inode_blk: 391) --> [0x5c524] needs more i_links=0x1 [FIX] (fsck_chk_inode_blk: 398) --> File: 0x5c524 i_links= 0x1 -> 0x2 [ASSERT] (fsck_chk_inode_blk: 525) --> ino: 0x5c524 has i_blocks: 00000019, but has 1 blocks [FIX] (fsck_chk_inode_blk: 530) --> [0x5c524] i_blocks=0x00000019 -> 0x1 [ASSERT] (fsck_chk_inode_blk: 391) --> [0x671ba] needs more i_links=0x1 [FIX] (fsck_chk_inode_blk: 398) --> File: 0x671ba i_links= 0x1 -> 0x2 ... [FIX] (fsck_chk_inode_blk: 530) --> [0x1a7bf] i_blocks=0x000000ca -> 0x1 [ASSERT] (IS_VALID_BLK_ADDR: 344) --> block addr [0x0] [ASSERT] (sanity_check_nid: 212) --> blkaddres is not valid. [0x0] [FIX] (__chk_dentries: 779) --> Unlink [0x1a7d8] - E B Jones.epub len[0x33], type[0x1] [ASSERT] (IS_VALID_BLK_ADDR: 344) --> block addr [0x0] ... NID[0x679e2] is unreachable NID[0x679e3] is unreachable NID[0x6bc52] is unreachable NID[0x6bc53] is unreachable NID[0x6bc54] is unreachable [FSCK] Unreachable nat entries [Fail] [0x2727] [FSCK] SIT valid block bitmap checking [Fail] [FSCK] Hard link checking for regular file [Ok..] [0x0] [FSCK] valid_block_count matching with CP [Fail] [0x6a6bc8a] [FSCK] valid_node_count matcing with CP (de lookup) [Fail] [0x6808d] [FSCK] valid_node_count matcing with CP (nat lookup) [Ok..] [0x6a7b4] [FSCK] valid_inode_count matched with CP [Fail] [0x55bb8] [FSCK] free segment_count matched with CP [Ok..] [0x8f5d] [FSCK] next block offset is free [Ok..] [FSCK] fixing SIT types [FIX] (check_sit_types:1056) --> Wrong segment type [0x3fc6a] 3 -> 4 [FIX] (check_sit_types:1056) --> Wrong segment type [0x3fc6b] 3 -> 4 [FSCK] other corrupted bugs [Fail] Doesn't look good to me, however, the filesystem was mountable without error afterwards, but find showed similar errors, so fsck.f2fs did not result in a working filesystem either. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / sc...@sc... -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Marc L. <sc...@sc...> - 2015-09-25 05:42:34
|
On Thu, Sep 24, 2015 at 10:27:49AM -0700, Jaegeuk Kim <ja...@ke...> wrote:
> > In the end, I might settle with -s64, and currently do tests with -s90.
>
> Got it. But why -s90? :)
He :) It's a nothing-special number between 64 and 128, that's all.
> I just pushed the patches to master branch in f2fs-tools.git.
> Could you pull them and check them?
Got them, last patch was the "check sit types" change.
> I added one more patch to avoid harmless sit_type fixes previously you reported.
>
> And, for the 8TB case, let me check again. It seems that we need to handle under
> 1% overprovision ratio. (e.g., 0.5%)
That might make me potentially very happy. But my main concern at the
moment is stability - even when you have a backup, restoring 8TB will take
days, and backups are never uptodate.
It would be nice to be able to control it more from the user side though.
For example, I have not yet reached 0.0% free with f2fs. That's fine, I don't
plan9 to, but I need to know at which percentage should I stop, which is
something I can only really find out with experiments.
And just filling these 8TB disks takes days, so the question is, can I
simulate near-full behaviour with smaller partitions.
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Marc L. <sc...@sc...> - 2015-09-21 08:19:44
|
On Mon, Sep 21, 2015 at 10:17:48AM +0200, Marc Lehmann <sc...@sc...> wrote:
> (-onoatime,inline_xattr,inline_data,flush_merge,extent_cache, note: no
Correction, I copied the wrong line from my log, the mount options were:
mount -o inline_data,inline_dentry,flush_merge,extent_cache /dev/vg_test/test2 /mnt
So with inline_dentry.
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Jaegeuk K. <ja...@ke...> - 2015-09-25 17:45:56
|
On Fri, Sep 25, 2015 at 07:42:25AM +0200, Marc Lehmann wrote:
> On Thu, Sep 24, 2015 at 10:27:49AM -0700, Jaegeuk Kim <ja...@ke...> wrote:
> > > In the end, I might settle with -s64, and currently do tests with -s90.
> >
> > Got it. But why -s90? :)
>
> He :) It's a nothing-special number between 64 and 128, that's all.
Oh, then, I don't think that is a good magic number.
It seems that you decided to use -s64, so it'd better to keep it to address
any perf results.
> > I just pushed the patches to master branch in f2fs-tools.git.
> > Could you pull them and check them?
>
> Got them, last patch was the "check sit types" change.
>
> > I added one more patch to avoid harmless sit_type fixes previously you reported.
> >
> > And, for the 8TB case, let me check again. It seems that we need to handle under
> > 1% overprovision ratio. (e.g., 0.5%)
>
> That might make me potentially very happy. But my main concern at the
> moment is stability - even when you have a backup, restoring 8TB will take
> days, and backups are never uptodate.
>
> It would be nice to be able to control it more from the user side though.
>
> For example, I have not yet reached 0.0% free with f2fs. That's fine, I don't
> plan9 to, but I need to know at which percentage should I stop, which is
> something I can only really find out with experiments.
>
> And just filling these 8TB disks takes days, so the question is, can I
> simulate near-full behaviour with smaller partitions.
Why not? :)
I think the behavior should be same. And, it'd good to set small sections
in order to see it more clearly.
Anyway, I wrote a patch to consider under 1% for large partitions.
section ovp ratio ovp size
For 8TB,
-s1 : 0.07% -> 10GB
-s32 : 0.39% -> 65GB
-s64 : 0.55% -> 92GB
-s128 : 0.78% -> 132GB
For 128GB,
-s1 : 0.55% -> 1.4GB
-s32 : 3.14% -> 8GB
-s64 : 4.45% -> 12GB
-s128 : 6.32% -> 17GB
Let me test this patch for a while, and then push into our git.
Thanks,
>From 2cdb04b52f202e931e370564396366d44bd4d1e2 Mon Sep 17 00:00:00 2001
From: Jaegeuk Kim <ja...@ke...>
Date: Fri, 25 Sep 2015 09:31:04 -0700
Subject: [PATCH] mkfs.f2fs: support <1% overprovision ratio
Big partition size needs uner 1% overprovision space to acquire more space.
section ovp ratio ovp size
For 8TB,
-s1 : 0.07% -> 10GB
-s32 : 0.39% -> 65GB
-s64 : 0.55% -> 92GB
-s128 : 0.78% -> 132GB
For 128GB,
-s1 : 0.55% -> 1.4GB
-s32 : 3.14% -> 8GB
-s64 : 4.45% -> 12GB
-s128 : 6.32% -> 17GB
Signed-off-by: Jaegeuk Kim <ja...@ke...>
---
include/f2fs_fs.h | 2 +-
mkfs/f2fs_format.c | 12 ++++++------
mkfs/f2fs_format_main.c | 2 +-
3 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/include/f2fs_fs.h b/include/f2fs_fs.h
index 38a774c..359deec 100644
--- a/include/f2fs_fs.h
+++ b/include/f2fs_fs.h
@@ -225,7 +225,7 @@ enum f2fs_config_func {
struct f2fs_configuration {
u_int32_t sector_size;
u_int32_t reserved_segments;
- u_int32_t overprovision;
+ double overprovision;
u_int32_t cur_seg[6];
u_int32_t segs_per_sec;
u_int32_t secs_per_zone;
diff --git a/mkfs/f2fs_format.c b/mkfs/f2fs_format.c
index 2d4ab09..176bdea 100644
--- a/mkfs/f2fs_format.c
+++ b/mkfs/f2fs_format.c
@@ -155,19 +155,19 @@ static void configure_extension_list(void)
free(config.extension_list);
}
-static u_int32_t get_best_overprovision(void)
+static double get_best_overprovision(void)
{
- u_int32_t reserved, ovp, candidate, end, diff, space;
- u_int32_t max_ovp = 0, max_space = 0;
+ double reserved, ovp, candidate, end, diff, space;
+ double max_ovp = 0, max_space = 0;
if (get_sb(segment_count_main) < 256) {
candidate = 10;
end = 95;
diff = 5;
} else {
- candidate = 1;
+ candidate = 0.01;
end = 10;
- diff = 1;
+ diff = 0.01;
}
for (; candidate <= end; candidate += diff) {
@@ -533,7 +533,7 @@ static int f2fs_write_check_point_pack(void)
set_cp(overprov_segment_count, get_cp(overprov_segment_count) +
get_cp(rsvd_segment_count));
- MSG(0, "Info: Overprovision ratio = %u%%\n", config.overprovision);
+ MSG(0, "Info: Overprovision ratio = %.3lf%%\n", config.overprovision);
MSG(0, "Info: Overprovision segments = %u (GC reserved = %u)\n",
get_cp(overprov_segment_count),
config.reserved_segments);
diff --git a/mkfs/f2fs_format_main.c b/mkfs/f2fs_format_main.c
index fc612d8..2ea809c 100644
--- a/mkfs/f2fs_format_main.c
+++ b/mkfs/f2fs_format_main.c
@@ -99,7 +99,7 @@ static void f2fs_parse_options(int argc, char *argv[])
config.vol_label = optarg;
break;
case 'o':
- config.overprovision = atoi(optarg);
+ config.overprovision = atof(optarg);
break;
case 'O':
parse_feature(strdup(optarg));
--
2.1.1
|
|
From: Marc L. <sc...@sc...> - 2015-09-21 09:58:14
|
Second test - we're getting there:
Summary: looks much better, no obvious corruption (but fsck still gives
tens of thousands of [FIX] messages), performance somewhat as expected,
but a 138GB partition can only store 71.5GB of data (avg filesize 2.2MB)
and f2fs doesn't seem to do visible background GC.
For this test, changed a bunch of parameters:
1. partition size
128GiB instead of 512GiB (not ideal, but I wanted this test to be
quick)
2. mkfs options
mkfs.f2fs -lTEST -o5 -s128 -t0 -a0 # change: -o5 -a0
3. mount options
mount -t f2fs -onoatime,flush_merge,active_logs=2,no_heap
# change: no inline_* options, no extent_cache, but no_heap + active_logs=2
First of all, the discrepancy between utilization in the status file, du
and df is quite large:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_test-test 128G 106G 22G 84% /mnt
# du -skc /mnt
51674268 /mnt
51674268 total
Utilization: 67% (13168028 valid blocks)
So ~52GB of files take up ~106GB of the partition, which is 84% of the
total size, yet it's only utilized by 67%.
Second, and subjectively, the filesystem was much more responsive during
the test- find almost instantly give ssome output, instead of having to
wait for half a minute, and find|rm is much faster as well. find also
reads data at ~2mb/s, while in the previous test, it was 0.7MB/s (which
can be good or bad, but it looks good).
At 6.7GB free (df: 95%, status: 91%, du: 70/128GiB) I paused rsync. The disk
then did some heavy read/write for a short while, and the Dirty: count
reduced:
http://ue.tst.eu/d61a7017786dc6ebf5be2f7e2d2006d7.txt
I continued, and the disk afterwards did almost the same amount of reading
as it was writing, with short intzermittent write-only periods for a fe
seconds each. Rsync itself was noticably slower, so I guess f2fs finally
ran out of space and did garbage collect.
This is exactly the behaviour I did expect of f2fs, but this is the first
time I actually saw it.
Pausing didn't result in any activity.
At 6.3GB free, disk write speed went down to 1MB/s with intermittent
phases of 100MB/s write only, or 50MB/s read + 50MB/s write (but rsync was
transferring about 100kb/s at this point only, so no real progress was
made).
After about 10 minutes I paused rsync again, still at 6.3GB free (df
reporting 96% in use, status 91% and du 52% (71.5GB))
I must admit I don't understand these ratios - df vs. status can easily
be explained by overprovisioning, but the fact that a 138GB (128GiB)
partition can only hold 72GB of data with very few small files is not
looking good to me:
# df -H /mnt
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_test-test 138G 130G 6.3G 96% /mnt
# du -skc /mnt
71572620 /mnt
I wonder what this means, too:
MAIN: 65152(OverProv:27009 Resv:26624)
Surely this doesn't mean that 27009 of 65152 segments are for
overprovisioning? That would explain the bad values for due, but then, I
did specify -o5, not -o45 or so.
status at that point was:
http://ue.tst.eu/f869dfb6ac7b4d52966e8eb012b81d2a.txt
Anyways, I did more thinning to regain free space by deleting every 10th
file. That went reasonably slow, the disk was contantly reading + writing at
high speed, so I guess it was busy garbage colelcting, as it should.
status after deleting, with completely idle disk:
http://ue.tst.eu/1831202bc94d9cd521cfcefc938d2095.txt
/dev/mapper/vg_test-test 138G 123G 15G 90% /mnt
I waited a few minutes, but there was no further activity. I then unpaused
the rsync, which proceeded with good speed again.
At 11GB free, rsync effectively stopped, and the disk went to ~1MB/s wrtite
mode aagin. Pausing rsync didn't cause I/O to stop this time, it continued
for a few minutes.
I waited for 2 minutes with no disk I/O, unpaused rsync, and the disk
immediately went into 1MB/s write mode againh, with rsync not really
getting any data through though.
It's as if f2fs only tried to clean up when there is write data. I would
expect a highly fragmented f2fs to be very busy garbage collecting, but
apparently, not so, it just idles, and when a program wants to write,
fails to perform. Maybe I need to give it more time than two minutes, but
then, I wouldn't see a point in delaying to garbage collect if it has to
be done anyways.
In any case, no progress possible, I deleted more files again, this time
every 5th file, which went reasonably fast,
status after delete:
http://ue.tst.eu/fb3287adf4cc109c88b89f6120c9e4a6.txt
/dev/mapper/vg_test-test 138G 114G 23G 84% /mnt
rsync writing was reasonably fast down to 18GB, when rsync stopped making
much profgress (<100kb/s), but the disk wasn't in "1MB/s mode" but instead in
40MB/s read+write, which looks reasonable to me, as the disk was probably
quite fargmented at this point:
http://ue.tst.eu/fb3287adf4cc109c88b89f6120c9e4a6.txt
However, when pausing rsync, f2fs immediatelly ceased doing anything again,
so even though clearly there is a need for clean up activities, f2fs doesn't
do them.
To state this more clearly: My expectation is that when f2fs runs out of
immediatelly usable space for writing, it should do GC. That means that
when rsync is very slow and the disk is very fragmented, even when I pause
rsync, f2fs should GC at full speed until it has a reasonable amount of
usable free space again. Instead, it apparently just sits idle until some
program generates write data.
At this point, I unmounted the filesystem and "fsck.f2fs -f"'ed it. The
report looked good:
[FSCK] Unreachable nat entries [Ok..] [0x0]
[FSCK] SIT valid block bitmap checking [Ok..]
[FSCK] Hard link checking for regular file [Ok..] [0x0]
[FSCK] valid_block_count matching with CP [Ok..] [0xe8b623]
[FSCK] valid_node_count matcing with CP (de lookup) [Ok..] [0xa58a]
[FSCK] valid_node_count matcing with CP (nat lookup) [Ok..] [0xa58a]
[FSCK] valid_inode_count matched with CP [Ok..] [0x7800]
[FSCK] free segment_count matched with CP [Ok..] [0x8a17]
[FSCK] next block offset is free [Ok..]
[FSCK] fixing SIT types
However, there were about 30000 messages like these:
[FIX] (check_sit_types:1056) --> Wrong segment type [0xfdf6] 0 -> 1
[FIX] (check_sit_types:1056) --> Wrong segment type [0xfdf7] 0 -> 1
[FIX] (check_sit_types:1056) --> Wrong segment type [0xfdf8] 0 -> 1
[FIX] (check_sit_types:1056) --> Wrong segment type [0xfdf9] 0 -> 1
[FIX] (check_sit_types:1056) --> Wrong segment type [0xfdfa] 0 -> 1
[FIX] (check_sit_types:1056) --> Wrong segment type [0xfdfb] 0 -> 1
[FIX] (check_sit_types:1056) --> Wrong segment type [0xfdfc] 0 -> 1
[FIX] (check_sit_types:1056) --> Wrong segment type [0xfdfd] 0 -> 1
[FIX] (check_sit_types:1056) --> Wrong segment type [0xfdfe] 0 -> 1
[FIX] (check_sit_types:1056) --> Wrong segment type [0xfdff] 0 -> 1
[FSCK] other corrupted bugs [Ok..]
That's not promising, why does it think it needs to fix anything?
I mounted the partition again. Listing the files was very fast. I deleted all
the files and ran rsync for a while. It seems the partition completely
recovered. This is the empty state btw.:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_test-test 138G 57G 80G 42% /mnt
So, all the pathological behaviour is gone (no 20kb/s write speed blocking
the disk for hours, more importantly, no obvious filesystem corruption,
although the fsck messages need explanation).
Moreso, the behaviour, while still confusing (weird du vs. df, no background
activity), at least seems to be in line with what I expect - fragmentation
kills performance, but f2fs seems capable of recovering.
So here is my wishlist:
1. the overprovisioning values seems to be completely out of this world. I'm
prepared top give up maybe 50GB of my 8TB disk for this, but not more.
2. even though ~40% of space is not used by file data, f2fs still becomes
extremely slow. this can't be right.
3. why does f2fs sit idle on a highly fragmented filesystem, why does it not
do background garbage collect at maximum I/O speed, so the filesystem is
ready when the next writes come?
Greetings, and good night :)
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Marc L. <sc...@sc...> - 2015-09-26 03:33:03
|
On Fri, Sep 25, 2015 at 10:45:46AM -0700, Jaegeuk Kim <ja...@ke...> wrote:
> > He :) It's a nothing-special number between 64 and 128, that's all.
>
> Oh, then, I don't think that is a good magic number.
Care to share why? :)
> It seems that you decided to use -s64, so it'd better to keep it to address
> any perf results.
Is there anysthing specially good for numbers of two? Or do you just want top
reduce the number of changed variables?
I'f yes, should I do the 3.18.21 test with -s90 (as the 3.18.21 and 4.2.1
tests before), or with -s64?
> > And just filling these 8TB disks takes days, so the question is, can I
> > simulate near-full behaviour with smaller partitions.
>
> Why not? :)
> I think the behavior should be same. And, it'd good to set small sections
> in order to see it more clearly.
The section size is a critical parameter for these drives. Also, the data
mix is the same for 8TB and smaller partitions (in these tests, which were
meantr to be the first round of tests only anyway).
So a smaller section size compared to the full partition test, I think,
would result in very different behaviour. Likewise, if a small partition
has comparatively more (or absolutely less) overprovision (and/or reserved
space), this again might cause different behaviour.
At least to me, it's not obvious what a good comparable overprovision ratio
is to test full device behaviour on a smaller partition.
Also, section sizes vary by a factor fo two over the device, so what might
work fine with -s64 in the middle of the disk, might work badly at the end.
Likewise, since the files don't get larger, the GC might do a much better
job at -s64 than at -s128 (almost certainly, actually).
As a thought experiment, what happens when I use -s8 or a similar small size?
If the GC writes linearly, there won't be too many RMW cycles. But is that
guaranteed even with an aging filesystem?
If yes, then the best -s number might be 1. Because all I rely on is
mostly linear batched large writes, not so much large batched reads.
That is, unfortunately, not something I can easily test.
> Let me test this patch for a while, and then push into our git.
Thanks, will do so, then.
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Marc L. <sc...@sc...> - 2015-09-22 20:22:12
|
Third test, using the full device, on linux 4.2.1 mkfs.f2fs -l COLD1 -o1 -a0 -d1 -s128 /dev/mapper/xmnt-cold1 mount -tf2fs -onoatime,flush_merge,active_logs=2,no_heap /dev/mapper/xmnt-cold1 /cold1 Unfortunately, mount failed with. The kernel showed that a high order allocation could not be satisfied: mount: page allocation failure: order:7, mode:0x40d0 ... F2FS-fs (dm-18): Failed to initialize F2FS segment manager (http://data.plan9.de/f2fs-mount-failure.txt) I think this memory management is a real problem - the server was booted about 20 minutes earlier and had 23GB free ram (used for cache). I was able to mount it by dropping the page cache, but clearly this shouldn't be neccessary. After this, df showed 185GB in use, which is more like 3%, not 1% - again overprovisioning seems to be out of bounds. I started copying files with tar|tar, after 10GB, I restarted, which started to overwrite the existing 10GB files. Unfortunately, this time the GC kicked in every 10-20 seconds, slowing down writing times. I don't know what triggered it this time, but I am quite sure at less than 1% utilisation it shouldn't feel the need to gc while the disk is busy writing. After 90GB were written, I decided to simulate a disk problem by deleting the device (to avoid any corruption issues the disk itself might have): echo 1 >/sys/block/sde/device/delete After rescanning the device, I used fsck.f2fs on it, and it failed quickly: Info: superblock features = 0 : Info: superblock encrypt level = 0, salt = 00000000000000000000000000000000 Info: total FS sectors = 15628050432 (7630884 MB) Info: CKPT version = 2 [ASSERT] (restore_node_summary: 688) ret >= 0 [Exit 255] Re-running it with -f failed differently, but also quickly: Info: superblock features = 0 : Info: superblock encrypt level = 0, salt = 00000000000000000000000000000000 Info: total FS sectors = 15628050432 (7630884 MB) Info: CKPT version = 2 [ASSERT] (get_current_sit_page: 803) ret >= 0 [Exit 255] I'll reformat and try without any simulated problems. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / sc...@sc... -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Jaegeuk K. <ja...@ke...> - 2015-09-22 23:09:01
|
Thank you for the test. On Tue, Sep 22, 2015 at 10:22:02PM +0200, Marc Lehmann wrote: > Third test, using the full device, on linux 4.2.1 > > mkfs.f2fs -l COLD1 -o1 -a0 -d1 -s128 /dev/mapper/xmnt-cold1 Could you check without -o1, since I merged a patch to calculate the best overprovision at runtime in f2fs-tools. Originally, even if you set a specific overprovision ratio, mkfs.f2fs calculates the real space again. For example, if you set 1%, we need to reserve 100 sections to do cleaning at the worse case. That's why you cannot see the reserved area as just 1% over total space. > mount -tf2fs -onoatime,flush_merge,active_logs=2,no_heap /dev/mapper/xmnt-cold1 /cold1 > > Unfortunately, mount failed with. The kernel showed that a high order > allocation could not be satisfied: > > mount: page allocation failure: order:7, mode:0x40d0 > ... > F2FS-fs (dm-18): Failed to initialize F2FS segment manager > (http://data.plan9.de/f2fs-mount-failure.txt) I think the below patch should resolve this issue. > > I think this memory management is a real problem - the server was booted > about 20 minutes earlier and had 23GB free ram (used for cache). I was able > to mount it by dropping the page cache, but clearly this shouldn't be > neccessary. > > After this, df showed 185GB in use, which is more like 3%, not 1% - again > overprovisioning seems to be out of bounds. Actually, 185GB should include FS metadata as well as reserved or overprovision space. It would be good to check the on-disk layout by fsck.f2fs. > > I started copying files with tar|tar, after 10GB, I restarted, which started > to overwrite the existing 10GB files. > > Unfortunately, this time the GC kicked in every 10-20 seconds, slowing down > writing times. I don't know what triggered it this time, but I am quite sure > at less than 1% utilisation it shouldn't feel the need to gc while the disk > is busy writing. > > After 90GB were written, I decided to simulate a disk problem by deleting > the device (to avoid any corruption issues the disk itself might have): > > echo 1 >/sys/block/sde/device/delete > > After rescanning the device, I used fsck.f2fs on it, and it failed quickly: > > Info: superblock features = 0 : > Info: superblock encrypt level = 0, salt = 00000000000000000000000000000000 > Info: total FS sectors = 15628050432 (7630884 MB) > Info: CKPT version = 2 > [ASSERT] (restore_node_summary: 688) ret >= 0 > [Exit 255] > > Re-running it with -f failed differently, but also quickly: > > Info: superblock features = 0 : > Info: superblock encrypt level = 0, salt = 00000000000000000000000000000000 > Info: total FS sectors = 15628050432 (7630884 MB) > Info: CKPT version = 2 > [ASSERT] (get_current_sit_page: 803) ret >= 0 > [Exit 255] Actually, this doesn't report f2fs inconsistency. Instead, these two errors are from lseek64() and read() failures in dev_read(): lib/libf2fs_io.c. Maybe ENOMEM? Can you check the errno of this function? Thanks, > > I'll reformat and try without any simulated problems. > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ > > ------------------------------------------------------------------------------ > _______________________________________________ > Linux-f2fs-devel mailing list > Lin...@li... > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel >From d495b00a2f04c0ec5e6c6d95c9e66bdba45b174c Mon Sep 17 00:00:00 2001 From: Jaegeuk Kim <ja...@ke...> Date: Tue, 22 Sep 2015 13:50:47 -0700 Subject: [PATCH] f2fs: use vmalloc to handle -ENOMEM error This patch introduces f2fs_kvmalloc to avoid -ENOMEM during mount. Signed-off-by: Jaegeuk Kim <ja...@ke...> --- fs/f2fs/f2fs.h | 11 +++++++++++ fs/f2fs/segment.c | 9 ++++----- 2 files changed, 15 insertions(+), 5 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 79c38ad..553529d 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -19,6 +19,7 @@ #include <linux/magic.h> #include <linux/kobject.h> #include <linux/sched.h> +#include <linux/vmalloc.h> #include <linux/bio.h> #ifdef CONFIG_F2FS_CHECK_FS @@ -1579,6 +1580,16 @@ static inline bool f2fs_may_extent_tree(struct inode *inode) return S_ISREG(mode); } +static inline void *f2fs_kvmalloc(size_t size, gfp_t flags) +{ + void *ret; + + ret = kmalloc(size, flags | __GFP_NOWARN); + if (!ret) + ret = __vmalloc(size, flags, PAGE_KERNEL); + return ret; +} + #define get_inode_mode(i) \ ((is_inode_flag_set(F2FS_I(i), FI_ACL_MODE)) ? \ (F2FS_I(i)->i_acl_mode) : ((i)->i_mode)) diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 78e6d06..13567ad 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -14,7 +14,6 @@ #include <linux/blkdev.h> #include <linux/prefetch.h> #include <linux/kthread.h> -#include <linux/vmalloc.h> #include <linux/swap.h> #include "f2fs.h" @@ -2028,12 +2027,12 @@ static int build_free_segmap(struct f2fs_sb_info *sbi) SM_I(sbi)->free_info = free_i; bitmap_size = f2fs_bitmap_size(MAIN_SEGS(sbi)); - free_i->free_segmap = kmalloc(bitmap_size, GFP_KERNEL); + free_i->free_segmap = f2fs_kvmalloc(bitmap_size, GFP_KERNEL); if (!free_i->free_segmap) return -ENOMEM; sec_bitmap_size = f2fs_bitmap_size(MAIN_SECS(sbi)); - free_i->free_secmap = kmalloc(sec_bitmap_size, GFP_KERNEL); + free_i->free_secmap = f2fs_kvmalloc(sec_bitmap_size, GFP_KERNEL); if (!free_i->free_secmap) return -ENOMEM; @@ -2348,8 +2347,8 @@ static void destroy_free_segmap(struct f2fs_sb_info *sbi) if (!free_i) return; SM_I(sbi)->free_info = NULL; - kfree(free_i->free_segmap); - kfree(free_i->free_secmap); + kvfree(free_i->free_segmap); + kvfree(free_i->free_secmap); kfree(free_i); } -- 2.1.1 |
|
From: Marc L. <sc...@sc...> - 2015-09-23 03:50:43
|
On Tue, Sep 22, 2015 at 04:08:50PM -0700, Jaegeuk Kim <ja...@ke...> wrote: > Could you check without -o1, since I merged a patch to calculate the best > overprovision at runtime in f2fs-tools. I assume I have it, as git pull didn't give me any updates. > For example, if you set 1%, we need to reserve 100 sections to do cleaning at > the worse case. That's why you cannot see the reserved area as just 1% over > total space. Ok, I tried with -o1, -o5, and no -o switch at all: switch df -h "Used" -o1 126GiB -o5 384GiB "" 126GiB So indeed, the manpage (which says -o5 is the default) doesn't match the behaviour. With -s1 instead of -s128, I get: 75GiB 100 sections at -s128 would be 25G, so I wonder what the remaining 101GiB are (or the remaining 75GiB). Don't get me wrong, a "default" ext4 gives me a lot less initial space, but that's why I don't use ext4. XFS (which has a lot more on-disk data structures) gives me a 100GB more space, which is not something to be trifled with. If f2fs absolutely needs this space, it has to be, but at the moment, it feels excessive. I'm also ok with having to wait for gc when the disk is almost completely full. The issue I have at this point is that f2fs reserves a LOT of space, and long before the disk even near getting full, it basically stops working. That's the point where I would have to wait for the GC, but f2fs just seems to sit idle. > > F2FS-fs (dm-18): Failed to initialize F2FS segment manager > > (http://data.plan9.de/f2fs-mount-failure.txt) > > I think the below patch should resolve this issue. Sounds cool! > > After this, df showed 185GB in use, which is more like 3%, not 1% - again > > overprovisioning seems to be out of bounds. > > Actually, 185GB should include FS metadata as well as reserved or overprovision > space. It would be good to check the on-disk layout by fsck.f2fs. That's a lot of metadata for an empty filesystem. > > [ASSERT] (get_current_sit_page: 803) ret >= 0 > > [Exit 255] > > Actually, this doesn't report f2fs inconsistency. > Instead, these two errors are from lseek64() and read() failures in dev_read(): > lib/libf2fs_io.c. > > Maybe ENOMEM? Can you check the errno of this function? That's very strange, if the kernel fails, I would expect some dmesg output, but the fs was mountable before and after. Unfortunately, I already went onwards with the next test. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / sc...@sc... -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Jaegeuk K. <ja...@ke...> - 2015-09-23 01:12:51
|
On Mon, Sep 21, 2015 at 11:58:06AM +0200, Marc Lehmann wrote: > Second test - we're getting there: > > Summary: looks much better, no obvious corruption (but fsck still gives > tens of thousands of [FIX] messages), performance somewhat as expected, > but a 138GB partition can only store 71.5GB of data (avg filesize 2.2MB) > and f2fs doesn't seem to do visible background GC. > > For this test, changed a bunch of parameters: > > 1. partition size > > 128GiB instead of 512GiB (not ideal, but I wanted this test to be > quick) > > 2. mkfs options > > mkfs.f2fs -lTEST -o5 -s128 -t0 -a0 # change: -o5 -a0 Please, check without -o5. > > 3. mount options > > mount -t f2fs -onoatime,flush_merge,active_logs=2,no_heap > # change: no inline_* options, no extent_cache, but no_heap + active_logs=2 Hmm. Is it necessary to reduce the number of active_logs? Only two logs would increase the GC overheads significantly. And, you can use inline_data in v4.2. In v4.3, I expect extent_cache will be stable and usable. > > First of all, the discrepancy between utilization in the status file, du > and df is quite large: > > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg_test-test 128G 106G 22G 84% /mnt > > # du -skc /mnt > 51674268 /mnt > 51674268 total > > Utilization: 67% (13168028 valid blocks) Ok. I could retrieve the on-disk layout from the below log. In the log, the overprovision area is set as about 54GB. However, when I tried to do mkfs.f2fs with the same options, I got about 18GB. Could you share the mkfs.f2fs messages and fsck.f2fs -d3 as well? > > So ~52GB of files take up ~106GB of the partition, which is 84% of the > total size, yet it's only utilized by 67%. > > Second, and subjectively, the filesystem was much more responsive during > the test- find almost instantly give ssome output, instead of having to > wait for half a minute, and find|rm is much faster as well. find also > reads data at ~2mb/s, while in the previous test, it was 0.7MB/s (which > can be good or bad, but it looks good). > > At 6.7GB free (df: 95%, status: 91%, du: 70/128GiB) I paused rsync. The disk > then did some heavy read/write for a short while, and the Dirty: count > reduced: > > http://ue.tst.eu/d61a7017786dc6ebf5be2f7e2d2006d7.txt > > I continued, and the disk afterwards did almost the same amount of reading > as it was writing, with short intzermittent write-only periods for a fe > seconds each. Rsync itself was noticably slower, so I guess f2fs finally > ran out of space and did garbage collect. > > This is exactly the behaviour I did expect of f2fs, but this is the first > time I actually saw it. > > Pausing didn't result in any activity. > > At 6.3GB free, disk write speed went down to 1MB/s with intermittent > phases of 100MB/s write only, or 50MB/s read + 50MB/s write (but rsync was > transferring about 100kb/s at this point only, so no real progress was > made). > > After about 10 minutes I paused rsync again, still at 6.3GB free (df > reporting 96% in use, status 91% and du 52% (71.5GB)) > > I must admit I don't understand these ratios - df vs. status can easily > be explained by overprovisioning, but the fact that a 138GB (128GiB) > partition can only hold 72GB of data with very few small files is not > looking good to me: > > # df -H /mnt > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg_test-test 138G 130G 6.3G 96% /mnt > # du -skc /mnt > 71572620 /mnt > > I wonder what this means, too: > > MAIN: 65152(OverProv:27009 Resv:26624) Yeah, that's the hint that overprovision area occupies 54GB abnormally. I think there is something wrong on your mkfs.f2fs when calculating reserved space. It needs to take a look at mkfs.f2fs log. > > Surely this doesn't mean that 27009 of 65152 segments are for > overprovisioning? That would explain the bad values for due, but then, I > did specify -o5, not -o45 or so. > > status at that point was: > > http://ue.tst.eu/f869dfb6ac7b4d52966e8eb012b81d2a.txt > > Anyways, I did more thinning to regain free space by deleting every 10th > file. That went reasonably slow, the disk was contantly reading + writing at > high speed, so I guess it was busy garbage colelcting, as it should. > > status after deleting, with completely idle disk: > > http://ue.tst.eu/1831202bc94d9cd521cfcefc938d2095.txt > > /dev/mapper/vg_test-test 138G 123G 15G 90% /mnt > > I waited a few minutes, but there was no further activity. I then unpaused > the rsync, which proceeded with good speed again. > > At 11GB free, rsync effectively stopped, and the disk went to ~1MB/s wrtite > mode aagin. Pausing rsync didn't cause I/O to stop this time, it continued > for a few minutes. > > I waited for 2 minutes with no disk I/O, unpaused rsync, and the disk > immediately went into 1MB/s write mode againh, with rsync not really > getting any data through though. > > It's as if f2fs only tried to clean up when there is write data. I would > expect a highly fragmented f2fs to be very busy garbage collecting, but > apparently, not so, it just idles, and when a program wants to write, > fails to perform. Maybe I need to give it more time than two minutes, but > then, I wouldn't see a point in delaying to garbage collect if it has to > be done anyways. > > In any case, no progress possible, I deleted more files again, this time > every 5th file, which went reasonably fast, > > status after delete: > > http://ue.tst.eu/fb3287adf4cc109c88b89f6120c9e4a6.txt > > /dev/mapper/vg_test-test 138G 114G 23G 84% /mnt > > rsync writing was reasonably fast down to 18GB, when rsync stopped making > much profgress (<100kb/s), but the disk wasn't in "1MB/s mode" but instead in > 40MB/s read+write, which looks reasonable to me, as the disk was probably > quite fargmented at this point: > > http://ue.tst.eu/fb3287adf4cc109c88b89f6120c9e4a6.txt > > However, when pausing rsync, f2fs immediatelly ceased doing anything again, > so even though clearly there is a need for clean up activities, f2fs doesn't > do them. It seems that why f2fs didn't do gc was that all the sections were traversed by background gc. In order to reset that, it needs to trigger checkpoint, but it couldn't meet the condition in background. How about calling "sync" before leaving the system as idle? Or, you can check decreasing the number in /sys/fs/f2fs/xxx/reclaim_segments to 256 or 512? > > To state this more clearly: My expectation is that when f2fs runs out of > immediatelly usable space for writing, it should do GC. That means that > when rsync is very slow and the disk is very fragmented, even when I pause > rsync, f2fs should GC at full speed until it has a reasonable amount of > usable free space again. Instead, it apparently just sits idle until some > program generates write data. > > At this point, I unmounted the filesystem and "fsck.f2fs -f"'ed it. The > report looked good: > > [FSCK] Unreachable nat entries [Ok..] [0x0] > [FSCK] SIT valid block bitmap checking [Ok..] > [FSCK] Hard link checking for regular file [Ok..] [0x0] > [FSCK] valid_block_count matching with CP [Ok..] [0xe8b623] > [FSCK] valid_node_count matcing with CP (de lookup) [Ok..] [0xa58a] > [FSCK] valid_node_count matcing with CP (nat lookup) [Ok..] [0xa58a] > [FSCK] valid_inode_count matched with CP [Ok..] [0x7800] > [FSCK] free segment_count matched with CP [Ok..] [0x8a17] > [FSCK] next block offset is free [Ok..] > [FSCK] fixing SIT types > > However, there were about 30000 messages like these: > > [FIX] (check_sit_types:1056) --> Wrong segment type [0xfdf6] 0 -> 1 > [FIX] (check_sit_types:1056) --> Wrong segment type [0xfdf7] 0 -> 1 > [FIX] (check_sit_types:1056) --> Wrong segment type [0xfdf8] 0 -> 1 > [FIX] (check_sit_types:1056) --> Wrong segment type [0xfdf9] 0 -> 1 > [FIX] (check_sit_types:1056) --> Wrong segment type [0xfdfa] 0 -> 1 > [FIX] (check_sit_types:1056) --> Wrong segment type [0xfdfb] 0 -> 1 > [FIX] (check_sit_types:1056) --> Wrong segment type [0xfdfc] 0 -> 1 > [FIX] (check_sit_types:1056) --> Wrong segment type [0xfdfd] 0 -> 1 > [FIX] (check_sit_types:1056) --> Wrong segment type [0xfdfe] 0 -> 1 > [FIX] (check_sit_types:1056) --> Wrong segment type [0xfdff] 0 -> 1 > [FSCK] other corrupted bugs [Ok..] > > That's not promising, why does it think it needs to fix anything? I need to take a look at the fsck.f2fs when handling there are two active logs. Anyway, this doesn't break the core FS consistency, so you can ignore them. > > I mounted the partition again. Listing the files was very fast. I deleted all > the files and ran rsync for a while. It seems the partition completely > recovered. This is the empty state btw.: > > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg_test-test 138G 57G 80G 42% /mnt > > So, all the pathological behaviour is gone (no 20kb/s write speed blocking > the disk for hours, more importantly, no obvious filesystem corruption, > although the fsck messages need explanation). > > Moreso, the behaviour, while still confusing (weird du vs. df, no background > activity), at least seems to be in line with what I expect - fragmentation > kills performance, but f2fs seems capable of recovering. > > So here is my wishlist: > > 1. the overprovisioning values seems to be completely out of this world. I'm > prepared top give up maybe 50GB of my 8TB disk for this, but not more. Maybe, it needs to check with other filesystems' *available* spaces. Since, many of them hide additional FS metadata initially. > > 2. even though ~40% of space is not used by file data, f2fs still becomes > extremely slow. this can't be right. I think it was due to the wrong overprovision space. It needs to check that number first. > > 3. why does f2fs sit idle on a highly fragmented filesystem, why does it not > do background garbage collect at maximum I/O speed, so the filesystem is > ready when the next writes come? I suspect the section size is too large comparing to the whole partition size, which number is only 509. Each GC selects a victim in a unit of section and background GC would not select again for the previously visited ones. So, I think GC is easy to traverse whole sections, and go to bed since there is no new victims. So, I think checkpoint, "sync", resets whole history and makes background GC conduct its job again. Thank you, :) > > Greetings, and good night :) > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ > > ------------------------------------------------------------------------------ > _______________________________________________ > Linux-f2fs-devel mailing list > Lin...@li... > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel |