Thread: Re: [f2fs-dev] SMR drive test 2; 128GB partition; no obvious corruption, much more sane behaviour, (Page 2)
Brought to you by:
kjgkr
|
From: Jaegeuk K. <ja...@ke...> - 2015-09-26 07:37:05
|
On Sat, Sep 26, 2015 at 05:32:53AM +0200, Marc Lehmann wrote: > On Fri, Sep 25, 2015 at 10:45:46AM -0700, Jaegeuk Kim <ja...@ke...> wrote: > > > He :) It's a nothing-special number between 64 and 128, that's all. > > > > Oh, then, I don't think that is a good magic number. > > Care to share why? :) Mostly, in the flash storages, it is multiple 2MB normally. :) > > > It seems that you decided to use -s64, so it'd better to keep it to address > > any perf results. > > Is there anysthing specially good for numbers of two? Or do you just want top > reduce the number of changed variables? IMO, likewise flash storages, it needs to investigate the raw device characteristics. I think this can be used for SMR too. https://github.com/bradfa/flashbench I think there might be some hints for section size at first and performance variation as well. > I'f yes, should I do the 3.18.21 test with -s90 (as the 3.18.21 and 4.2.1 > tests before), or with -s64? > > > > And just filling these 8TB disks takes days, so the question is, can I > > > simulate near-full behaviour with smaller partitions. > > > > Why not? :) > > I think the behavior should be same. And, it'd good to set small sections > > in order to see it more clearly. > > The section size is a critical parameter for these drives. Also, the data > mix is the same for 8TB and smaller partitions (in these tests, which were > meantr to be the first round of tests only anyway). > > So a smaller section size compared to the full partition test, I think, > would result in very different behaviour. Likewise, if a small partition > has comparatively more (or absolutely less) overprovision (and/or reserved > space), this again might cause different behaviour. > > At least to me, it's not obvious what a good comparable overprovision ratio > is to test full device behaviour on a smaller partition. > > Also, section sizes vary by a factor fo two over the device, so what might > work fine with -s64 in the middle of the disk, might work badly at the end. > > Likewise, since the files don't get larger, the GC might do a much better > job at -s64 than at -s128 (almost certainly, actually). > > As a thought experiment, what happens when I use -s8 or a similar small size? > If the GC writes linearly, there won't be too many RMW cycles. But is that > guaranteed even with an aging filesystem? > > If yes, then the best -s number might be 1. Because all I rely on is > mostly linear batched large writes, not so much large batched reads. > > That is, unfortunately, not something I can easily test. > > > Let me test this patch for a while, and then push into our git. > > Thanks, will do so, then. > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Marc L. <sc...@sc...> - 2015-09-23 04:15:33
|
On Tue, Sep 22, 2015 at 06:12:39PM -0700, Jaegeuk Kim <ja...@ke...> wrote:
> Hmm. Is it necessary to reduce the number of active_logs?
I don't know, the documentation isn't very forthcoming with details :)
In any case, this is just for testing. My rationale was that multiple logs
probably means that there are multiple sequential write zones. Reducing those
Only two logs would help the disk. Probably. Maybe.
> increase the GC overheads significantly.
Can you elaborate? I do get a speed improvement with only two logs, but of
course, GC time is an impoprtant factor, so maybe more logs would be a
necessary trade-off.
> And, you can use inline_data in v4.2.
I think I did - the documentation says inline_data is the default.
> > Filesystem Size Used Avail Use% Mounted on
> > /dev/mapper/vg_test-test 128G 106G 22G 84% /mnt
> >
> > # du -skc /mnt
> > 51674268 /mnt
> > 51674268 total
> >
> > Utilization: 67% (13168028 valid blocks)
>
> Ok. I could retrieve the on-disk layout from the below log.
> In the log, the overprovision area is set as about 54GB.
> However, when I tried to do mkfs.f2fs with the same options, I got about 18GB.
> Could you share the mkfs.f2fs messages and fsck.f2fs -d3 as well?
When I re-ran the mkfs.f2fs, I got:
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_test-test 138G 20G 118G 14% /mnt
I didn't note down the overhead in my test, the df I had was when the disk
was filled, so it possibly changed(?) at runtime?
(I tried debians mkfs.f2fs, but it gave identical results).
I'll redo the 128GiB test and see if I can get similar results.
> > However, when pausing rsync, f2fs immediatelly ceased doing anything again,
> > so even though clearly there is a need for clean up activities, f2fs doesn't
> > do them.
>
> It seems that why f2fs didn't do gc was that all the sections were traversed
> by background gc. In order to reset that, it needs to trigger checkpoint, but
> it couldn't meet the condition in background.
>
> How about calling "sync" before leaving the system as idle?
> Or, you can check decreasing the number in /sys/fs/f2fs/xxx/reclaim_segments to
> 256 or 512?
Will try next time. I distinctly remember that sync didn't do anything to
pre-free and free, though.
> > 1. the overprovisioning values seems to be completely out of this world. I'm
> > prepared top give up maybe 50GB of my 8TB disk for this, but not more.
>
> Maybe, it needs to check with other filesystems' *available* spaces.
> Since, many of them hide additional FS metadata initially.
I habitually do comparefree space between filesystems. While f2fs is better
than ext4 with default settings (and even with some tuning), ext4 is quite
known to have excessive preallocated metadata requirements.
As mentioned in my other mail, XFS for example has 100GB more free
space than f2fs on the full 8TB device, and from memory I expect other
filesystems without fixed inode numbers (practically all of them) to be
similar.
> > 3. why does f2fs sit idle on a highly fragmented filesystem, why does it not
> > do background garbage collect at maximum I/O speed, so the filesystem is
> > ready when the next writes come?
>
> I suspect the section size is too large comparing to the whole partition size,
> which number is only 509. Each GC selects a victim in a unit of section and
> background GC would not select again for the previously visited ones.
> So, I think GC is easy to traverse whole sections, and go to bed since there
> is no new victims. So, I think checkpoint, "sync", resets whole history and
> makes background GC conduct its job again.
The large section size is of course the whole point of the exercise, as
hopefully this causes the GC to do larger sequential writes. It's clear
that this is not a perfect match for these SMR drives, but the goal is to
have acceptable performance, faster than a few megabytes/s. And indeed,
when the GC runs, it get quite good I/O performance in my test (deleteing
every nth file makes comparatively small holes, so the GC has to copy most
of the section).
Now, the other thing is that the GC, whgen it triggers, isn't very
aggressive - when I saw it, it was doing something every 10-15 seconds,
with the system being idle, when it should be more or less completely busy.
I am aware that "idle" is a difficult to inmpossible condition to detect
- maybe this could be made more tunable (I tried to play around with the
gc_*_time values, but probably due to lack of documentation. I didn't get
very far, and couldn't correlate the behaviour I saw with the settings I
made).
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Marc L. <sc...@sc...> - 2015-09-23 06:06:44
|
> by metadata could br reduced, I'd risk f2fs in production in one system
> here.
Oh, and please, I beg you, consider increasing the hardlink limit to >16
bit - look at other filesystems,. many filesystems thought they could get
away with 16 bit (ext*, xfs, ...) but all of them nowadays support 31 bit
or more for the hardlink count :) Merely 18 bits would probably suffice :)
While 65535 will just work at the moment for me (my largest directory has
~62000 subdirectories, and I can half this wiht some extra work), it's
guaranteed to fail sooner or later.
Thanks for listening (even if you decide against it :).
Greetings,
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Marc L. <sc...@sc...> - 2015-09-26 13:54:03
|
On Sat, Sep 26, 2015 at 12:36:55AM -0700, Jaegeuk Kim <ja...@ke...> wrote: > > Care to share why? :) > > Mostly, in the flash storages, it is multiple 2MB normally. :) Well, any value of -s gives me a multiple of 2MB, no? :) > > Is there anysthing specially good for numbers of two? Or do you just want top > > reduce the number of changed variables? > > IMO, likewise flash storages, it needs to investigate the raw device > characteristics. Keep in mind that I don't use it for flash, but smr drives. We already know the raw device characteristics, basically, the zones are between 15 and 40 or so MB in size (on the seagate 8tb drive), and they likely don't have "even" sizes at all. It's also by far not easy to benchmark these things, the disks can buffer up to 25GB of random writes (and then might need several hours of cleanup). Failing a linear write incurs a 0.6-1.6s penalty, to be paid much later. It's a shame that none of the drive companies actually release any usable info on their drives. These guys made a hole into the disk and devised a lot of benchmarks to find out the characteristics of these drives. https://www.usenix.org/system/files/conference/fast15/fast15-paper-aghayev.pdf So, the strategy for a fs would be to write linearly, most of the time, without any gaps. f2fs (at least in 3.18.x) manages to do that very nicely, which is why I really try to get it working. But for writing once, any value of -s would probably suffice. There are two problems when the disk gets full: a) ipu writes. the drive can't do, so gc might be cheaper. b) reuse of sections - if sections are reasonably large, if one gets freed and reused, it should be large to guarantee large linear writes again. b) is the reason behind me trying large values of -s. Since I know that f2fs is the only fs that I tested that can have a sustained write performance on these drives that is near the physical drive characteristics, all that needs to be done is to see how f2fs performs after it starts gc'ing. That's why I am so interested in disk full conditions - writing the disk linearly once is easy, I can just write a tar to the device. Ensuring that writes are large linear after deleting and cleaning up is harder. nilfs is a good example - it should fit smr drives perfectly, until they are nearly full, after which nilfs still matches smr drives perfectly, but waiting for 8TB to be shuffled around to delete some files can take days. More surprising is that nilfs phenomenally fails with these drives, performance wise, for reaosns I haven't investigated (my guess is that nilfs leaves gaps). > I think this can be used for SMR too. You can run any blockdevice operation on these drives, but the results from flashbench will be close to meaningless for them. For example, you can't distinguish betwene a nonaligned write causing a read-modify write from an aligned large write, or a partial write, by access time, as they will probably all have similar access times. > I think there might be some hints for section size at first and performance > variation as well. I think you confuse these drives with flash drives - while they share some characteristics, they are completely unlike flash. There is no translation layer, there is no need for wear leveling, zones have widely varying sizes, appending can be expensive or cheap, depending on the write size. What these drives need is primarily large linear writes without gaps, and secondarily any optimisations for rotational media apply. (And for that, f2fs performs unexpectedly good, given it wasn't meant for rotational media). Now, if f2fs can be made to (mostly) work bug-free, but with the characteristics of 3.18.21, and the gc can ensure that reasonably big areas spanning multiple zones will be reused, then f2fs will be the _ONLY_ fs able to take care of drive managed smr disks efficiently. Specifically, these filesystems do NOT work well with these drives: nilfs, zfs, btrfs, ext4, xfs And modifications for these filesystems are either far away in the future, or not targetted at drive managed disks (ext4 already has some modifications, but they are clearly not very suitable for actual drives, assuming these drives have a fast area near the start of the disk, which isn't the case). But these disks are not uncommon (seagate is shipping by the millions), and will stay with us for quite a while. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / sc...@sc... -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Marc L. <sc...@sc...> - 2015-09-23 06:00:48
|
On Wed, Sep 23, 2015 at 06:15:24AM +0200, Marc Lehmann <sc...@sc...> wrote: > > However, when I tried to do mkfs.f2fs with the same options, I got about 18GB. > > Could you share the mkfs.f2fs messages and fsck.f2fs -d3 as well? > > When I re-ran the mkfs.f2fs, I got: I get the feeling I did something idiotic, but for the life of it, I don't know what. I see the mkfs.f2fs in my test log, I see it in my command history, but for the life of it, I can't reproduce it. So let's disregard this and go to the next test - I redid the 128G partitipn test, with 6 active logs, no -o and -s64: mkfs.f2fs -lTEST -s64 -t0 -a0 This allowed me to arrive at this state, at which rsync stopped making progress: root@shag:/sys/fs/f2fs/dm-1# df -H /mnt Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_test-test 138G 137G 803k 100% /mnt This would be about perfect (I even got ENOSPC for the first time!). However, when I do my "delete every nth file": /dev/mapper/vg_test-test 138G 135G 1.8G 99% /mnt The disk still sits mostly idle. I did verify that "sync" indeed reduces Pre-Free to 0, and I do see some activity every ~30s now, though: http://ue.tst.eu/ac1ec447de214edc4e007623da2dda72.txt (see the dsk/sde columns). If I start writing, I guess I trigger the foreground gc: http://ue.tst.eu/1dfbac9166552a95551855000d820ce9.txt The first few lines there are some background gc activity (I guess), then I started an rsync to write data - net/total shows the data rsync transfers. After that, there is constant ~40mb read/write activity, but very little actual write data gets to the disk (rsync makes progress at <100kb/s). At some point I stop rsync (the single line line with 0/0 for sde read write, after the second header), followed by sync a second later. Sync does it's job, and then there is no activity for a bit, until I start rsync again, which immediatelly triggers the 40/40 mode, and makes little progress. So little to no gc activity, even though the filesystem really needs some GC activity at this point. If I play around with gc_* like this: echo 1 >gc_idle echo 1000 >gc_max_sleep_time echo 5000 >gc_no_gc_sleep_time Then I get a lot more activity: http://ue.tst.eu/f05ee3ff52dc7814ee8352cc2d67f364.txt But still, as you can see, a lot of the time the disk and the cpu are idle. In any case, I think I am getting somewhere - until now all my tests ended in unusable filesystem sooner or later, this is the firts one which shows mostly expected behaviour. Maybe -s128 (or -s256) with which I did my previous tests are problematic? Maybe the active_logs=2 caused problems (but I only used this option recently)? And the previous problems can be explaioned by using inline_dentry and/or extent_cache. Anyway, this behaviour is what I would expect, mostly. Now, I could go with -s64 (128MB segments still span 4-7 zones with this disk). Or maybe something uneven, such as -s90, if that doesn't cause problems. Also, if it were possible to tune the gc to be more aggressive when idle (and mostly off if the disk is free), and possibly, if the loss of space by metadata could br reduced, I'd risk f2fs in production in one system here. Greetings, -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / sc...@sc... -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Jaegeuk K. <ja...@ke...> - 2015-09-28 18:34:04
|
On Sat, Sep 26, 2015 at 03:53:53PM +0200, Marc Lehmann wrote: > On Sat, Sep 26, 2015 at 12:36:55AM -0700, Jaegeuk Kim <ja...@ke...> wrote: > > > Care to share why? :) > > > > Mostly, in the flash storages, it is multiple 2MB normally. :) > > Well, any value of -s gives me a multiple of 2MB, no? :) > > > > Is there anysthing specially good for numbers of two? Or do you just want top > > > reduce the number of changed variables? > > > > IMO, likewise flash storages, it needs to investigate the raw device > > characteristics. > > Keep in mind that I don't use it for flash, but smr drives. > > We already know the raw device characteristics, basically, the zones are > between 15 and 40 or so MB in size (on the seagate 8tb drive), and they > likely don't have "even" sizes at all. > > It's also by far not easy to benchmark these things, the disks can > buffer up to 25GB of random writes (and then might need several hours of > cleanup). Failing a linear write incurs a 0.6-1.6s penalty, to be paid > much later. It's a shame that none of the drive companies actually release > any usable info on their drives. > > These guys made a hole into the disk and devised a lot of benchmarks to > find out the characteristics of these drives. > > https://www.usenix.org/system/files/conference/fast15/fast15-paper-aghayev.pdf > > So, the strategy for a fs would be to write linearly, most of the time, > without any gaps. f2fs (at least in 3.18.x) manages to do that very > nicely, which is why I really try to get it working. > > But for writing once, any value of -s would probably suffice. There are > two problems when the disk gets full: > > a) ipu writes. the drive can't do, so gc might be cheaper. > b) reuse of sections - if sections are reasonably large, if one gets freed > and reused, it should be large to guarantee large linear writes again. > > b) is the reason behind me trying large values of -s. Hmm. It seems that SMR has 20~25GB cache to absorb random writes with a big block map. Then, it uses a static allocation, which is a kind of very early stage of FTL shapes though. Comparing to flash, it seems that SMR degrades the performance significantly due to internal cleaning overhead, so I could understand that it needs to control IO patterns very carefully. So, how about testing -s20, which comes resasonble to me? + direct IO can break the alignment too. > Since I know that f2fs is the only fs that I tested that can have a sustained > write performance on these drives that is near the physical drive > characteristics, all that needs to be done is to see how f2fs performs after > it starts gc'ing. > > That's why I am so interested in disk full conditions - writing the disk > linearly once is easy, I can just write a tar to the device. Ensuring that > writes are large linear after deleting and cleaning up is harder. > > nilfs is a good example - it should fit smr drives perfectly, until they > are nearly full, after which nilfs still matches smr drives perfectly, > but waiting for 8TB to be shuffled around to delete some files can take > days. More surprising is that nilfs phenomenally fails with these drives, > performance wise, for reaosns I haven't investigated (my guess is that > nilfs leaves gaps). > > > I think this can be used for SMR too. > > You can run any blockdevice operation on these drives, but the results > from flashbench will be close to meaningless for them. For example, you > can't distinguish betwene a nonaligned write causing a read-modify write > from an aligned large write, or a partial write, by access time, as they > will probably all have similar access times. > > > I think there might be some hints for section size at first and performance > > variation as well. > > I think you confuse these drives with flash drives - while they share some > characteristics, they are completely unlike flash. There is no translation > layer, there is no need for wear leveling, zones have widely varying > sizes, appending can be expensive or cheap, depending on the write size. > > What these drives need is primarily large linear writes without gaps, and > secondarily any optimisations for rotational media apply. (And for that, f2fs > performs unexpectedly good, given it wasn't meant for rotational media). > > Now, if f2fs can be made to (mostly) work bug-free, but with the > characteristics of 3.18.21, and the gc can ensure that reasonably big > areas spanning multiple zones will be reused, then f2fs will be the _ONLY_ fs > able to take care of drive managed smr disks efficiently. Hmm. The f2fs has been deployed on smartphones for a couple of years so far. The main stuffs here would be about tuning it with SMR drives. It's the time for me to take a look at pretty big partitions. :) Oh, anyway, have you tried just -s1 for fun? Thanks, > > Specifically, these filesystems do NOT work well with these drives: > > nilfs, zfs, btrfs, ext4, xfs > > And modifications for these filesystems are either far away in the > future, or not targetted at drive managed disks (ext4 already has some > modifications, but they are clearly not very suitable for actual drives, > assuming these drives have a fast area near the start of the disk, which > isn't the case). But these disks are not uncommon (seagate is shipping by > the millions), and will stay with us for quite a while. > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Marc L. <sc...@sc...> - 2015-09-29 07:36:35
|
On Mon, Sep 28, 2015 at 11:33:52AM -0700, Jaegeuk Kim <ja...@ke...> wrote:
> Hmm. It seems that SMR has 20~25GB cache to absorb random writes with a big
> block map. Then, it uses a static allocation, which is a kind of very early
> stage of FTL shapes though.
Yes, very sucky. For my previous tests, though, the cache is essentially
irrelevant, and only makes it harder to diagnose problems (it is very
helpful under light load, though).
> Comparing to flash, it seems that SMR degrades the performance significantly
> due to internal cleaning overhead, so I could understand that it needs to
> control IO patterns very carefully.
Yes, basically every write that ends (in time) before the zone boundary
requires RMW. Even writes that cross the zone boundary might require RMW as
the disk can probably only overwrite the zone partially once before having to
rewrite it fully again.
Since basically every write ends within a zone, the only way to keep
performance is to have very large sequential writes crossing multiple
zones, in multiple chunks, quick enough so the disk doesn't consider the
write as finished. Large meaning 100MB+.
> So, how about testing -s20, which comes resasonble to me?
I can test with -s20, but I fail to see why that is reasonable: -s20 menas
40MB, which isn't even as large as a single large zone, so spells desaster
in my book, basically causing a RMW cycle for every single section.
(Hopefully I just don't understand f2fs well enough).
In any acse, if -s20 if reasonable, then I would assume -s1 would also
reasonable, as both cause sections to be not larger than a zone.
> > characteristics of 3.18.21, and the gc can ensure that reasonably big
> > areas spanning multiple zones will be reused, then f2fs will be the _ONLY_ fs
> > able to take care of drive managed smr disks efficiently.
>
> Hmm. The f2fs has been deployed on smartphones for a couple of years so far.
> The main stuffs here would be about tuning it with SMR drives.
Well, I don't want to sound too negative, and honestly, now that I gathered
more experience with f2fs I do start to consider it for a lot more than
originally anticipated (I will try to replace ext4 with it for a database
partiton on an ssd, and I do think f2fs might be a possible replacement for
traditional fs's on rotationel media as well).
However, it's clearly far from stable - the amuount of data corruption I got
with documented options was enourmous, and the fact that causes sync to hang
and freeze the fs in 3.18.21 is a serious show-stopper.
You would expect that it doesn't work fine, out of the box, with SMR
drives, but the reality is that all my early tests showed that f2fs works
fine (compared to other filesystems even stellar!) on SMR drives, but
isn't stable itself, independ on the drive technology. Only the later
kernels fail to perform with SMR drives, and that might or might not be
fixable.
> It's the time for me to take a look at pretty big partitions. :)
I also have no issues when large partitions pose a problem for f2fs - I
am confident that this can be fixed. Can't wait to use it for some 40TB
partitions and see how it performs in practise :)
In fact, I think f2fs + dmcache (with google modifications) + traditional
rotational drives might deliver absolutely superior performance to XFS,
which is my current workhorse for such partitions.
(One would hope fsck times could be improved for this, though, although
they are not particularly bad at this time).
> Oh, anyway, have you tried just -s1 for fun?
Will also try and see how it performs with the first hundred GB or so.
Then I will get the traces.
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Chao Yu <cha...@sa...> - 2015-09-23 08:57:02
|
Hi Marc, > -----Original Message----- > From: Marc Lehmann [mailto:sc...@sc...] > Sent: Wednesday, September 23, 2015 2:01 PM > To: Jaegeuk Kim > Cc: lin...@li... > Subject: Re: [f2fs-dev] SMR drive test 2; 128GB partition; no obvious corruption, much more > sane behaviour, weird overprovisioning > > On Wed, Sep 23, 2015 at 06:15:24AM +0200, Marc Lehmann <sc...@sc...> wrote: > > > However, when I tried to do mkfs.f2fs with the same options, I got about 18GB. > > > Could you share the mkfs.f2fs messages and fsck.f2fs -d3 as well? > > > > When I re-ran the mkfs.f2fs, I got: > > I get the feeling I did something idiotic, but for the life of it, I don't > know what. I see the mkfs.f2fs in my test log, I see it in my command > history, but for the life of it, I can't reproduce it. > > So let's disregard this and go to the next test - I redid the 128G partitipn > test, with 6 active logs, no -o and -s64: > > mkfs.f2fs -lTEST -s64 -t0 -a0 > > This allowed me to arrive at this state, at which rsync stopped making > progress: > > root@shag:/sys/fs/f2fs/dm-1# df -H /mnt > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg_test-test 138G 137G 803k 100% /mnt > > This would be about perfect (I even got ENOSPC for the first > time!). However, when I do my "delete every nth file": > > /dev/mapper/vg_test-test 138G 135G 1.8G 99% /mnt > > The disk still sits mostly idle. I did verify that "sync" indeed reduces > Pre-Free to 0, and I do see some activity every ~30s now, though: > > http://ue.tst.eu/ac1ec447de214edc4e007623da2dda72.txt (see the dsk/sde > columns). > > If I start writing, I guess I trigger the foreground gc: > > http://ue.tst.eu/1dfbac9166552a95551855000d820ce9.txt > > The first few lines there are some background gc activity (I guess), then I > started an rsync to write data - net/total shows the data rsync transfers. > After that, there is constant ~40mb read/write activity, but very little > actual write data gets to the disk (rsync makes progress at <100kb/s). > > At some point I stop rsync (the single line line with 0/0 for sde read > write, after the second header), followed by sync a second later. Sync > does it's job, and then there is no activity for a bit, until I start > rsync again, which immediatelly triggers the 40/40 mode, and makes little > progress. > > So little to no gc activity, even though the filesystem really needs some > GC activity at this point. > > If I play around with gc_* like this: > > echo 1 >gc_idle > echo 1000 >gc_max_sleep_time > echo 5000 >gc_no_gc_sleep_time One thing I note is that gc_min_sleep_time is not be set in your script, so in some condition gc may still do the sleep with gc_min_sleep_time (30 seconds by default) instead of gc_max_sleep_time which we expect. So setting gc_min_sleep_time/gc_max_sleep_time as a pair is a better way of controlling sleeping time of gc. > > Then I get a lot more activity: > > http://ue.tst.eu/f05ee3ff52dc7814ee8352cc2d67f364.txt > > But still, as you can see, a lot of the time the disk and the cpu are > idle. > > In any case, I think I am getting somewhere - until now all my tests ended in > unusable filesystem sooner or later, this is the firts one which shows mostly > expected behaviour. > > Maybe -s128 (or -s256) with which I did my previous tests are problematic? > Maybe the active_logs=2 caused problems (but I only used this option recently)? > > And the previous problems can be explaioned by using inline_dentry and/or > extent_cache. > > Anyway, this behaviour is what I would expect, mostly. > > Now, I could go with -s64 (128MB segments still span 4-7 zones with this > disk). Or maybe something uneven, such as -s90, if that doesn't cause > problems. > > Also, if it were possible to tune the gc to be more aggressive when idle In 4.3 rc1 kernel, we have add a new ioctl to trigger in batches gc, maybe we can use it as one option. https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/? id=c1c1b58359d45e1a9f236ce5a40d50720c07c70e Thanks, > (and mostly off if the disk is free), and possibly, if the loss of space > by metadata could br reduced, I'd risk f2fs in production in one system > here. > > Greetings, > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ > > ------------------------------------------------------------------------------ > Monitor Your Dynamic Infrastructure at Any Scale With Datadog! > Get real-time metrics from all of your servers, apps and tools > in one place. > SourceForge users - Click here to start your Free Trial of Datadog now! > http://pubads.g.doubleclick.net/gampad/clk?id=241902991&iu=/4140 > _______________________________________________ > Linux-f2fs-devel mailing list > Lin...@li... > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel |
|
From: Chao Yu <cha...@sa...> - 2015-09-23 09:11:13
|
Hi Marc,
The max hardlink number was increased to 0xffffffff by Jaegeuk in 4.3 rc1
Kernel, we can use it directly through backport.
>From a6db67f06fd9f6b1ddb11bcf4d7e8e8a86908d01 Mon Sep 17 00:00:00 2001
From: Jaegeuk Kim <ja...@ke...>
Date: Mon, 10 Aug 2015 15:01:12 -0700
Subject: [PATCH] f2fs: increase the number of max hard links
This patch increases the number of maximum hard links for one file.
Reviewed-by: Chao Yu <cha...@sa...>
Signed-off-by: Jaegeuk Kim <ja...@ke...>
---
fs/f2fs/f2fs.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 23bfc0c..8308488 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -321,7 +321,7 @@ enum {
*/
};
-#define F2FS_LINK_MAX 32000 /* maximum link count per file */
+#define F2FS_LINK_MAX 0xffffffff /* maximum link count per file */
#define MAX_DIR_RA_PAGES 4 /* maximum ra pages of dir */
--
2.5.2
|
|
From: Jaegeuk K. <ja...@ke...> - 2015-09-23 21:30:19
|
Thanks Chao.
That's right.
On Wed, Sep 23, 2015 at 05:10:21PM +0800, Chao Yu wrote:
> Hi Marc,
>
> The max hardlink number was increased to 0xffffffff by Jaegeuk in 4.3 rc1
> Kernel, we can use it directly through backport.
>
> >From a6db67f06fd9f6b1ddb11bcf4d7e8e8a86908d01 Mon Sep 17 00:00:00 2001
> From: Jaegeuk Kim <ja...@ke...>
> Date: Mon, 10 Aug 2015 15:01:12 -0700
> Subject: [PATCH] f2fs: increase the number of max hard links
>
> This patch increases the number of maximum hard links for one file.
>
> Reviewed-by: Chao Yu <cha...@sa...>
> Signed-off-by: Jaegeuk Kim <ja...@ke...>
> ---
> fs/f2fs/f2fs.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> index 23bfc0c..8308488 100644
> --- a/fs/f2fs/f2fs.h
> +++ b/fs/f2fs/f2fs.h
> @@ -321,7 +321,7 @@ enum {
> */
> };
>
> -#define F2FS_LINK_MAX 32000 /* maximum link count per file */
> +#define F2FS_LINK_MAX 0xffffffff /* maximum link count per file */
>
> #define MAX_DIR_RA_PAGES 4 /* maximum ra pages of dir */
>
> --
> 2.5.2
|
|
From: Marc L. <sc...@sc...> - 2015-09-23 23:12:07
|
On Wed, Sep 23, 2015 at 05:10:21PM +0800, Chao Yu <cha...@sa...> wrote:
> The max hardlink number was increased to 0xffffffff by Jaegeuk in 4.3 rc1
> Kernel, we can use it directly through backport.
Thats absolutely wonderful news, thanks a lot!
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Jaegeuk K. <ja...@ke...> - 2015-09-23 21:29:40
|
On Wed, Sep 23, 2015 at 06:15:24AM +0200, Marc Lehmann wrote: > On Tue, Sep 22, 2015 at 06:12:39PM -0700, Jaegeuk Kim <ja...@ke...> wrote: > > Hmm. Is it necessary to reduce the number of active_logs? > > I don't know, the documentation isn't very forthcoming with details :) > > In any case, this is just for testing. My rationale was that multiple logs > probably means that there are multiple sequential write zones. Reducing those > > Only two logs would help the disk. Probably. Maybe. > > > increase the GC overheads significantly. > > Can you elaborate? I do get a speed improvement with only two logs, but of > course, GC time is an impoprtant factor, so maybe more logs would be a > necessary trade-off. This will help you to understand more precisely. https://www.usenix.org/system/files/conference/fast15/fast15-paper-lee.pdf One GC needs to move whole valid blocks inside a section, so if the section size is too large, every GC is likely to show very long latency. In addion, we need more overprovision space too. And, if the number of logs is small, GC can suffer from moving hot and cold data blocks which represents somewhat temporal locality. Of course, these numbers highly depend on storage speed and workloads, so it needs to be tuned up. Thanks, > > > And, you can use inline_data in v4.2. > > I think I did - the documentation says inline_data is the default. > > > > Filesystem Size Used Avail Use% Mounted on > > > /dev/mapper/vg_test-test 128G 106G 22G 84% /mnt > > > > > > # du -skc /mnt > > > 51674268 /mnt > > > 51674268 total > > > > > > Utilization: 67% (13168028 valid blocks) > > > > Ok. I could retrieve the on-disk layout from the below log. > > In the log, the overprovision area is set as about 54GB. > > However, when I tried to do mkfs.f2fs with the same options, I got about 18GB. > > Could you share the mkfs.f2fs messages and fsck.f2fs -d3 as well? > > When I re-ran the mkfs.f2fs, I got: > > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg_test-test 138G 20G 118G 14% /mnt > > I didn't note down the overhead in my test, the df I had was when the disk > was filled, so it possibly changed(?) at runtime? > > (I tried debians mkfs.f2fs, but it gave identical results). > > I'll redo the 128GiB test and see if I can get similar results. > > > > However, when pausing rsync, f2fs immediatelly ceased doing anything again, > > > so even though clearly there is a need for clean up activities, f2fs doesn't > > > do them. > > > > It seems that why f2fs didn't do gc was that all the sections were traversed > > by background gc. In order to reset that, it needs to trigger checkpoint, but > > it couldn't meet the condition in background. > > > > How about calling "sync" before leaving the system as idle? > > Or, you can check decreasing the number in /sys/fs/f2fs/xxx/reclaim_segments to > > 256 or 512? > > Will try next time. I distinctly remember that sync didn't do anything to > pre-free and free, though. > > > > 1. the overprovisioning values seems to be completely out of this world. I'm > > > prepared top give up maybe 50GB of my 8TB disk for this, but not more. > > > > Maybe, it needs to check with other filesystems' *available* spaces. > > Since, many of them hide additional FS metadata initially. > > I habitually do comparefree space between filesystems. While f2fs is better > than ext4 with default settings (and even with some tuning), ext4 is quite > known to have excessive preallocated metadata requirements. > > As mentioned in my other mail, XFS for example has 100GB more free > space than f2fs on the full 8TB device, and from memory I expect other > filesystems without fixed inode numbers (practically all of them) to be > similar. > > > > 3. why does f2fs sit idle on a highly fragmented filesystem, why does it not > > > do background garbage collect at maximum I/O speed, so the filesystem is > > > ready when the next writes come? > > > > I suspect the section size is too large comparing to the whole partition size, > > which number is only 509. Each GC selects a victim in a unit of section and > > background GC would not select again for the previously visited ones. > > So, I think GC is easy to traverse whole sections, and go to bed since there > > is no new victims. So, I think checkpoint, "sync", resets whole history and > > makes background GC conduct its job again. > > The large section size is of course the whole point of the exercise, as > hopefully this causes the GC to do larger sequential writes. It's clear > that this is not a perfect match for these SMR drives, but the goal is to > have acceptable performance, faster than a few megabytes/s. And indeed, > when the GC runs, it get quite good I/O performance in my test (deleteing > every nth file makes comparatively small holes, so the GC has to copy most > of the section). > > Now, the other thing is that the GC, whgen it triggers, isn't very > aggressive - when I saw it, it was doing something every 10-15 seconds, > with the system being idle, when it should be more or less completely busy. > > I am aware that "idle" is a difficult to inmpossible condition to detect > - maybe this could be made more tunable (I tried to play around with the > gc_*_time values, but probably due to lack of documentation. I didn't get > very far, and couldn't correlate the behaviour I saw with the settings I > made). > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Jaegeuk K. <ja...@ke...> - 2015-09-23 22:08:33
|
On Wed, Sep 23, 2015 at 08:00:37AM +0200, Marc Lehmann wrote: > On Wed, Sep 23, 2015 at 06:15:24AM +0200, Marc Lehmann <sc...@sc...> wrote: > > > However, when I tried to do mkfs.f2fs with the same options, I got about 18GB. > > > Could you share the mkfs.f2fs messages and fsck.f2fs -d3 as well? > > > > When I re-ran the mkfs.f2fs, I got: > > I get the feeling I did something idiotic, but for the life of it, I don't > know what. I see the mkfs.f2fs in my test log, I see it in my command > history, but for the life of it, I can't reproduce it. > > So let's disregard this and go to the next test - I redid the 128G partitipn > test, with 6 active logs, no -o and -s64: > > mkfs.f2fs -lTEST -s64 -t0 -a0 > > This allowed me to arrive at this state, at which rsync stopped making > progress: > > root@shag:/sys/fs/f2fs/dm-1# df -H /mnt > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg_test-test 138G 137G 803k 100% /mnt Could you please share /sys/kernel/debug/f2fs/status? > > This would be about perfect (I even got ENOSPC for the first > time!). However, when I do my "delete every nth file": > > /dev/mapper/vg_test-test 138G 135G 1.8G 99% /mnt > > The disk still sits mostly idle. I did verify that "sync" indeed reduces > Pre-Free to 0, and I do see some activity every ~30s now, though: > > http://ue.tst.eu/ac1ec447de214edc4e007623da2dda72.txt (see the dsk/sde > columns). > > If I start writing, I guess I trigger the foreground gc: > > http://ue.tst.eu/1dfbac9166552a95551855000d820ce9.txt > > The first few lines there are some background gc activity (I guess), then I > started an rsync to write data - net/total shows the data rsync transfers. > After that, there is constant ~40mb read/write activity, but very little > actual write data gets to the disk (rsync makes progress at <100kb/s). > > At some point I stop rsync (the single line line with 0/0 for sde read > write, after the second header), followed by sync a second later. Sync > does it's job, and then there is no activity for a bit, until I start > rsync again, which immediatelly triggers the 40/40 mode, and makes little > progress. > > So little to no gc activity, even though the filesystem really needs some > GC activity at this point. > > If I play around with gc_* like this: > > echo 1 >gc_idle > echo 1000 >gc_max_sleep_time > echo 5000 >gc_no_gc_sleep_time As Chao mentioned, if the system is idle, f2fs starts to do GC with gc_min_sleep_time. > > Then I get a lot more activity: > > http://ue.tst.eu/f05ee3ff52dc7814ee8352cc2d67f364.txt > > But still, as you can see, a lot of the time the disk and the cpu are > idle. > > In any case, I think I am getting somewhere - until now all my tests ended in > unusable filesystem sooner or later, this is the firts one which shows mostly > expected behaviour. > > Maybe -s128 (or -s256) with which I did my previous tests are problematic? > Maybe the active_logs=2 caused problems (but I only used this option recently)? > > And the previous problems can be explaioned by using inline_dentry and/or > extent_cache. > > Anyway, this behaviour is what I would expect, mostly. > > Now, I could go with -s64 (128MB segments still span 4-7 zones with this > disk). Or maybe something uneven, such as -s90, if that doesn't cause > problems. > > Also, if it were possible to tune the gc to be more aggressive when idle > (and mostly off if the disk is free), and possibly, if the loss of space > by metadata could br reduced, I'd risk f2fs in production in one system > here. When I did mkfs.f2fs on 128GB, I got the following numbers. option overprovision area reserved area -o5 -s128 9094 6144 -o5 -s64 6179 3072 -o5 -s1 3309 48 -o1 -s128 27009 26624 -o1 -s64 13831 13312 -o1 -s1 858 208 -s1 (ovp:1%) 858 208 -s64 (ovp:1%) 13831 13312 -s128 (ovp:1%) 27009 26624 So, I'm convinced that your inital test set "-o1 -s128", which was an unlucky trial. :) Anyway, I've found a bug in the case without -o, which is "-s64" should select other overprovision ratio instead of "1%". With the below patch, I could get: -s1 (ovp:1%) 858 208 -s64 (ovp:4%) 6172 3712 -s128 (ovp:6%) 8721 5120 >From 6e2b58dcaffc2d88291e07fa1f99773eca04a58f Mon Sep 17 00:00:00 2001 From: Jaegeuk Kim <ja...@ke...> Date: Wed, 23 Sep 2015 14:59:30 -0700 Subject: [PATCH] mkfs.f2fs: fix wrong ovp space calculation on large section If a section consists of multiple segments, we should change the equation to apply it on reserved space. On 128GB, option overprovision area reserved area -o5 -s128 9094 6144 -o5 -s64 6179 3072 -o5 -s1 3309 48 -o1 -s128 27009 26624 -o1 -s64 13831 13312 -o1 -s1 858 208 -s1 858 208 -s64 * 13831 13312 -s128 * 27009 26624 : * should be wrong. After patch, -s1 (ovp:1%) 858 208 -s64 (ovp:4%) 6172 3712 -s128 (ovp:6%) 8721 5120 Signed-off-by: Jaegeuk Kim <ja...@ke...> --- mkfs/f2fs_format.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mkfs/f2fs_format.c b/mkfs/f2fs_format.c index 21e74fe..2d4ab09 100644 --- a/mkfs/f2fs_format.c +++ b/mkfs/f2fs_format.c @@ -171,7 +171,8 @@ static u_int32_t get_best_overprovision(void) } for (; candidate <= end; candidate += diff) { - reserved = 2 * (100 / candidate + 1) + 6; + reserved = (2 * (100 / candidate + 1) + 6) * + get_sb(segs_per_sec); ovp = (get_sb(segment_count_main) - reserved) * candidate / 100; space = get_sb(segment_count_main) - reserved - ovp; if (max_space < space) { -- 2.1.1 |
|
From: Marc L. <sc...@sc...> - 2015-09-23 23:24:23
|
On Wed, Sep 23, 2015 at 02:29:31PM -0700, Jaegeuk Kim <ja...@ke...> wrote:
> > Can you elaborate? I do get a speed improvement with only two logs, but of
> > course, GC time is an impoprtant factor, so maybe more logs would be a
> > necessary trade-off.
>
> This will help you to understand more precisely.
Thanks, will read more thoroughly, but that means I probably do want two logs.
Regarding your elaboration:
> One GC needs to move whole valid blocks inside a section, so if the section
> size is too large, every GC is likely to show very long latency.
> In addion, we need more overprovision space too.
That wouldn't increase the overhead in general though, because the
overhead depends on how much space is free in each section.
> And, if the number of logs is small, GC can suffer from moving hot and cold
> data blocks which represents somewhat temporal locality.
I am somewhat skeptical of this for (on of my) my usage(s) (archival),
because there is absolutely no way to know in advance what is hot and what
is cold. Example: a file might be deleted, but there is no way in advance
to know which it will be. The only thing I know is that files never get
modified after written once (but often replaced). In another of of my
usages, files do get modified, but there is no way to know in advance
which it will be, and they will only ever be modified once (after initial
allocation).
So I am very suspicious of both static and dynamic attempts to seperate
data into hot/cold. You can't know from file extensions, and you can't
know from past modification history.
The only applicability of hot/cold I can see is filesystem metadata and
directories (files get moved/renamed/added), and afaics, f2fs already does
that.
> Of course, these numbers highly depend on storage speed and workloads, so
> it needs to be tuned up.
From your original comment, I assumed that the gc somehow needs more logs
to be more efficient for some internal reason, but it seems since it is
mostly a matter of section size (which I want to have "unreasonably" large),
which means potentially a lot of valid data has to be moved, and hot/cold
data, which I am very skeptical about.
(I think hot/cold works absolutely splendid for normal desktop uses and
most forms of /home, though).
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Jaegeuk K. <ja...@ke...> - 2015-09-24 17:51:42
|
On Thu, Sep 24, 2015 at 01:24:14AM +0200, Marc Lehmann wrote: > On Wed, Sep 23, 2015 at 02:29:31PM -0700, Jaegeuk Kim <ja...@ke...> wrote: > > > Can you elaborate? I do get a speed improvement with only two logs, but of > > > course, GC time is an impoprtant factor, so maybe more logs would be a > > > necessary trade-off. > > > > This will help you to understand more precisely. > > Thanks, will read more thoroughly, but that means I probably do want two logs. > Regarding your elaboration: > > > One GC needs to move whole valid blocks inside a section, so if the section > > size is too large, every GC is likely to show very long latency. > > In addion, we need more overprovision space too. > > That wouldn't increase the overhead in general though, because the > overhead depends on how much space is free in each section. Surely, it depends on workloads. > > > And, if the number of logs is small, GC can suffer from moving hot and cold > > data blocks which represents somewhat temporal locality. > > I am somewhat skeptical of this for (on of my) my usage(s) (archival), > because there is absolutely no way to know in advance what is hot and what > is cold. Example: a file might be deleted, but there is no way in advance > to know which it will be. The only thing I know is that files never get > modified after written once (but often replaced). In another of of my > usages, files do get modified, but there is no way to know in advance > which it will be, and they will only ever be modified once (after initial > allocation). > > So I am very suspicious of both static and dynamic attempts to seperate > data into hot/cold. You can't know from file extensions, and you can't > know from past modification history. Yes, regarding to userdata, we cannot determine the hotness of every data actually. > The only applicability of hot/cold I can see is filesystem metadata and > directories (files get moved/renamed/added), and afaics, f2fs already does > that. It does all the time. But, what I'm curious is the effect of splitting directories and files explicitly. If we use two logs, f2fs only splits metadata and their data. But, if we use 4 logs at least, it splits each of metadata and data according to their origins, directory or user file. For example, if I can represent blocks like: D : dentry block U : user block I : directory inode F : file inode, O : obsolete 1) in 2 logs, each section can consist of DDUUUUUDDUUUUU IFFFFIFFFFFF 2) in 4 logs, DDDD UUUUUUUUUUU II FFFFFFFFFF Then, if we rename files or delete files, 1) in 2 logs, OOUUUUUODUUUUDD IOOOOIFFOOFI 2) in 4 logs, OOODDD OOOOOUUUOUUU OOIII OOOOFFOOFFFF So, I expect, we can reduce the number of valid blocks if we use 4 logs. Surely, if workloads produce mostly a huge number of data blocks, I think two logs are enough. Using more logs would not show a big impact. Thanks, > > > Of course, these numbers highly depend on storage speed and workloads, so > > it needs to be tuned up. > > From your original comment, I assumed that the gc somehow needs more logs > to be more efficient for some internal reason, but it seems since it is > mostly a matter of section size (which I want to have "unreasonably" large), > which means potentially a lot of valid data has to be moved, and hot/cold > data, which I am very skeptical about. > > (I think hot/cold works absolutely splendid for normal desktop uses and > most forms of /home, though). > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Marc L. <sc...@sc...> - 2015-09-25 06:51:06
|
On Thu, Sep 24, 2015 at 11:28:36AM -0700, Jaegeuk Kim <ja...@ke...> wrote: > One thing that we can try is to run the latest f2fs source in v3.18. > This branch supports f2fs for v3.18. Ok, please bear with me, the last time I built my own kernel was during the 2.4 timeframe, and this is a ubuntu kernel. What I did is this: git clone -b linux-3.18 git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git cd f2fs/fs/f2fs rsync -avPR include/linux/f2fs_fs.h include/trace/events/f2fs.h /usr/src/linux-headers-3.18.21-031821/. make -C /lib/modules/3.18.21-031821-generic/build/ M=$PWD modules modules_install I then rmmod f2fs/insmod the resulting module, and tried to mount my existing f2fs fs for a quick test, but got a null ptr exception on "mount": http://ue.tst.eu/e4628dcee97324e580da1bafad938052.txt Probably caused me not building a full kernel, but recreating how ubuntu build their kernels on a debian system isn't something I look forward to. > For example, if I can represent blocks like: [number of logs discussion] Thanks for this explanation - two logs doesn't look so bad, from a locality viewpoint (not a big issue for flash, but a big issue for rotational devices - I also realised I can't use dmcache as dmcache, even in writethrough mode, writes back all data after an unclean shutdown, which would positively kill the disk). Since whatever speed difference I saw with two logs wasn't big, you completely sold me on 6 logs, or 4 (especially if it seepds up the gc, which I haven't much tested yet). Two logs was merely a test anyway (the same with no_heap, I don't know what it does, but I thought it is worth a try, as metadata + data nearer together is better than having them at opposite ends of the log or so). -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / sc...@sc... -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Jaegeuk K. <ja...@ke...> - 2015-09-25 18:26:56
|
On Fri, Sep 25, 2015 at 08:50:57AM +0200, Marc Lehmann wrote: > On Thu, Sep 24, 2015 at 11:28:36AM -0700, Jaegeuk Kim <ja...@ke...> wrote: > > One thing that we can try is to run the latest f2fs source in v3.18. > > This branch supports f2fs for v3.18. > > Ok, please bear with me, the last time I built my own kernel was during > the 2.4 timeframe, and this is a ubuntu kernel. What I did is this: > > git clone -b linux-3.18 git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git > cd f2fs/fs/f2fs > rsync -avPR include/linux/f2fs_fs.h include/trace/events/f2fs.h /usr/src/linux-headers-3.18.21-031821/. > make -C /lib/modules/3.18.21-031821-generic/build/ M=$PWD modules modules_install > > I then rmmod f2fs/insmod the resulting module, and tried to mount my > existing f2fs fs for a quick test, but got a null ptr exception on "mount": > > http://ue.tst.eu/e4628dcee97324e580da1bafad938052.txt > > Probably caused me not building a full kernel, but recreating how ubuntu > build their kernels on a debian system isn't something I look forward to. Please, pull the v3.18 again. I rebased it. :-( > > > For example, if I can represent blocks like: > [number of logs discussion] > > Thanks for this explanation - two logs doesn't look so bad, from a > locality viewpoint (not a big issue for flash, but a big issue for > rotational devices - I also realised I can't use dmcache as dmcache, even > in writethrough mode, writes back all data after an unclean shutdown, > which would positively kill the disk). > > Since whatever speed difference I saw with two logs wasn't big, you > completely sold me on 6 logs, or 4 (especially if it seepds up the gc, > which I haven't much tested yet). Two logs was merely a test anyway (the > same with no_heap, I don't know what it does, but I thought it is worth > a try, as metadata + data nearer together is better than having them at > opposite ends of the log or so). If the section size is pretty large, no_heap would be enough. The original intention was to provide more contiguous space for data only so that a big file could have a large extent instead of splitting by its metadata. > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Marc L. <sc...@sc...> - 2015-09-26 03:22:27
|
On Fri, Sep 25, 2015 at 05:47:12PM +0800, Chao Yu <cha...@sa...> wrote:
> Please revert the commit 7c5e466755ff ("f2fs: readahead cp payload
> pages when mount") since in this commit we try to access invalid
> SIT_I(sbi)->sit_base_addr which should be inited later.
Wow, you are fast. To make it short, the new module loads and mounts. Since
systemd failed to clear the dmcache again, I need to wait a few hours for it
to write back before testing. On the plus side, this gives a fairly high
chance of fragmented memory, so I can test the code that avoids oom on mount
as well :)
> > Since whatever speed difference I saw with two logs wasn't big, you
> > completely sold me on 6 logs, or 4 (especially if it seepds up the gc,
> > which I haven't much tested yet). Two logs was merely a test anyway (the
> > same with no_heap, I don't know what it does, but I thought it is worth
> > a try, as metadata + data nearer together is better than having them at
> > opposite ends of the log or so).
>
> If the section size is pretty large, no_heap would be enough. The original
> intention was to provide more contiguous space for data only so that a big
> file could have a large extent instead of splitting by its metadata.
Great, so no_heap it is.
Also, I was thinking a bit more on the active_logs issue.
The problem with SMR drives and too many logs is not just locality,
but the fatc that appending data, unlike as with flash, requires a
read-modify-write cycle. Likewise, I am pretty sure the disk can't keep
6 open write fragments in memory - maybe it can only keep one, so every
metadata write might cause a RMW cycle again, because it's not big enough
to fill a full zone (17-30MB).
So, hmm, well, separating the metadata that frequently changes
(directories) form the rest is necessary for the GC to not have to copy
almost all data block, but otherwise, it's nice if everything else clumps
together.
(likewise, stat information probably changes a lot more often than file
data, e.g. chown -R user . will change stat data regardless of whether the
files already belong to a user, and it would be nice if that menas the
data blocks can be kept untouched. Similar, renames).
What would you recommend for this case?
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Jaegeuk K. <ja...@ke...> - 2015-09-26 07:48:11
|
On Sat, Sep 26, 2015 at 05:22:18AM +0200, Marc Lehmann wrote:
> On Fri, Sep 25, 2015 at 05:47:12PM +0800, Chao Yu <cha...@sa...> wrote:
> > Please revert the commit 7c5e466755ff ("f2fs: readahead cp payload
> > pages when mount") since in this commit we try to access invalid
> > SIT_I(sbi)->sit_base_addr which should be inited later.
>
> Wow, you are fast. To make it short, the new module loads and mounts. Since
> systemd failed to clear the dmcache again, I need to wait a few hours for it
> to write back before testing. On the plus side, this gives a fairly high
> chance of fragmented memory, so I can test the code that avoids oom on mount
> as well :)
>
> > > Since whatever speed difference I saw with two logs wasn't big, you
> > > completely sold me on 6 logs, or 4 (especially if it seepds up the gc,
> > > which I haven't much tested yet). Two logs was merely a test anyway (the
> > > same with no_heap, I don't know what it does, but I thought it is worth
> > > a try, as metadata + data nearer together is better than having them at
> > > opposite ends of the log or so).
> >
> > If the section size is pretty large, no_heap would be enough. The original
> > intention was to provide more contiguous space for data only so that a big
> > file could have a large extent instead of splitting by its metadata.
>
> Great, so no_heap it is.
>
> Also, I was thinking a bit more on the active_logs issue.
>
> The problem with SMR drives and too many logs is not just locality,
> but the fatc that appending data, unlike as with flash, requires a
> read-modify-write cycle. Likewise, I am pretty sure the disk can't keep
> 6 open write fragments in memory - maybe it can only keep one, so every
> metadata write might cause a RMW cycle again, because it's not big enough
> to fill a full zone (17-30MB).
>
> So, hmm, well, separating the metadata that frequently changes
> (directories) form the rest is necessary for the GC to not have to copy
> almost all data block, but otherwise, it's nice if everything else clumps
> together.
>
> (likewise, stat information probably changes a lot more often than file
> data, e.g. chown -R user . will change stat data regardless of whether the
> files already belong to a user, and it would be nice if that menas the
> data blocks can be kept untouched. Similar, renames).
>
> What would you recommend for this case?
Hmm, from the device side, IMO, it's not a big concern for the number of open
zones, since f2fs normally tries to merge data and node IOs separately in order
to submit a big IO at once.
So, in my sense, it is not a big deal to use more logs.
>
> --
> The choice of a Deliantra, the free code+content MORPG
> -----==- _GNU_ http://www.deliantra.net
> ----==-- _ generation
> ---==---(_)__ __ ____ __ Marc Lehmann
> --==---/ / _ \/ // /\ \/ / sc...@sc...
> -=====/_/_//_/\_,_/ /_/\_\
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Linux-f2fs-devel mailing list
> Lin...@li...
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel
|
|
From: Marc L. <sc...@sc...> - 2015-10-02 08:53:48
|
On Thu, Oct 01, 2015 at 08:51:24PM +0200, Marc Lehmann <sc...@sc...> wrote:
> Ok, for completeness, here is the full log and a description of what was
> going on.
Ok, so I did a fsck, which took one hour, which is not very good, but I
don't use fsck very often. It didn't find any problems (everything Ok).
However, I have a freeze. When I mount the volume, start a du on it, and after
a while do:
echo 3 >/proc/sys/vm/drop_caches
Then this process hangs with 100% sys time. /proc/../stack gives no usable
backtrace.
umount on the f2fs volume also hangs:
[<ffffffff8139ba03>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffff8118086d>] unregister_shrinker+0x1d/0x70
[<ffffffff811e7911>] deactivate_locked_super+0x41/0x60
[<ffffffff811e7eee>] deactivate_super+0x4e/0x70
[<ffffffff81204733>] cleanup_mnt+0x43/0x90
[<ffffffff812047d2>] __cleanup_mnt+0x12/0x20
[<ffffffff8108e8a4>] task_work_run+0xc4/0xe0
[<ffffffff81012fa7>] do_notify_resume+0x97/0xb0
[<ffffffff8178896f>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Jaegeuk K. <ja...@ke...> - 2015-10-02 16:47:36
|
On Thu, Oct 01, 2015 at 08:51:24PM +0200, Marc Lehmann wrote: > On Thu, Oct 01, 2015 at 02:11:20PM +0200, Marc Lehmann <sc...@sc...> wrote: > > WOW, THAT HELPED A LOT. While the peak throughput seems quite a bit lower > > Ok, for completeness, here is the full log and a description of what was > going on. > > http://data.plan9.de/f2fs.s64.noinline.full.trace.xz Now, I can see much clean patterns. > status at the end + some idle time > http://ue.tst.eu/d16cf98c72fe9ecbac178ded47a21396.txt > > It was faster than the reader till roughtly the 1.2TB mark, after > which it acquired longish episodes of being <<50MB/s (for example, > around 481842.363964), and also periods of ~20kb/s, due to many small > WRITE_SYNC's in a row (e.g. at 482329.101222 and 490189.681438, > http://ue.tst.eu/cc94978eafc736422437a4ab35862c12.txt). The small > WRITE_SYNCs did not always result in this behaviour by the disk, though. Hmm, this is because of FS metadata flushes in background. I pushed one patch, can you get it through v3.18 branch? > After that, it was generally write-I/O bound. > > Also, the gc seemed to have kicked in at around that time, which is kind > of counterproductive. I increased the gc_* values in /sys, but don't know > if that had any effect. > > Most importantly, f2fs always recovered and had periods of much faster > writes (>= 120MB/s), so it's not the case that f2fs somehow saturates the > internal cache and then becomes slow forever. > > Overall, the throughput was 83MB/s, which is 20% worse than stock 3.18, but > still way beyond what any other filesystem could do. Cool. > Also, writing 1TB in a single session, with somewhat reduced speed > afterwards, would be enough for my purposes, i.e. I can live with that > (still, gigabit speeds would be nice of course, as that is the data rate I > often deal with). > > Notwithstanding any other improvements you might implement, f2fs has now > officially become my choice for SMR drives, the only remaining thing > needed is to convince me of its stability - it seems getting a kernel > with truly stable f2fs is a bit of a game of chance still, but I guess > confidence will come with more tests and actualy using it in production, > which I will do soon. > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ > > ------------------------------------------------------------------------------ > _______________________________________________ > Linux-f2fs-devel mailing list > Lin...@li... > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel |
|
From: Marc L. <sc...@sc...> - 2015-10-04 09:40:51
|
On Fri, Oct 02, 2015 at 09:46:45AM -0700, Jaegeuk Kim <ja...@ke...> wrote: > Hmm, this is because of FS metadata flushes in background. > I pushed one patch, can you get it through v3.18 branch? I continued to write the same ~2TB data set to the disk, with the same kernel module, giving me 85MB/s and 66MB/s throughput, respectively. This was due to extended periods where write performance was around 30-40MB/s, followed by shorter periods where it was >100MB/s. After using "disable_ext_identify", this *seemed* to improve somewhat. I did this on the theory that the zones near the end of the device (assuming f2fs would roughly fill from the beginning) get larger, causing more pressure on the disk (which has only 128MB of RAM to combine writes), but the result wasn't conclusive either wayy. For the fourth set, which wouldn't fully fit, I choose pdfs (larger average filesize), and used the new kernel, either of which might have helped. I configured f2fs like this: echo 16 >ipu_policy echo 100 >min_ipu_util echo 100000 >reclaim_segments echo 1 >gc_idle echo 500 >gc_min_sleep_time echo 90000 >gc_max_sleep_time echo 30000 >gc_no_gc_sleep_time Performance was ok'ish (as during the whole test) till about 200GB were left out of 8.1 (metric) TB. I started to make a trace around the 197GB mark: http://data.plan9.de/f2fs.near_full.xz status at beginning: http://ue.tst.eu/a4fc2a2522f3e372c7e92255cad1f3c3.txt rsync was writing at this point, and I think you can see GC activity. at 173851.953639, I ^S'ed rsync, which, due to -v, would cause it to pause after a file. there was regular (probably GC) activity, but most of the time, the disk was again idle, something I simply wouldn't expect from the GC config (see next mail). status after pausing rsync: http://ue.tst.eu/26c170d7d9f946d60926a5cdca814bbe.txt I unpaused rsync at 174004.438302 status before unpausing rsync: http://ue.tst.eu/cc22fafa0efcb1cadae5a3849dff873b.txt At 174186.324000, speed went down to ~2MB/s, and looking at the traces, it seems f2fs is writing random 2MB segments, which would explain the speed. I stopped at this point and started to prepare this mail. I could see constant but very low activity afterwards, roughly every 30s. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / sc...@sc... -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Marc L. <sc...@sc...> - 2015-09-26 05:25:59
|
Ok, before I tried the f2fs git I made another short test with the original 3.18.21 f2fs, and it was as fast as before. Then I used the faulty f2fs module,. which forced a reboot. Now I started to redo the 3.18.21 test + git f2fs, with the same parameters (specifically, -s90), and while it didn't start out to be as slow as 4.2.1, it's similarly slow. After 218GiB, I stopped the test, giving me an average of 50MiB/s. Here is typical dstat output (again, dsk/sde): http://ue.tst.eu/7a40644b3432e2932bdd8c1f6b6fc32d.txt So less read behaviour than with 4.2.1, but also very slow writes. That means the performance drop moves with f2fs, not the kernel version. This is the resulting status: http://ue.tst.eu/6d94e9bfad48a433bbc6f7daeaf5eb38.txt Just for fun I'll start doing a -s64 run. -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / sc...@sc... -=====/_/_//_/\_,_/ /_/\_\ |
|
From: Marc L. <sc...@sc...> - 2015-09-26 05:57:18
|
On Sat, Sep 26, 2015 at 07:25:51AM +0200, Marc Lehmann <sc...@sc...> wrote:
> Just for fun I'll start doing a -s64 run.
Same thing with -s64.
--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / sc...@sc...
-=====/_/_//_/\_,_/ /_/\_\
|
|
From: Jaegeuk K. <ja...@ke...> - 2015-10-02 16:51:32
|
On Fri, Oct 02, 2015 at 10:53:40AM +0200, Marc Lehmann wrote: > On Thu, Oct 01, 2015 at 08:51:24PM +0200, Marc Lehmann <sc...@sc...> wrote: > > Ok, for completeness, here is the full log and a description of what was > > going on. > > Ok, so I did a fsck, which took one hour, which is not very good, but I > don't use fsck very often. It didn't find any problems (everything Ok). > > However, I have a freeze. When I mount the volume, start a du on it, and after > a while do: How was your scenario? Did you delete device before, or just plain mount and umount? > echo 3 >/proc/sys/vm/drop_caches > > Then this process hangs with 100% sys time. /proc/../stack gives no usable > backtrace. This should be shrinker is stuck on mutex. I suspect a deadlock. Can you do this, if you meet this again? # echo l > /proc/sysrq-trigger # echo w > /proc/sysrq-trigger # demsg Thanks, > > umount on the f2fs volume also hangs: > > [<ffffffff8139ba03>] call_rwsem_down_write_failed+0x13/0x20 > [<ffffffff8118086d>] unregister_shrinker+0x1d/0x70 > [<ffffffff811e7911>] deactivate_locked_super+0x41/0x60 > [<ffffffff811e7eee>] deactivate_super+0x4e/0x70 > [<ffffffff81204733>] cleanup_mnt+0x43/0x90 > [<ffffffff812047d2>] __cleanup_mnt+0x12/0x20 > [<ffffffff8108e8a4>] task_work_run+0xc4/0xe0 > [<ffffffff81012fa7>] do_notify_resume+0x97/0xb0 > [<ffffffff8178896f>] int_signal+0x12/0x17 > [<ffffffffffffffff>] 0xffffffffffffffff > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / sc...@sc... > -=====/_/_//_/\_,_/ /_/\_\ > > ------------------------------------------------------------------------------ > _______________________________________________ > Linux-f2fs-devel mailing list > Lin...@li... > https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel |