linux-f2fs-devel Mailing List for linux-f2fs (Page 6)
Brought to you by:
kjgkr
You can subscribe to this list here.
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(10) |
Dec
(98) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 |
Jan
(100) |
Feb
(72) |
Mar
(79) |
Apr
(122) |
May
(93) |
Jun
(97) |
Jul
(72) |
Aug
(72) |
Sep
(73) |
Oct
(121) |
Nov
(161) |
Dec
(206) |
2014 |
Jan
(75) |
Feb
(54) |
Mar
(82) |
Apr
(98) |
May
(67) |
Jun
(89) |
Jul
(136) |
Aug
(122) |
Sep
(136) |
Oct
(58) |
Nov
(87) |
Dec
(114) |
2015 |
Jan
(140) |
Feb
(129) |
Mar
(141) |
Apr
(71) |
May
(192) |
Jun
(52) |
Jul
(120) |
Aug
(125) |
Sep
(157) |
Oct
(100) |
Nov
(54) |
Dec
(248) |
2016 |
Jan
(301) |
Feb
(180) |
Mar
(138) |
Apr
(137) |
May
(145) |
Jun
(123) |
Jul
(98) |
Aug
(143) |
Sep
(196) |
Oct
(166) |
Nov
(205) |
Dec
(141) |
2017 |
Jan
(167) |
Feb
(275) |
Mar
(273) |
Apr
(239) |
May
(193) |
Jun
(171) |
Jul
(226) |
Aug
(153) |
Sep
(212) |
Oct
(311) |
Nov
(257) |
Dec
(418) |
2018 |
Jan
(474) |
Feb
(188) |
Mar
(252) |
Apr
(500) |
May
(176) |
Jun
(291) |
Jul
(361) |
Aug
(331) |
Sep
(355) |
Oct
(154) |
Nov
(209) |
Dec
(185) |
2019 |
Jan
(172) |
Feb
(214) |
Mar
(247) |
Apr
(425) |
May
(273) |
Jun
(360) |
Jul
(400) |
Aug
(409) |
Sep
(149) |
Oct
(218) |
Nov
(319) |
Dec
(225) |
2020 |
Jan
(231) |
Feb
(487) |
Mar
(411) |
Apr
(258) |
May
(292) |
Jun
(369) |
Jul
(407) |
Aug
(173) |
Sep
(266) |
Oct
(317) |
Nov
(273) |
Dec
(391) |
2021 |
Jan
(285) |
Feb
(130) |
Mar
(232) |
Apr
(156) |
May
(311) |
Jun
(252) |
Jul
(336) |
Aug
(326) |
Sep
(151) |
Oct
(86) |
Nov
(114) |
Dec
(125) |
2022 |
Jan
(132) |
Feb
(167) |
Mar
(230) |
Apr
(460) |
May
(334) |
Jun
(324) |
Jul
(147) |
Aug
(188) |
Sep
(262) |
Oct
(346) |
Nov
(314) |
Dec
(245) |
2023 |
Jan
(306) |
Feb
(190) |
Mar
(199) |
Apr
(444) |
May
(378) |
Jun
(441) |
Jul
(403) |
Aug
(464) |
Sep
(144) |
Oct
(98) |
Nov
(152) |
Dec
(212) |
2024 |
Jan
(288) |
Feb
(365) |
Mar
(218) |
Apr
(275) |
May
(200) |
Jun
(228) |
Jul
(255) |
Aug
(228) |
Sep
(280) |
Oct
(319) |
Nov
(241) |
Dec
(174) |
2025 |
Jan
(166) |
Feb
(171) |
Mar
(469) |
Apr
(235) |
May
(257) |
Jun
(342) |
Jul
(391) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Qu W. <quw...@gm...> - 2025-07-21 11:37:28
|
在 2025/7/21 19:55, Jan Kara 写道: > On Mon 21-07-25 11:14:02, Gao Xiang wrote: >> Hi Barry, >> >> On 2025/7/21 09:02, Barry Song wrote: >>> On Wed, Jul 16, 2025 at 8:28 AM Gao Xiang <hsi...@li...> wrote: [...] >>> Given the difficulty of allocating large folios, it's always a good >>> idea to have order-0 as a fallback. While I agree with your point, >>> I have a slightly different perspective — enabling large folios for >>> those devices might be beneficial, but the maximum order should >>> remain small. I'm referring to "small" large folios. >> >> Yeah, agreed. Having a way to limit the maximum order for those small >> devices (rather than disabling it completely) would be helpful. At >> least "small" large folios could still provide benefits when memory >> pressure is light. > > Well, in the page cache you can tune not only the minimum but also the > maximum order of a folio being allocated for each inode. Btrfs and ext4 > already use this functionality. So in principle the functionality is there, > it is "just" a question of proper user interfaces or automatic logic to > tune this limit. > > Honza And enabling large folios doesn't mean all fs operations will grab an unnecessarily large folio. For buffered write, all those filesystem will only try to get folios as large as necessary, not overly large. This means if the user space program is always doing buffered IO in a power-of-two unit (and aligned offset of course), the folio size will match the buffer size perfectly (if we have enough memory). So for properly aligned buffered writes, large folios won't really cause unnecessarily large folios, meanwhile brings all the benefits. Although I'm not familiar enough with filemap to comment on folio read and readahead... Thanks, Qu |
From: Jan K. <ja...@su...> - 2025-07-21 10:26:10
|
On Mon 21-07-25 11:14:02, Gao Xiang wrote: > Hi Barry, > > On 2025/7/21 09:02, Barry Song wrote: > > On Wed, Jul 16, 2025 at 8:28 AM Gao Xiang <hsi...@li...> wrote: > > > > > ... > > > > > > > ... high-order folios can cause side effects on embedded devices > > > like routers and IoT devices, which still have MiBs of memory (and I > > > believe this won't change due to their use cases) but they also use > > > Linux kernel for quite long time. In short, I don't think enabling > > > large folios for those devices is very useful, let alone limiting > > > the minimum folio order for them (It would make the filesystem not > > > suitable any more for those users. At least that is what I never > > > want to do). And I believe this is different from the current LBS > > > support to match hardware characteristics or LBS atomic write > > > requirement. > > > > Given the difficulty of allocating large folios, it's always a good > > idea to have order-0 as a fallback. While I agree with your point, > > I have a slightly different perspective — enabling large folios for > > those devices might be beneficial, but the maximum order should > > remain small. I'm referring to "small" large folios. > > Yeah, agreed. Having a way to limit the maximum order for those small > devices (rather than disabling it completely) would be helpful. At > least "small" large folios could still provide benefits when memory > pressure is light. Well, in the page cache you can tune not only the minimum but also the maximum order of a folio being allocated for each inode. Btrfs and ext4 already use this functionality. So in principle the functionality is there, it is "just" a question of proper user interfaces or automatic logic to tune this limit. Honza -- Jan Kara <ja...@su...> SUSE Labs, CR |
From: <bug...@ke...> - 2025-07-21 06:50:49
|
https://bugzilla.kernel.org/show_bug.cgi?id=220321 --- Comment #5 from SEO HOYOUNG (hy5...@sa...) --- Hi, I uploaded to mainline fix patch. But I do not know it is right. https://lore.kernel.org/linux-scsi/202...@sa.../T/#u I thought of another way, how about below it? How about it to change "flush_delayed_work" to "cancel_work_sync" or "cancel_delayed_work_sync". Then it will be wait until writeback workqueue done. And "quota_release_work" function will queueing to events_unbound. Because if "cancel_work_sync" is called, the second argument of "__flsuh work" is called as true, and "check_flush_dependency" will be return normally, so it is unlikely that there will be a problem. -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug. |
From: yohan.joung <yoh...@sk...> - 2025-07-21 05:41:54
|
pinfile is excluded as it operates with direct I/O Signed-off-by: yohan.joung <yoh...@sk...> --- fs/f2fs/file.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c index 4039ccb5022c..cac8c9650a7a 100644 --- a/fs/f2fs/file.c +++ b/fs/f2fs/file.c @@ -4844,7 +4844,8 @@ static ssize_t f2fs_file_read_iter(struct kiocb *iocb, struct iov_iter *to) /* In LFS mode, if there is inflight dio, wait for its completion */ if (f2fs_lfs_mode(F2FS_I_SB(inode)) && - get_pages(F2FS_I_SB(inode), F2FS_DIO_WRITE)) + get_pages(F2FS_I_SB(inode), F2FS_DIO_WRITE) && + !f2fs_is_pinned_file(inode)) inode_dio_wait(inode); if (f2fs_should_use_dio(inode, iocb, to)) { -- 2.33.0 |
From: Gao X. <hsi...@li...> - 2025-07-21 03:14:22
|
Hi Barry, On 2025/7/21 09:02, Barry Song wrote: > On Wed, Jul 16, 2025 at 8:28 AM Gao Xiang <hsi...@li...> wrote: >> ... >> >> ... high-order folios can cause side effects on embedded devices >> like routers and IoT devices, which still have MiBs of memory (and I >> believe this won't change due to their use cases) but they also use >> Linux kernel for quite long time. In short, I don't think enabling >> large folios for those devices is very useful, let alone limiting >> the minimum folio order for them (It would make the filesystem not >> suitable any more for those users. At least that is what I never >> want to do). And I believe this is different from the current LBS >> support to match hardware characteristics or LBS atomic write >> requirement. > > Given the difficulty of allocating large folios, it's always a good > idea to have order-0 as a fallback. While I agree with your point, > I have a slightly different perspective — enabling large folios for > those devices might be beneficial, but the maximum order should > remain small. I'm referring to "small" large folios. Yeah, agreed. Having a way to limit the maximum order for those small devices (rather than disabling it completely) would be helpful. At least "small" large folios could still provide benefits when memory pressure is light. Thanks, Gao Xiang > > Still, even with those, allocation can be difficult — especially > since so many other allocations (which aren't large folios) can cause > fragmentation. So having order-0 as a fallback remains important. > > It seems we're missing a mechanism to enable "small" large folios > for files. For anon large folios, we do have sysfs knobs—though they > don’t seem to be universally appreciated. :-) > > Thanks > Barry |
From: Chao Yu <ch...@ke...> - 2025-07-21 02:02:47
|
Commit 0638a3197c19 ("f2fs: avoid unused block when dio write in LFS mode") has fixed unused block issue for dio write in lfs mode. However, f2fs_map_blocks() may break and return smaller extent when last allocated block locates in the end of section, even allocator can allocate contiguous blocks across sections. Actually, for the case that allocator returns a block address which is not contiguous w/ current extent, we can record the block address in iomap->private, in the next round, skip reallocating for the last allocated block, then we can fix unused block issue, meanwhile, also, we can allocates contiguous physical blocks as much as possible for dio write in lfs mode. Testcase: - mkfs.f2fs -f /dev/vdb - mount -o mode=lfs /dev/vdb /mnt/f2fs - dd if=/dev/zero of=/mnt/f2fs/file bs=1M count=3; sync; - dd if=/dev/zero of=/mnt/f2fs/dio bs=2M count=1 oflag=direct; - umount /mnt/f2fs Before: f2fs_map_blocks: dev = (253,16), ino = 4, file offset = 0, start blkaddr = 0x0, len = 0x100, flags = 1, seg_type = 8, may_create = 1, multidevice = 0, flag = 5, err = 0 f2fs_map_blocks: dev = (253,16), ino = 4, file offset = 256, start blkaddr = 0x0, len = 0x100, flags = 1, seg_type = 8, may_create = 1, multidevice = 0, flag = 5, err = 0 f2fs_map_blocks: dev = (253,16), ino = 4, file offset = 512, start blkaddr = 0x0, len = 0x100, flags = 1, seg_type = 8, may_create = 1, multidevice = 0, flag = 5, err = 0 f2fs_map_blocks: dev = (253,16), ino = 5, file offset = 0, start blkaddr = 0x4700, len = 0x100, flags = 3, seg_type = 1, may_create = 1, multidevice = 0, flag = 3, err = 0 f2fs_map_blocks: dev = (253,16), ino = 5, file offset = 256, start blkaddr = 0x4800, len = 0x100, flags = 3, seg_type = 1, may_create = 1, multidevice = 0, flag = 3, err = 0 After: f2fs_map_blocks: dev = (253,16), ino = 4, file offset = 0, start blkaddr = 0x0, len = 0x100, flags = 1, seg_type = 8, may_create = 1, multidevice = 0, flag = 5, err = 0 f2fs_map_blocks: dev = (253,16), ino = 4, file offset = 256, start blkaddr = 0x0, len = 0x100, flags = 1, seg_type = 8, may_create = 1, multidevice = 0, flag = 5, err = 0 f2fs_map_blocks: dev = (253,16), ino = 4, file offset = 512, start blkaddr = 0x0, len = 0x100, flags = 1, seg_type = 8, may_create = 1, multidevice = 0, flag = 5, err = 0 f2fs_map_blocks: dev = (253,16), ino = 5, file offset = 0, start blkaddr = 0x4700, len = 0x200, flags = 3, seg_type = 1, may_create = 1, multidevice = 0, flag = 3, err = 0 Cc: Daejun Park <dae...@sa...> Signed-off-by: Chao Yu <ch...@ke...> --- fs/f2fs/data.c | 28 ++++++++++++++++++---------- fs/f2fs/f2fs.h | 1 + 2 files changed, 19 insertions(+), 10 deletions(-) diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index d1a2616d41be..4e62f7f00b70 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -1550,10 +1550,14 @@ int f2fs_map_blocks(struct inode *inode, struct f2fs_map_blocks *map, int flag) unsigned int start_pgofs; int bidx = 0; bool is_hole; + bool lfs_dio_write; if (!maxblocks) return 0; + lfs_dio_write = (flag == F2FS_GET_BLOCK_DIO && f2fs_lfs_mode(sbi) && + map->m_may_create); + if (!map->m_may_create && f2fs_map_blocks_cached(inode, map, flag)) goto out; @@ -1600,7 +1604,7 @@ int f2fs_map_blocks(struct inode *inode, struct f2fs_map_blocks *map, int flag) /* use out-place-update for direct IO under LFS mode */ if (map->m_may_create && (is_hole || (flag == F2FS_GET_BLOCK_DIO && f2fs_lfs_mode(sbi) && - !f2fs_is_pinned_file(inode)))) { + !f2fs_is_pinned_file(inode) && map->m_last_pblk != blkaddr))) { if (unlikely(f2fs_cp_error(sbi))) { err = -EIO; goto sync_out; @@ -1684,10 +1688,15 @@ int f2fs_map_blocks(struct inode *inode, struct f2fs_map_blocks *map, int flag) if (map->m_multidev_dio) map->m_bdev = FDEV(bidx).bdev; + + if (lfs_dio_write) + map->m_last_pblk = NULL_ADDR; } else if (map_is_mergeable(sbi, map, blkaddr, flag, bidx, ofs)) { ofs++; map->m_len++; } else { + if (lfs_dio_write && !f2fs_is_pinned_file(inode)) + map->m_last_pblk = blkaddr; goto sync_out; } @@ -1712,14 +1721,6 @@ int f2fs_map_blocks(struct inode *inode, struct f2fs_map_blocks *map, int flag) dn.ofs_in_node = end_offset; } - if (flag == F2FS_GET_BLOCK_DIO && f2fs_lfs_mode(sbi) && - map->m_may_create) { - /* the next block to be allocated may not be contiguous. */ - if (GET_SEGOFF_FROM_SEG0(sbi, blkaddr) % BLKS_PER_SEC(sbi) == - CAP_BLKS_PER_SEC(sbi) - 1) - goto sync_out; - } - if (pgofs >= end) goto sync_out; else if (dn.ofs_in_node < end_offset) @@ -4162,7 +4163,7 @@ static int f2fs_iomap_begin(struct inode *inode, loff_t offset, loff_t length, unsigned int flags, struct iomap *iomap, struct iomap *srcmap) { - struct f2fs_map_blocks map = {}; + struct f2fs_map_blocks map = { NULL, }; pgoff_t next_pgofs = 0; int err; @@ -4171,6 +4172,10 @@ static int f2fs_iomap_begin(struct inode *inode, loff_t offset, loff_t length, map.m_next_pgofs = &next_pgofs; map.m_seg_type = f2fs_rw_hint_to_seg_type(F2FS_I_SB(inode), inode->i_write_hint); + if (flags & IOMAP_WRITE && iomap->private) { + map.m_last_pblk = (unsigned long)iomap->private; + iomap->private = NULL; + } /* * If the blocks being overwritten are already allocated, @@ -4209,6 +4214,9 @@ static int f2fs_iomap_begin(struct inode *inode, loff_t offset, loff_t length, iomap->flags |= IOMAP_F_MERGED; iomap->bdev = map.m_bdev; iomap->addr = F2FS_BLK_TO_BYTES(map.m_pblk); + + if (flags & IOMAP_WRITE && map.m_last_pblk) + iomap->private = (void *)map.m_last_pblk; } else { if (flags & IOMAP_WRITE) return -ENOTBLK; diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index dfddb66910b3..97c1a2a3fbd7 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -732,6 +732,7 @@ struct f2fs_map_blocks { block_t m_lblk; unsigned int m_len; unsigned int m_flags; + unsigned long m_last_pblk; /* last allocated block, only used for DIO in LFS mode */ pgoff_t *m_next_pgofs; /* point next possible non-hole pgofs */ pgoff_t *m_next_extent; /* point to next possible extent */ int m_seg_type; -- 2.49.0 |
From: Barry S. <21...@gm...> - 2025-07-21 01:03:09
|
On Wed, Jul 16, 2025 at 8:28 AM Gao Xiang <hsi...@li...> wrote: > > > > On 2025/7/16 07:32, Gao Xiang wrote: > > Hi Matthew, > > > > On 2025/7/16 04:40, Matthew Wilcox wrote: > >> I've started looking at how the page cache can help filesystems handle > >> compressed data better. Feedback would be appreciated! I'll probably > >> say a few things which are obvious to anyone who knows how compressed > >> files work, but I'm trying to be explicit about my assumptions. > >> > >> First, I believe that all filesystems work by compressing fixed-size > >> plaintext into variable-sized compressed blocks. This would be a good > >> point to stop reading and tell me about counterexamples. > > > > At least the typical EROFS compresses variable-sized plaintext (at least > > one block, e.g. 4k, but also 4k+1, 4k+2, ...) into fixed-sized compressed > > blocks for efficient I/Os, which is really useful for small compression > > granularity (e.g. 4KiB, 8KiB) because use cases like Android are usually > > under memory pressure so large compression granularity is almost > > unacceptable in the low memory scenarios, see: > > https://erofs.docs.kernel.org/en/latest/design.html > > > > Currently EROFS works pretty well on these devices and has been > > successfully deployed in billions of real devices. > > > >> > >> From what I've been reading in all your filesystems is that you want to > >> allocate extra pages in the page cache in order to store the excess data > >> retrieved along with the page that you're actually trying to read. That's > >> because compressing in larger chunks leads to better compression. > >> > >> There's some discrepancy between filesystems whether you need scratch > >> space for decompression. Some filesystems read the compressed data into > >> the pagecache and decompress in-place, while other filesystems read the > >> compressed data into scratch pages and decompress into the page cache. > >> > >> There also seems to be some discrepancy between filesystems whether the > >> decompression involves vmap() of all the memory allocated or whether the > >> decompression routines can handle doing kmap_local() on individual pages. > >> > >> So, my proposal is that filesystems tell the page cache that their minimum > >> folio size is the compression block size. That seems to be around 64k, > >> so not an unreasonable minimum allocation size. That removes all the > >> extra code in filesystems to allocate extra memory in the page cache.> It means we don't attempt to track dirtiness at a sub-folio granularity > >> (there's no point, we have to write back the entire compressed bock > >> at once). We also get a single virtually contiguous block ... if you're > >> willing to ditch HIGHMEM support. Or there's a proposal to introduce a > >> vmap_file() which would give us a virtually contiguous chunk of memory > >> (and could be trivially turned into a noop for the case of trying to > >> vmap a single large folio). > > > > I don't see this will work for EROFS because EROFS always supports > > variable uncompressed extent lengths and that will break typical > > EROFS use cases and on-disk formats. > > > > Other thing is that large order folios (physical consecutive) will > > caused "increase the latency on UX task with filemap_fault()" > > because of high-order direct reclaims, see: > > https://android-review.googlesource.com/c/kernel/common/+/3692333 > > so EROFS will not set min-order and always support order-0 folios. > > > > I think EROFS will not use this new approach, vmap() interface is > > always the case for us. > > ... high-order folios can cause side effects on embedded devices > like routers and IoT devices, which still have MiBs of memory (and I > believe this won't change due to their use cases) but they also use > Linux kernel for quite long time. In short, I don't think enabling > large folios for those devices is very useful, let alone limiting > the minimum folio order for them (It would make the filesystem not > suitable any more for those users. At least that is what I never > want to do). And I believe this is different from the current LBS > support to match hardware characteristics or LBS atomic write > requirement. Given the difficulty of allocating large folios, it's always a good idea to have order-0 as a fallback. While I agree with your point, I have a slightly different perspective — enabling large folios for those devices might be beneficial, but the maximum order should remain small. I'm referring to "small" large folios. Still, even with those, allocation can be difficult — especially since so many other allocations (which aren't large folios) can cause fragmentation. So having order-0 as a fallback remains important. It seems we're missing a mechanism to enable "small" large folios for files. For anon large folios, we do have sysfs knobs—though they don’t seem to be universally appreciated. :-) Thanks Barry |
From: Barry S. <21...@gm...> - 2025-07-21 00:44:14
|
On Wed, Jul 16, 2025 at 7:32 AM Gao Xiang <hsi...@li...> wrote: [...] > > I don't see this will work for EROFS because EROFS always supports > variable uncompressed extent lengths and that will break typical > EROFS use cases and on-disk formats. > > Other thing is that large order folios (physical consecutive) will > caused "increase the latency on UX task with filemap_fault()" > because of high-order direct reclaims, see: > https://android-review.googlesource.com/c/kernel/common/+/3692333 > so EROFS will not set min-order and always support order-0 folios. Regarding Hailong's Android hook, it's essentially a complaint about the GFP mask used to allocate large folios for files. I'm wondering why the page cache hasn't adopted the same approach that's used for anon large folios: gfp = vma_thp_gfp_mask(vma); Another concern might be that the allocation order is too large, which could lead to memory fragmentation and waste. Ideally, we'd have "small" large folios—say, with order <= 4—to strike a better balance. > > I think EROFS will not use this new approach, vmap() interface is > always the case for us. > > Thanks, > Gao Xiang > > > > Thanks Barry |
From: Daeho J. <da...@gm...> - 2025-07-18 22:04:48
|
From: Daeho Jeong <dae...@go...> Otherwise F2FS will not do GC in background in low free section. Signed-off-by: Daeho Jeong <dae...@go...> --- fs/f2fs/gc.c | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c index 0d7703e7f9e0..08eead027648 100644 --- a/fs/f2fs/gc.c +++ b/fs/f2fs/gc.c @@ -391,14 +391,15 @@ static unsigned int get_cb_cost(struct f2fs_sb_info *sbi, unsigned int segno) } static inline unsigned int get_gc_cost(struct f2fs_sb_info *sbi, - unsigned int segno, struct victim_sel_policy *p) + unsigned int segno, struct victim_sel_policy *p, + unsigned int valid_thresh_ratio) { if (p->alloc_mode == SSR) return get_seg_entry(sbi, segno)->ckpt_valid_blocks; - if (p->one_time_gc && (get_valid_blocks(sbi, segno, true) >= - CAP_BLKS_PER_SEC(sbi) * sbi->gc_thread->valid_thresh_ratio / - 100)) + if (p->one_time_gc && (valid_thresh_ratio < 100) && + (get_valid_blocks(sbi, segno, true) >= + CAP_BLKS_PER_SEC(sbi) * valid_thresh_ratio / 100)) return UINT_MAX; /* alloc_mode == LFS */ @@ -779,6 +780,7 @@ int f2fs_get_victim(struct f2fs_sb_info *sbi, unsigned int *result, unsigned int secno, last_victim; unsigned int last_segment; unsigned int nsearched; + unsigned int valid_thresh_ratio = 100; bool is_atgc; int ret = 0; @@ -788,7 +790,11 @@ int f2fs_get_victim(struct f2fs_sb_info *sbi, unsigned int *result, p.alloc_mode = alloc_mode; p.age = age; p.age_threshold = sbi->am.age_threshold; - p.one_time_gc = one_time; + if (one_time) { + p.one_time_gc = one_time; + if (has_enough_free_secs(sbi, 0, NR_PERSISTENT_LOG)) + valid_thresh_ratio = sbi->gc_thread->valid_thresh_ratio; + } retry: select_policy(sbi, gc_type, type, &p); @@ -914,7 +920,7 @@ int f2fs_get_victim(struct f2fs_sb_info *sbi, unsigned int *result, goto next; } - cost = get_gc_cost(sbi, segno, &p); + cost = get_gc_cost(sbi, segno, &p, valid_thresh_ratio); if (p.min_cost > cost) { p.min_segno = segno; -- 2.50.0.727.gbf7dc18ff4-goog |
From: Daeho J. <da...@gm...> - 2025-07-18 21:50:14
|
From: Daeho Jeong <dae...@go...> Add this to control GC algorithm for boost GC. Signed-off-by: Daeho Jeong <dae...@go...> --- v2: use GC_GREEDY instead of 1 --- Documentation/ABI/testing/sysfs-fs-f2fs | 8 +++++++- fs/f2fs/gc.c | 3 ++- fs/f2fs/gc.h | 1 + fs/f2fs/sysfs.c | 16 ++++++++++++++++ 4 files changed, 26 insertions(+), 2 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs b/Documentation/ABI/testing/sysfs-fs-f2fs index 931c1f63aa2e..2158055cd9d1 100644 --- a/Documentation/ABI/testing/sysfs-fs-f2fs +++ b/Documentation/ABI/testing/sysfs-fs-f2fs @@ -866,6 +866,12 @@ What: /sys/fs/f2fs/<disk>/gc_boost_gc_multiple Date: June 2025 Contact: "Daeho Jeong" <dae...@go...> Description: Set a multiplier for the background GC migration window when F2FS GC is - boosted. + boosted. the range should be from 1 to the segment count in a section. Default: 5 +What: /sys/fs/f2fs/<disk>/gc_boost_gc_greedy +Date: June 2025 +Contact: "Daeho Jeong" <dae...@go...> +Description: Control GC algorithm for boost GC. 0: cost benefit, 1: greedy + Default: 1 + diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c index de7e59bc0906..0d7703e7f9e0 100644 --- a/fs/f2fs/gc.c +++ b/fs/f2fs/gc.c @@ -141,7 +141,7 @@ static int gc_thread_func(void *data) FOREGROUND : BACKGROUND); sync_mode = (F2FS_OPTION(sbi).bggc_mode == BGGC_MODE_SYNC) || - gc_control.one_time; + (gc_control.one_time && gc_th->boost_gc_greedy); /* foreground GC was been triggered via f2fs_balance_fs() */ if (foreground) @@ -198,6 +198,7 @@ int f2fs_start_gc_thread(struct f2fs_sb_info *sbi) gc_th->urgent_sleep_time = DEF_GC_THREAD_URGENT_SLEEP_TIME; gc_th->valid_thresh_ratio = DEF_GC_THREAD_VALID_THRESH_RATIO; gc_th->boost_gc_multiple = BOOST_GC_MULTIPLE; + gc_th->boost_gc_greedy = GC_GREEDY; if (f2fs_sb_has_blkzoned(sbi)) { gc_th->min_sleep_time = DEF_GC_THREAD_MIN_SLEEP_TIME_ZONED; diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h index efa1968810a0..1a2e7a84b59f 100644 --- a/fs/f2fs/gc.h +++ b/fs/f2fs/gc.h @@ -69,6 +69,7 @@ struct f2fs_gc_kthread { unsigned int boost_zoned_gc_percent; unsigned int valid_thresh_ratio; unsigned int boost_gc_multiple; + unsigned int boost_gc_greedy; }; struct gc_inode_list { diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c index b0270b1c939c..3a52f51ee3c6 100644 --- a/fs/f2fs/sysfs.c +++ b/fs/f2fs/sysfs.c @@ -824,6 +824,20 @@ static ssize_t __sbi_store(struct f2fs_attr *a, return count; } + if (!strcmp(a->attr.name, "gc_boost_gc_multiple")) { + if (t < 1 || t > SEGS_PER_SEC(sbi)) + return -EINVAL; + sbi->gc_thread->boost_gc_multiple = (unsigned int)t; + return count; + } + + if (!strcmp(a->attr.name, "gc_boost_gc_greedy")) { + if (t > GC_GREEDY) + return -EINVAL; + sbi->gc_thread->boost_gc_greedy = (unsigned int)t; + return count; + } + *ui = (unsigned int)t; return count; @@ -1051,6 +1065,7 @@ GC_THREAD_RW_ATTR(gc_no_zoned_gc_percent, no_zoned_gc_percent); GC_THREAD_RW_ATTR(gc_boost_zoned_gc_percent, boost_zoned_gc_percent); GC_THREAD_RW_ATTR(gc_valid_thresh_ratio, valid_thresh_ratio); GC_THREAD_RW_ATTR(gc_boost_gc_multiple, boost_gc_multiple); +GC_THREAD_RW_ATTR(gc_boost_gc_greedy, boost_gc_greedy); /* SM_INFO ATTR */ SM_INFO_RW_ATTR(reclaim_segments, rec_prefree_segments); @@ -1222,6 +1237,7 @@ static struct attribute *f2fs_attrs[] = { ATTR_LIST(gc_boost_zoned_gc_percent), ATTR_LIST(gc_valid_thresh_ratio), ATTR_LIST(gc_boost_gc_multiple), + ATTR_LIST(gc_boost_gc_greedy), ATTR_LIST(gc_idle), ATTR_LIST(gc_urgent), ATTR_LIST(reclaim_segments), -- 2.50.0.727.gbf7dc18ff4-goog |
From: Daeho J. <da...@gm...> - 2025-07-18 21:40:32
|
From: Daeho Jeong <dae...@go...> Add a sysfs knob to set a multiplier for the background GC migration window when F2FS Garbage Collection is boosted. Signed-off-by: Daeho Jeong <dae...@go...> --- v2: limit the available value range --- Documentation/ABI/testing/sysfs-fs-f2fs | 8 ++++++++ fs/f2fs/gc.c | 3 ++- fs/f2fs/gc.h | 1 + fs/f2fs/sysfs.c | 2 ++ 4 files changed, 13 insertions(+), 1 deletion(-) diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs b/Documentation/ABI/testing/sysfs-fs-f2fs index bf03263b9f46..931c1f63aa2e 100644 --- a/Documentation/ABI/testing/sysfs-fs-f2fs +++ b/Documentation/ABI/testing/sysfs-fs-f2fs @@ -861,3 +861,11 @@ Description: This is a read-only entry to show the value of sb.s_encoding_flags, SB_ENC_STRICT_MODE_FL 0x00000001 SB_ENC_NO_COMPAT_FALLBACK_FL 0x00000002 ============================ ========== + +What: /sys/fs/f2fs/<disk>/gc_boost_gc_multiple +Date: June 2025 +Contact: "Daeho Jeong" <dae...@go...> +Description: Set a multiplier for the background GC migration window when F2FS GC is + boosted. + Default: 5 + diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c index 3cb5242f4ddf..de7e59bc0906 100644 --- a/fs/f2fs/gc.c +++ b/fs/f2fs/gc.c @@ -197,6 +197,7 @@ int f2fs_start_gc_thread(struct f2fs_sb_info *sbi) gc_th->urgent_sleep_time = DEF_GC_THREAD_URGENT_SLEEP_TIME; gc_th->valid_thresh_ratio = DEF_GC_THREAD_VALID_THRESH_RATIO; + gc_th->boost_gc_multiple = BOOST_GC_MULTIPLE; if (f2fs_sb_has_blkzoned(sbi)) { gc_th->min_sleep_time = DEF_GC_THREAD_MIN_SLEEP_TIME_ZONED; @@ -1749,7 +1750,7 @@ static int do_garbage_collect(struct f2fs_sb_info *sbi, !has_enough_free_blocks(sbi, sbi->gc_thread->boost_zoned_gc_percent)) window_granularity *= - BOOST_GC_MULTIPLE; + sbi->gc_thread->boost_gc_multiple; end_segno = start_segno + window_granularity; } diff --git a/fs/f2fs/gc.h b/fs/f2fs/gc.h index 5c1eaf55e127..efa1968810a0 100644 --- a/fs/f2fs/gc.h +++ b/fs/f2fs/gc.h @@ -68,6 +68,7 @@ struct f2fs_gc_kthread { unsigned int no_zoned_gc_percent; unsigned int boost_zoned_gc_percent; unsigned int valid_thresh_ratio; + unsigned int boost_gc_multiple; }; struct gc_inode_list { diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c index 75134d69a0bd..b0270b1c939c 100644 --- a/fs/f2fs/sysfs.c +++ b/fs/f2fs/sysfs.c @@ -1050,6 +1050,7 @@ GC_THREAD_RW_ATTR(gc_no_gc_sleep_time, no_gc_sleep_time); GC_THREAD_RW_ATTR(gc_no_zoned_gc_percent, no_zoned_gc_percent); GC_THREAD_RW_ATTR(gc_boost_zoned_gc_percent, boost_zoned_gc_percent); GC_THREAD_RW_ATTR(gc_valid_thresh_ratio, valid_thresh_ratio); +GC_THREAD_RW_ATTR(gc_boost_gc_multiple, boost_gc_multiple); /* SM_INFO ATTR */ SM_INFO_RW_ATTR(reclaim_segments, rec_prefree_segments); @@ -1220,6 +1221,7 @@ static struct attribute *f2fs_attrs[] = { ATTR_LIST(gc_no_zoned_gc_percent), ATTR_LIST(gc_boost_zoned_gc_percent), ATTR_LIST(gc_valid_thresh_ratio), + ATTR_LIST(gc_boost_gc_multiple), ATTR_LIST(gc_idle), ATTR_LIST(gc_urgent), ATTR_LIST(reclaim_segments), -- 2.50.0.727.gbf7dc18ff4-goog |
From: <pat...@ke...> - 2025-07-18 20:19:51
|
Hello: The following patches were marked "accepted", because they were applied to jaegeuk/f2fs.git (dev): Patch: [f2fs-dev] f2fs: fix to avoid out-of-boundary access in dnode page Submitter: Chao Yu <ch...@ke...> Committer: Jaegeuk Kim <ja...@ke...> Patchwork: https://patchwork.kernel.org/project/f2fs/list/?series=983421 Lore link: https://lore.kernel.org/r/202...@ke... Total patches: 1 -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html |
From: <pat...@ke...> - 2025-07-18 20:19:51
|
Hello: This patch was applied to jaegeuk/f2fs.git (dev) by Jaegeuk Kim <ja...@ke...>: On Thu, 17 Jul 2025 21:26:33 +0800 you wrote: > As Jiaming Zhang reported: > > <TASK> > __dump_stack lib/dump_stack.c:94 [inline] > dump_stack_lvl+0x1c1/0x2a0 lib/dump_stack.c:120 > print_address_description mm/kasan/report.c:378 [inline] > print_report+0x17e/0x800 mm/kasan/report.c:480 > kasan_report+0x147/0x180 mm/kasan/report.c:593 > data_blkaddr fs/f2fs/f2fs.h:3053 [inline] > f2fs_data_blkaddr fs/f2fs/f2fs.h:3058 [inline] > f2fs_get_dnode_of_data+0x1a09/0x1c40 fs/f2fs/node.c:855 > f2fs_reserve_block+0x53/0x310 fs/f2fs/data.c:1195 > prepare_write_begin fs/f2fs/data.c:3395 [inline] > f2fs_write_begin+0xf39/0x2190 fs/f2fs/data.c:3594 > generic_perform_write+0x2c7/0x910 mm/filemap.c:4112 > f2fs_buffered_write_iter fs/f2fs/file.c:4988 [inline] > f2fs_file_write_iter+0x1ec8/0x2410 fs/f2fs/file.c:5216 > new_sync_write fs/read_write.c:593 [inline] > vfs_write+0x546/0xa90 fs/read_write.c:686 > ksys_write+0x149/0x250 fs/read_write.c:738 > do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] > do_syscall_64+0xf3/0x3d0 arch/x86/entry/syscall_64.c:94 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > [...] Here is the summary with links: - [f2fs-dev] f2fs: fix to avoid out-of-boundary access in dnode page https://git.kernel.org/jaegeuk/f2fs/c/026e81230291 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html |
From: wangzijie <wan...@ho...> - 2025-07-18 10:07:26
|
When we need to alloc nat entry and set it dirty, we can directly add it to dirty set list(or initialize its list_head for new_ne) instead of adding it to clean list and make a move. Introduce init_dirty flag to do it. Signed-off-by: wangzijie <wan...@ho...> --- fs/f2fs/node.c | 37 ++++++++++++++++++++++++++++++------- 1 file changed, 30 insertions(+), 7 deletions(-) diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index b9fbc6bf7..b891be98b 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -185,7 +185,7 @@ static void __free_nat_entry(struct nat_entry *e) /* must be locked by nat_tree_lock */ static struct nat_entry *__init_nat_entry(struct f2fs_nm_info *nm_i, - struct nat_entry *ne, struct f2fs_nat_entry *raw_ne, bool no_fail) + struct nat_entry *ne, struct f2fs_nat_entry *raw_ne, bool no_fail, bool init_dirty) { if (no_fail) f2fs_radix_tree_insert(&nm_i->nat_root, nat_get_nid(ne), ne); @@ -195,6 +195,11 @@ static struct nat_entry *__init_nat_entry(struct f2fs_nm_info *nm_i, if (raw_ne) node_info_from_raw_nat(&ne->ni, raw_ne); + if (init_dirty) { + nm_i->nat_cnt[TOTAL_NAT]++; + return ne; + } + spin_lock(&nm_i->nat_list_lock); list_add_tail(&ne->list, &nm_i->nat_entries); spin_unlock(&nm_i->nat_list_lock); @@ -256,7 +261,7 @@ static struct nat_entry_set *__grab_nat_entry_set(struct f2fs_nm_info *nm_i, } static void __set_nat_cache_dirty(struct f2fs_nm_info *nm_i, - struct nat_entry *ne) + struct nat_entry *ne, bool init_dirty) { struct nat_entry_set *head; bool new_ne = nat_get_blkaddr(ne) == NEW_ADDR; @@ -275,6 +280,18 @@ static void __set_nat_cache_dirty(struct f2fs_nm_info *nm_i, set_nat_flag(ne, IS_PREALLOC, new_ne); + if (init_dirty) { + nm_i->nat_cnt[DIRTY_NAT]++; + set_nat_flag(ne, IS_DIRTY, true); + spin_lock(&nm_i->nat_list_lock); + if (new_ne) + INIT_LIST_HEAD(&ne->list); + else + list_add_tail(&ne->list, &head->entry_list); + spin_unlock(&nm_i->nat_list_lock); + return; + } + if (get_nat_flag(ne, IS_DIRTY)) goto refresh_list; @@ -441,7 +458,7 @@ static void cache_nat_entry(struct f2fs_sb_info *sbi, nid_t nid, f2fs_down_write(&nm_i->nat_tree_lock); e = __lookup_nat_cache(nm_i, nid); if (!e) - e = __init_nat_entry(nm_i, new, ne, false); + e = __init_nat_entry(nm_i, new, ne, false, false); else f2fs_bug_on(sbi, nat_get_ino(e) != le32_to_cpu(ne->ino) || nat_get_blkaddr(e) != @@ -458,11 +475,13 @@ static void set_node_addr(struct f2fs_sb_info *sbi, struct node_info *ni, struct f2fs_nm_info *nm_i = NM_I(sbi); struct nat_entry *e; struct nat_entry *new = __alloc_nat_entry(sbi, ni->nid, true); + bool init_dirty = false; f2fs_down_write(&nm_i->nat_tree_lock); e = radix_tree_lookup(&nm_i->nat_root, ni->nid); if (!e) { - e = __init_nat_entry(nm_i, new, NULL, true); + init_dirty = true; + e = __init_nat_entry(nm_i, new, NULL, true, true); copy_node_info(&e->ni, ni); f2fs_bug_on(sbi, ni->blk_addr == NEW_ADDR); } else if (new_blkaddr == NEW_ADDR) { @@ -498,7 +517,7 @@ static void set_node_addr(struct f2fs_sb_info *sbi, struct node_info *ni, nat_set_blkaddr(e, new_blkaddr); if (!__is_valid_data_blkaddr(new_blkaddr)) set_nat_flag(e, IS_CHECKPOINTED, false); - __set_nat_cache_dirty(nm_i, e); + __set_nat_cache_dirty(nm_i, e, init_dirty); /* update fsync_mark if its inode nat entry is still alive */ if (ni->nid != ni->ino) @@ -2914,6 +2933,7 @@ static void remove_nats_in_journal(struct f2fs_sb_info *sbi) struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_HOT_DATA); struct f2fs_journal *journal = curseg->journal; int i; + bool init_dirty; down_write(&curseg->journal_rwsem); for (i = 0; i < nats_in_cursum(journal); i++) { @@ -2924,12 +2944,15 @@ static void remove_nats_in_journal(struct f2fs_sb_info *sbi) if (f2fs_check_nid_range(sbi, nid)) continue; + init_dirty = false; + raw_ne = nat_in_journal(journal, i); ne = radix_tree_lookup(&nm_i->nat_root, nid); if (!ne) { + init_dirty = true; ne = __alloc_nat_entry(sbi, nid, true); - __init_nat_entry(nm_i, ne, &raw_ne, true); + __init_nat_entry(nm_i, ne, &raw_ne, true, true); } /* @@ -2944,7 +2967,7 @@ static void remove_nats_in_journal(struct f2fs_sb_info *sbi) spin_unlock(&nm_i->nid_list_lock); } - __set_nat_cache_dirty(nm_i, ne); + __set_nat_cache_dirty(nm_i, ne, init_dirty); } update_nats_in_cursum(journal, -i); up_write(&curseg->journal_rwsem); -- 2.25.1 |
From: wangzijie <wan...@ho...> - 2025-07-18 10:07:18
|
__lookup_nat_cache follows LRU manner to move clean nat entry, when nat entries are going to be dirty, no need to move them to tail of lru list. Signed-off-by: wangzijie <wan...@ho...> --- fs/f2fs/node.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index 4b3d9070e..b9fbc6bf7 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -460,7 +460,7 @@ static void set_node_addr(struct f2fs_sb_info *sbi, struct node_info *ni, struct nat_entry *new = __alloc_nat_entry(sbi, ni->nid, true); f2fs_down_write(&nm_i->nat_tree_lock); - e = __lookup_nat_cache(nm_i, ni->nid); + e = radix_tree_lookup(&nm_i->nat_root, ni->nid); if (!e) { e = __init_nat_entry(nm_i, new, NULL, true); copy_node_info(&e->ni, ni); @@ -2926,7 +2926,7 @@ static void remove_nats_in_journal(struct f2fs_sb_info *sbi) raw_ne = nat_in_journal(journal, i); - ne = __lookup_nat_cache(nm_i, nid); + ne = radix_tree_lookup(&nm_i->nat_root, nid); if (!ne) { ne = __alloc_nat_entry(sbi, nid, true); __init_nat_entry(nm_i, ne, &raw_ne, true); -- 2.25.1 |
From: <bug...@ke...> - 2025-07-18 09:13:00
|
https://bugzilla.kernel.org/show_bug.cgi?id=220321 --- Comment #4 from Chao Yu (ch...@ke...) --- Sorry for the delay. The workqueue is not allocated by f2fs, not sure this is a f2fs bug... Can you please report this issue to fsdevel mailing list: lin...@vg...? -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug. |
From: Chao Yu <ch...@ke...> - 2025-07-18 08:55:18
|
On 2025/7/15 16:56, Chao Yu wrote: > Hi, > > Sorry for the delay. > > On 6/12/25 20:32, 김규진 wrote: >> Hi F2FS developers, >> >> I'm testing multi-threaded direct I/O in LFS mode on Linux kernel >> 6.8.0-57.59, and noticed what seems to be an inefficiency in block >> allocation behavior inside `fs/f2fs/data.c` (specifically >> `f2fs_map_blocks()`): >> >> 1. In LFS mode with direct I/O, `f2fs_map_blocks()` always calls >> `__allocate_data_block()` to reserve a new block and updates the >> node/NAT entry, regardless of extent continuity. >> >> 2. If the new block is not physically contiguous with the current >> extent, it submits the current bio and defers the write of the newly >> reserved block (which is now recorded in the node) to the next >> mapping. >> >> 3. On the next `f2fs_map_blocks()` call, it finds that the logical >> block is already mapped in the node/NAT entry and skips over >> it—despite the block never having been written—resulting in allocation >> of yet another block. Over time, this leaves behind holes in the >> current segment, especially under heavy multi-threaded DIO. > > IIUC, > > The problem is something like this, is my understanding right? > > - user tries to write 768 blocks w/ direct IO. > - f2fs_iomap_begin(ofs:0, len:768) > - f2fs_map_blocks allocates two extents [ofs:0, blk:512, len:512] and > [ofs:512, blk:0, len:0], however f2fs_map_blocks() only return the first > extent, > - f2fs_iomap_begin(ofs:512, len:256) > f2fs_map_blocks allocates another physical block for ofs:512 even there is > a unwritten physical block allocated during previous f2fs_map_blocks. If I'm not missing any thing, this issue has been fixed w/ below patch: commit 0638a3197c194bed837c157c3574685e36febc7b Author: Daejun Park <dae...@sa...> Date: Thu Sep 5 14:24:33 2024 +0900 f2fs: avoid unused block when dio write in LFS mode This patch addresses the problem that when using LFS mode, unused blocks may occur in f2fs_map_blocks() during block allocation for dio writes. If a new section is allocated during block allocation, it will not be included in the map struct by map_is_mergeable() if the LBA of the allocated block is not contiguous. However, the block already allocated in this process will remain unused due to the LFS mode. This patch avoids the possibility of unused blocks by escaping f2fs_map_blocks() when allocating the last block in a section. Signed-off-by: Daejun Park <dae...@sa...> Reviewed-by: Chao Yu <ch...@ke...> Signed-off-by: Jaegeuk Kim <ja...@ke...> Thanks, > > Thanks, > >> >> >> Since I'm still new to F2FS internals, I may be missing something — >> I'd like to understand the design rationale behind this behavior in >> LFS mode, if possible. >> >> **My questions are:** >> >> - Is there a specific reason F2FS does not distinguish between >> reserved-but-unwritten and already-written blocks in this case? >> - Would it be possible (or beneficial) to: >> >> 1. Delay block allocation until the extent can actually be extended? >> >> 2. Track "reserved-but-unwritten" blocks distinctly to avoid reallocation? >> >> >> Thanks in advance for any clarification or insight. >> >> Best regards, >> >> Gyjin Kim >> >> >> _______________________________________________ >> Linux-f2fs-devel mailing list >> Lin...@li... >> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel > |
From: <bug...@ke...> - 2025-07-18 07:22:40
|
https://bugzilla.kernel.org/show_bug.cgi?id=220321 --- Comment #3 from SEO HOYOUNG (hy5...@sa...) --- Our development system occurred this problem again. It seems that write data to f2fs using write back option. The wb_wrokfn function use writeback workqueue. And the writeback workqueue created using WQ_MEM_RECLAIM option. static int __init default_bdi_init(void) { bdi_wq = alloc_workqueue("writeback", WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_SYSFS, 0); if (!bdi_wq) return -ENOMEM; return 0; } So it occurred kernel panic when insert quota_release_work to system_unbound_wq during wb_workfn working. But it seems that could not add WQ_MEM_RECLAIM to system_unbound_wq. Therefore, it would be better to remove the WQ_MEM_RECLAIM flag from the bdi_wq workcue. Is it possible? -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug. |
From: Daniel L. <ch...@go...> - 2025-07-17 18:21:07
|
The ino_t type can be defined as either 'unsigned long' or 'unsigned long long'. Signed-off-by: Daniel Lee <ch...@go...> --- tools/f2fs_io/f2fs_io.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/f2fs_io/f2fs_io.c b/tools/f2fs_io/f2fs_io.c index 8e81ba9..595d1e6 100644 --- a/tools/f2fs_io/f2fs_io.c +++ b/tools/f2fs_io/f2fs_io.c @@ -2329,8 +2329,8 @@ static void do_test_lookup_perf(int argc, char **argv, const struct cmd_desc *cm if (!verb) continue; - printf("%-8lu %-10s %-9d %-8jd %s\n", - dp->d_ino, + printf("%-8llu %-10s %-9d %-8jd %s\n", + (unsigned long long)dp->d_ino, (dp->d_type == DT_REG) ? "regular" : (dp->d_type == DT_DIR) ? "directory" : (dp->d_type == DT_FIFO) ? "FIFO" : -- 2.50.0.727.gbf7dc18ff4-goog |
From: Chao Yu <ch...@ke...> - 2025-07-17 15:52:57
|
As Jiaming Zhang reported: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x1c1/0x2a0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0x17e/0x800 mm/kasan/report.c:480 kasan_report+0x147/0x180 mm/kasan/report.c:593 data_blkaddr fs/f2fs/f2fs.h:3053 [inline] f2fs_data_blkaddr fs/f2fs/f2fs.h:3058 [inline] f2fs_get_dnode_of_data+0x1a09/0x1c40 fs/f2fs/node.c:855 f2fs_reserve_block+0x53/0x310 fs/f2fs/data.c:1195 prepare_write_begin fs/f2fs/data.c:3395 [inline] f2fs_write_begin+0xf39/0x2190 fs/f2fs/data.c:3594 generic_perform_write+0x2c7/0x910 mm/filemap.c:4112 f2fs_buffered_write_iter fs/f2fs/file.c:4988 [inline] f2fs_file_write_iter+0x1ec8/0x2410 fs/f2fs/file.c:5216 new_sync_write fs/read_write.c:593 [inline] vfs_write+0x546/0xa90 fs/read_write.c:686 ksys_write+0x149/0x250 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xf3/0x3d0 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f The root cause is in the corrupted image, there is a dnode has the same node id w/ its inode, so during f2fs_get_dnode_of_data(), it tries to access block address in dnode at offset 934, however it parses the dnode as inode node, so that get_dnode_addr() returns 360, then it tries to access page address from 360 + 934 * 4 = 4096 w/ 4 bytes. To fix this issue, let's add sanity check for node id of all direct nodes during f2fs_get_dnode_of_data(). Cc: st...@ke... Reported-by: Jiaming Zhang <r77...@gm...> Closes: https://groups.google.com/g/syzkaller/c/-ZnaaOOfO3M Signed-off-by: Chao Yu <ch...@ke...> --- fs/f2fs/node.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index 651537598759..12cab5c69fcd 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -815,6 +815,16 @@ int f2fs_get_dnode_of_data(struct dnode_of_data *dn, pgoff_t index, int mode) for (i = 1; i <= level; i++) { bool done = false; + if (nids[i] && nids[i] == dn->inode->i_ino) { + err = -EFSCORRUPTED; + f2fs_err_ratelimited(sbi, + "inode mapping table is corrupted, run fsck to fix it, " + "ino:%lu, nid:%u, level:%d, offset:%d", + dn->inode->i_ino, nids[i], level, offset[level]); + set_sbi_flag(sbi, SBI_NEED_FSCK); + goto release_pages; + } + if (!nids[i] && mode == ALLOC_NODE) { /* alloc new node */ if (!f2fs_alloc_nid(sbi, &(nids[i]))) { -- 2.49.0 |
From: Gao X. <hsi...@li...> - 2025-07-17 03:18:34
|
On 2025/7/17 10:49, Eric Biggers wrote: > On Wed, Jul 16, 2025 at 11:37:28PM +0100, Phillip Lougher wrote: ... > buffer. I suspect that vmap() (or vm_map_ram() which is what f2fs uses) > is actually more efficient than these streaming APIs, since it avoids > the internal copy. But it would need to be measured. Of course vm_map_ram() (that is what erofs relies on first for decompression in tree since 2018, then the f2fs one) will be efficient for decompression and avoid polluting unnecessary caching (considering typical PIPT or VIPT.) Especially for large compressed extents such as 1MiB, another memcpy() will cause much extra overhead over lz4. But as for gzip, xz and zstd, they just implement internal lz77 dictionaries then memcpy for streaming APIs. Since those algorithms are relatively slow (for example Zstd still relies on Huffman and FSE), I don't think it causes much difference to avoid memcpy() in the whole I/O path (because Huffman tree and FSE table are already slow), but lz4 matters. Thanks, Gao Xiang |
From: Eric B. <ebi...@ke...> - 2025-07-17 02:50:02
|
On Wed, Jul 16, 2025 at 11:37:28PM +0100, Phillip Lougher wrote: > > There also seems to be some discrepancy between filesystems whether the > > decompression involves vmap() of all the memory allocated or whether the > > decompression routines can handle doing kmap_local() on individual pages. > > > > Squashfs does both, and this depends on whether the decompression > algorithm implementation in the kernel is multi-shot or single-shot. > > The zlib/xz/zstd decompressors are multi-shot, in that you can call them > multiply, giving them an extra input or output buffer when it runs out. > This means you can get them to output into a 4K page at a time, without > requiring the pages to be contiguous. kmap_local() can be called on each > page before passing it to the decompressor. While those compression libraries do provide streaming APIs, it's sort of an illusion. They still need the uncompressed data in a virtually contiguous buffer for the LZ77 match finding and copying to work. So, internally they copy the uncompressed data into a virtually contiguous buffer. I suspect that vmap() (or vm_map_ram() which is what f2fs uses) is actually more efficient than these streaming APIs, since it avoids the internal copy. But it would need to be measured. > > So, my proposal is that filesystems tell the page cache that their minimum > > folio size is the compression block size. That seems to be around 64k, > > so not an unreasonable minimum allocation size. That removes all the > > extra code in filesystems to allocate extra memory in the page cache. > > It means we don't attempt to track dirtiness at a sub-folio granularity > > (there's no point, we have to write back the entire compressed bock > > at once). We also get a single virtually contiguous block ... if you're > > willing to ditch HIGHMEM support. Or there's a proposal to introduce a > > vmap_file() which would give us a virtually contiguous chunk of memory > > (and could be trivially turned into a noop for the case of trying to > > vmap a single large folio). ... but of course, if we could get a virtually contiguous buffer "for free" (at least in the !HIGHMEM case) as in the above proposal, that would clearly be the best option. - Eric |
From: Nanzhe Z. <nz...@12...> - 2025-07-17 01:06:05
|
Dear Mr.Matthew and other fs developers: I'm very sorry.My gmail maybe be blocked for reasons I don't know.I have to change my email domain. > So, my proposal is that filesystems tell the page cache that their minimu= m > folio size is the compression block size. That seems to be around 64k, > so not an unreasonable minimum allocation size. Excuse me,but could you please clarify the meaning of "compression block si= ze"? If you mean the minimum buffer window size that a filesystem requires to perform one whole compress write/decompress read io(also we can call it the granularity),which,in f2fs context we can interpret as the cluster size.Then that means for compress files,we could not fallback to 0 order folio in memory pressure case when setting folio's minmium order to "compression block size"? If that is the case,then when f2fs' cluster size was configured,the minium order was determined(and may beyond 64KiB.Depending on how we set the cluster size).If the cluster size was set to a large number,we will encounter much more risk when in memory pressure case. Well,as for the 64Kib minimum granularity,because Android now switchs page size to 16Kib so for current f2fs compress implementation the minimum possible granularity indeed just exactly equals 64Kib.But I do hold a opinion that may not be a very good point for f2fs. Because just as I know,there are lots of small random write on Android.So instead of having a minimum granularity in 64Kib,I appreciate future f2fs's compression's implementation should support smaller cluster size for compression. As far as I know,storage engineers from vivo is experimenting a dynamic cluster compression implementation.It can adjust the cluster size within a file adaptively.(Maybe larger in some part and smaller in other part) They didn't publish the code now.But this design maybe more suitable for cooperating with folios for its vary-order feature. > It means we don't attempt to track dirtiness at a sub-folio granularity > > (there's no point, we have to write back the entire compressed bock > at once). That DO has point for f2fs.Because we cannot control the order of folio that readahead gave us if we don't set maximum order.A large folio can cross multi clusters in f2fs as I have mentioned. Since f2fs has no buffered head or a concept of subpage as we have discussed previously, It must rely on iomap_folio_state or a similar per folio struct to distinguish which cluster range of this folio is dirty. And it must distinguish a partialy dirted cluster to avoid compress write. Besides,l do think larger folio can cross multi compressed extent in btrfs too if I didn't misunderstand.May I ask how do btrfs deal with the possible write amplification? |
From: Phillip L. <ph...@sq...> - 2025-07-16 22:57:37
|
On 15/07/2025 21:40, Matthew Wilcox wrote: > I've started looking at how the page cache can help filesystems handle > compressed data better. Feedback would be appreciated! I'll probably > say a few things which are obvious to anyone who knows how compressed > files work, but I'm trying to be explicit about my assumptions. > > First, I believe that all filesystems work by compressing fixed-size > plaintext into variable-sized compressed blocks. This would be a good > point to stop reading and tell me about counterexamples. For Squashfs Yes. > >>From what I've been reading in all your filesystems is that you want to > allocate extra pages in the page cache in order to store the excess data > retrieved along with the page that you're actually trying to read. That's > because compressing in larger chunks leads to better compression. > Yes. > There's some discrepancy between filesystems whether you need scratch > space for decompression. Some filesystems read the compressed data into > the pagecache and decompress in-place, while other filesystems read the > compressed data into scratch pages and decompress into the page cache. > Squashfs uses scratch pages. > There also seems to be some discrepancy between filesystems whether the > decompression involves vmap() of all the memory allocated or whether the > decompression routines can handle doing kmap_local() on individual pages. > Squashfs does both, and this depends on whether the decompression algorithm implementation in the kernel is multi-shot or single-shot. The zlib/xz/zstd decompressors are multi-shot, in that you can call them multiply, giving them an extra input or output buffer when it runs out. This means you can get them to output into a 4K page at a time, without requiring the pages to be contiguous. kmap_local() can be called on each page before passing it to the decompressor. The lzo/lz4 decompressors are single-shot, they expect to be called once, with a single contiguous input buffer containing the data to be decompressed, and a single contiguous output buffer large enough to hold all the uncompressed data. > So, my proposal is that filesystems tell the page cache that their minimum > folio size is the compression block size. That seems to be around 64k, > so not an unreasonable minimum allocation size. That removes all the > extra code in filesystems to allocate extra memory in the page cache. > It means we don't attempt to track dirtiness at a sub-folio granularity > (there's no point, we have to write back the entire compressed bock > at once). We also get a single virtually contiguous block ... if you're > willing to ditch HIGHMEM support. Or there's a proposal to introduce a > vmap_file() which would give us a virtually contiguous chunk of memory > (and could be trivially turned into a noop for the case of trying to > vmap a single large folio). > The compression block size in Squashfs can be 4K to 1M in size. Phillip |
From: hanqi <ha...@vi...> - 2025-07-16 08:28:12
|
在 2025/7/16 11:43, Jens Axboe 写道: > On 7/15/25 9:34 PM, hanqi wrote: >> >> ? 2025/7/15 22:28, Jens Axboe ??: >>> On 7/14/25 9:10 PM, Qi Han wrote: >>>> Jens has already completed the development of uncached buffered I/O >>>> in git [1], and in f2fs, the feature can be enabled simply by setting >>>> the FOP_DONTCACHE flag in f2fs_file_operations. >>> You need to ensure that for any DONTCACHE IO that the completion is >>> routed via non-irq context, if applicable. I didn't verify that this is >>> the case for f2fs. Generally you can deduce this as well through >>> testing, I'd say the following cases would be interesting to test: >>> >>> 1) Normal DONTCACHE buffered read >>> 2) Overwrite DONTCACHE buffered write >>> 3) Append DONTCACHE buffered write >>> >>> Test those with DEBUG_ATOMIC_SLEEP set in your config, and it that >>> doesn't complain, that's a great start. >>> >>> For the above test cases as well, verify that page cache doesn't grow as >>> IO is performed. A bit is fine for things like meta data, but generally >>> you want to see it remain basically flat in terms of page cache usage. >>> >>> Maybe this is all fine, like I said I didn't verify. Just mentioning it >>> for completeness sake. >> Hi, Jens >> Thanks for your suggestion. As I mentioned earlier in [1], in f2fs, >> the regular buffered write path invokes folio_end_writeback from a >> softirq context. Therefore, it seems that f2fs may not be suitable >> for DONTCACHE I/O writes. >> >> I?d like to ask a question: why is DONTCACHE I/O write restricted to >> non-interrupt context only? Is it because dropping the page might be >> too time-consuming to be done safely in interrupt context? This might >> be a naive question, but I?d really appreciate your clarification. >> Thanks in advance. > Because (as of right now, at least) the code doing the invalidation > needs process context. There are various reasons for this, which you'll > see if you follow the path off folio_end_writeback() -> > filemap_end_dropbehind_write() -> filemap_end_dropbehind() -> > folio_unmap_invalidate(). unmap_mapping_folio() is one case, and while > that may be doable, the inode i_lock is not IRQ safe. > > Most file systems have a need to punt some writeback completions to > non-irq context, eg for file extending etc. Hence for most file systems, > the dontcache case just becomes another case that needs to go through > that path. > > It'd certainly be possible to improve upon this, for example by having > an opportunistic dontcache unmap from IRQ/soft-irq context, and then > punting to a workqueue if that doesn't pan out. But this doesn't exist > as of yet, hence the need for the workqueue punt. Hi, Jens Thank you for your response. I tested uncached buffer I/O reads with a 50GB dataset on a local F2FS filesystem, and the page cache size only increased slightly, which I believe aligns with expectations. After clearing the page cache, the page cache size returned to its initial state. The test results are as follows: stat 50G.txt File: 50G.txt Size: 53687091200 Blocks: 104960712 IO Blocks: 512 regular file [read before]: echo 3 > /proc/sys/vm/drop_caches 01:48:17 kbmemfree kbavail kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty 01:50:59 6404648 8149508 2719384 23.40 512 1898092 199384760 823.75 1846756 466832 44 ./uncached_io_test 8192 1 1 50G.txt Starting 1 threads reading bs 8192, uncached 1 1s: 754MB/sec, MB=754 ... 64s: 844MB/sec, MB=262144 [read after]: 01:52:33 6326664 8121240 2747968 23.65 728 1947656 199384788 823.75 1887896 502004 68 echo 3 > /proc/sys/vm/drop_caches 01:53:11 6351136 8096936 2772400 23.86 512 1900500 199385216 823.75 1847252 533768 104 Hi Chao, Given that F2FS currently calls folio_end_writeback in the softirq context for normal write scenarios, could we first support uncached buffer I/O reads? For normal uncached buffer I/O writes, would it be feasible for F2FS to introduce an asynchronous workqueue to handle the page drop operation in the future? What are your thoughts on this? Thank you! |