[PATCH 2/2] f2fs: pack same-inode blocks by inode during FG_GC

From: Daejun Park

Date: Tue May 19 2026 - 06:49:38 EST

The legacy FG_GC path migrates a victim section's valid blocks in
source segment-offset order: blocks of several inodes that were
interleaved in each source segment are migrated to the destination
curseg in the same interleaved order, carrying source-side
fragmentation forward into the post-GC layout regardless of section
size.

Pack the migration order by inode for every victim section:

* gc_data_segment()'s phase 3 records each valid block on a
per-inode gc_blocks list hanging off the inode_entry that
add_gc_inode() already creates in gc_list. Each gc_block
carries the source segno, nofs, ofs_in_node and per-segment off
so the deferred migration can rebuild start_bidx and pass the
correct segno to check_valid_map() inside the existing
do_migrate_one_data_block() helper.

* Phase 4 of gc_data_segment() is gated by nr_phases: in packing
mode nr_phases caps the loop at 4 (phases 0..3), so the summary
block is not re-scanned just to hit a per-slot 'continue'. The
phase 4 migration body is reached only via the new 'goto
do_migrate' fallback path described below, in which case the
inode_entry just returned by add_gc_inode() is reused instead of
repeating find_gc_inode().

* do_garbage_collect() invokes pack_gc_section() once, after every
source segment of the victim section has been parsed. Walking
gc_list->ilist in inode order emits all of one inode's blocks
contiguously to the destination curseg. On large sections this
lets an inode's blocks span the full SEGS_PER_SEC *
usable_blks_in_seg destination range.

* i_gc_rwsem is taken and released per block inside the packing
pass (via do_migrate_one_data_block), matching the legacy
phase 4 lock-holding window so concurrent user IO sees no
additional latency.

Activation conditions:
* sbi->gc_inode_local_packing == true (sysfs writable, accepts
only 0 or 1; default derived from __is_large_section(sbi) since
the gain on a single-segment section is marginal and adds memory
pressure with little return)
* gc_type == FG_GC; BG_GC's move_data_page() path defers
destination allocation to the writeback flusher, so any
reordering applied during GC is lost.

Race against the sysfs knob: gc_inode_local_packing is unsynchronised.
Re-reading it from phase 3 (enqueue), phase 4 (skip) and the pack
pass independently would let a concurrent toggle queue blocks via
gc_blocks and then bypass pack_gc_section(). do_garbage_collect()
snapshots the value into a local 'pack_by_inode' bool and threads it
through gc_data_segment() and the packing call so all three sites
remain consistent for the entire section.

Per-block records are allocated from a dedicated f2fs_gc_block slab
(SLAB_RECLAIM_ACCOUNT via f2fs_kmem_cache_create) rather than
kmalloc(GFP_NOFS); on a fully valid 64 MiB section (SEGS_PER_SEC=32)
one section can queue up to SEGS_PER_SEC * BLKS_PER_SEG records
(~512 KiB at 32 B per gc_block), so a per-cache slabinfo line and
FAULT_SLAB_ALLOC coverage of the fallback path are useful for
diagnostics.

Allocation failure falls through to 'goto do_migrate', the same
phase 4 body the !pack_by_inode path uses, so the block is migrated
immediately rather than dropped. This costs the packing benefit
for the one block but preserves FG_GC progress under memory
pressure, which matters more when FG_GC is called precisely
because the system is short on free sections.

Measurements (QEMU virtio guest, 4-cycle fragmentation, gc_urgent
40s, filefrag total extents before/after GC; structural counters
only since QEMU virtio BW/lat is unreliable):

Large section (mkfs.f2fs -s 32 = 64 MiB section,
64 files x 4 MiB):
legacy 65536 -> 65536 0 % reduction
packed 65536 -> 49170 24 % reduction (-16366 extents)

Default section (mkfs.f2fs -s 1 = 2 MiB section,
128 files x 256 KiB):
legacy 8192 -> 8192 0 % reduction
packed 8192 -> 7690 6 % reduction
GC work (move_blks, cp_blks, gc_calls) identical between modes;
the packing only reorders dest curseg writes.

Natural FG_GC under tight cold migration
(mkfs.f2fs -s 32, 2 GiB disk 90 % fill,
6 hot x 200 MiB + 6 cold x 100 MiB interleaved fill,
background_gc=sync, 300 s hot rewrite):
legacy cold extents 350 -> 357 (delta +7, no improvement)
packed cold extents 350 -> 132 (delta -218, -63 % reduction)
per user iter:
move_blks legacy 42344 packed 34822 (-18 %)
cp_blocks legacy 23.90 packed 22.95 (-4 %)
skipped_gc_rwsem legacy 108 packed 44 (-59 %)
hot rewrite iters in fixed 300 s window: +45 %

Sanity verified in QEMU guest (mkfs.f2fs -s 8, 16 x 4 MiB files,
gc_urgent + remount): data sha256 matches before and after GC; no
WARN/BUG in dmesg; gc_inode_local_packing knob exposed under
/sys/fs/f2fs/<disk>/. An additional stress run on mkfs.f2fs -s 32
with FAULT_SLAB_ALLOC at inject_rate=4 triggered 7689 slab alloc
failures during FG_GC, exercising the 'goto do_migrate' fallback;
sha256 was preserved and dmesg stayed clean.

Signed-off-by: Daejun Park <daejun7.park@xxxxxxxxxxx>
---
Documentation/ABI/testing/sysfs-fs-f2fs | 10 +++
fs/f2fs/f2fs.h | 7 +-
fs/f2fs/gc.c | 109 ++++++++++++++++++++++--
fs/f2fs/super.c | 1 +
fs/f2fs/sysfs.c | 7 ++
5 files changed, 123 insertions(+), 11 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-fs-f2fs b/Documentation/ABI/testing/sysfs-fs-f2fs
index 1b58c029a..1085af8f6 100644
--- a/Documentation/ABI/testing/sysfs-fs-f2fs
+++ b/Documentation/ABI/testing/sysfs-fs-f2fs
@@ -1002,3 +1002,13 @@ Description: It can be used to tune priority of f2fs critical task, e.g. f2fs_ck
threads, limitation as below:
- it requires user has CAP_SYS_NICE capability.
- the range is [100, 139], by default the value is 120.
+
+What: /sys/fs/f2fs/<disk>/gc_inode_local_packing
+Date: May 2026
+Contact: Daejun Park <daejun7.park@xxxxxxxxxxx>
+Description: When set to 1, foreground GC packs valid blocks of the same
+ inode contiguously into the destination curseg, in addition to
+ (rather than within) source segment-offset order. Effective
+ only under FG_GC; BG_GC's writeback-deferred destination
+ allocation is unaffected. Default is 1 on large sections
+ (SEGS_PER_SEC > 1), 0 otherwise. Set to 0 to disable.
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index f0a54883b..8cd0ec5b5 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -405,8 +405,9 @@ struct ino_entry {

/* for the list of inodes to be GCed */
struct inode_entry {
- struct list_head list; /* list head */
- struct inode *inode; /* vfs inode pointer */
+ struct list_head list; /* list head */
+ struct inode *inode; /* vfs inode pointer */
+ struct list_head gc_blocks; /* per-inode block list for GC packing */
};

struct fsync_node_entry {
@@ -1908,6 +1909,8 @@ struct f2fs_sb_info {
unsigned int migration_granularity;
/* migration window granularity of garbage collection, unit: segment */
unsigned int migration_window_granularity;
+ /* pack same-inode blocks together during FG_GC migration */
+ bool gc_inode_local_packing;

/*
* for stat information.
diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 48412f9a5..1d598b188 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -24,6 +24,16 @@
#include <trace/events/f2fs.h>

static struct kmem_cache *victim_entry_slab;
+static struct kmem_cache *gc_block_slab;
+
+/* Per-block migration record for inode-local packing under FG_GC. */
+struct gc_block {
+ struct list_head list;
+ unsigned int segno; /* source segment for check_valid_map() */
+ unsigned int nofs;
+ unsigned int ofs_in_node;
+ int off;
+};

static unsigned int count_bits(const unsigned long *addr,
unsigned int offset, unsigned int len);
@@ -1004,6 +1014,7 @@ static struct inode_entry *add_gc_inode(struct gc_inode_list *gc_list,
new_ie = f2fs_kmem_cache_alloc(f2fs_inode_entry_slab,
GFP_NOFS, true, NULL);
new_ie->inode = inode;
+ INIT_LIST_HEAD(&new_ie->gc_blocks);

f2fs_radix_tree_insert(&gc_list->iroot, inode->i_ino, new_ie);
list_add_tail(&new_ie->list, &gc_list->ilist);
@@ -1013,8 +1024,13 @@ static struct inode_entry *add_gc_inode(struct gc_inode_list *gc_list,
static void put_gc_inode(struct gc_inode_list *gc_list)
{
struct inode_entry *ie, *next_ie;
+ struct gc_block *e, *tmp_e;

list_for_each_entry_safe(ie, next_ie, &gc_list->ilist, list) {
+ list_for_each_entry_safe(e, tmp_e, &ie->gc_blocks, list) {
+ list_del(&e->list);
+ kmem_cache_free(gc_block_slab, e);
+ }
radix_tree_delete(&gc_list->iroot, ie->inode->i_ino);
iput(ie->inode);
list_del(&ie->list);
@@ -1612,7 +1628,7 @@ static int do_migrate_one_data_block(struct f2fs_sb_info *sbi,
*/
static int gc_data_segment(struct f2fs_sb_info *sbi, struct f2fs_summary *sum,
struct gc_inode_list *gc_list, unsigned int segno, int gc_type,
- bool force_migrate, struct blk_plug *plug)
+ bool force_migrate, bool pack_by_inode, struct blk_plug *plug)
{
struct super_block *sb = sbi->sb;
struct f2fs_summary *entry;
@@ -1621,6 +1637,8 @@ static int gc_data_segment(struct f2fs_sb_info *sbi, struct f2fs_summary *sum,
int phase = 0;
int submitted = 0;
unsigned int usable_blks_in_seg = f2fs_usable_blks_in_seg(sbi, segno);
+ /* packing path skips phase 4; pack_gc_section() handles migration */
+ int nr_phases = pack_by_inode ? 4 : 5;

start_addr = START_BLOCK(sbi, segno);

@@ -1629,6 +1647,7 @@ static int gc_data_segment(struct f2fs_sb_info *sbi, struct f2fs_summary *sum,

for (off = 0; off < usable_blks_in_seg; off++, entry++) {
struct inode *inode;
+ struct inode_entry *ie = NULL;
struct node_info dni; /* dnode info for the data */
unsigned int ofs_in_node, nofs;
block_t start_bidx;
@@ -1671,6 +1690,7 @@ static int gc_data_segment(struct f2fs_sb_info *sbi, struct f2fs_summary *sum,

if (phase == 3) {
struct folio *data_folio;
+ struct gc_block *e;
int err;

inode = f2fs_iget(sb, dni.ino);
@@ -1717,8 +1737,10 @@ static int gc_data_segment(struct f2fs_sb_info *sbi, struct f2fs_summary *sum,
iput(inode);
continue;
}
- add_gc_inode(gc_list, inode);
- continue;
+ ie = add_gc_inode(gc_list, inode);
+ if (!pack_by_inode)
+ continue;
+ goto queue;
}

data_folio = f2fs_get_read_data_folio(inode, start_bidx,
@@ -1730,18 +1752,37 @@ static int gc_data_segment(struct f2fs_sb_info *sbi, struct f2fs_summary *sum,
}

f2fs_folio_put(data_folio, false);
- add_gc_inode(gc_list, inode);
+ ie = add_gc_inode(gc_list, inode);
+ if (!pack_by_inode)
+ continue;
+queue:
+ e = f2fs_kmem_cache_alloc(gc_block_slab, GFP_NOFS,
+ false, sbi);
+ if (!e)
+ goto do_migrate; /* alloc fail: migrate now */
+ e->segno = segno;
+ e->nofs = nofs;
+ e->ofs_in_node = ofs_in_node;
+ e->off = off;
+ list_add_tail(&e->list, &ie->gc_blocks);
continue;
}

- /* phase 4 */
- inode = find_gc_inode(gc_list, dni.ino);
+ /*
+ * phase 4: legacy per-segment migration. Capped out by
+ * nr_phases when packing is on; reached only via the
+ * 'goto do_migrate' fallback above, in which case @ie is
+ * the entry add_gc_inode() just returned and we reuse it
+ * instead of repeating the radix-tree lookup.
+ */
+do_migrate:
+ inode = ie ? ie->inode : find_gc_inode(gc_list, dni.ino);
if (inode)
submitted += do_migrate_one_data_block(sbi, inode,
segno, off, nofs, ofs_in_node, gc_type);
}

- if (++phase < 5) {
+ if (++phase < nr_phases) {
blk_finish_plug(plug);
blk_start_plug(plug);
goto next_step;
@@ -1750,6 +1791,31 @@ static int gc_data_segment(struct f2fs_sb_info *sbi, struct f2fs_summary *sum,
return submitted;
}

+/*
+ * pack_gc_section - migrate all gc_blocks queued for this victim section,
+ * grouped by inode. gc_list->ilist is walked in insertion order so
+ * destination curseg writes form inode-contiguous runs that span every
+ * source segment of the section.
+ */
+static int pack_gc_section(struct f2fs_sb_info *sbi,
+ struct gc_inode_list *gc_list, int gc_type)
+{
+ struct inode_entry *ie;
+ struct gc_block *e, *tmp;
+ int submitted = 0;
+
+ list_for_each_entry(ie, &gc_list->ilist, list) {
+ list_for_each_entry_safe(e, tmp, &ie->gc_blocks, list) {
+ submitted += do_migrate_one_data_block(sbi, ie->inode,
+ e->segno, e->off, e->nofs,
+ e->ofs_in_node, gc_type);
+ list_del(&e->list);
+ kmem_cache_free(gc_block_slab, e);
+ }
+ }
+ return submitted;
+}
+
static int __get_victim(struct f2fs_sb_info *sbi, unsigned int *victim,
int gc_type, bool one_time)
{
@@ -1776,6 +1842,13 @@ static int do_garbage_collect(struct f2fs_sb_info *sbi,
unsigned char type;
unsigned char data_type;
int submitted = 0, sum_blk_cnt;
+ /*
+ * Snapshot the packing knob once for this section. Re-reading the
+ * sysfs-writable bool from phase 3, phase 4 and the pack pass would
+ * let a concurrent toggle queue blocks via add_gc_block() and then
+ * bypass pack_gc_section(), losing this cycle of migration.
+ */
+ bool pack_by_inode = sbi->gc_inode_local_packing && gc_type == FG_GC;

if (__is_large_section(sbi)) {
sec_end_segno = rounddown(end_segno, SEGS_PER_SEC(sbi));
@@ -1904,7 +1977,8 @@ static int do_garbage_collect(struct f2fs_sb_info *sbi,
else
submitted += gc_data_segment(sbi, sum->entries,
gc_list, cur_segno,
- gc_type, force_migrate, &plug);
+ gc_type, force_migrate,
+ pack_by_inode, &plug);

stat_inc_gc_seg_count(sbi, data_type, gc_type);
sbi->gc_reclaimed_segs[sbi->gc_mode]++;
@@ -1930,6 +2004,14 @@ static int do_garbage_collect(struct f2fs_sb_info *sbi,
segno = block_end_segno;
}

+ /*
+ * Drain the per-inode gc_blocks queue. Skipped on freezing
+ * (goto stop above): leftover entries are freed by put_gc_inode()
+ * in f2fs_gc().
+ */
+ if (pack_by_inode)
+ submitted += pack_gc_section(sbi, gc_list, gc_type);
+
stop:
if (submitted)
f2fs_submit_merged_write(sbi, data_type);
@@ -2105,11 +2187,20 @@ int __init f2fs_create_garbage_collection_cache(void)
{
victim_entry_slab = f2fs_kmem_cache_create("f2fs_victim_entry",
sizeof(struct victim_entry));
- return victim_entry_slab ? 0 : -ENOMEM;
+ if (!victim_entry_slab)
+ return -ENOMEM;
+ gc_block_slab = f2fs_kmem_cache_create("f2fs_gc_block",
+ sizeof(struct gc_block));
+ if (!gc_block_slab) {
+ kmem_cache_destroy(victim_entry_slab);
+ return -ENOMEM;
+ }
+ return 0;
}

void f2fs_destroy_garbage_collection_cache(void)
{
+ kmem_cache_destroy(gc_block_slab);
kmem_cache_destroy(victim_entry_slab);
}

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index ada8098f8..f1bee7f3d 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -4353,6 +4353,7 @@ static void init_sb_info(struct f2fs_sb_info *sbi)
sbi->migration_granularity = SEGS_PER_SEC(sbi);
sbi->migration_window_granularity = f2fs_sb_has_blkzoned(sbi) ?
DEF_MIGRATION_WINDOW_GRANULARITY_ZONED : SEGS_PER_SEC(sbi);
+ sbi->gc_inode_local_packing = __is_large_section(sbi);
sbi->seq_file_ra_mul = MIN_RA_MUL;
sbi->max_fragment_chunk = DEF_FRAGMENT_SIZE;
sbi->max_fragment_hole = DEF_FRAGMENT_SIZE;
diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c
index 665687244..30a3beb60 100644
--- a/fs/f2fs/sysfs.c
+++ b/fs/f2fs/sysfs.c
@@ -659,6 +659,11 @@ static ssize_t __sbi_store(struct f2fs_attr *a,
return -EINVAL;
}

+ if (!strcmp(a->attr.name, "gc_inode_local_packing")) {
+ if (t > 1)
+ return -EINVAL;
+ }
+
if (!strcmp(a->attr.name, "gc_urgent")) {
if (t == 0) {
sbi->gc_mode = GC_NORMAL;
@@ -1269,6 +1274,7 @@ F2FS_SBI_RW_ATTR(gc_reclaimed_segments, gc_reclaimed_segs);
F2FS_SBI_GENERAL_RW_ATTR(max_victim_search);
F2FS_SBI_GENERAL_RW_ATTR(migration_granularity);
F2FS_SBI_GENERAL_RW_ATTR(migration_window_granularity);
+F2FS_SBI_GENERAL_RW_ATTR(gc_inode_local_packing);
F2FS_SBI_GENERAL_RW_ATTR(dir_level);
F2FS_SBI_GENERAL_RW_ATTR(allocate_section_hint);
F2FS_SBI_GENERAL_RW_ATTR(allocate_section_policy);
@@ -1438,6 +1444,7 @@ static struct attribute *f2fs_attrs[] = {
ATTR_LIST(max_victim_search),
ATTR_LIST(migration_granularity),
ATTR_LIST(migration_window_granularity),
+ ATTR_LIST(gc_inode_local_packing),
ATTR_LIST(dir_level),
ATTR_LIST(ram_thresh),
ATTR_LIST(ra_nid_pages),
--
2.43.0