[PATCH v3 2/3] bpf: use kvfree() for replaced sysctl write buffer

From: Dawei Feng

Date: Wed Jun 03 2026 - 07:11:02 EST


proc_sys_call_handler() allocates its temporary sysctl buffer with
kvzalloc() and passes it to __cgroup_bpf_run_filter_sysctl(). Since
kvzalloc() may fall back to vmalloc() for large allocations, freeing
that buffer with kfree() is wrong and can corrupt memory.

Use kvfree() to safely handle both kmalloc and kvzalloc()/vmalloc
allocations.

The bug was first flagged by an experimental analysis tool we are
developing for kernel memory-management bugs while analyzing
v6.13-rc1. The tool is still under development and is not yet publicly
available. Manual inspection confirms that the bug is still
present in v7.1-rc5.

Reproduced the bug based on v7.1-rc4 in a QEMU x86_64 guest booted with
KASAN and CONFIG_FAILSLAB enabled. To exercise the replacement path, the
test tree also included the accompanying fix for the stale ret == 1
check in __cgroup_bpf_run_filter_sysctl(). The reproducer confines
failslab injections to the proc_sys_call_handler() range, uses
stacktrace-depth=32, and injects fail-nth=1 while writing 8191 bytes to
/proc/sys/kernel/domainname from a task in the target cgroup. Under
that setup, fail-nth=1 triggered the fault:

BUG: unable to handle page fault for address: ffffeb0200024d48
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 SMP KASAN NOPTI
CPU: 2 UID: 0 PID: 209 Comm: repro_proc_sys_ Not tainted 7.1.0-rc4-00686-g97625979a5d4 PREEMPT(lazy)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:kfree+0x6e/0x510
...
Call Trace:
<TASK>
? __cgroup_bpf_run_filter_sysctl+0x626/0xc30
__cgroup_bpf_run_filter_sysctl+0x74d/0xc30
? __pfx___cgroup_bpf_run_filter_sysctl+0x10/0x10
? srso_return_thunk+0x5/0x5f
? __kvmalloc_node_noprof+0x345/0x870
? proc_sys_call_handler+0x250/0x480
? srso_return_thunk+0x5/0x5f
proc_sys_call_handler+0x3a2/0x480
? __pfx_proc_sys_call_handler+0x10/0x10
? srso_return_thunk+0x5/0x5f
? selinux_file_permission+0x39f/0x500
? srso_return_thunk+0x5/0x5f
? lock_is_held_type+0x9e/0x120
vfs_write+0x98e/0x1000
...
</TASK>

With this fix applied on top of the same test setup, rerunning the
reproducer with fail-nth=1 yields no corresponding Oops reports.

Fixes: 4508943794ef ("proc: use kvzalloc for our kernel buffer")
Cc: stable@xxxxxxxxxxxxxxx

Reviewed-by: Emil Tsalapatis <emil@xxxxxxxxxxxxxxx>
Reviewed-by: Jiayuan Chen <jiayuan.chen@xxxxxxxxx>
Acked-by: Yonghong Song <yonghong.song@xxxxxxxxx>
Signed-off-by: Zilin Guan <zilin@xxxxxxxxxx>
Signed-off-by: Dawei Feng <dawei.feng@xxxxxxxxxx>
---
kernel/bpf/cgroup.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 2c7f72d3fb11..a0b5f8cd8b10 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -1936,7 +1936,7 @@ int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
kfree(ctx.cur_val);

if (ret == 1 && ctx.new_updated) {
- kfree(*buf);
+ kvfree(*buf);
*buf = ctx.new_val;
*pcount = ctx.new_len;
} else {
--
2.34.1