Re: [PATCH] bpf: Always defer local storage free
From: bot+bpf-ci
Date: Mon Mar 16 2026 - 19:07:20 EST
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
[ ... ]
> +static void bpf_selem_unlink_defer_free(struct hlist_head *selem_free_list,
> + struct bpf_local_storage *local_storage,
> + bool free_local_storage)
> +{
> + struct bpf_local_storage_elem *s;
> + struct hlist_node *n;
> + struct hlist_head *deferred_selem = this_cpu_ptr(&bpf_deferred_selem_free_list);
> + struct hlist_head *deferred_storage = this_cpu_ptr(&bpf_deferred_storage_free_list);
> + struct bpf_deferred_free_rcu *deferred_rcu = this_cpu_ptr(&bpf_deferred_free_rcu);
> +
> + hlist_for_each_entry_safe(s, n, selem_free_list, free_node) {
> + hlist_del(&s->free_node);
> + hlist_add_head(&s->free_node, deferred_selem);
> + }
> +
> + if (free_local_storage)
> + hlist_add_head(&local_storage->deferred_free_node, deferred_storage);
Can these per-CPU lists be corrupted by concurrent access?
This function is called from bpf_selem_unlink() after releasing
local_storage->lock with raw_res_spin_unlock_irqrestore(), so IRQs
are restored. The BPF program context provides migrate_disable()
(keeping the CPU stable for this_cpu_ptr), but IRQs are not masked.
A BPF tracing program (kprobe, perf_event) running at hardirq or
NMI level can interrupt this function mid-hlist_add_head() and call
bpf_selem_unlink() -> bpf_selem_unlink_defer_free() on the same
CPU, modifying the same per-CPU list concurrently. hlist_add_head()
is a multi-step, non-atomic operation, so this interleaving can
lose list entries or corrupt the list pointers.
The same race exists between bpf_deferred_free_work_fn() (running
in process context on the workqueue) and
bpf_selem_unlink_defer_free() called from any BPF program that
interrupts it.
Would something like an llist (lock-free linked list that is safe
against concurrent producers) or local_irq_save/restore around
the list operations be needed here?
> +
> + if (atomic_cmpxchg(this_cpu_ptr(&bpf_deferred_free_pending), 0, 1) == 0) {
> + deferred_rcu->cpu = smp_processor_id();
> + call_rcu(&deferred_rcu->rcu, bpf_deferred_free_rcu_callback);
> + }
> +}
[ ... ]
> +static void bpf_deferred_free_work_fn(struct work_struct *work)
> +{
> + struct bpf_deferred_free_work *deferred_work =
> + container_of(work, struct bpf_deferred_free_work, work);
> + int cpu = deferred_work->cpu;
> + struct hlist_head *selem_list = per_cpu_ptr(&bpf_deferred_selem_free_list, cpu);
> + struct hlist_head *storage_list = per_cpu_ptr(&bpf_deferred_storage_free_list, cpu);
> + struct bpf_local_storage_elem *selem;
> + struct bpf_local_storage *local_storage;
> + struct hlist_node *n;
> +
> + atomic_set(per_cpu_ptr(&bpf_deferred_free_pending, cpu), 0);
> +
> + hlist_for_each_entry_safe(selem, n, selem_list, free_node) {
> + hlist_del_init(&selem->free_node);
> + bpf_selem_free(selem, true);
> + }
> +
> + hlist_for_each_entry_safe(local_storage, n, storage_list, deferred_free_node) {
> + hlist_del_init(&local_storage->deferred_free_node);
> + bpf_local_storage_free(local_storage, true);
> + }
> +}
Does passing reuse_now=true here change the RCU synchronization
domain?
Previously, bpf_selem_unlink() called bpf_selem_free(selem, false),
which for use_kmalloc_nolock elements used
call_rcu_tasks_trace() to wait for a Tasks Trace RCU grace period
before freeing. This is needed because sleepable BPF programs hold
rcu_read_lock_trace() (see __bpf_prog_enter_sleepable()) and can
retain pointers to sdata across sleep points.
With this change, the work function calls bpf_selem_free(selem, true),
which uses regular call_rcu() instead. The full chain becomes:
bpf_selem_unlink_defer_free()
-> call_rcu() [regular RCU grace period]
-> queue_work_on()
-> bpf_selem_free(selem, true)
-> call_rcu() [regular RCU grace period]
-> kfree_nolock()
Two regular RCU grace periods do not substitute for one Tasks Trace
RCU grace period -- they are different synchronization domains. A
sleepable BPF program sleeping inside rcu_read_lock_trace() will
not block regular call_rcu() callbacks.
The reuse_now=true path (regular call_rcu) was previously only used
from bpf_selem_unlink_nofail() in the map_free/destroy paths, where
synchronize_rcu_mult(call_rcu, call_rcu_tasks_trace) had already been
called to cover both domains. In this new usage, no Tasks Trace RCU
synchronization occurs before the free.
The same applies to bpf_local_storage_free(local_storage, true)
which similarly changes from call_rcu_tasks_trace() to call_rcu().
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/23169312812