[RFC PATCH v8 00/10] context_tracking,x86: Defer some IPIs until a user->kernel transition
From: Valentin Schneider
Date: Tue Mar 24 2026 - 05:56:20 EST
Context
=======
We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:
64359.052209596 NetworkManager 0 1405 smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
smp_call_function_many_cond+0x1
smp_call_function+0x39
on_each_cpu+0x2a
flush_tlb_kernel_range+0x7b
__purge_vmap_area_lazy+0x70
_vm_unmap_aliases.part.42+0xdf
change_page_attr_set_clr+0x16a
set_memory_ro+0x26
bpf_int_jit_compile+0x2f9
bpf_prog_select_runtime+0xc6
bpf_prepare_filter+0x523
sk_attach_filter+0x13
sock_setsockopt+0x92c
__sys_setsockopt+0x16a
__x64_sys_setsockopt+0x20
do_syscall_64+0x87
entry_SYSCALL_64_after_hwframe+0x65
The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.
The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.
Deferral approach
=================
Previous versions would assign IPIs a "type" and have a mapping of IPI type to
callback, leveraged upon kernel entry via the context_tracking framework.
This version now gets rid of all that, and instead goes with an
"unconditionnally run a catch-up sequence at kernel entry" approach - as was
suggested at LPC 2025 [3].
Another point made during LPC25 (sorry I didn't get your name!) was that when
kPTI is in use, the use of global pages is very limited and thus a CR4 may not
be warranted for a kernel TLB flush. That means the existing CR3 RMW used to switch
between kernel and user page tables can be used as the unconditionnal TLB flush,
meaning I could get rid of my CR4 dance.
In the same spirit, turns out a CR3 RMW is a serializing instruction:
SDM vol2 chapter 4.3 - Move to/from control registers:
```
MOV CR* instructions, except for MOV CR8, are serializing instructions.
```
That means I don't need to do anything extra on kernel entry to handle deferred
sync_core() IPIs sent from text_poke().
So long story short, the CR3 RMW that is executed for every user <-> kernel
transition when kPTI is enabled does everything I need to defer kernel TLB flush
and kernel text update IPIs.
>From that, I've completely nuked the context_tracking deferral faff.
The added x86-specific code is now "just" about having a software signal
to figure out which CR3 a CPU is using - easier said than done, details in
the individual changelogs.
Kernel entry vs execution of the deferred operation
===================================================
This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].
There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (before we start getting into
context_tracking.c proper), i.e.:
idtentry
idtentry_body
error_entry
SWITCH_TO_KERNEL_CR3
This danger zone used to be much wider in v7 and earlier (from kernel entry all
the way down to ct_kernel_enter_state()). The objtool instrumentation thus now
targets .entry.text rather than .noinstr as a whole.
Show me numbers
===============
Xeon E5-2699 system with SMToff, NOHZ_FULL, 26 isolated CPUs.
RHEL10 userspace.
Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:
$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
-R "stacktrace if cpu & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
-e "ipi_send_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
rteval --onlyload --loads-cpulist=$HK_CPUS \
--hackbench-runlowmem=True --duration=$DURATION
This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 6 hours.
v6.19
o ~6000 IPIs received, so about ~230 interfering IPI per isolated CPU
o About one interfering IPI roughly every 1 minute 30 seconds
v6.19 + patches
o Zilch... With some caveats
I still get some TLB flush IPIs sent to seemingly still-in-userspace CPUs,
about one per ~3h for /some/ runs. I haven't seen any in the last cumulated
24h of testing...
pcpu_balance_work also sometimes shows up, and isn't covered by the deferral
faff. Again, sometimes it shows up, sometimes it doesn't and hasn't for a
while now.
Patches
=======
o Patches 1-4 are standalone objtool cleanups.
o Patches 5-6 add infrastructure for annotating static keys that may be used in
entry code (courtesy of Josh).
o Patch 7 adds ASM support for static keys
o Patches 8-10 add the deferral mechanism.
Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v8
Acknowledgements
================
Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o Dave Hansen for patiently educating me about mm
o All of the folks who attended various (too many?) talks about this and
provided precious feedback.
Links
=====
[1]: https://lore.kernel.org/all/20210929151723.162004989@xxxxxxxxxxxxx/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://lpc.events/event/19/contributions/2219/
[4]: https://lpc.events/event/18/contributions/1889/
Revisions
=========
v7 -> v8
++++++++
o Rebased onto v6.19
o Fixed objtool --uaccess validation preventing --noinstr validation of
unwind hints
o Added more objtool --noinstr warning fixes
o Reduced objtool noinstr static key validation to just .entry.text
o Moved the kernel_cr3_loaded signal update to before writing to CR3
o Ditched context_tracking based deferral
o Ditched the (additionnal) unconditionnal TLB flush upon kernel entry
v6 -> v7
++++++++
o Rebased onto latest v6.18-rc5 (6fa9041b7177f)
o Collected Acks (Sean, Frederic)
o Fixed <asm/context_tracking_work.h> include (Shrikanth)
o Fixed ct_set_cpu_work() CT_RCU_WATCHING logic (Frederic)
o Wrote more verbose comments about NOINSTR static keys and calls (Petr)
o [NEW PATCH] Instrumented one more static key: cpu_bf_vm_clear
o [NEW PATCH] added ASM-accessible static key helpers to gate NO_HZ_FULL logic
in early entry code (Frederic)
v5 -> v6
++++++++
o Rebased onto v6.17
o Small conflict fixes with cpu_buf_idle_clear smp_text_poke() renaming
o Added the TLB flush craziness
v4 -> v5
++++++++
o Rebased onto v6.15-rc3
o Collected Reviewed-by
o Annotated a few more static keys
o Added proper checking of noinstr sections that are in loadable code such as
KVM early entry (Sean Christopherson)
o Switched to checking for CT_RCU_WATCHING instead of CT_STATE_KERNEL or
CT_STATE_IDLE, which means deferral is now behaving sanely for IRQ/NMI
entry from idle (thanks to Frederic!)
o Ditched the vmap TLB flush deferral (for now)
RFCv3 -> v4
+++++++++++
o Rebased onto v6.13-rc6
o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)
o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups
o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ
RFCv2 -> RFCv3
++++++++++++++
o Rebased onto v6.12-rc6
o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral
RFCv1 -> RFCv2
++++++++++++++
o Rebased onto v6.5-rc1
o Updated the trace filter patches (Steven)
o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
existing .state field (Peter, Frederic)
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
rcutorture case for a low-size counter (Paul)
o Fixed flush_tlb_kernel_range_deferrable() definition
Josh Poimboeuf (1):
objtool: Add .entry.text validation for static branches
Valentin Schneider (9):
objtool: Make validate_call() recognize indirect calls to pv_ops[]
objtool: Flesh out warning related to pv_ops[] calls
objtool: Always pass a section to validate_unwind_hints()
x86/retpoline: Make warn_thunk_thunk .noinstr
sched/isolation: Mark housekeeping_overridden key as __ro_after_init
x86/jump_label: Add ASM support for static_branch_likely()
x86/mm/pti: Introduce a kernel/user CR3 software signal
context_tracking,x86: Defer kernel text patching IPIs when tracking
CR3 switches
x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs when tracking CR3
switches
arch/x86/Kconfig | 14 +++
arch/x86/entry/calling.h | 13 +++
arch/x86/entry/entry.S | 3 +-
arch/x86/entry/syscall_64.c | 4 +
arch/x86/include/asm/jump_label.h | 33 +++++++-
arch/x86/include/asm/text-patching.h | 5 ++
arch/x86/include/asm/tlbflush.h | 4 +
arch/x86/kernel/alternative.c | 34 ++++++--
arch/x86/kernel/cpu/bugs.c | 2 +-
arch/x86/kernel/kprobes/core.c | 4 +-
arch/x86/kernel/kprobes/opt.c | 4 +-
arch/x86/kernel/module.c | 2 +-
arch/x86/mm/pti.c | 36 +++++---
arch/x86/mm/tlb.c | 34 ++++++--
include/linux/jump_label.h | 11 ++-
include/linux/objtool.h | 16 ++++
kernel/sched/isolation.c | 2 +-
mm/vmalloc.c | 30 +++++--
tools/objtool/Documentation/objtool.txt | 12 +++
tools/objtool/check.c | 108 ++++++++++++++++++++----
tools/objtool/include/objtool/check.h | 2 +
tools/objtool/include/objtool/elf.h | 3 +-
tools/objtool/include/objtool/special.h | 1 +
tools/objtool/special.c | 15 +++-
24 files changed, 331 insertions(+), 61 deletions(-)
--
2.52.0