Re: [PATCH v4 0/8] s390: Improve this_cpu operations

From: David Laight

Date: Fri May 22 2026 - 14:33:18 EST

On Fri, 22 May 2026 16:12:49 +0200
Heiko Carstens <hca@xxxxxxxxxxxxx> wrote:

> v4:
>
> - Drop alternatives approach and extract percpu base register number for
> mviy at compile time [David Laight]

Definitely looks better.
I'm sure I managed to understand it once :-)

-- David

>
> - Fix logic for percpu code section detection, as well as
> interruption/exception/nmi path [Sashiko [5]]
>
> [5] https://sashiko.dev/#/patchset/20260520092243.264847-1-hca%40linux.ibm.com
>
> v3:
>
> - Fix various typos [Juergen Christ]
>
> - Add missing kprobe detection / handling [Sashiko [3]]
> [FWIW, this made me also aware of that the current general s390 kprobes
> code seems to be racy against concurrent removal of a kprobe while a
> probe hit on a different CPU. But that is a different story.]
>
> - Fix various minor findings [Sashiko [3]]
>
> - All of this might be dropped / exchanged in future in favor of the percpu
> page table approach proposed by Yang Shi [4].
>
> [3] https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca@xxxxxxxxxxxxx
> [4] https://lore.kernel.org/all/20260429170758.3018959-1-yang@xxxxxxxxxxxxxxxxxxxxxx/
>
> v2:
>
> - Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
> warnings
>
> - Add missing __packed attribute to insn structure [Sashiko [2]]
>
> - Fix inverted if condition [Sashiko [2]]
>
> - Add missing user_mode() check [Sashiko [2]]
>
> - Move percpu_entry() call in front of irqentry_enter() call in all
> entry paths to avoid that potential this_cpu() operations overwrite
> the not-yet saved percpu code section indicator [Sashiko [2]]
>
> [2] https://sashiko.dev/#/patchset/20260317195436.2276810-1-hca%40linux.ibm.com
>
> v1:
>
> This is a follow-up to Peter Zijlstra's in-kernel rseq RFC [1].
>
> With the intended removal of PREEMPT_NONE this_cpu operations based on
> atomic instructions, guarded with preempt_disable()/preempt_enable() pairs,
> become more expensive: the preempt_disable() / preempt_enable() pairs are
> not optimized away anymore during compile time.
>
> In particular the conditional call to preempt_schedule_notrace() after
> preempt_enable() adds additional code and register pressure.
>
> To avoid this Peter suggested an in-kernel rseq approach. While this would
> certainly work, this series tries to come up with a solution which uses
> less instructions and doesn't require to repeat instruction sequences.
>
> The idea is that this_cpu operations based on atomic instructions are
> guarded with mvyi instructions:
>
> - The first mvyi instruction writes the register number, which contains
> the percpu address variable to lowcore. This also indicates that a
> percpu code section is executed.
>
> - The first instruction following the mvyi instruction must be the ag
> instruction which adds the percpu offset to the percpu address register.
>
> - Afterwards the atomic percpu operation follows.
>
> - Then a second mvyi instruction writes a zero to lowcore, which indicates
> the end of the percpu code section.
>
> - In case of an interrupt/exception/nmi the register number which was
> written to lowcore is copied to the exception frame (pt_regs), and a zero
> is written to lowcore.
>
> - On return to the previous context it is checked if a percpu code section
> was executed (saved register number not zero), and if the process was
> migrated to a different cpu. If the percpu offset was already added to
> the percpu address register (instruction address does _not_ point to the
> ag instruction) the content of the percpu address register is adjusted so
> it points to percpu variable of the new cpu.
>
> All of this seems to work, but of course it could still be broken since I
> missed some detail.
>
> In total this series results in a kernel text size reduction of ~106kb. The
> number of preempt_schedule_notrace() call sites is reduced from 7089 to
> 1577.
>
> Note: this comes without any huge performance analysis, however all
> microbenchmarks confirmed that the new code is at least as fast as the
> old code, like expected.
>
> [1] 20260223163843.GR1282955@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> Heiko Carstens (8):
> s390/percpu: Infrastructure for more efficient this_cpu operations
> s390/percpu: Add missing do { } while (0) constructs
> s390/percpu: Use new percpu code section for arch_this_cpu_add()
> s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
> s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
> s390/percpu: Provide arch_this_cpu_read() implementation
> s390/percpu: Provide arch_this_cpu_write() implementation
> s390/percpu: Remove one and two byte this_cpu operation implementation
>
> arch/s390/include/asm/entry-percpu.h | 78 ++++++++
> arch/s390/include/asm/lowcore.h | 3 +-
> arch/s390/include/asm/percpu.h | 257 +++++++++++++++++++++------
> arch/s390/include/asm/ptrace.h | 2 +
> arch/s390/kernel/irq.c | 24 ++-
> arch/s390/kernel/nmi.c | 5 +
> arch/s390/kernel/traps.c | 5 +
> 7 files changed, 315 insertions(+), 59 deletions(-)
> create mode 100644 arch/s390/include/asm/entry-percpu.h
>