Re: [PATCH v3 3/4] mm/page_owner: add NUMA node filter with nodelist support

From: zhen.ni

Date: Thu Apr 30 2026 - 02:06:26 EST

在 2026/4/30 13:16, SeongJae Park 写道:

On Thu, 30 Apr 2026 11:56:33 +0800 "zhen.ni" <zhen.ni@xxxxxxxxxxxx> wrote:

在 2026/4/29 22:56, SeongJae Park 写道:

On Wed, 29 Apr 2026 17:03:56 +0800 "zhen.ni" <zhen.ni@xxxxxxxxxxxx> wrote:

在 2026/4/29 09:28, SeongJae Park 写道:

On Tue, 28 Apr 2026 15:11:11 +0800 Zhen Ni <zhen.ni@xxxxxxxxxxxx> wrote:

[...]

@@ -685,6 +685,7 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
struct page_ext *page_ext;
struct page_owner *page_owner;
depot_stack_handle_t handle;
+ nodemask_t mask;
if (!static_branch_unlikely(&page_owner_inited))
return -EINVAL;
@@ -698,6 +699,8 @@ read_page_owner(struct file *file, char __user *buf, size_t count, loff_t *ppos)
while (!pfn_valid(pfn) && (pfn & (MAX_ORDER_NR_PAGES - 1)) != 0)
pfn++;
+ mask = owner_filter.nid_mask;
+

READ_ONCE() was used for owner_filter.print_mode. Should nid_mask also read
using READ_ONCE()?

The reason is that `owner_filter.nid_mask` is a nodemask_t, which is a
128-byte structure. READ_ONCE() only supports types up to 8 bytes and
will trigger a compile-time assertion failure for larger structures.

This was actually an issue in v2 - the AI review tool (sashiko.dev) and
Andrew both caught the compilation error with READ_ONCE/WRITE_ONCE on
nodemask_t, so v3 removed them.

Thank you for kindly sharing the context. Now I understand why READ_ONCE()
cannot be used. But, is plain load/store safe enough for nodemask_t?
Shouldn't it still be protected against races?

Concurrency Safety:
I considered spinlock and RCU, but decided against them:

- Spinlock: Adds overhead on every read, overkill for a debug facility
- RCU: Requires dynamic allocation of 128-byte nodemask_t, too complex
- READ_ONCE/WRITE_ONCE: Not possible, exceeds 8-byte limit

Plain load/store is safe here because:
1. page_owner is debug code with low-frequency filter changes
2. Worst case of torn read is temporary inconsistency in debug output
3. Similar debugfs interfaces use the same approach

The overhead of locking doesn't justify the benefit for this debug use case.

Do you think this is acceptable, or would you prefer I add locking?

Thank you for kindly explaining this. Unless others have different opinions, I
think this is ok. But, I think this would be good to be clarly documented, on
the code or the user documentation.

I'll add a comment in the code explaining the concurrency consideration

[...]

+ /* Support: "-1" to clear, or nodelist format like "0", "0,2", "0-3" */
+ if (kstrtoint(kbuf, 10, &val) == 0 && val == -1)
+ nodes_clear(mask);
+ else if (nodelist_parse(kbuf, mask)) {
+ ret = -EINVAL;
+ goto out_free;
+ }

Doesn't empty string input to nodelist_parse() clears the mask? Can't it be
reused?

Yes, empty input (echo > nid) works because nodelist_parse() handles it
correctly. However, nodelist_parse() - which is implemented via
bitmap_parselist() - cannot handle "-1" as it's not a valid range format
and would return an error. The explicit "-1" check is necessary to
support `echo "-1" > nid` without returning an error.

So the "-1" check handles a case that nodelist_parse() cannot handle.

Thank you for kindly explaining the reason. But, do we really need to support
"-1" input? Couldn't we just redefine the interface?

I chose "-1" to clearly differentiate from valid NUMA node IDs (0, 1, 2,
3...).Since node IDs are non-negative integers, "-1" naturally means
"invalid" or "no filter", which is an intuitive convention in Linux
(e.g., pid -1, signal -1).

Do you have a better suggestion for how to represent "clear filter"？

Seems my suggestion was too implicit. I'm suggesting using empty string
instead of "-1". I think it is also clarly differentiated from valid NUMA node
IDs?

I understand your point about simplifying the code by removing the "-1"
special case. I'll remove it and use only empty string for clearing the
filter.

Thank you for the suggestions.

Thanks,
SJ

[...]

Thanks,
Zhen