Re: [PATCH v7 3/3] mm/mempolicy: Support memory hotplug in weighted interleave

From: Honggyu Kim
Date: Wed Apr 16 2025 - 00:04:59 EST

Next message: Yen-Chi Huang: "[PATCH v4] platform/x86: portwell-ec: Add GPIO and WDT driver for Portwell EC"
Previous message: Paul E. McKenney: "Re: [PATCH] tools/drgn: Add script to display page state for a given PID and VADDR"
Next in thread: Honggyu Kim: "Re: [PATCH v7 3/3] mm/mempolicy: Support memory hotplug in weighted interleave"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Jonathan,

Thanks for reviewing our patches.

I have a few comments and the rest will be addressed by Rakie.

On 4/16/2025 1:00 AM, Jonathan Cameron wrote:

On Tue, 8 Apr 2025 16:32:42 +0900
Rakie Kim <rakie.kim@xxxxxx> wrote:

The weighted interleave policy distributes page allocations across multiple
NUMA nodes based on their performance weight, thereby improving memory
bandwidth utilization. The weight values for each node are configured
through sysfs.

Previously, sysfs entries for configuring weighted interleave were created
for all possible nodes (N_POSSIBLE) at initialization, including nodes that
might not have memory. However, not all nodes in N_POSSIBLE are usable at
runtime, as some may remain memoryless or offline.
This led to sysfs entries being created for unusable nodes, causing
potential misconfiguration issues.

To address this issue, this patch modifies the sysfs creation logic to:
1) Limit sysfs entries to nodes that are online and have memory, avoiding
the creation of sysfs entries for nodes that cannot be used.
2) Support memory hotplug by dynamically adding and removing sysfs entries
based on whether a node transitions into or out of the N_MEMORY state.

Additionally, the patch ensures that sysfs attributes are properly managed
when nodes go offline, preventing stale or redundant entries from persisting
in the system.

By making these changes, the weighted interleave policy now manages its
sysfs entries more efficiently, ensuring that only relevant nodes are
considered for interleaving, and dynamically adapting to memory hotplug
events.

Signed-off-by: Rakie Kim <rakie.kim@xxxxxx>
Signed-off-by: Honggyu Kim <honggyu.kim@xxxxxx>
Signed-off-by: Yunjeong Mun <yunjeong.mun@xxxxxx>
Reviewed-by: Oscar Salvador <osalvador@xxxxxxx>
---
mm/mempolicy.c | 106 ++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 83 insertions(+), 23 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 988575f29c53..9aa884107f4c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -113,6 +113,7 @@
#include <asm/tlbflush.h>
#include <asm/tlb.h>
#include <linux/uaccess.h>
+#include <linux/memory.h>
#include "internal.h"
@@ -3421,6 +3422,7 @@ struct iw_node_attr {
struct sysfs_wi_group {
struct kobject wi_kobj;
+ struct mutex kobj_lock;
struct iw_node_attr *nattrs[];
};
@@ -3470,13 +3472,24 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
static void sysfs_wi_node_delete(int nid)
{
- if (!wi_group->nattrs[nid])
+ struct iw_node_attr *attr;
+
+ if (nid < 0 || nid >= nr_node_ids)
+ return;
+
+ mutex_lock(&wi_group->kobj_lock);
+ attr = wi_group->nattrs[nid];
+ if (!attr) {
+ mutex_unlock(&wi_group->kobj_lock);
return;
+ }
+
+ wi_group->nattrs[nid] = NULL;
+ mutex_unlock(&wi_group->kobj_lock);
- sysfs_remove_file(&wi_group->wi_kobj,
- &wi_group->nattrs[nid]->kobj_attr.attr);
- kfree(wi_group->nattrs[nid]->kobj_attr.attr.name);
- kfree(wi_group->nattrs[nid]);
+ sysfs_remove_file(&wi_group->wi_kobj, &attr->kobj_attr.attr);
+ kfree(attr->kobj_attr.attr.name);
+ kfree(attr);

Here you go through a careful dance to not touch wi_group->nattrs[nid]
except under the lock, but later you are happy to do so in the
error handling paths. Maybe better to do similar to here and
set it to NULL under the lock but do the freeing on a copy taken
under that lock.
.

}
static void sysfs_wi_release(struct kobject *wi_kobj)
@@ -3495,35 +3508,77 @@ static const struct kobj_type wi_ktype = {
static int sysfs_wi_node_add(int nid)
{
- struct iw_node_attr *node_attr;
+ int ret = 0;

Trivial but isn't ret always set when it is used? So no need to initialize
here.

If we don't initialize it, then this kind of trivial fixup might be needed later
so I think there is no reason not to initialize it.
https://lore.kernel.org/mm-commits/20240705010631.46743C4AF07@xxxxxxxxxxxxxxx

char *name;
+ struct iw_node_attr *new_attr = NULL;

This is also always set before use so I'm not seeing a
reason to initialize it to NULL.

Ditto.

- node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL);
- if (!node_attr)
+ if (nid < 0 || nid >= nr_node_ids) {
+ pr_err("Invalid node id: %d\n", nid);
+ return -EINVAL;
+ }
+
+ new_attr = kzalloc(sizeof(struct iw_node_attr), GFP_KERNEL);

I'd prefer sizeof(*new_attr) because I'm lazy and don't like checking
types for allocation sizes :) Local style seems to be a bit
of a mix though.

Agreed.

+ if (!new_attr)
return -ENOMEM;
name = kasprintf(GFP_KERNEL, "node%d", nid);
if (!name) {
- kfree(node_attr);
+ kfree(new_attr);
return -ENOMEM;
}
- sysfs_attr_init(&node_attr->kobj_attr.attr);
- node_attr->kobj_attr.attr.name = name;
- node_attr->kobj_attr.attr.mode = 0644;
- node_attr->kobj_attr.show = node_show;
- node_attr->kobj_attr.store = node_store;
- node_attr->nid = nid;
+ mutex_lock(&wi_group->kobj_lock);
+ if (wi_group->nattrs[nid]) {
+ mutex_unlock(&wi_group->kobj_lock);
+ pr_info("Node [%d] already exists\n", nid);
+ kfree(new_attr);
+ kfree(name);
+ return 0;
+ }
+ wi_group->nattrs[nid] = new_attr;

This set can be done after all the "wi_group->nattrs[nid]" related set is done.

- if (sysfs_create_file(&wi_group->wi_kobj, &node_attr->kobj_attr.attr)) {
- kfree(node_attr->kobj_attr.attr.name);
- kfree(node_attr);
- pr_err("failed to add attribute to weighted_interleave\n");
- return -ENOMEM;
+ sysfs_attr_init(&wi_group->nattrs[nid]->kobj_attr.attr);

I'd have been tempted to use the new_attr pointer but perhaps
this brings some documentation like advantages.

+ wi_group->nattrs[nid]->kobj_attr.attr.name = name;
+ wi_group->nattrs[nid]->kobj_attr.attr.mode = 0644;
+ wi_group->nattrs[nid]->kobj_attr.show = node_show;
+ wi_group->nattrs[nid]->kobj_attr.store = node_store;
+ wi_group->nattrs[nid]->nid = nid;

As Jonathan mentioned, all the "wi_group->nattrs[nid]" here is better to be
"new_attr" for simplicity.

Thanks,
Honggyu

+
+ ret = sysfs_create_file(&wi_group->wi_kobj,
+ &wi_group->nattrs[nid]->kobj_attr.attr);
+ if (ret) {
+ kfree(wi_group->nattrs[nid]->kobj_attr.attr.name);

See comment above on the rather different handling here to in
sysfs_wi_node_delete() where you set it to NULL first, release the lock and tidy up.
new_attrand name are still set so you could even combine the handling with the
if (wi_group->nattrs[nid]) above via appropriate gotos.

+ kfree(wi_group->nattrs[nid]);
+ wi_group->nattrs[nid] = NULL;
+ pr_err("Failed to add attribute to weighted_interleave: %d\n", ret);
}
+ mutex_unlock(&wi_group->kobj_lock);
- wi_group->nattrs[nid] = node_attr;
- return 0;
+ return ret;
+}

Next message: Yen-Chi Huang: "[PATCH v4] platform/x86: portwell-ec: Add GPIO and WDT driver for Portwell EC"
Previous message: Paul E. McKenney: "Re: [PATCH] tools/drgn: Add script to display page state for a given PID and VADDR"
Next in thread: Honggyu Kim: "Re: [PATCH v7 3/3] mm/mempolicy: Support memory hotplug in weighted interleave"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]