Re: [PATCH-next v6 1/6] cgroup/cpuset: Fix node inconsistencies between cpuset_update_tasks_nodemask() and cpuset_attach()

From: Waiman Long

Date: Fri Jun 05 2026 - 13:18:02 EST


On 6/5/26 3:48 AM, Ridong Chen wrote:

On 6/4/2026 11:02 PM, Waiman Long wrote:
Whenever memory node mask is changed, there are 4 places where the node
mask has to be updated or used.
1) task's node mask via cpuset_change_task_nodemask()
2) memory policy binding via mpol_rebind_mm()
3) if memory migration is enabled, migrate from old_mems_allowed to
the new node mask via cpuset_migrate_mm().
4) setting old_mems_allowed

These memory actions are done in cpuset_update_tasks_nodemask() and
cpuset_attach(). However there are inconsistencies in what node masks
are being used in these 2 functions.

In cpuset_update_tasks_nodemask(),
- cpuset_change_task_nodemask(): guarantee_online_mems()
- mpol_rebind_mm(): mems_allowed
- cpuset_migrate_mm(): guarantee_online_mems()
- old_mems_allowed: guarantee_online_mems()

In cpuset_attach(),
- cpuset_change_task_nodemask(): guarantee_online_mems()
- mpol_rebind_mm(): effective_mems
- cpuset_migrate_mm(): effective_mems
- old_mems_allowed: effective_mems

These inconsistencies dates back to quite a long time ago and it is
hard to say what should be the correct values.

The guarantee_online_mems() function returns a node mask from current or
an ancestor cpuset that is a subset of node_states[N_MEMORY]. Nodes in
node_states[N_MEMORY] are all online, i.e. in node_states[N_ONLINE].
However, node in node_states[N_ONLINE] may not have memory. So
node_states[N_MEMORY] should be a subset of node_states[N_ONLINE].

The guarantee_online_mems() function should only be useful for v1 where
mems_allowed is the same as effective_mems. With v2, the memory nodes
in effective_mems should always be a subset of node_states[N_MEMORY].
The only time that may not be true is when a memory hot-unplug operation
is in progress and a memory node is removed from node_states[N_MEMORY]
but not yet reflected in effective_mems as cpuset_handle_hotplug()
has not yet been called from cpuset_track_online_nodes(). When
cpuset_handle_hotplug() is called later, the memory node setting
of the relevant cpusets and tasks will be updated. So replacing the
guarantee_online_mems() call by just using cs->effective_mems should
be fine.

The message should be updated.
Could you be more specific about what message are you referring to?

Let use the following setup for both of them and make them consistent.
- cpuset_change_task_nodemask(): guarantee_online_mems()
- mpol_rebind_mm(): effective_mems
- cpuset_migrate_mm(): guarantee_online_mems()
- old_mems_allowed: guarantee_online_mems()

So for v2, it is effectively all effective_mems. For v1, mpol_rebind_mm()
uses mems_allowed which may differ from what guarantee_online_mems()
returns.

Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
kernel/cgroup/cpuset.c | 37 +++++++++++++++++++++++++------------
1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 6bdb68689c24..8305b5830c3c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -489,7 +489,10 @@ static void guarantee_active_cpus(struct task_struct *tsk,
* Return in *pmask the portion of a cpusets's mems_allowed that
* are online, with memory. If none are online with memory, walk
* up the cpuset hierarchy until we find one that does have some
- * online mems. The top cpuset always has some mems online.
+ * online mems. The top cpuset always has some mems online. With v2,
+ * effective_mems should always contain online memory nodes except
+ * during the transition period where a memory node hotunplug operation
+ * is in progress.
*
* One way or another, we guarantee to return some non-empty subset
* of node_states[N_MEMORY].
@@ -498,6 +501,10 @@ static void guarantee_active_cpus(struct task_struct *tsk,
*/
static void guarantee_online_mems(struct cpuset *cs, nodemask_t *pmask)
{
+ if (cpuset_v2()) {
+ *pmask = cs->effective_mems;
+ return;
+ }
while (!nodes_and(*pmask, cs->effective_mems, node_states[N_MEMORY]))
cs = parent_cs(cs);
}
@@ -2616,6 +2623,13 @@ static void *cpuset_being_rebound;
* Iterate through each task of @cs updating its mems_allowed to the
* effective cpuset's. As this function is called with cpuset_mutex held,
* cpuset membership stays stable.
+ *
+ * - cpuset_change_task_nodemask(): guarantee_online_mems()
+ * - mpol_rebind_mm(): effective_mems
+ * - cpuset_migrate_mm(): guarantee_online_mems()
+ * - old_mems_allowed: guarantee_online_mems()
+ *
+ * For v2, guarantee_online_mems() should just return effective_mems.
*/
void cpuset_update_tasks_nodemask(struct cpuset *cs)
{
@@ -2624,7 +2638,6 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
struct task_struct *task;
cpuset_being_rebound = cs; /* causes mpol_dup() rebind */
-
guarantee_online_mems(cs, &newmems);
/*
@@ -2650,7 +2663,7 @@ void cpuset_update_tasks_nodemask(struct cpuset *cs)
migrate = is_memory_migrate(cs);
- mpol_rebind_mm(mm, &cs->mems_allowed);
+ mpol_rebind_mm(mm, &cs->effective_mems);
if (migrate)
cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
else
@@ -3148,17 +3161,18 @@ static void cpuset_attach(struct cgroup_taskset *tset)
/*
* In the default hierarchy, enabling cpuset in the child cgroups
- * will trigger a number of cpuset_attach() calls with no change
- * in effective cpus and mems. In that case, we can optimize out
- * by skipping the task iteration and update.
+ * will trigger a cpuset_attach() call with no change in effective cpus
+ * and mems. In that case, we can optimize out by skipping the task
+ * iteration and update.
*/
- if (cpuset_v2() && !cpus_updated && !mems_updated) {
+ if (cpuset_v2()) {
cpuset_attach_nodemask_to = cs->effective_mems;
- goto out;
+ if (!cpus_updated && !mems_updated)
+ goto out;
+ } else {
+ guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
}
- guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
-
Nit.

I prefer:

guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
if (cpuset_v2() && !cpus_updated && !mems_updated)
goto out;

This is fixed in patch 7. I may have to send a new version of the patch 7 integrated back into individual patches.

Cheers,
Longman

cgroup_taskset_for_each(task, css, tset)
cpuset_attach_task(cs, task);
@@ -3168,7 +3182,6 @@ static void cpuset_attach(struct cgroup_taskset *tset)
* if there is no change in effective_mems and CS_MEMORY_MIGRATE is
* not set.
*/
- cpuset_attach_nodemask_to = cs->effective_mems;
if (!is_memory_migrate(cs) && !mems_updated)
goto out;
@@ -3176,7 +3189,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
struct mm_struct *mm = get_task_mm(leader);
if (mm) {
- mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
+ mpol_rebind_mm(mm, &cs->effective_mems);
/*
* old_mems_allowed is the same with mems_allowed
Other than that, looks good to me.

Reviewed-by: Ridong Chen <ridong.chen@xxxxxxxxx>