Re: [PATCH] cgroup: Wait for dying tasks to leave on rmdir
From: Tejun Heo
Date: Mon Mar 23 2026 - 15:55:56 EST
Hello,
On Mon, Mar 23, 2026 at 12:32:52PM +0100, Sebastian Andrzej Siewior wrote:
...
> I saw instances on PREEMPT_RT where the above cgroup_is_populated()
> reported true due to cgrp->nr_populated_csets = 1, the following
> iterator returned NULL but in that time do_cgroup_task_dead() saw no
> waiter and continued without a wake_up and then the following schedule()
> hung.
Ah, right, false->true is protected by cgroup_mutex but true->false is only
by css_set_lock. It should check populated again with css_set_lock held and
then do prepare_to_wait().
> There is no serialisation between this wait/ check and latter wake. An
> alternative would be to check and prepare_to_wait() under css_set_lock.
Yeap.
> > + finish_wait(&cgrp->dying_populated_waitq, &wait);
> > + mutex_lock(&cgroup_mutex);
> > + goto retry;
> > +}
>
> Then I added my RCU patch. This led to a problem already during boot up
> (didn't manage to get to the test suite).
Is that the patch to move cgroup_task_dead() to delayed_put_task_struct()? I
don't think we can delay populated state update till usage count reaches
zero. e.g. bpf_task_acquire() can be used by arbitrary bpf programs and will
pin the usage count indefinitely delaying populated state update. Similar to
delaying the event to free path, you can construct a deadlock scenario too.
> systemd-1 places modprobe-1044 in a cgroup, then destroys the cgroup.
> It hangs in cgroup_drain_dying() because nr_populated_csets is still 1.
> modprobe-1044 is still there in Z so the cgroup removal didn't get there
> yet. That irq_work was quicker than RCU in this case. This can be
> reproduced without RCU by
Isn't this the exact scenario? systemd is the one who should reap and drop
the usage count but it's waiting for rmdir() to finish which can't finish
due to the usage count which hasn't been reapted by systemd? We can't
interlock these two. They have to make progress independently.
> - irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
> + schedule_delayed_work(this_cpu_ptr(&cgrp_delayed_tasks_iwork), HZ);
>
> So there is always a one second delay. If I give up waiting after 10secs
> then it boots eventually and there are no zombies around. The test_core
> seems to complete…
>
> Having the irq_work as-is, then the "cgroup_dead()" happens on the HZ
> tick. test_core then complains just with
> | not ok 7 test_cgcore_populated
The test is assuming that waitpid() success guarantees cgroup !populated
event. While before all these changes, that held, it wasn't intentional and
the test just picked up on arbitrary ordering. I'll just remove that
particular test.
Thanks.
--
tejun