Re: (sashiko review) [RFC PATCH v2] mm/damon/core: fix damon_call() vs kdamond_fn() exit race deadlock
From: SeongJae Park
Date: Thu Mar 26 2026 - 22:30:34 EST
On Thu, 26 Mar 2026 19:15:45 -0700 SeongJae Park <sj@xxxxxxxxxx> wrote:
> Forwarding Sashiko review for doing discussions via mails.
>
> # review url: https://sashiko.dev/#/patchset/20260327004952.58266-1-sj@xxxxxxxxxx
> # start of sashiko.dev inline review
> commit 36b78ee39968830aca3d375219a2cfce73d0d1d0
> Author: SeongJae Park <sj@xxxxxxxxxx>
>
> mm/damon/core: fix damon_call() vs kdamond_fn() exit race deadlock
>
> This commit fixes a race condition between damon_call and the
> kdamond_fn exit path that could result in a deadlock. It introduces a
> call_controls_obsolete flag to prevent damon_call from waiting
> indefinitely for a terminating thread.
>
> Link: https://lore.kernel.org/20260325141956.87144-1-sj@xxxxxxxxxx
>
> > diff --git a/mm/damon/core.c b/mm/damon/core.c
> > --- a/mm/damon/core.c
> > +++ b/mm/damon/core.c
>
> [ ... ]
>
> > @@ -3062,6 +3042,9 @@ static int kdamond_fn(void *data)
> > damon_destroy_targets(ctx);
> >
> > kfree(ctx->regions_score_histogram);
> > + mutex_lock(&ctx->call_controls_lock);
> > + ctx->call_controls_obsolete = true;
> > + mutex_unlock(&ctx->call_controls_lock);
> > kdamond_call(ctx, true);
> > damos_walk_cancel(ctx);
>
> Does damos_walk suffer from the same exit race regression that this patch
> fixes for damon_call?
>
> In kdamond_fn, the thread termination sequence executes damos_walk_cancel
> and only later sets ctx->kdamond = NULL under the kdamond_lock.
>
> damos_walk appears to use the same pattern as the old damon_call where it
> registers ctx->walk_control, checks damon_is_running (which verifies
> ctx->kdamond != NULL), and then unconditionally blocks on wait_for_completion.
>
> If a user triggers a DAMOS walk in the exact window after damos_walk_cancel
> has finished but before ctx->kdamond = NULL is executed, wouldn't damos_walk
> successfully register its control and pass the damon_is_running check?
>
> If so, it would then block indefinitely on wait_for_completion because the
> cancellation phase has already passed and the thread is exiting.
>
> Because damos_walk is invoked via state_store with the global
> damon_sysfs_lock held, could this infinite wait block the entire DAMON sysfs
> interface for all users on the system?
Yes, and I'm working on it separately, as I also replied [1] to the previous
version.
[1] https://lore.kernel.org/20260326135146.90670-1-sj@xxxxxxxxxx
Thanks,
SJ
[...]