Re: [PATCH 3/4] mm: prevent KSM from completely breaking VMA merging
From: Lorenzo Stoakes
Date: Mon May 19 2025 - 14:52:38 EST
On Mon, May 19, 2025 at 08:00:29PM +0200, David Hildenbrand wrote:
> On 19.05.25 10:51, Lorenzo Stoakes wrote:
> > If a user wishes to enable KSM mergeability for an entire process and all
> > fork/exec'd processes that come after it, they use the prctl()
> > PR_SET_MEMORY_MERGE operation.
> >
> > This defaults all newly mapped VMAs to have the VM_MERGEABLE VMA flag set
> > (in order to indicate they are KSM mergeable), as well as setting this flag
> > for all existing VMAs.
> >
> > However it also entirely and completely breaks VMA merging for the process
> > and all forked (and fork/exec'd) processes.
> >
> > This is because when a new mapping is proposed, the flags specified will
> > never have VM_MERGEABLE set. However all adjacent VMAs will already have
> > VM_MERGEABLE set, rendering VMAs unmergeable by default.
> >
> > To work around this, we try to set the VM_MERGEABLE flag prior to
> > attempting a merge. In the case of brk() this can always be done.
> >
> > However on mmap() things are more complicated - while KSM is not supported
> > for file-backed mappings, it is supported for MAP_PRIVATE file-backed
> > mappings.
> >
> > And these mappings may have deprecated .mmap() callbacks specified which
> > could, in theory, adjust flags and thus KSM merge eligiblity.
> >
> > So we check to determine whether this at all possible. If not, we set
> > VM_MERGEABLE prior to the merge attempt on mmap(), otherwise we retain the
> > previous behaviour.
> >
> > When .mmap_prepare() is more widely used, we can remove this precaution.
> >
> > While this doesn't quite cover all cases, it covers a great many (all
> > anonymous memory, for instance), meaning we should already see a
> > significant improvement in VMA mergeability.
>
> We should add a Fixes: tag.
>
> CCing stable is likely not a good idea at this point (and might be rather
> hairy).
We should probably underline to Andrew not to add one :>) but sure can add.
A backport would be a total pain yes.
>
> [...]
>
> > /**
> > - * ksm_add_vma - Mark vma as mergeable if compatible
> > + * ksm_vma_flags - Update VMA flags to mark as mergeable if compatible
> > *
> > - * @vma: Pointer to vma
> > + * @mm: Proposed VMA's mm_struct
> > + * @file: Proposed VMA's file-backed mapping, if any.
> > + * @vm_flags: Proposed VMA"s flags.
> > + *
> > + * Returns: @vm_flags possibly updated to mark mergeable.
> > */
> > -void ksm_add_vma(struct vm_area_struct *vma)
> > +vm_flags_t ksm_vma_flags(const struct mm_struct *mm, const struct file *file,
> > + vm_flags_t vm_flags)
> > {
> > - struct mm_struct *mm = vma->vm_mm;
> > + vm_flags_t ret = vm_flags;
> > - if (test_bit(MMF_VM_MERGE_ANY, &mm->flags))
> > - __ksm_add_vma(vma);
> > + if (test_bit(MMF_VM_MERGE_ANY, &mm->flags) &&
> > + __ksm_should_add_vma(file, vm_flags))
> > + ret |= VM_MERGEABLE;
> > +
> > + return ret;
> > }
>
>
> No need for ret without harming readability.
>
> if ()
> vm_flags |= VM_MERGEABLE
> return vm_flags;
Ack this was just me following the 'don't mutate arguments' rule-of-thumb
that obviously we violate constantly int he kernel anyway and probably
never really matters... :>)
But yeah ret is kind of gross here, will change.
>
> [...]
>
> > +/*
> > + * Are we guaranteed no driver can change state such as to preclude KSM merging?
> > + * If so, let's set the KSM mergeable flag early so we don't break VMA merging.
> > + *
> > + * This is applicable when PR_SET_MEMORY_MERGE has been set on the mm_struct via
> > + * prctl() causing newly mapped VMAs to have the KSM mergeable VMA flag set.
> > + *
> > + * If this is not the case, then we set the flag after considering mergeability,
> > + * which will prevent mergeability as, when PR_SET_MEMORY_MERGE is set, a new
> > + * VMA will not have the KSM mergeability VMA flag set, but all other VMAs will,
> > + * preventing any merge.
>
> Hmmm, so an ordinary MAP_PRIVATE of any file (executable etc.) will get
> VM_MERGEABLE set but not be able to merge?
>
> Probably these are not often expected to be merged ...
>
> Preventing merging should really only happen because of VMA flags that are
> getting set: VM_PFNMAP, VM_MIXEDMAP, VM_DONTEXPAND, VM_IO.
>
>
> I am not 100% sure why we bail out on special mappings: all we have to do is
> reliably identify anon pages, and we should be able to do that.
But they map e.g. kernel memory (at least for VM_PFNMAP, purely and by
implication really VM_IO), it wouldn't make sense for KSM to be asked to
try to merge these right?
And of course no underlying struct page to pin, no reference counting
either, so I think you'd end up in trouble potentially here wouldn't you?
And how would the CoW work?
>
> GUP does currently refuses any VM_PFNMAP | VM_IO, and KSM uses GUP, which
> might need a tweak then (maybe the solution could be to ... not use GUP but
> a folio_walk).
How could GUP work when there's no struct page to grab?
>
> So, assuming we could remove the VM_PFNMAP | VM_IO | VM_DONTEXPAND |
> VM_MIXEDMAP constraint from vma_ksm_compatible(), could we simplify?
Well I question removing this constraint for above reasons.
At any rate, even if we _could_ this feels like a bigger change that we
should come later.
But hmm this has made me think :)
So if something is backed by struct file *filp, and the driver says 'make
this PFN mapped' then of course it won't erroneously merge anyway.
If adjacent VMAs are not file-backed, then the merge will fail anyway.
So actually we're probably perfectly safe from a _merging_ point of view.
Buuut we are not safe from a setting VM_MERGEABLE point of view :)
So I think things have to stay the way they are, sensibly.
Fact that .mmap_prepare() will fix this in future makes it more reasonable
I think.
>
> That is: the other ones must not really be updated during mmap(), right?
> (in particular: VM_SHARED | VM_MAYSHARE | VM_HUGETLB | VM_DROPPABLE)
>
> Have to double-check VM_SAO and VM_SPARC_ADI.
_Probably_ :)
It really is mostly VM_SPECIAL.
>
> > + */
> > +static bool can_set_ksm_flags_early(struct mmap_state *map)
> > +{
> > + struct file *file = map->file;
> > +
> > + /* Anonymous mappings have no driver which can change them. */
> > + if (!file)
> > + return true;
> > +
> > + /* shmem is safe. */
> > + if (shmem_file(file))
> > + return true;
> > +
> > + /*
> > + * If .mmap_prepare() is specified, then the driver will have already
> > + * manipulated state prior to updating KSM flags.
> > + */
> > + if (file->f_op->mmap_prepare)
> > + return true;
> > +
> > + return false;
> > +}
>
> So, long-term (mmap_prepare) this function will essentially go away?
Yes, .mmap_prepare() solves this totally. Again it's super useful to have
the ability to get the driver to tell us 'we want flags X, Y, Z' ahead of
time. .mmap() must die :)
>
> Nothing jumped at me, this definitely improves the situation.
Thanks!
>
> --
> Cheers,
>
> David / dhildenb
>