Re: [PATCH v1 13/13] libceph: force host network namespace for kernel CephFS mounts

From: Ilya Dryomov

Date: Mon Mar 16 2026 - 11:38:43 EST

On Thu, Mar 12, 2026 at 9:17 AM Ionut Nechita (Wind River)
<ionut.nechita@xxxxxxxxxxxxx> wrote:
>
> From: Ionut Nechita <ionut.nechita@xxxxxxxxxxxxx>
>
> In containerized environments (e.g., Rook-Ceph CSI with
> forcecephkernelclient=true), the mount() syscall
> for kernel CephFS may be invoked from a pod's network namespace
> instead of the host namespace. This happens despite the CSI node
> plugin (csi-cephfsplugin) running with hostNetwork: true, due to
> race conditions during kubelet restart or pod scheduling.

Hi Ionut,

Can you elaborate on these race conditions? It sounds like a bug or
a misconfiguration in the orchestration in userspace that this patch is
trying to work around in the kernel client.

>
> ceph_messenger_init() captures current->nsproxy->net_ns at mount
> time and uses it for all subsequent socket operations. When a pod
> NS is captured, all kernel ceph sockets (mon, mds, osd) are
> created in that namespace, which typically lacks routes to the
> Ceph monitors (e.g., fd04:: ClusterIP addresses).
> This causes permanent EADDRNOTAVAIL (-99) on every connection
> attempt at ip6_dst_lookup_flow(), with no possibility of recovery
> short of force-unmount and remount from the correct namespace.

What network provider (in the sense of [1]) are you using?

>
> Root cause confirmed via kprobe tracing on ip6_dst_lookup_flow:
> the net pointer passed to the routing lookup was the pod's
> net_ns (0xff367a0125dd5780) instead of init_net
> (0xffffffffbda76940). The pod NS had no route for fd04::/64
> (monitor ClusterIP range), while userspace python connect() from
> the same host succeeded because it ran in host NS.
>
> Fix this by always using init_net (the host network namespace)
> in ceph_messenger_init(). The kernel CephFS client inherently
> requires host-level network access to reach Ceph monitors, OSDs,
> and MDS daemons. Using the caller's namespace was inherited from
> generic socket patterns but is incorrect for a kernel filesystem

This behavior wasn't inherited but actually introduced as a feature in
commit [2] at someone's request. Prior to that change attempting to
mount a CephFS filesytem or map an RBD image from anywhere but init_net
produced an error, see commit [3].

I'm going to challenge your "incorrect for a kernel filesystem" claim
because NFS, SMB/CIFS, AFS and likely other network filesystem clients
in the kernel behave the same way. Mounts outside of init_net are
allowed with the mounting process network namespace getting captured
and used when creating sockets.

> client that must survive beyond the lifetime of the mounting
> process and its network namespace.

Network namespaces are reference counted and CephFS grabs a reference
for the namespace it's mounted in. The namespace should persist for as
long as the CephFS mount persists even if the mounting process goes
away: another process should be able to enter that namespace, etc. The
namespace can of course get wedged by the orchestration tearing down
the relevant virtual network devices prematurely, but it's a separate
issue.

>
> A warning is logged when a mount from a non-init namespace is
> detected, to aid debugging.
>
> Observed in production (kernel 6.12.0-1-rt-amd64, Ceph Reef
> 18.2.5, IPv6-only cluster, ceph-csi v3.13.1):
> - Fresh boot of compute-0, ceph-csi mounts CephFS via kernel
> - All monitor connections fail with EADDRNOTAVAIL immediately
> - kprobe confirms wrong net_ns in ip6_dst_lookup_flow
> - Workaround: umount -l + systemctl restart kubelet
> - After restart: mount captures host NS, works immediately

[1] https://github.com/rook/rook/blob/master/Documentation/CRDs/Cluster/network-providers.md
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=757856d2b9568a701df9ea6a4be68effbb9d6f44
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eea553c21fbfa486978c82525ee8256239d4f921

Thanks,

Ilya