What code prevents mount namespace loops? In a more complex case involving mount propagation

Question

Background information

Assume /tmp is mounted with private propagation, as per the kernel default (but not the systemd default), and as a separate filesystem. If necessary, you can ensure this by running commands inside unshare -m, and/or using mount --bind /tmp /tmp.

# findmnt -n -o propagation /tmp
private

Notice that the following commands return an error:

# touch /tmp/a
# mount --bind /proc/self/ns/mnt /tmp/a
mount: /tmp/a: wrong fs type, bad option, bad superblock on /proc/self/ns/mnt, missing codepage or helper program, or other error.

This is because the kernel code (see extracts below) prevents a simple mount namespace loop. The code comments explain why this is not allowed. The lifetime of a mount namespace is tracked by a simple reference count. If you have a loop where mount namespaces A and B both reference the other, then both A and B will always have at least one reference, and they would never be freed. The allocated memory would be lost, until you rebooted the entire system.

For comparison, the kernel allows the following, which is not a loop:

# unshare -m
# echo $$
8456
# kill -STOP $$
[1]+  Stopped                 unshare -m

# touch /tmp/a
# mount --bind /proc/8456/ns/mnt /tmp/a
#
# umount /tmp/a  # now clean up
# kill %1; echo
#

Question

Where does the kernel code distinguish between the following two cases?

If I try to create a loop using mount propagation, it fails:

# mount --make-shared /tmp
# unshare -m --propagation shared
# echo $$
8456
# kill -STOP $$
[1]+  Stopped                 unshare -m

# mount --bind /proc/8456/ns/mnt /tmp/a
mount: /tmp/a: wrong fs type, bad option, bad superblock on /proc/9061/ns/mnt, missing codepage or helper program, or other error.

But if I remove the mount propagation, no loop is created, and it succeeds:

# unshare -m --propagation private
# echo $$
8456
# kill -STOP $$
[1]+  Stopped                 unshare -m

# mount --bind /proc/8456/ns/mnt /tmp/a
# 
# umount /tmp/a  # cleanup

Kernel code which handles the simpler case

https://elixir.bootlin.com/linux/v4.18/source/fs/namespace.c

static bool mnt_ns_loop(struct dentry *dentry)
{
    /* Could bind mounting the mount namespace inode cause a
     * mount namespace loop?
     */
    struct mnt_namespace *mnt_ns;
    if (!is_mnt_ns_file(dentry))
        return false;

    mnt_ns = to_mnt_ns(get_proc_ns(dentry->d_inode));
    return current->nsproxy->mnt_ns->seq >= mnt_ns->seq;
}

...

    err = -EINVAL;
    if (mnt_ns_loop(old_path.dentry))
        goto out;

...

 * Assign a sequence number so we can detect when we attempt to bind
 * mount a reference to an older mount namespace into the current
 * mount namespace, preventing reference counting loops.  A 64bit
 * number incrementing at 10Ghz will take 12,427 years to wrap which
 * is effectively never, so we can ignore the possibility.
 */
static atomic64_t mnt_ns_seq = ATOMIC64_INIT(1);

static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns)

It distinguishes in the last line of `mnt_ns_loop()`: `return current->nsproxy->mnt_ns->seq >= mnt_ns->seq;`. newer `mnt_ns` objects have a `seq` greater that older objects. Or is it something else you're unclear about? — , Oct 07 '18 at 08:58
@mosvy I don't understand. Why would that make a difference between my two cases with and without mount propagation? The relative order of the sequence numbers should be the same in both cases. — sourcejedi, Oct 07 '18 at 17:04
Tried on a 5.4.0 kernel but I cannot reproduce the first example ("...which is not a loop"). Just the error: "wrong fs type, bad option, bad superblock on..." — TheDiveO, Aug 06 '21 at 11:47
@TheDiveO I think I was being evilly obscure, and these results assume /tmp has the kernel default propagation mode - private - not the systemd default propagation mode - shared. Sorry. I don't want to confuse people using majority distros, which includes myself, but I'm not immediately sure how the question should be fixed to make it clear. — sourcejedi, Aug 06 '21 at 16:19

sourcejedi · Accepted Answer · 2019-05-03T07:48:24.717

See commit 4ce5d2b1a8fd, vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces

propagate_one() calls copy_tree() without CL_COPY_MNT_NS_FILE. In this case, if the tree root is a mount of a NS file, copy_tree() fails with the error EINVAL. The term "a NS file" means one of the files /proc/*/ns/mnt.

Reading further, I notice that if the tree root is not an NS file, but one of the child mounts is, it is excluded from propagation (in the same way as an unbindable mount is).

Example of a NS file being silently skipped during propagation

# mount --make-shared /tmp
# cd /tmp
# mkdir private_mnt
# mount --bind private_mnt private_mnt
# mount --make-private private_mnt
# touch private_mnt/child_ns
# unshare --mount=private_mnt/child_ns --propagation=shared ls -l /proc/self/ns/mnt
lrwxrwxrwx. 1 root root 0 Oct  7 18:25 /proc/self/ns/mnt -> 'mnt:[4026532807]'
# findmnt | grep /tmp
├─/tmp                                tmpfs                             tmpfs           ...
│ ├─/tmp/private_mnt                  tmpfs[/private_mnt]               tmpfs           ...
│ │ └─/tmp/private_mnt/child_ns       nsfs[mnt:[4026532807]]            nsfs            ...

Let's create a normal mount for comparison

# mkdir private_mnt/child_mnt
# mount --bind private_mnt/child_mnt private_mnt/child_mnt

Now try to propagate everything. (Create a recursive bind mount of private_mnt inside /tmp. /tmp is a shared mount).

# mkdir shared_mnt
# mount --rbind private_mnt shared_mnt
# findmnt | grep /tmp/shared_mnt
│ └─/tmp/shared_mnt                   tmpfs[/private_mnt]               tmpfs           ...
│   ├─/tmp/shared_mnt/child_ns        nsfs[mnt:[4026532809]]            nsfs            ...
│   └─/tmp/shared_mnt/child_mnt       tmpfs[/private_mnt/child_mnt]     tmpfs           ...
# nsenter --mount=/tmp/private_mnt/child_ns findmnt|grep /tmp/shared_mnt
│ └─/tmp/shared_mnt                   tmpfs[/private_mnt]               tmpfs           ...
│   └─/tmp/shared_mnt/child_mnt       tmpfs[/private_mnt/child_mnt]     tmpfs           ...

Kernel code

Here are the extracts from the current version of the code, which was added in the commit linked above.

https://elixir.bootlin.com/linux/v4.18/source/fs/pnode.c#L226

static int propagate_one(struct mount *m)
{
...
    /* Notice when we are propagating across user namespaces */
    if (m->mnt_ns->user_ns != user_ns)
        type |= CL_UNPRIVILEGED;
    child = copy_tree(last_source, last_source->mnt.mnt_root, type);
    if (IS_ERR(child))
        return PTR_ERR(child);

https://elixir.bootlin.com/linux/v4.18/source/fs/namespace.c#L1790

struct mount *copy_tree(struct mount *mnt, struct dentry *dentry,
                    int flag)
{
    struct mount *res, *p, *q, *r, *parent;

    if (!(flag & CL_COPY_UNBINDABLE) && IS_MNT_UNBINDABLE(mnt))
        return ERR_PTR(-EINVAL);

    if (!(flag & CL_COPY_MNT_NS_FILE) && is_mnt_ns_file(dentry))
        return ERR_PTR(-EINVAL);

Thank you very much for this example as it finally allows me to construct the proper unit test for an open source Go module that opens bind-mounted mount namespaces (without process) for file access from the current process. — TheDiveO, Aug 06 '21 at 11:57