Opening /proc//net/dev prevents network namespace from expiring, is this expected?

Question

I'm looking for input whether the following observation related to network namespace expiration is expected, or should be reported as bug?

When some process opens /proc/<pid>/net/dev it can prevent/delay the expiration of the other process's namespace until it's closing this file. It doesn't need to be part of that namespace to do so.

This seems very surprising behavior. It allows a local user with access to the appropriate proc files to delay/prevent the destruction of veth interfaces of network namespaces. A buggy monitoring tool opening files in /proc without closing them might cause this just as well.

Reproducer

(on Debian Buster - Linux 5.4.0-0.bpo.4-amd64)

1) Create a network namespace:

$ unshare -n
$ echo $BASHPID
18807

2) Create veth and move one end into the network namespace created above

$ ip link add dev veth18807 type veth peer name eth18807           
$ ip link set eth18807 netns 18807
$ ip addr | grep veth
14: veth18807@if13: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000

3) Start tail -f /proc/18807/net/dev in a separate terminal

$ tail -f /proc/18807/net/dev
...
tail: /proc/18807/net/dev: file truncated
...leave hanging...

4) In 1), exit the namespace, list interfaces:

$ ip addr | grep veth
14: veth18807@if13: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000

The previously created veth is still there. However, there are no obvious traces of the network namespace created in step 1). lsns doesn't show it, no process has it in its /ns directory, etc.

As soon as the tail -f is interrupted, the interface vanishes immediately from ip addr. It doesn't need to be tail, just opening it with open() is enough.

I suspect technically this may make sense, as opening ../net/dev might take a reference to the network namespace. It's just that its very surprising that it's possible to keep the namespace alive that way.

As a workaround, explicitly deleting the veth created before using ip link del works. However, I do wonder if that will still keep the namespace around.

Firejail Example

This investigation was triggered because of firejail messages complaining about "already in use" IP addresses. After going down the rabbit hole, eventually it seemed that with static IPs, it can be provoked as follows:

1) Start a jail

$ /usr/local/bin/firejail --net=docker0 --ip=172.30.0.30 --noprofile
Parent pid 20890, child pid 20891

Interface        MAC                IP               Mask             Status
lo                                  127.0.0.1        255.0.0.0        UP
eth0             e2:87:2e:06:07:5b  172.30.0.30      255.255.0.0      UP
Default gateway 172.30.0.1

Child process initialized in 1491.22 ms

2) In separate terminal open the net/dev of the child:

$ tail -F /proc/20891/net/dev

3) Exit above firejail and restart with the same arguments again.

$ /usr/local/bin/firejail --net=docker0 --ip=172.30.0.30 --noprofile
Error: IP address 172.30.0.30 is already in use

Above message is because the veth continues to responds to firejail ARP checks for the IP.

Docker

I can not reproduce above Firejail scenario with docker - the interface vanishes after the container is stopped. Maybe Docker actually implements the ip link del workaround (?).

References

On linuxcontainers similar observations were reported and related to kernel bugs.

That’s usually an indication that the network namespace of your container never expired, which normally indicates an issue with the kernel. When the last process using a network namespace goes away, the namespace is destroyed, which causes all virtual interfaces to be destroyed and physical interfaces to be moved back to the host network namespace.

https://discuss.linuxcontainers.org/t/serverside-veth-not-clean-shutdown-on-container-reboot-or-shutdown/4379/4

https://discuss.linuxcontainers.org/t/vethxxxxx-interfaces-are-not-removed-when-lxc-container-is-stopped/4816/2

Is "a local user with access to the appropriate `proc` files" ever an unprivileged user who's not the owner of the process/namespace in question? — Joseph Sible-Reinstate Monica, Mar 29 '20 at 20:01
`unshare -n` is executed as root. Opening/tailing the `/proc//net/dev` file can be done as unprivileged user. The file is o+r, `-r--r--r-- 1 root root 0 Mar 29 22:11 /proc/3022/net/dev. Though even for a privileged user this behavior is still surprising. — Arne Welzel, Mar 29 '20 at 20:11
You should report that as a bug. They should take a reference to the network namespace when reading the `/proc//net/` files, not when opening them. The problem is not only with `/proc//net/dev` with but any file in that directory. This hardly could be intended, as it's an obvious race -- in the time between opening and reading the file, the info can become stale. — , Mar 30 '20 at 06:16
But FWIW, `unshare -Un` can be executed by any user -- the `/proc/sys/kernel/unprivileged_userns_clone` is something specific to Debian. — , Mar 30 '20 at 06:16
thanks for checking in. I have not yet, I'll do in the following days. Does bugzilla.kernel.org sound reasonable? — Arne Welzel, Apr 18 '20 at 16:23

Opening /proc//net/dev prevents network namespace from expiring, is this expected?

Reproducer

Firejail Example

Docker

References

0 Answers0