I'm looking for input whether the following observation related to network namespace expiration is expected, or should be reported as bug?
- When some process opens
/proc/<pid>/net/devit can prevent/delay the expiration of the other process's namespace until it's closing this file. It doesn't need to be part of that namespace to do so.
This seems very surprising behavior. It allows a local user with access to the appropriate proc files to delay/prevent the destruction of veth interfaces of network namespaces. A buggy monitoring tool opening files in /proc without closing them might cause this just as well.
Reproducer
(on Debian Buster - Linux 5.4.0-0.bpo.4-amd64)
1) Create a network namespace:
$ unshare -n
$ echo $BASHPID
18807
2) Create veth and move one end into the network namespace created above
$ ip link add dev veth18807 type veth peer name eth18807
$ ip link set eth18807 netns 18807
$ ip addr | grep veth
14: veth18807@if13: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
3) Start tail -f /proc/18807/net/dev in a separate terminal
$ tail -f /proc/18807/net/dev
...
tail: /proc/18807/net/dev: file truncated
...leave hanging...
4) In 1), exit the namespace, list interfaces:
$ ip addr | grep veth
14: veth18807@if13: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
The previously created veth is still there. However, there are no obvious traces of the network namespace created in step 1). lsns doesn't show it, no process has it in its /ns directory, etc.
As soon as the tail -f is interrupted, the interface vanishes immediately from ip addr. It doesn't need to be tail, just opening it with open() is enough.
I suspect technically this may make sense, as opening ../net/dev might take a reference to the network namespace. It's just that its very surprising that it's possible to keep the namespace alive that way.
As a workaround, explicitly deleting the veth created before using ip link del works. However, I do wonder if that will still keep the namespace around.
Firejail Example
This investigation was triggered because of firejail messages complaining about "already in use" IP addresses. After going down the rabbit hole, eventually it seemed that with static IPs, it can be provoked as follows:
1) Start a jail
$ /usr/local/bin/firejail --net=docker0 --ip=172.30.0.30 --noprofile
Parent pid 20890, child pid 20891
Interface MAC IP Mask Status
lo 127.0.0.1 255.0.0.0 UP
eth0 e2:87:2e:06:07:5b 172.30.0.30 255.255.0.0 UP
Default gateway 172.30.0.1
Child process initialized in 1491.22 ms
2) In separate terminal open the net/dev of the child:
$ tail -F /proc/20891/net/dev
3) Exit above firejail and restart with the same arguments again.
$ /usr/local/bin/firejail --net=docker0 --ip=172.30.0.30 --noprofile
Error: IP address 172.30.0.30 is already in use
Above message is because the veth continues to responds to firejail ARP checks for the IP.
Docker
I can not reproduce above Firejail scenario with docker - the interface vanishes after the container is stopped. Maybe Docker actually implements the ip link del workaround (?).
References
On linuxcontainers similar observations were reported and related to kernel bugs.
That’s usually an indication that the network namespace of your container never expired, which normally indicates an issue with the kernel. When the last process using a network namespace goes away, the namespace is destroyed, which causes all virtual interfaces to be destroyed and physical interfaces to be moved back to the host network namespace.