I have mounted an nfs drive on the machine. I can see it with the following command:
$ df -aTh | grep nfs
… unrelated volumes …
10.200.30.100:/export/pvc-70..1d nfs4 976M 2.0M 907M 1% /var/lib/kubelet/pods/72..1d/volumes/kubernetes.io~nfs/pvc-70..1d
[I added ellipsis for brevity's sake.]
I repeatedly write to a file like from the machine itself (but doing this from a container yields the same result)
$ while true
do echo test >> /var/lib/kubelet/pods/72..1d/volumes/kubernetes.io~nfs/pvc-70..1d/data.txt
sleep 5
done
This works, but after a while I get the error-message:
-bash: /var/lib/kubelet/pods/72..1d/volumes/kubernetes.io~nfs/pvc-70..1d/data.txt: Invalid argument
This goes on forever, or as long as my patience lasts.
Is there any reason nfs should behave like this? Why does it stop allowing me to write so suddenly?
I tried giving 777 permissions, but this didn't affect anything.
Even creating a new file after about ~38 mins the same problem appears. The file is not writable anymore from the machine. Then, (sometimes) if I read the file, for some reason I can write it again for a while.
I got the following log when the error is occurring:
[816270.396574] NFS: permission(0:231/8193), mask=0x81, res=0
[816270.403198] NFS: permission(0:231/8193), mask=0x3, res=0
[816270.406521] NFS: atomic_open(0:231/8193), data2.txt
[816270.412065] --> nfs_put_client({2})
[816270.415211] --> nfs4_setup_sequence clp ffff8fbd383c2800 session ffff8fbcf6fa1800 sr_slot 4294967295
[816270.419180] --> nfs41_setup_sequence
[816270.420853] --> nfs4_alloc_slot used_slots=0000 highest_used=4294967295 max_slots=3
[816270.424797] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0
[816270.428181] <-- nfs41_setup_sequence slotid=0 seqid=71351
[816270.430539] <-- nfs4_setup_sequence status=0
[816270.433691] encode_sequence: sessionid=1942:1545315805:1942:0 seqid=71351 slotid=0 max_slotid=0 cache_this=1
[816270.439233] --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=3
[816270.442573] <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1
[816270.444845] nfs4_free_slot: slotid 1 highest_used_slotid 0
[816270.447379] nfs41_sequence_process: Error 0 free the slot
[816270.451224] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
Interestingly, when it works it doesn't use atomic_open but just open.
(Source code here)
I thought I could debug this by setting authorisation to NONE instead of the default. How do I do that?
UPDATE:
I finally managed to reproduce the problem reliably! I turned on logging only for file and I get the following log that shows a transition between when I can write the file without problem and when the operation results in a EINVAL error.
...
[3533332.934037] NFS: open file(/data13.txt)
[3533332.937896] NFS: flush(/data13.txt)
[3533332.941986] NFS: fsync file(/data13.txt) datasync 0
[3533332.947038] NFS: write(/data13.txt, 33@0)
[3533332.952519] NFS: flush(/data13.txt)
[3533332.962836] NFS: fsync file(/data13.txt) datasync 0
[3533332.965829] NFS: release(/data13.txt)
[3533336.975155] docker0: port 2(veth295ffd2) entered blocking state
[3533336.977344] docker0: port 2(veth295ffd2) entered disabled state
[3533336.984250] device veth295ffd2 entered promiscuous mode
[3533336.995705] IPv6: ADDRCONF(NETDEV_UP): veth295ffd2: link is not ready
[3533337.061611] nfsd: initializing export module (net: ffff8fbceb052580).
[3533337.263506] eth0: renamed from veth9c5fe34
[3533337.274301] IPv6: ADDRCONF(NETDEV_CHANGE): veth295ffd2: link becomes ready
[3533337.277259] docker0: port 2(veth295ffd2) entered blocking state
[3533337.279440] docker0: port 2(veth295ffd2) entered forwarding state
[3533337.972794] NFS: open file(/data13.txt)
[3533337.982390] NFS: flush(/data13.txt)
[3533337.984978] NFS: fsync file(/data13.txt) datasync 0
[3533337.991418] NFS: write(/data13.txt, 33@0)
[3533337.998084] NFS: flush(/data13.txt)
[3533338.017990] NFS: fsync file(/data13.txt) datasync 0
[3533338.020636] NFS: release(/data13.txt)
[3533341.385814] NFSD: laundromat service - starting
[3533341.387690] NFSD: laundromat_main - sleeping for 90 seconds
[3533342.869641] NFS: open file(/data.txt)
[3533342.894059] NFS: write(/data.txt, 1@0)
[3533342.896545] NFS: flush(/data.txt)
[3533342.906735] NFS: fsync file(/data.txt) datasync 0
[3533342.910853] NFS: release(/data.txt)
[3533342.914387] NFS: open file(/data.txt)
[3533342.917169] NFS: write(/data.txt, 29@0)
[3533342.920237] NFS: flush(/data.txt)
[3533342.927772] NFS: fsync file(/data.txt) datasync 0
[3533342.931496] NFS: release(/data.txt)
[3533343.002610] NFS: read(/data.txt, 30@30374)
[3533343.233929] docker0: port 2(veth295ffd2) entered disabled state
[3533343.239144] veth9c5fe34: renamed from eth0
[3533343.257428] nfsd_inet6addr_event: removed fe80:0000:0000:0000:f026:2cff:feb1:510b
[3533343.260065] docker0: port 2(veth295ffd2) entered disabled state
[3533343.270280] device veth295ffd2 left promiscuous mode
[3533343.272992] docker0: port 2(veth295ffd2) entered disabled state
[3533343.330925] nfsd: shutting down export module (net: ffff8fbceb052580).
[3533343.337814] nfsd: export shutdown complete (net: ffff8fbceb052580).
In the above log I can write the file without problem until the last messages about shutting down the so called export module.