NFS stopped working after network interface card upgrade

Question

Scroll down for latest updates.

I have an infrastructure which contains an NFS server that is hosting homes for users. The server is running Ubuntu server, and has 10G Fiber ethernet (Myri-10G dual protocol nic). It has been running fine for about two years. No changes were made to the server during this network card transition, and the server has always had 10G fiber.

Overview of infrastructure:

Server: (10.131.39.114) Ubuntu 16.04.4, Myri-10G Dual-Protocol NIC, firmware 1.4.57, nfs-kernel-server 1:1.2.8-9ubuntu12.3, linux kernel 4.4.0-109-generic
Switch: Force 10 S2410, layer 2 only, 10G fiber interfaces only
Client: Linux Mint 18.2, Myri-10G Dual-Protocol NIC, firmware 1.4.57, running autofs, linux kernel 4.8.0-53-generic (All clients identical, reminder, they were on copper ethernet using the Intel 82579LM Gigabit Network Connection)

The client workstations are Dell workstation class machines and were using the built in 1G ethernet (Intel 82579LM). We are working on large data, and were gifted more Myri-10G dual protocol nics.

Half of our workstations were upgraded with the new nics, and were connected via fiber to the S2410 switch. It all seemed to work when rebooted. We turned off the Intel and configured the Myricom, with the same IP address as the copper nic (and we turned the copper nic off). Everything looked good, we can ping, download files, etc, HOWEVER, when a client logs in, it hangs. After a short investigation, we realized that the NFS server is not connecting.

NOTE: We are using VLANS. In the very beginning I thought this might be a vlan routing issue, so we put the client and servers on the SAME VLAN. We experienced the same issues.

Observations/troubleshooting:

 lshw -C network
 *-network
       description: Ethernet interface
       product: Myri-10G Dual-Protocol NIC
       vendor: MYRICOM Inc.
       physical id: 0
       bus info: pci@0000:22:00.0
       logical name: enp34s0
       version: 00
       serial: 00:60:dd:44:96:a8
       size: 10Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: msi pm pciexpress msix vpd bus_master cap_list rom ethernet physical fibre
       configuration: autonegotiation=off broadcast=yes driver=myri10ge driverversion=1.5.3-1.534 duplex=full firmware=1.4.57 -- 2013/10/23 13:58:51 m latency=0 link=yes multicast=yes port=fibre speed=10Gbit/s
       resources: irq:62 memory:fa000000-faffffff memory:fbd00000-fbdfffff memory:fbe00000-fbe7ffff
  *-network
       description: Ethernet interface
       physical id: 1
       logical name: enp34s0.731
       serial: 00:60:dd:44:96:a8
       size: 10Gbit/s
       capabilities: ethernet physical fibre
       configuration: autonegotiation=off broadcast=yes driver=802.1Q VLAN Support driverversion=1.8 duplex=full firmware=N/A ip=10.131.31.181 link=yes multicast=yes port=fibre speed=10Gbit/s


rpcinfo -p 10.131.39.114
   program vers proto   port  service
    100000    4   tcp    111  portmapper
    100000    3   tcp    111  portmapper
    100000    2   tcp    111  portmapper
    100000    4   udp    111  portmapper
    100000    3   udp    111  portmapper
    100000    2   udp    111  portmapper
    100011    1   udp    787  rquotad
    100011    2   udp    787  rquotad
    100011    1   tcp    787  rquotad
    100011    2   tcp    787  rquotad
    100005    1   udp  40712  mountd
    100005    1   tcp  45016  mountd
    100005    2   udp  44618  mountd
    100005    2   tcp  49309  mountd
    100005    3   udp  43643  mountd
    100005    3   tcp  53119  mountd
    100003    2   tcp   2049  nfs
    100003    3   tcp   2049  nfs
    100003    4   tcp   2049  nfs
    100227    2   tcp   2049
    100227    3   tcp   2049
    100003    2   udp   2049  nfs
    100003    3   udp   2049  nfs
    100003    4   udp   2049  nfs
    100227    2   udp   2049
    100227    3   udp   2049
    100021    1   udp  51511  nlockmgr
    100021    3   udp  51511  nlockmgr
    100021    4   udp  51511  nlockmgr
    100021    1   tcp  43334  nlockmgr
    100021    3   tcp  43334  nlockmgr
    100021    4   tcp  43334  nlockmgr

rpcinfo -u 10.131.39.114 mount
program 100005 version 1 ready and waiting
program 100005 version 2 ready and waiting
program 100005 version 3 ready and waiting

rpcinfo -u 10.131.39.114 portmap
program 100000 version 2 ready and waiting
program 100000 version 3 ready and waiting
program 100000 version 4 ready and waiting

rpcinfo -u 10.131.39.114 nfs
program 100003 version 2 ready and waiting
program 100003 version 3 ready and waiting
program 100003 version 4 ready and waiting

However, this fails:

showmount -e 10.131.39.114
rpc mount export: RPC: Timed out

Side note, on a working client (on copper), this is what you would normally see:

showmount -e 10.131.39.114
Export list for 10.131.39.114:
/mnt/homes      10.131.84.0/26,10.131.31.187,10.131.31.186,10.131.31.185,10.131.31.184,10.131.31.183,10.131.31.182,10.131.31.181,10.131.31.180
/mnt/clones 10.131.31.0/24,10.131.39.0/24,10.131.84.0/26

(yes, I know they are on different lans, but it has been working for years now).

Side note: We turned off network manager, and /etc/network/interfaces contains:

auto enp34s0.731
iface enp34s0.731 inet static
    vlan-raw-device enp34s0
    address 10.131.31.181
    netmask 255.255.255.0
    gateway 10.131.31.1
    dns-nameservers 10.131.31.53,10.35.32.15

Perhaps this information is of assistance:

On the client that has 10G, if I create a dir to mount a different exported dir from the server, like /mnt/clones (which we turn on for cloning), and I mount it by hand using NFSv4, it seems to work, but then you can not ls, or cd to the mounted dir. A df works, but you can not stat any files in the directory. I have seen this issue before, but I can not recall why.

Note that the clients are by default using nfs4 (e.g. from a working copper ethernet client with auto.home enabled):

10.131.39.114:/mnt/homes/usera on /home/usera type nfs4 (rw,nosuid,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.131.31.185,local_lock=none,addr=10.131.39.114)

In sum:

NFS seems to no longer work from the client after an upgrade to 10gig... I know I have doen this in the past, with these exact same network cards (in fact, these exact same network cards are the cards I used in 2012 on another cluster, we were given them back to use in these workstations, which means this nfs not working makes even less sense).
If you by hand mount an nfs share, it will fail.
If you by hand mount an nfs share, forcing v4, it seems to work, but only the mount. Files and operations will all fail, except for the df command.
If you try to login and have automount mount the home, it will fail.
If you force automount to use v4 on the 10G client, it will seem to mount, but the user still fails to login. The home looks mounted, but you can not do any operations on it.

Interestingly, the server has NO LOGS of the client attempting an authentication request. e.g. when a working copper client has a user login, syslog on the NFS server logs an authenticated nfs request. When the same client tries to login to the 10G workstation, there is no log on the NFS server of a mount request. It is as if the request is not getting to the server.

Again, from the 10G workstation, everything else on the network works. File transfers, hitting servers (even the NFS server via ssh, http, every port I try works). This problem only seems to impact NFS.

The fundamental question of this post is: What diagnosis do I perform next? It seems I am getting RPC timeouts, but all help/FAQ on the internet point to routing or networking. These hosts are plugged into the same switch, and in fact I had moved them to the same VLAN for testing, with the same results. Any ideas or insight would be appreciated.

UPDATE: I think this is radically important, and is the cause of my issue, but I am not sure how to diagnose this:

From a client with a 10Gig fiber card:

nmap -sC -p111 10.131.39.114
Starting Nmap 7.80 ( https://nmap.org ) at 2021-03-12 15:20 UTC
Nmap scan report for cmixhyperv03.cmix.louisiana.edu (10.131.39.114)
Host is up (0.00011s latency).

PORT    STATE SERVICE
111/tcp open  rpcbind
MAC Address: 00:60:DD:46:D6:DE (Myricom)

Nmap done: 1 IP address (1 host up) scanned in 3.79 seconds

From a similar client, but with 1G copper ethernet:

 nmap -sC -p111 10.131.39.114

Starting Nmap 7.01 ( https://nmap.org ) at 2021-03-12 09:21 CST
Nmap scan report for cmixhyperv03.cmix.louisiana.edu (10.131.39.114)
Host is up (0.00044s latency).
PORT    STATE SERVICE
111/tcp open  rpcbind
| rpcinfo: 
|   program version   port/proto  service
|   100000  2,3,4        111/tcp  rpcbind
|   100000  2,3,4        111/udp  rpcbind
|   100003  2,3,4       2049/tcp  nfs
|   100003  2,3,4       2049/udp  nfs
|   100005  1,2,3      43643/udp  mountd
|   100005  1,2,3      53119/tcp  mountd
|   100011  1,2          787/tcp  rquotad
|   100011  1,2          787/udp  rquotad
|   100021  1,3,4      43334/tcp  nlockmgr
|   100021  1,3,4      51511/udp  nlockmgr
|   100227  2,3         2049/tcp  nfs_acl
|_  100227  2,3         2049/udp  nfs_acl

Nmap done: 1 IP address (1 host up) scanned in 1.21 seconds

Update 20210315

tcpdump to wireshark on the client and server. The only thing I can see different between a copper client that works and a fiber client that fails is that the server gets the connection and everything looks identical to when a copper client connects, however, after it starts to read the home dir files (.bash_profile, etc), the server seems to start retransmitting and getting spurious retransmissions. After a while of this, NFS is still trying to load the dir, then I see a TCP RST, ACK and RST and then NFS NFSERR_BADSESSION. I can not tell, thus far, from wireshark why the server is retransmissing or why the client is failing...

I have thus far swapped the 10gig switch with another, and also used different clients. No luck.

Your client is trying to mount from IP `10.131.39.114`, but your server is configured to use `10.131.31.181`. The `nmap` output also shows that the MAC address of `10.131.39.114` is `00:60:DD:46:D6:DE`, but the earlier configuration info you post from the server shows that the server's MAC is `00:60:dd:44:96:a8`. It would seem that perhaps you're not talking to the host you think you are... ? Are you able to `ping 10.131.31.181`? Does `showmount -e 10.131.31.181` show anything interesting? — Jim L., Mar 15 '21 at 22:35
Those look confusing as I keep switching interfaces. When I test the fiber, I switch to the fiber. Since NFS is not working, I switch back... and sometimes I acidentially grab the copper ethernet when I post here. I am sure I am hitting the correct server on the correct subnet. However, showmount -e fails, timeout, even on the same LAN without VLANS. — number9, Mar 15 '21 at 23:39
@JimL. I see, your confusion is that at first i was trying this from the .31 netwok, but then to ensure it was not the fact that I was crossing VLANs, I moved the box to the .39 network, same as the NFS server, for more testing. I even plugged it into the same switch. I wonder now, since multiple clients on multiple networks have this issue, could it be the servers fiber network adapter? But then, why do all of the copper clients work? The only other thing I can think now that is different is the MTU (copper 1500, fiber 9000) — number9, Mar 16 '21 at 10:50
I beg to differ, but the MAC address discrepancy suggests that you are NOT hitting the correct server on the correct subnet. Make it work on an isolated, stand-alone network. Once that works, migrate it to your pre-defined network. Methinks you are falling victim to your own mistaken assumptions Start small. Prove that it works. Then expand. — Jim L., Mar 16 '21 at 13:26
@JimL. I did that. I move the workstation to the LAN that the server is on, physically, and turn off the other ethernet devices. I get the same results. — number9, Mar 16 '21 at 17:08

number9 · Answer 1 · 2021-03-18T00:03:32.987

After much gnashing of teeth, I had a sudden realization... as noted, I have a workstation with both copper and fiber that I was testing and the fiber does not work.. however, it occured to me that they both had to cross the vlan boundary, and due to the fact that my switch is L2 only, they are talking to the router.

The answer that I had here... was incorrect. Moving the client to 1500MTU does "solve" the problem, which led myself and the network team to think that the routers MTU was also at 1500. This is incorrect. If we move the workstation to a standalone switch and set the MTU of everyone to 9000, it does not work. As it turns out... it seems like NFS does not like MTU 9000.

I am referencing these articles, but this problem is not "solved" in the sense that I am using 10Gig with Jumbo frames. It is solved in that if you move a client to have an MTU of 1500 bytes, it works.

NFS stopped working after network interface card upgrade

1 Answers1