2

I am running Debian bullseye on a (2nd hand) bare metal server, which crashes occasionally (happened 3 times with the course of 8 days now) and I can't seem to figure out why. I haven't found ways to reproduce it either, because the cause seems to come from outside the system.

On three occasions, the following happens:

  • The system is (practically) idle
  • There is a NETDEV WATCHDOG: enp0s31f6 (e1000e): transmit queue 0 timed out error with a stack trace in de kernel log, no preceding messages (the gap with the message prior is usually a few hours).
  • The message e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx keeps repeating every 10-20 seconds.
  • At this point, the network is down, I can no longer access it, so I issued a hardware reset on all occasions to get it back up and running again.

Now, I did try the very first time to see if I could reset the network (through the console), (I didn't try to remove/reinsert the driver module though, not sure if that would have helped), but all'n all it didn't seem to be very fruitful endeavor so I decided to reboot and hope for the best.

Can anyone help me with some kind of approach on how to debug the situation if it arises again, and maybe some pointers in how to reproduce the problem, and a way to get it running again without a hardware reset?

Log files

(this is just the first time, the logs are equal or, at least, very similar for all 3 occasions)

Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.662109] ------------[ cut here ]------------
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.662249] NETDEV WATCHDOG: enp0s31f6 (e1000e): transmit queue 0 timed out
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.662401] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:467 dev_watchdog+0x260/0x270
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.662554] Modules linked in: dm_mod xt_nat vhost_net vhost vhost_iotlb tap tun xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_
ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink bridge stp llc intel_rapl_msr intel_rapl_common intel_pmc_core_pltdrv intel_pmc_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel evdev kvm irqbypass rapl intel_cstate intel_uncore wdat_wdt intel_pch_thermal
 watchdog ee1004 serio_raw ie31200_edac acpi_pad button drm fuse configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multip
ath linear raid1 md_mod crc32_pclmul crc32c_intel ahci xhci_pci ghash_clmulni_intel xhci_hcd libahci nvme e1000e libata aesni_intel usbcore libaes crypto_simd scsi_mod nvme_core ptp psmouse pps_core cryptd glue_helper t10_pi i2c_i801 crc_t10dif
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.663385]  crct10dif_generic i2c_smbus crct10dif_pclmul crct10dif_common wmi usb_common video
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664310] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.10.0-21-amd64 #1 Debian 5.10.162-1
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664461] Hardware name: FUJITSU /D3417-B2, BIOS V5.0.0.12 R1.27.0.SR.1 for D3417-B2x               06/10/2020
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664630] RIP: 0010:dev_watchdog+0x260/0x270
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664747] Code: eb a9 48 8b 1c 24 c6 05 c7 16 0d 01 01 48 89 df e8 b5 73 fa ff 44 89 e9 48 89 de 48 c7 c7 08 b8 b6 91 48 89 c2 e8 da a0 14 00 <0f> 0b eb 86 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664968] RSP: 0018:ffffbb7e40128eb0 EFLAGS: 00010282
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665088] RAX: 0000000000000000 RBX: ffff920c20740000 RCX: 000000000000083f
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665234] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665381] RBP: ffff920c207403dc R08: 0000000000000000 R09: ffffbb7e40128cd0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665532] R10: ffffbb7e40128cc8 R11: ffffffff920cb6a8 R12: ffff920b4143c080
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665681] R13: 0000000000000000 R14: ffff920c20740480 R15: 0000000000000001
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665832] FS:  0000000000000000(0000) GS:ffff921a2e440000(0000) knlGS:0000000000000000
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665985] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666101] CR2: 000000c0002f9000 CR3: 0000000c9480a001 CR4: 00000000003726e0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666249] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666394] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666537] Call Trace:
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666646]  <IRQ>
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666754]  ? pfifo_fast_enqueue+0x150/0x150
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666868]  call_timer_fn+0x27/0x100
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666988]  __run_timers.part.0+0x1d9/0x250
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667106]  ? ktime_get+0x35/0xa0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667223]  ? lapic_next_deadline+0x28/0x40
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667340]  ? clockevents_program_event+0x8a/0xf0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667462]  run_timer_softirq+0x26/0x50
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667536]  __do_softirq+0xc2/0x279
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667610]  asm_call_irq_on_stack+0xf/0x20
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667684]  </IRQ>
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667755]  do_softirq_own_stack+0x37/0x50
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667830]  irq_exit_rcu+0x92/0xc0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667904]  sysvec_apic_timer_interrupt+0x36/0x80
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667980]  asm_sysvec_apic_timer_interrupt+0x12/0x20
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668057] RIP: 0010:cpuidle_enter_state+0xc7/0x350
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668133] Code: 8b 3d dd 71 f4 6e e8 b8 9a 9f ff 49 89 c5 0f 1f 44 00 00 31 ff e8 29 a6 9f ff 45 84 ff 0f 85 fe 00 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 0a 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668256] RSP: 0018:ffffbb7e400c3ea8 EFLAGS: 00000246
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668334] RAX: ffff921a2e473c40 RBX: 0000000000000006 RCX: 000000000000001f
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668425] RDX: 0000000000000000 RSI: 0000000021c15a3d RDI: 0000000000000000
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668517] RBP: ffff921a2e47e800 R08: 00007429fb821b6a R09: 0000000000000001
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668608] R10: 0000000000000000 R11: 0000000000002b55 R12: ffffffff921aea80
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668700] R13: 00007429fb821b6a R14: 0000000000000006 R15: 0000000000000000
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668792]  ? cpuidle_enter_state+0xb7/0x350
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668867]  cpuidle_enter+0x29/0x40
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668941]  do_idle+0x1f3/0x2b0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.669015]  cpu_startup_entry+0x19/0x20
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.669089]  secondary_startup_64_no_verify+0xb0/0xbb
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.669166] ---[ end trace 4e1f5ac6215c3384 ]---

Hardware info

# lspci -vvvv -s 0000:00:1f.6
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
    Subsystem: Fujitsu Technology Solutions Ethernet Connection (2) I219-LM
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0
    Interrupt: pin A routed to IRQ 126
    IOMMU group: 8
    Region 0: Memory at ef200000 (32-bit, non-prefetchable) [size=128K]
    Capabilities: [c8] Power Management version 3
        Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
        Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
        Address: 00000000fee002b8  Data: 0000
    Capabilities: [e0] PCI Advanced Features
        AFCap: TP+ FLR+
        AFCtrl: FLR-
        AFStatus: TP-
    Kernel driver in use: e1000e
    Kernel modules: e1000e

uname -a

Linux Debian-1106-bullseye-amd64-base 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux

kernel package info

apt show linux-image-5.10.0-21-amd64
Package: linux-image-5.10.0-21-amd64
Version: 5.10.162-1
Built-Using: linux (= 5.10.162-1)
Priority: optional
Section: kernel
Source: linux-signed-amd64 (5.10.162+1)
Maintainer: Debian Kernel Team <[email protected]>
Installed-Size: 318 MB
Depends: kmod, linux-base (>= 4.3~), initramfs-tools (>= 0.120+deb8u2) | linux-initramfs-tool
Recommends: firmware-linux-free, apparmor
Suggests: linux-doc-5.10, debian-kernel-handbook, grub-pc | grub-efi-amd64 | extlinux
Conflicts: linux-image-5.10.0-21-amd64-unsigned
Breaks: fwupdate (<< 12-7), initramfs-tools (<< 0.120+deb8u2), wireless-regdb (<< 2019.06.03-1~), xserver-xorg-input-vmmouse (<< 1:13.0.99)
Replaces: linux-image-5.10.0-21-amd64-unsigned
Homepage: https://www.kernel.org/
Download-Size: 55.5 MB
APT-Manual-Installed: no
APT-Sources: http://security.debian.org/debian-security bullseye-security/main amd64 Packages
Description: Linux 5.10 for 64-bit PCs (signed)
 The Linux kernel 5.10 and modules for use on PCs with AMD64, Intel 64 or
 VIA Nano processors.
 .
 The kernel image and modules are signed for use with Secure Boot.
  • Possibly related since there seems to be a problem in some interrupt handling but with no expectation it would help : I see your NIC device MSI capable but routed to IRQ 126. Did you try preferring MSIs ? – MC68020 Feb 28 '23 at 09:40
  • @MC68020 I have no idea what that means, so I guess I haven't :D Can you give me some pointers? – Gerard van Helden Feb 28 '23 at 09:54
  • I dug a little bit and noticed that in /proc/interrupts IRQ 126 is assigned to CPU 3, while in the stack traces the errors are either CPU 1 or CPU 2.... Could that be related? Or does the CPU assignment rotate in some way? – Gerard van Helden Feb 28 '23 at 11:01
  • Have a look at [this fix](https://forum.proxmox.com/threads/e1000-driver-hang.58284/page-9#post-529754) in the proxmox forum (either with ethtool which works until reboot or permanently in your network config). – Freddy Feb 28 '23 at 12:04
  • Thanks @Freddy, I will try that. I have it configured now, though I am not 100% convinced that this will solve it. I'll keep an eye out. If it no longer happens within the next couple of days, I'll draft an answer. – Gerard van Helden Mar 01 '23 at 12:01

1 Answers1

1

One of the first things to try when seeing TX timeouts is to disable TSO.

sudo ethtool -k enp0s31f6 tso off

I'm also interested to know if ethtool -S enp0s31f6 shows any odd counters, for instance, any errors, or specifically tx_tcp_seg_failed and tx_tcp_seg_good.

If you are having interrupt issues, which I would be surprised about, then you could always try disabling MSI or MSI-X when loading the driver with the IntMode= parameter. See the Kernel documentation.

For reference here is output from my I219 running e1000e. If any of your stats below are non-zero where mine are zero, I'd suggest to look closer at why those stats are going up.

$ ethtool -S enp0s31f6 | grep tx_
     tx_packets: 133102433
     tx_bytes: 178802443357
     tx_broadcast: 163
     tx_multicast: 5121
     tx_errors: 0
     tx_dropped: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 0
     tx_tcp_seg_good: 20245901
     tx_tcp_seg_failed: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     tx_smbus: 0
     tx_dma_failed: 0
     tx_hwtstamp_timeouts: 0
     tx_hwtstamp_skipped: 0
  • Hi Jesse, thanks, I have turned `gso` and `tso` off as per Freddy's suggestion earlier and will give it a few days to see if the issue resurfaces. I did that yesterday, so I'm guessing the numbers of `ethtool -S` wouldn't be very helpful now (?). What would denote 'odd' numbers, you think? – Gerard van Helden Mar 02 '23 at 23:29
  • 1
    There isn't much of a reason to turn off GSO, it's just software segmentation of frames in an efficient way. The driver sees one packet at a time, never larger than MTU, once TSO is turned off, regardless of GSO enabling. As for the counters with `ethtool -S` I'll add a little to my answer. – Jesse Brandeburg Mar 04 '23 at 02:24
  • Thanks, that was really helpful! – Gerard van Helden Mar 04 '23 at 12:52