AMD Radeon graphics lockup: could this be a hardware problem?

Question

I've had this computer, an AMD based desktop with a Radeon Vega 56 graphics card, for about 2½ years. It's been pretty solid all throughout, including playing games which make it run like a space heater. It's crashed a couple of times in the past month, which is not great but I've been busy so I rebooted and moved on. Today, though, it's crashing constantly. The crashes result in logs like this:

Jan 16 17:05:16 [hostname] kernel: rfkill: input handler disabled
Jan 16 17:05:21 [hostname] kernel: snd_hda_intel 0000:28:00.1: can't change power state from D0 to D3hot (config space inaccessible)
Jan 16 17:05:28 [hostname] kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Jan 16 17:05:28 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=77, emitted seq=79
Jan 16 17:05:28 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1396 thread gnome-shel:cs0 pid 1453
Jan 16 17:05:28 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset begin!
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xf1b000, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: [drm] REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:955
Jan 16 17:05:48 [hostname] kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Jan 16 17:05:48 [hostname] kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DF8C (len 824, WS 0, PS 0) @ 0xE10C
Jan 16 17:05:48 [hostname] kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DE46 (len 326, WS 0, PS 0) @ 0xDF36
Jan 16 17:05:48 [hostname] kernel: [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
Jan 16 17:06:08 [hostname] kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Jan 16 17:06:08 [hostname] kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C0B6 (len 62, WS 0, PS 0) @ 0xC0D2
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x1, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x3, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xf1b000, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: [drm:dce110_vblank_set [amdgpu]] *ERROR* Failed to get VBLANK!
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x800000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x22, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x25, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x30, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xf1b000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x10000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x4000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x8000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x8000000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x400, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x1000000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x30f, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x800, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x1000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x2000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x80000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x40, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x10000000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu 0000:28:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Jan 16 17:06:10 [hostname] kernel: [drm] Timeout wait for RLC serdes 0,0
Jan 16 17:06:10 [hostname] kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 0000000034d786ac; ring_buffer_end = 00000000c05dc59d; write_frame = 0000000094e0183d
Jan 16 17:06:10 [hostname] kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
Jan 16 17:06:10 [hostname] kernel: [drm:psp_suspend [amdgpu]] *ERROR* Failed to unload asd
Jan 16 17:06:10 [hostname] kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <psp> failed -22
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: MODE1 reset
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU mode1 reset
Jan 16 17:06:10 [hostname] kernel: [drm] psp is not working correctly before mode1 reset!
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU mode1 reset failed
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: ASIC reset failed with error, -22 for drm dev, 0000:28:00.0
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset(2) failed
Jan 16 17:06:10 [hostname] kernel: snd_hda_intel 0000:28:00.1: can't change power state from D3cold to D0 (config space inaccessible)
Jan 16 17:06:10 [hostname] kernel: snd_hda_intel 0000:28:00.1: CORB reset timeout#2, CORBRP = 65535
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset end with ret = -22
Jan 16 17:06:20 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Jan 16 17:06:30 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

When this happens, the monitor abruptly says there's no signal and goes dark. However, the system isn't actually down: I can SSH in and look at the logs, add and remove software, etc.

I know there's a new kernel (5.10) and updated mesa in updates-testing, so the first thing I did was roll those back But the problems persist. In fact, they're getting worse: at first, it was a couple of times over the course of a few hours, but as I'm trying to diagnose that, sometimes it won't even let me log in before there's a crash. So, the problem happens with:

kernel 5.10.7
kernel 5.9.16

and with

mesa-* 20.2.6
mesa-* 20.3.3

AND I even booted with a Fedora 33 Live image, and, while I can't ssh in to test, I get the same crash after < 5 minutes where the monitor cuts out.

It's weird for this to start all of the sudden. I've done some basic web searches, but most of what I see is old and points to various problems with drivers and card quirks. It seems like if that was the problem, this would have been happening all along.

I also don't think it's particularly hot — I've played Baldur's Gate 3 under Wine previously (like, over the holiday break for quite a lot of hours) and I didn't have any problems even though the fan was definitely running and pumping out heat like a space heater. Today, I let it sit turned off for half an hour and it still froze within a few minutes of booting up again.

I tried sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover as suggested here, but that just gets me

Jan 16 21:41:53 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset begin!
Jan 16 21:41:53 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: Bailing on TDR for s_job:ffffffffffffffff, as another already in progress

in the logs.

Any insight? Anything I should try?

Nothing indicates that the system is down. The monitor doesn't receive a signal. To rule out a hardware issue, try the card in different PCIE slot, use a known working discrete graphics card, or try the onboard video if you have it and see if the issue persists. If you have another machine available, you can also try the card with it to see if the issue occurs there as well. — Nasir Riley, Jan 17 '21 at 02:20
Yeah, the system isn't down but it seems like the graphics card crashed. See the message `amdgpu: GPU reset(2) failed`. Gonna take a look at https://unix.stackexchange.com/questions/352226/how-to-restart-a-failed-amdgpu-kernel-module and see if that gets me anywhere. — mattdm, Jan 17 '21 at 02:32

AMD Radeon graphics lockup: could this be a hardware problem?

0 Answers0