Running process (not a zombie) that's impossible to kill

Question

Yes, I know this question has been asked tens if not hundreds of times before. Still, I went through all similar questions and tried everything listed on them, to no avail.

After compiling some code on my Raspberry Pi 4 model B running Ubuntu 21.04 (Linux rpi4 5.11.0-1017-raspi #18-Ubuntu SMP PREEMPT Mon Aug 23 07:34:31 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux), ccache got stuck and it's been running for almost an hour at 100% CPU. Here's the output of ps -l for the offending process:

$ ps -l -p 7580
F S   UID     PID    PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
1 R  1000    7580       1 99  80   0 -  1725 -      pts/2    00:54:10 ccache

I tried to kill it and to kill -9 it. No effect. It doesn't appear to be a zombie; it's running:

$ sudo cat /proc/7580/syscall 
running

I tried to attach to it with strace and gdb, and both hang.

I tried to find the parent processes (as indicated by ps auxf) and killing all parents, and then trying to kill the offending process again. It didn't work.

I ran demsg and looked at /var/log/syslog, looking back until the time I started the process. I found no clues to help me debug it.

Normally I'd just reboot and get on with life, but this is the third reboot of the day (two due to ccache, and another due to a shell script called cpuUsage.sh remotely installed by VS Code), and I suspect this will become the norm from now on. I've had this board for a couple of months, and this hadn't happened until today.

My only reasonable, yet unsubstantiated, hypothesis is that the SD card the board is booting from may be bad, but I have no idea how to diagnose this.

Although I'd love to be told a magic command that kills this process, I'm fairly sure there is no such thing, given all that I've tried until now. My question is: assuming this continues to happen, how can I diagnose this? It's evidently unsustainable to keep rebooting this board multiple times daily, as I suspect I may have to do from now on.

EDIT: following a suggestion on the comments, I tried the following while looking at dmesg output:

$ sudo dd if=/dev/mmcblk0p2 of=/dev/null bs=1M
60648+1 records in
60648+1 records out
63595068928 bytes (64 GB, 59 GiB) copied, 1386,27 s, 45,9 MB/s

Saw this on the dmesg output:

[27430.135999] INFO: task kworker/3:2:12138 blocked for more than 120 seconds.
[27430.136031]       Tainted: G         C OE     5.11.0-1017-raspi #18-Ubuntu
[27430.136041] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27430.136050] task:kworker/3:2     state:D stack:    0 pid:12138 ppid:     2 flags:0x00000008
[27430.136067] Workqueue: events_freezable mmc_rescan
[27430.136088] Call trace:
[27430.136092]  __switch_to+0xb8/0xe4
[27430.136102]  __schedule+0x2bc/0x7dc
[27430.136110]  schedule+0x7c/0x110
[27430.136117]  __mmc_claim_host+0xc0/0x1f0
[27430.136124]  mmc_get_card+0x40/0x50
[27430.136130]  mmc_sd_detect+0x2c/0xa0
[27430.136136]  mmc_rescan+0xc8/0x314
[27430.136143]  process_one_work+0x200/0x4f0
[27430.136151]  worker_thread+0x74/0x3c0
[27430.136158]  kthread+0x12c/0x140
[27430.136164]  ret_from_fork+0x10/0x3c

Given the presence of SD card-related functions on the stack trace, this seems to confirm my suspicion of a bad SD card.

re: `hypothesis is that the SD card the board is booting from may be bad, but I have no idea how to diagnose this` - if you suspect bad hardware, then try replacing it. Fortunately, SD cards are cheap. Also, see if you can trigger the problem without `ccache` by repeatedly running `cat /dev/sdX > /dev/null` (where sdX is the device node for your SD card). maybe run `dmesg -w &` before the cat loop, so you can see any kernel error message. If the machine locks up while you're cloning your current SD card, that's also a pretty good indicator. — cas, Sep 25 '21 at 07:02
An process that can't be killed with `kill -9` is a kernel bug or a hardware failure. When it's a kernel bug, usually the bug is in some driver and the process is state D, in a syscall, and sometimes there's a way to unblock it with some indirect action on the driver (e.g. disconnecting a peripheral). A bad SD card could be the reason if the kernel image is corrupted on it; the very first thing to try is to replace the SD card, obviously not by copying from the existing card (since you'd be copying the bad image). — Gilles 'SO- stop being evil', Sep 25 '21 at 07:31

Running process (not a zombie) that's impossible to kill

0 Answers0