Yes, I know this question has been asked tens if not hundreds of times before. Still, I went through all similar questions and tried everything listed on them, to no avail.
After compiling some code on my Raspberry Pi 4 model B running Ubuntu 21.04 (Linux rpi4 5.11.0-1017-raspi #18-Ubuntu SMP PREEMPT Mon Aug 23 07:34:31 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux), ccache got stuck and it's been running for almost an hour at 100% CPU. Here's the output of ps -l for the offending process:
$ ps -l -p 7580
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
1 R 1000 7580 1 99 80 0 - 1725 - pts/2 00:54:10 ccache
I tried to kill it and to kill -9 it. No effect. It doesn't appear to be a zombie; it's running:
$ sudo cat /proc/7580/syscall
running
I tried to attach to it with strace and gdb, and both hang.
I tried to find the parent processes (as indicated by ps auxf) and killing all parents, and then trying to kill the offending process again. It didn't work.
I ran demsg and looked at /var/log/syslog, looking back until the time I started the process. I found no clues to help me debug it.
Normally I'd just reboot and get on with life, but this is the third reboot of the day (two due to ccache, and another due to a shell script called cpuUsage.sh remotely installed by VS Code), and I suspect this will become the norm from now on. I've had this board for a couple of months, and this hadn't happened until today.
My only reasonable, yet unsubstantiated, hypothesis is that the SD card the board is booting from may be bad, but I have no idea how to diagnose this.
Although I'd love to be told a magic command that kills this process, I'm fairly sure there is no such thing, given all that I've tried until now. My question is: assuming this continues to happen, how can I diagnose this? It's evidently unsustainable to keep rebooting this board multiple times daily, as I suspect I may have to do from now on.
EDIT: following a suggestion on the comments, I tried the following while looking at dmesg output:
$ sudo dd if=/dev/mmcblk0p2 of=/dev/null bs=1M
60648+1 records in
60648+1 records out
63595068928 bytes (64 GB, 59 GiB) copied, 1386,27 s, 45,9 MB/s
Saw this on the dmesg output:
[27430.135999] INFO: task kworker/3:2:12138 blocked for more than 120 seconds.
[27430.136031] Tainted: G C OE 5.11.0-1017-raspi #18-Ubuntu
[27430.136041] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27430.136050] task:kworker/3:2 state:D stack: 0 pid:12138 ppid: 2 flags:0x00000008
[27430.136067] Workqueue: events_freezable mmc_rescan
[27430.136088] Call trace:
[27430.136092] __switch_to+0xb8/0xe4
[27430.136102] __schedule+0x2bc/0x7dc
[27430.136110] schedule+0x7c/0x110
[27430.136117] __mmc_claim_host+0xc0/0x1f0
[27430.136124] mmc_get_card+0x40/0x50
[27430.136130] mmc_sd_detect+0x2c/0xa0
[27430.136136] mmc_rescan+0xc8/0x314
[27430.136143] process_one_work+0x200/0x4f0
[27430.136151] worker_thread+0x74/0x3c0
[27430.136158] kthread+0x12c/0x140
[27430.136164] ret_from_fork+0x10/0x3c
Given the presence of SD card-related functions on the stack trace, this seems to confirm my suspicion of a bad SD card.