5

So, this has been annoying me for years.

This happens with more programs than just dd, but I find it happens very often with programs that involve raw filesystem manipulation.

When I'm copying with dd -- e.g., making a bootable USB disk by doing sudo dd if=somelinuxdistro.iso of=/dev/sdb bs=64K status=progress, it's like all my signals are ignored by the application. (Or by the kernel in the case of SIGKILL) htop shows status D, which apparently means "uninterruptible sleep". It can stay in this state for ages if there's a hardware glitch, and in regular usage I can't seem to find any way of detaching it from the terminal so I can keep working -- often I just end up switching to a different terminal to finish my work.

I've looked this up before, but I've never found an explanation of what this state is for or why the kernel refuses to kill a process in this state -- or what exactly the recommended thing is to do to avoid wasting time. (Nor any recommendations of what to do when this happens.)

In short: I'd like to have is a way of reliably force killing processes in state D, or at least detaching them from the terminal. And I'd also like an explanation of what's going on in the background to cause them to be in this state in the first place.

Rui F Ribeiro
  • 55,929
  • 26
  • 146
  • 227
Alexandria P.
  • 149
  • 2
  • 10

1 Answers1

1

If you cannot interrupt an "uninterruptable read" and this is not related to a switched off NFS server, you discovered a driver bug.

I/O to local background storage should not have a timeout that is larger than 5-10 minutes. So if you type ^C or ^Z and nothing happens within 10 minutes, there is a driver bug.

The background is that UNIX defines that so called fast IO is not interruptable by signals because fast IO will terminate after a forseeable amount of time.

Making IO interruptable by signals causes a high overhead as there is a need to go back to a clean state. Everything that happened after startin the IO needs to be unwound and a return can only happen from where the IO was initiated.

Even worse, if a background storage driver did implement interruptable IO, this would cause unhandlable problems in filesystems above such a driver. You are using a driver that is intended to be used as background storage for a filesystem...

You could call dmesg and check the kernel messages for your problem. If interrupting really does not work after 10 minutes (when one read or write system call is expected to time out and there is a chance to kill dd between two such syscalls) you need to reboot.

If this is a device at USB, you could try to pull the device before you reboot.

schily
  • 18,806
  • 5
  • 38
  • 60
  • 1
    I understand that there may be problems with the driver or with the hardware -- but this doesn't address my question of how to deal with the workflow problem. Nor does it explain why `fast IO` is designed to be non-interruptable when it potentially causes problems by assuming the backend is working properly. What advantage is there to blocking the user from killing a process like this? If the user is trying to force kill it, they surely already understand that file corruption is likely. – Alexandria P. Sep 10 '18 at 21:04
  • 1
    Can you clarify what you meant by "Making IO interruptable by signals causes a high overhead as there is a need to go back to a clean state"? – Alexandria P. Sep 11 '18 at 03:33
  • Hmmm. I'm still a little confused. Why can't the kernel just *drop* all the IO stuff it was doing? I mean, that's seemingly what happens when the device gets physically unplugged -- so it can't be any worse than unplugging a device. – Alexandria P. Sep 12 '18 at 07:25