11

I'm working on a system with multiple NVIDIA GPUs. I would like disable / make-disappear one of my GPUs, but not the others; without rebooting; and so that I can later re-enable it.

Is this possible?

Notes:

  • Assume I have root (though a non-root solution for users which have permissions for the device files is even better).
  • In case it matters, the distribution is either SLES 12 or SLES 15, and - don't ask me why :-(
Marcus Müller
  • 21,602
  • 2
  • 39
  • 54
einpoklum
  • 8,772
  • 19
  • 65
  • 129
  • I guess some BIOS let you disable a hardware? – 炸鱼薯条德里克 Jun 13 '21 at 10:24
  • @炸鱼薯条德里克: Like I said, I mustn't reboot. So no BIOS access either. – einpoklum Jun 13 '21 at 10:33
  • Good luck in the wonderful world of PCIe hotplugging! It's a known bug that nvidia's GPU linux drivers can't fully de- and re-initialize GPUs. Nvidia announced a month or so they did something about it (I think there was a Phoronix article?) I don't know whether the fixed driver is available yet. Anyway, try with SLES 15, and ignore SLES 12. Much (good) has happened in the last 5 years when it comes to PCIe hotplugging. – Marcus Müller Jun 13 '21 at 11:25
  • ah no, that was AMD. https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.14-AMDGPU-Hot-Unplug However, if AMD haven't had this straight, chances are nvidia is worse (I've yet to encounter an instance where the modern kernel AMD drivers are as bad as nvidia's closed source drivers), sorry :( – Marcus Müller Jun 13 '21 at 11:27
  • @einpoklum by the way, for which purpose do you need to disable it? In case this is about it not being used to display stuff, that's a whole different, much much much much MUCH easier problem! – Marcus Müller Jun 13 '21 at 11:59
  • @MarcusMüller: I don't necessarily need to hot-plug it in and out. It would be enough if access to it was refused somehow. As for my purpose - it's for testing. I want to see how something behaves when a GPU fails/disappears. – einpoklum Jun 13 '21 at 12:21
  • fails/disappears *how* exactly? because the way devices fail is pretty different from "i remove it and then I can later re-add it": If you can later re-add it, then the device did not fail. That's a very different problem. And "access": that's not how GPUs work, essentially. Could you explain with what you mean "access is refused": to whom? What are you testing here? Kernel drivers? – Marcus Müller Jun 13 '21 at 12:22
  • @MarcusMüller: I can't go into details, unfortunately, since it's in a commercial setting. You're right that one can test for all sorts of problems. – einpoklum Jun 13 '21 at 12:29
  • That's kind of sad, asking for free help, then when being even asked the most fundamental questions about what specifically you want to do, saying, thanks, can't give you even that info for your time. – Marcus Müller Jun 13 '21 at 12:55
  • @MarcusMüller : See my answer. SE is not always / not basically about asking for free help. – einpoklum Jun 13 '21 at 13:37

1 Answers1

16

Disabling:

The following disables a GPU, making it invisible, so that it's not on the list of CUDA devices you can find (and it doesn't even take up a device index)

nvidia-smi -i 0000:xx:00.0 -pm 0
nvidia-smi drain -p 0000:xx:00.0 -m 1

where xx is the PCI device ID of your GPU. You can determine that using lspci | grep NVIDIA or nvidia-smi.

The device will still be visible with lspci after running the commands above.

Re-enabling:

nvidia-smi drain -p 0000:xx:00.0 -m 0

the device should now be visible

Problems with this approach

  • This may fail to work if you are not root; or in some scenarios I can't yet characterize.
  • Haven't yet checked what happens to procesess which are actively using the GPU as you do this.
  • The syntax is baroque and confusing. NVIDIA - for shame, you need to make it simpler to disable GPUs.
einpoklum
  • 8,772
  • 19
  • 65
  • 129