What could be the cause for going to uniterruptable deep sleep state for this particular make process?

Question

I am trying to understand 'D' state correctly.

In my case, the following process went to 'D' state:

make -f freac/CMakeFiles/freac_objs.dir/build.make freac/CMakeFiles/freac_objs.dir/build

It is using NFS share.

Also the load keep on increasing. load_avg is now at 1600(40 CPUs). I think 40 is accepatable limit for 40 processors.

Ok leaving that, three things I want to know:

Why does the load increase when a process is in 'D' state?
Why does a process go to 'D' state if access to a NFS share is troublesome, instead of the process completely getting killed?
What could cause sudden issue in accessing NFS share (Could it be due to network in most cases?)

Thanks!

The metric is just called "load", not "cpu load". It's not tied to the cpu. See http://unix.stackexchange.com/a/116865/4358 — phemmer, Jun 21 '16 at 12:28

score 4 · Accepted Answer · answered Jun 21 '16 at 12:47

4

A process in 'D' state is normally (but not always) "blocked on I/O wait". This can happen if a disk is busy and suffering high service times, for example. Process in D state count towards the load average, even though they're not using real CPU resources.

In the case of NFS, a process can spend a lot of time in 'D' state waiting for the NFS server to respond.

The default behaviour of an NFS client is to retry for up to 60 seconds (see the timeo option from man nfs) before retrying. This will mean a process may be in I/O wait for at least 60 seconds if there is a problem.

What happens then will depend on the retrans setting and the hard/soft settings.

If the filesystem is mounted hard then retries happen indefinitely; if mounted soft then the I/O request is finally failed. But we can see that this isn't immediate because of the timeo and retrans options.

Clients can see NFS issues for a number of reasons; a common one is network bandwidth (especially if you're on a WiFi network). Another one is volume of requests (if you run things in parallel then you could be causing a bottleneck). The server, itself, may be suffering from poor disk performance and so responding slow to NFS requests, or the server may not be running enough daemon threads to handle the volume of requests.

answered Jun 21 '16 at 12:47

Stephen Harris

42,369
5
94
123

Thanks! That explains my actual question! :) It seems soft mounts is not recommendable as it may corrupt data. So, its better to use hard mount even though we face this hung issues sometimes. But still it doesn't answer my first question: why can't it kill the process instead of taking it to D state. What can it achieve by taking it to D state that it can't achive by killing it? – GP92 Jun 21 '16 at 13:00
All processes go into D state when doing I/O and waiting for a device to respond, whether it's a local disk or an NFS server or anything else. It's the normal process flow. If a program was killed instead of going into D state then you'd never get anything done :-) The problem with NFS is that D state times can be extended (because it depends on network I/O and remote servers and retry windows...) so you see it frequently with NFS, but it's not limited to NFS and can occur elsewhere. – Stephen Harris Jun 21 '16 at 13:05
Hi Stephen, thanks for explaining..so in my case, the process is there in D state for so long time and still it is, does it mean that the NFS is still not accessible? It is accessible from other servers however. Here, is where I am confused. – GP92 Jun 21 '16 at 13:07
What I understand is the process should pickup and continue from where it left when the I/O is available (i.e, NFS is accessible). However I am not sure if NFS caused this, I couldn't think of any other reason. nor I can find any info from logs. – GP92 Jun 21 '16 at 13:10
There may not actually be a problem; if you're doing a _lot_ of I/O then you may just be seeing the results of a slow (compared to local disk) filesystem. If you `strace` the process you might see it doing things. If there is a problem then it'll typically show up as "NFS server not responding" type messages. – Stephen Harris Jun 21 '16 at 13:10
Yes, I do found this message in logs: NFS server not responding. But not in this case. It is observed for some other servers before and the rest is same, process is hung and we did reboot. But how long NFS server not responded I don't know. And here, I can't find any such messages, but only these: `kernel: INFO: task make:27163 blocked for more than 120 seconds.` – GP92 Jun 21 '16 at 13:14
So, I guess may be not NFS issue for my case. – GP92 Jun 21 '16 at 13:15
It may be an NFS issue, but the underlying cause could be one of many reasons. I'd recommend opening another question focused on that aspect (indeed there may already be answers to help with that diagnosis); that'll get more visibility than trying to work it out in these comments :-) – Stephen Harris Jun 21 '16 at 13:19
Yes, sure thanks! I already have a similar one: asked in different perspective. If you could, please also look at it once: http://unix.stackexchange.com/questions/287910/what-happens-if-dbus-connection-fails. I will create a new question after filtering out rest of my confusions. :) – GP92 Jun 21 '16 at 13:24

What could be the cause for going to uniterruptable deep sleep state for this particular make process?

1 Answers1