5

The night support guys where I work have a tendency to reboot machines citing the fact that they can't ssh in and figure out what's going on in the first place. It's pretty much against company policy to do this, (as the person responsible for the code on the devices, it's against my policy at least)

But, policies and politics aside, there's never actually an instance where resource over-utilization will completely cripple a machine to the point where you can't ssh in at all, is there? In my experience, you get a painfully slow terminal but ssh gets one or two cycles every two minutes and you can kill the offending process and maybe get a stack dump.

It may be expedient to just reboot the machine, but it's my opinion that "if we kill it, we won't learn nothin'". So, if anyone could give me some ammo to make the argument that rebooting is not the answer and some troubleshooting pointers to help the overworked night-shift guys ssh in to pretty much hosed machines, I could use some help.

Peter Turner
  • 1,754
  • 3
  • 20
  • 34

2 Answers2

5

If a server is completely consumed cpu-wise, it won't have the cycles to service your ssh request.

If it's completely consumed memory-wise, it won't be able to fork a new sshd process for you.

I find there's quite often instances when ssh doesn't work, and it's due to resource over-utilization.

That said, repeatedly taking the sledgehammer approach of rebooting without figuring out the root cause seems unwise and short-sighted.

steve
  • 21,582
  • 5
  • 48
  • 75
  • 1
    Is there anything else a person could do? I believe there is still some memory left on the machine when it goes casters up. So the chances of running a command like "ssh user@slowhost top -n 1" to at least see what is gumming stuff up should have a good chance of returning some info, right? – Peter Turner Aug 17 '15 at 16:32
  • 2
    ... and in the olden days the linux "out-of-memory" killer would target `sshd` processes. Fun! They've thankfully since fixed that minor oversight. Otherwise, a busy system can usually eventually be logged into, you'll need patience and perhaps increased connect timeouts for `ssh`. – thrig Aug 17 '15 at 16:34
  • 1
    If this were my server, I'd think about generating a crash dump to get some diagnostics for review. And putting some proactive monitoring in place, recording stats to help hunt down the process with the *voracious* appetite for cpu/memory. – steve Aug 17 '15 at 16:39
5

This is really just a comment that is too long for comments.

The short answer to your question is:

Yes. Resource over-utilization can kill each and every functionality that the server has. Every process requires memory. When the memory runs out, sad times.

Long answer

If you can't recover the machine while it is struggling, finding the root cause will be harder for you.

Next time the machine is going down, try to save it. immediately make it stop doing what you already know it is doing. Don't waste your precious seconds trying to run a diagnostic command. Just make it stop doing what you know it is doing first. It is a web server, immediately kill all apache/nginx/lighttpd processes. If it runs email, immediately kill all email processes. If it is a database server, DO NOT kill the DB processes outright, but immediately give the stop command (if all the DB requests come through web sites or some app, just kill the webserver or service that serves the app).

You need to shut down whatever is feeding more and more processes to your server, to stop your server from blindly trying to answer all requests and dying from lack of memory.

Once it is somewhat under control, and assuming you can't find anything with the diagnostics, your only hope is the logs. If it is a web/email/db server, check your logs for things like the number of IPs making request within a certain time frame, compare the times when the server fails to the times when it runs smoothly. Check the sort of web or email requests that came through just prior and during the resource problem. Check the number of DB queries writing to your disk, disk I/O issues can easily back things up to the point of killing your server. You are likely to find problems with long running/disk writing DB queries and/or abusive email/web users this way.

Further, once the services are off and you are grasping for clues, check the process list for any existing processes that are running as system users which should not be. For example, you shut down apache and it runs as 'nobody', look and see if any other script is being run by 'nobody'. Sometimes you can find malicious shells and things uploaded to tmp this way.

Use top to find anything eating a lot of memory and if you are not sure of that process, investigate it. Use commands like lsof and other system tools to see what directory that process is running from, anything that can give you a clue of an illegitimate process.

Odds are you can find something like this. If you can't because the logging sucks, then at least turn up/enable the logging and you will have more data if it happens again. If it is a file server (ftp, scp, etc) enable the logging so you can see when files are being uploaded/downloaded. Are people on your network doing massive uploads/downloads at the same time?

These are just the tip of the ice berg, there is a lot you can do but treat it like an investigation, you need a clue to work from.

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
Baazigar
  • 732
  • 3
  • 9