I have 10 servers running on Ubuntu 14.04 x64. Each server has a few Nvidia GPUs. I am looking for a monitoring program that would allow me to view the GPU usage on all servers at a glance.
4 Answers
You can use the ganglia monitoring software (free of charge, open source). It has number of user-contributed Gmond Python DSO metric modules, including a GPU Nvidia module (/ganglia/gmond_python_modules/gpu/nvidia/).
Its architecture is typical for a cluster monitoring software:
It's straightforward to install (~ 30 minutes without rushing), except for the GPU Nvidia module, which lacks clear documentation. (I am still stuck)
To install ganglia, you can do as follows. On the server:
sudo apt-get install -y ganglia-monitor rrdtool gmetad ganglia-webfrontend
Choose Yes each time you are asking a question about Apache
First, we configure the Ganglia server, i.e. gmetad:
sudo cp /etc/ganglia-webfrontend/apache.conf /etc/apache2/sites-enabled/ganglia.conf
sudo nano /etc/ganglia/gmetad.conf
In gmetad.conf, make the following changes:
Replace:
data_source "my cluster" localhost
by (assuming that 192.168.10.22 is the IP of the server)
data_source "my cluster" 50 192.168.10.22:8649
It means that the Ganglia should listen on the 8649 port (which is the default port for Ganglia). You should make sure that the IP and the port is accessible to the Ganglia clients that will run on the machines you plan to monitor.
You can now launch the Ganglia server:
sudo /etc/init.d/gmetad restart
sudo /etc/init.d/apache2 restart
You can access the web interface on http://192.168.10.22/ganglia/ (where 192.168.10.22 is the IP of the server)
Second, we configure the Ganglia client (i.e. gmond), either on the same machine or another machine.
sudo apt-get install -y ganglia-monitor
sudo nano /etc/ganglia/gmond.conf
In gmond.conf , make the following changes so that the Ganglia client, i.e. gmond, points to the server:
Replace:
cluster {
name = "unspecified"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}
to
cluster {
name = "my cluster"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}
Replace
udp_send_channel {
mcast_join = 239.2.11.71
port = 8649
ttl = 1
}
by
udp_send_channel {
# mcast_join = 239.2.11.71
host = 192.168.10.22
port = 8649
ttl = 1
}
Replace:
udp_recv_channel {
mcast_join = 239.2.11.71
port = 8649
bind = 239.2.11.71
}
to
udp_recv_channel {
# mcast_join = 239.2.11.71
port = 8649
# bind = 239.2.11.71
}
You can now start the Ganglia client:
sudo /etc/init.d/ganglia-monitor restart
It should appear within 30 seconds in the Ganglia web interface on the server (i.e., http://192.168.10.22/ganglia/).
Since the gmond.conf file is the same for all clients, you can add the ganglia monitoring on a new machine in a few seconds:
sudo apt-get install -y ganglia-monitor
wget http://somewebsite/gmond.conf # this gmond.conf is configured so that it points to the right ganglia server, as described above
sudo cp -f gmond.conf /etc/ganglia/gmond.conf
sudo /etc/init.d/ganglia-monitor restart
I used the following guides:
- http://www.ubuntugeek.com/install-ganglia-on-ubuntu-14-04-server-trusty-tahr.html
- https://www.digitalocean.com/community/tutorials/introduction-to-ganglia-on-ubuntu-14-04
A bash script to start or restart gmond on all the servers you want to monitor:
deploy.sh:
#!/usr/bin/env bash
# Some useful resources:
# while read ip user pass; do : http://unix.stackexchange.com/questions/92664/how-to-deploy-programs-on-multiple-machines
# -o StrictHostKeyChecking=no: http://askubuntu.com/questions/180860/regarding-host-key-verification-failed
# -T: http://stackoverflow.com/questions/21659637/how-to-fix-sudo-no-tty-present-and-no-askpass-program-specified-error
# echo $pass |: http://stackoverflow.com/questions/11955298/use-sudo-with-password-as-parameter
# http://stackoverflow.com/questions/36805184/why-is-this-while-loop-not-looping
while read ip user pass <&3; do
echo $ip
sshpass -p "$pass" ssh $user@$ip -o StrictHostKeyChecking=no -T "
echo $pass | sudo -S sudo /etc/init.d/ganglia-monitor restart
"
echo 'done'
done 3<servers.txt
servers.txt:
53.12.45.74 my_username my_password
54.12.45.74 my_username my_password
57.12.45.74 my_username my_password
Screenshots of the main page in the web interface:
https://www.safaribooksonline.com/library/view/monitoring-with-ganglia/9781449330637/ch04.html gives a nice overview of the Ganglia Web Interface:
- 4,749
- 15
- 48
- 79
munin has at least one plugin for monitoring nvidia GPUs (which uses the nvidia-smi utility to gather its data).
You could setup a munin server (perhaps on one of the GPU servers, or on the head node of your cluster), and then install the munin-node client and the nvidia plugin (plus whatever other plugins you might be interested in) on each of your GPU servers.
That would allow you to look in detail at the munin data for each server, or see an overview of the nvidia data for all servers. This would include graphs charting changes in, e.g., GPU temperature over time
Otherwise you could write a script to use ssh (or pdsh) to run the nvidia-smi utility on each server, extract the data you want, and present it in whatever format you want.
- 1
- 7
- 119
- 185
As cas said, I could write my own tool, so here it is (not polished at all, but it works.):
Client side (i.e., the GPU node)
gpu_monitoring.sh (assumes that the IP of the server that serves the monitoring webpage is 128.52.200.39)
while true;
do
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv >> gpu_utilization.log;
python gpu_monitoring.py
sshpass -p 'my_password' scp -o StrictHostKeyChecking=no ./gpu_utilization_100.png [email protected]:/var/www/html/gpu_utilization_100_server1.png
sshpass -p 'my_password' scp -o StrictHostKeyChecking=no ./gpu_utilization_10000.png [email protected]:/var/www/html/gpu_utilization_10000_server1.png
sleep 10;
done
gpu_monitoring.py:
'''
Monitor GPU use
'''
from __future__ import print_function
from __future__ import division
import numpy as np
import matplotlib
import os
matplotlib.use('Agg') # http://stackoverflow.com/questions/2801882/generating-a-png-with-matplotlib-when-display-is-undefined
import matplotlib.pyplot as plt
import time
import datetime
def get_current_milliseconds():
'''
http://stackoverflow.com/questions/5998245/get-current-time-in-milliseconds-in-python
'''
return(int(round(time.time() * 1000)))
def get_current_time_in_seconds():
'''
http://stackoverflow.com/questions/415511/how-to-get-current-time-in-python
'''
return(time.strftime("%Y-%m-%d_%H-%M-%S", time.gmtime()))
def get_current_time_in_miliseconds():
'''
http://stackoverflow.com/questions/5998245/get-current-time-in-milliseconds-in-python
'''
return(get_current_time_in_seconds() + '-' + str(datetime.datetime.now().microsecond))
def generate_plot(gpu_log_filepath, max_history_size, graph_filepath):
'''
'''
# Get data
history_size = 0
number_of_gpus = -1
gpu_utilization = []
gpu_utilization_one_timestep = []
for line_number, line in enumerate(reversed(open(gpu_log_filepath).readlines())): # http://stackoverflow.com/questions/2301789/read-a-file-in-reverse-order-using-python
if history_size > max_history_size: break
line = line.split(',')
if line[0].startswith('util') or len(gpu_utilization_one_timestep) == number_of_gpus:
if number_of_gpus == -1 and len(gpu_utilization_one_timestep) > 0:
number_of_gpus = len(gpu_utilization_one_timestep)
if len(gpu_utilization_one_timestep) == number_of_gpus:
gpu_utilization.append(list(reversed(gpu_utilization_one_timestep))) # reversed because since we read the log file from button to up, GPU order is reversed.
#print('gpu_utilization_one_timestep: {0}'.format(gpu_utilization_one_timestep))
history_size += 1
else: #len(gpu_utilization_one_timestep) <> number_of_gpus:
pass
#print('gpu_utilization_one_timestep: {0}'.format(gpu_utilization_one_timestep))
gpu_utilization_one_timestep = []
if line[0].startswith('util'): continue
try:
current_gpu_utilization = int(line[0].strip().replace(' %', ''))
except:
print('line: {0}'.format(line))
print('line_number: {0}'.format(line_number))
1/0
gpu_utilization_one_timestep.append(current_gpu_utilization)
# Plot graph
#print('gpu_utilization: {0}'.format(gpu_utilization))
gpu_utilization = np.array(list(reversed(gpu_utilization))) # We read the log backward, i.e., ante-chronological. We reverse again to get the chronological order.
#print('gpu_utilization.shape: {0}'.format(gpu_utilization.shape))
fig = plt.figure(1)
ax = fig.add_subplot(111)
ax.plot(range(gpu_utilization.shape[0]), gpu_utilization)
ax.set_title('GPU utilization over time ({0})'.format(get_current_time_in_miliseconds()))
ax.set_xlabel('Time')
ax.set_ylabel('GPU utilization (%)')
gpu_utilization_mean_per_gpu = np.mean(gpu_utilization, axis=0)
lgd = ax.legend( [ 'GPU {0} (avg {1})'.format(gpu_number, np.round(gpu_utilization_mean, 1)) for gpu_number, gpu_utilization_mean in zip(range(gpu_utilization.shape[1]), gpu_utilization_mean_per_gpu)]
, loc='center right', bbox_to_anchor=(1.45, 0.5))
plt.savefig(graph_filepath, dpi=300, format='png', bbox_inches='tight')
plt.close()
def main():
'''
This is the main function
'''
# Parameters
gpu_log_filepath = 'gpu_utilization.log'
max_history_size = 100
max_history_sizes =[100, 10000]
for max_history_size in max_history_sizes:
graph_filepath = 'gpu_utillization_{0}.png'.format(max_history_size)
generate_plot(gpu_log_filepath, max_history_size, graph_filepath)
if __name__ == "__main__":
main()
#cProfile.run('main()') # if you want to do some profiling
Server-side (i.e., the Web server)
gpu.html:
<!DOCTYPE html>
<html>
<body>
<h2>gpu_utilization_server1.png</h2>
<img src="gpu_utilization_100_server1.png" alt="Mountain View" style="height:508px;"><img src="gpu_utilization_10000_server1.png" alt="Mountain View" style="height:508px;">
</body>
</html>
- 4,749
- 15
- 48
- 79
Or simply use
https://github.com/PatWie/cluster-smi
which acts exactly in the same way as nvidia-smi in the terminal but gathers all information of nodes across your cluster, which are running the cluster-smi-node. The output will be
+---------+------------------------+---------------------+----------+----------+
| Node | Gpu | Memory-Usage | Mem-Util | GPU-Util |
+---------+------------------------+---------------------+----------+----------+
| node-00 | 0: TITAN Xp | 3857MiB / 12189MiB | 31% | 0% |
| | 1: TITAN Xp | 11689MiB / 12189MiB | 95% | 0% |
| | 2: TITAN Xp | 10787MiB / 12189MiB | 88% | 0% |
| | 3: TITAN Xp | 10965MiB / 12189MiB | 89% | 100% |
+---------+------------------------+---------------------+----------+----------+
| node-01 | 0: TITAN Xp | 11667MiB / 12189MiB | 95% | 100% |
| | 1: TITAN Xp | 11667MiB / 12189MiB | 95% | 96% |
| | 2: TITAN Xp | 8497MiB / 12189MiB | 69% | 100% |
| | 3: TITAN Xp | 8499MiB / 12189MiB | 69% | 98% |
+---------+------------------------+---------------------+----------+----------+
| node-02 | 0: GeForce GTX 1080 Ti | 1447MiB / 11172MiB | 12% | 8% |
| | 1: GeForce GTX 1080 Ti | 1453MiB / 11172MiB | 13% | 99% |
| | 2: GeForce GTX 1080 Ti | 1673MiB / 11172MiB | 14% | 0% |
| | 3: GeForce GTX 1080 Ti | 6812MiB / 11172MiB | 60% | 36% |
+---------+------------------------+---------------------+----------+----------+
when using 3 nodes.
It uses NVML to read these values directly for efficiency. I suggest to not parse the output of nvidia-smi as proposed in the other answers.
Further, you can track these information from cluster-smi using Python+ZMQ.
- 101
- 3




