Monitoring a GPU cluster

Question

I have 10 servers running on Ubuntu 14.04 x64. Each server has a few Nvidia GPUs. I am looking for a monitoring program that would allow me to view the GPU usage on all servers at a glance.

score 3 · Answer 1 · edited Apr 13 '17 at 12:13

You can use the ganglia monitoring software (free of charge, open source). It has number of user-contributed Gmond Python DSO metric modules, including a GPU Nvidia module (/ganglia/gmond_python_modules/gpu/nvidia/).

Its architecture is typical for a cluster monitoring software:

(source of the image)

It's straightforward to install (~ 30 minutes without rushing), except for the GPU Nvidia module, which lacks clear documentation. (I am still stuck)

To install ganglia, you can do as follows. On the server:

sudo apt-get install -y ganglia-monitor rrdtool gmetad ganglia-webfrontend

Choose Yes each time you are asking a question about Apache

First, we configure the Ganglia server, i.e. gmetad:

sudo cp /etc/ganglia-webfrontend/apache.conf /etc/apache2/sites-enabled/ganglia.conf

sudo nano /etc/ganglia/gmetad.conf

In gmetad.conf, make the following changes:

Replace:

data_source "my cluster" localhost

by (assuming that 192.168.10.22 is the IP of the server)

data_source "my cluster" 50 192.168.10.22:8649

It means that the Ganglia should listen on the 8649 port (which is the default port for Ganglia). You should make sure that the IP and the port is accessible to the Ganglia clients that will run on the machines you plan to monitor.

You can now launch the Ganglia server:

sudo /etc/init.d/gmetad restart
sudo /etc/init.d/apache2 restart

You can access the web interface on http://192.168.10.22/ganglia/ (where 192.168.10.22 is the IP of the server)

Second, we configure the Ganglia client (i.e. gmond), either on the same machine or another machine.

sudo apt-get install -y ganglia-monitor

sudo nano /etc/ganglia/gmond.conf

In gmond.conf , make the following changes so that the Ganglia client, i.e. gmond, points to the server:

Replace:

cluster {
name = "unspecified"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}

to

cluster {
name = "my cluster"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}

Replace

udp_send_channel {
mcast_join = 239.2.11.71
port = 8649
ttl = 1
}

by

udp_send_channel {
# mcast_join = 239.2.11.71
host = 192.168.10.22
port = 8649
ttl = 1
}

Replace:

udp_recv_channel {
mcast_join = 239.2.11.71
port = 8649
bind = 239.2.11.71
}

to

udp_recv_channel {
# mcast_join = 239.2.11.71
port = 8649
# bind = 239.2.11.71
}

You can now start the Ganglia client:

sudo /etc/init.d/ganglia-monitor restart

It should appear within 30 seconds in the Ganglia web interface on the server (i.e., http://192.168.10.22/ganglia/).

Since the gmond.conf file is the same for all clients, you can add the ganglia monitoring on a new machine in a few seconds:

sudo apt-get install -y ganglia-monitor
wget http://somewebsite/gmond.conf # this gmond.conf is configured so that it points to the right ganglia server, as described above
sudo cp -f gmond.conf /etc/ganglia/gmond.conf
sudo /etc/init.d/ganglia-monitor restart

I used the following guides:

A bash script to start or restart gmond on all the servers you want to monitor:

deploy.sh:

#!/usr/bin/env bash

# Some useful resources:
# while read ip user pass; do : http://unix.stackexchange.com/questions/92664/how-to-deploy-programs-on-multiple-machines
# -o StrictHostKeyChecking=no: http://askubuntu.com/questions/180860/regarding-host-key-verification-failed
# -T: http://stackoverflow.com/questions/21659637/how-to-fix-sudo-no-tty-present-and-no-askpass-program-specified-error
# echo $pass |: http://stackoverflow.com/questions/11955298/use-sudo-with-password-as-parameter
# http://stackoverflow.com/questions/36805184/why-is-this-while-loop-not-looping


while read ip user pass <&3; do 
  echo $ip
  sshpass -p "$pass" ssh $user@$ip  -o StrictHostKeyChecking=no -T "
  echo $pass | sudo -S sudo /etc/init.d/ganglia-monitor restart
  "
  echo 'done'
done 3<servers.txt

servers.txt:

53.12.45.74 my_username my_password
54.12.45.74 my_username my_password
57.12.45.74 my_username my_password
‌‌

Screenshots of the main page in the web interface:

https://www.safaribooksonline.com/library/view/monitoring-with-ganglia/9781449330637/ch04.html gives a nice overview of the Ganglia Web Interface:

score 2 · Answer 2 · answered Apr 21 '16 at 11:34

munin has at least one plugin for monitoring nvidia GPUs (which uses the nvidia-smi utility to gather its data).

You could setup a munin server (perhaps on one of the GPU servers, or on the head node of your cluster), and then install the munin-node client and the nvidia plugin (plus whatever other plugins you might be interested in) on each of your GPU servers.

That would allow you to look in detail at the munin data for each server, or see an overview of the nvidia data for all servers. This would include graphs charting changes in, e.g., GPU temperature over time

Otherwise you could write a script to use ssh (or pdsh) to run the nvidia-smi utility on each server, extract the data you want, and present it in whatever format you want.

score 0 · Answer 3 · edited Apr 13 '17 at 12:36

As cas said, I could write my own tool, so here it is (not polished at all, but it works.):

Client side (i.e., the GPU node)

gpu_monitoring.sh (assumes that the IP of the server that serves the monitoring webpage is 128.52.200.39)

while true; 
do 
    nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv >> gpu_utilization.log; 
    python gpu_monitoring.py
    sshpass -p 'my_password' scp -o StrictHostKeyChecking=no ./gpu_utilization_100.png [email protected]:/var/www/html/gpu_utilization_100_server1.png
    sshpass -p 'my_password' scp -o StrictHostKeyChecking=no ./gpu_utilization_10000.png [email protected]:/var/www/html/gpu_utilization_10000_server1.png
    sleep 10; 
done

gpu_monitoring.py:

'''
Monitor GPU use
'''

from __future__ import print_function
from __future__ import division

import numpy as np

import matplotlib
import os
matplotlib.use('Agg') # http://stackoverflow.com/questions/2801882/generating-a-png-with-matplotlib-when-display-is-undefined
import matplotlib.pyplot as plt
import time
import datetime




def get_current_milliseconds():
    '''
    http://stackoverflow.com/questions/5998245/get-current-time-in-milliseconds-in-python
    '''
    return(int(round(time.time() * 1000)))


def get_current_time_in_seconds():
    '''
    http://stackoverflow.com/questions/415511/how-to-get-current-time-in-python
    '''
    return(time.strftime("%Y-%m-%d_%H-%M-%S", time.gmtime()))

def get_current_time_in_miliseconds():
    '''
    http://stackoverflow.com/questions/5998245/get-current-time-in-milliseconds-in-python
    '''
    return(get_current_time_in_seconds() + '-' + str(datetime.datetime.now().microsecond))




def generate_plot(gpu_log_filepath, max_history_size, graph_filepath):
    '''

    '''
    # Get data
    history_size = 0
    number_of_gpus = -1
    gpu_utilization = []
    gpu_utilization_one_timestep = []
    for line_number, line in enumerate(reversed(open(gpu_log_filepath).readlines())): # http://stackoverflow.com/questions/2301789/read-a-file-in-reverse-order-using-python
        if history_size > max_history_size: break
        line = line.split(',')

        if line[0].startswith('util') or len(gpu_utilization_one_timestep) == number_of_gpus:
            if number_of_gpus == -1 and len(gpu_utilization_one_timestep) > 0:
                 number_of_gpus = len(gpu_utilization_one_timestep)
            if len(gpu_utilization_one_timestep) == number_of_gpus:
                gpu_utilization.append(list(reversed(gpu_utilization_one_timestep))) # reversed because since we read the log file from button to up, GPU order is reversed.
                #print('gpu_utilization_one_timestep: {0}'.format(gpu_utilization_one_timestep))
                history_size += 1

            else: #len(gpu_utilization_one_timestep) <> number_of_gpus:
                pass
                #print('gpu_utilization_one_timestep: {0}'.format(gpu_utilization_one_timestep))

            gpu_utilization_one_timestep = []

        if line[0].startswith('util'): continue

        try:
            current_gpu_utilization = int(line[0].strip().replace(' %', ''))
        except:
            print('line: {0}'.format(line))
            print('line_number: {0}'.format(line_number))
            1/0
        gpu_utilization_one_timestep.append(current_gpu_utilization)

    # Plot graph
    #print('gpu_utilization: {0}'.format(gpu_utilization))
    gpu_utilization = np.array(list(reversed(gpu_utilization))) # We read the log backward, i.e., ante-chronological. We reverse again to get the chronological order.

    #print('gpu_utilization.shape: {0}'.format(gpu_utilization.shape))
    fig = plt.figure(1)
    ax = fig.add_subplot(111)
    ax.plot(range(gpu_utilization.shape[0]), gpu_utilization)
    ax.set_title('GPU utilization over time ({0})'.format(get_current_time_in_miliseconds()))
    ax.set_xlabel('Time')
    ax.set_ylabel('GPU utilization (%)')
    gpu_utilization_mean_per_gpu = np.mean(gpu_utilization, axis=0)
    lgd = ax.legend( [ 'GPU {0} (avg {1})'.format(gpu_number, np.round(gpu_utilization_mean, 1)) for gpu_number, gpu_utilization_mean in zip(range(gpu_utilization.shape[1]), gpu_utilization_mean_per_gpu)]
                     , loc='center right', bbox_to_anchor=(1.45, 0.5))
    plt.savefig(graph_filepath, dpi=300, format='png', bbox_inches='tight')
    plt.close()



def main():
    '''
    This is the main function
    '''
    # Parameters
    gpu_log_filepath = 'gpu_utilization.log' 
    max_history_size = 100

    max_history_sizes =[100, 10000] 
    for max_history_size in max_history_sizes:
        graph_filepath = 'gpu_utillization_{0}.png'.format(max_history_size)
        generate_plot(gpu_log_filepath, max_history_size, graph_filepath)


if __name__ == "__main__":
    main()
    #cProfile.run('main()') # if you want to do some profiling

Server-side (i.e., the Web server)

gpu.html:

<!DOCTYPE html>
<html>
<body>


<h2>gpu_utilization_server1.png</h2>
<img src="gpu_utilization_100_server1.png" alt="Mountain View" style="height:508px;"><img src="gpu_utilization_10000_server1.png" alt="Mountain View" style="height:508px;">


</body>
</html>

Patwie · Answer 4 · 2018-01-13T14:15:50.850

Or simply use

https://github.com/PatWie/cluster-smi

which acts exactly in the same way as nvidia-smi in the terminal but gathers all information of nodes across your cluster, which are running the cluster-smi-node. The output will be

+---------+------------------------+---------------------+----------+----------+
| Node    | Gpu                    | Memory-Usage        | Mem-Util | GPU-Util |
+---------+------------------------+---------------------+----------+----------+
| node-00 | 0: TITAN Xp            |  3857MiB / 12189MiB | 31%      | 0%       |
|         | 1: TITAN Xp            | 11689MiB / 12189MiB | 95%      | 0%       |
|         | 2: TITAN Xp            | 10787MiB / 12189MiB | 88%      | 0%       |
|         | 3: TITAN Xp            | 10965MiB / 12189MiB | 89%      | 100%     |
+---------+------------------------+---------------------+----------+----------+
| node-01 | 0: TITAN Xp            | 11667MiB / 12189MiB | 95%      | 100%     |
|         | 1: TITAN Xp            | 11667MiB / 12189MiB | 95%      | 96%      |
|         | 2: TITAN Xp            |  8497MiB / 12189MiB | 69%      | 100%     |
|         | 3: TITAN Xp            |  8499MiB / 12189MiB | 69%      | 98%      |
+---------+------------------------+---------------------+----------+----------+
| node-02 | 0: GeForce GTX 1080 Ti |  1447MiB / 11172MiB | 12%      | 8%       |
|         | 1: GeForce GTX 1080 Ti |  1453MiB / 11172MiB | 13%      | 99%      |
|         | 2: GeForce GTX 1080 Ti |  1673MiB / 11172MiB | 14%      | 0%       |
|         | 3: GeForce GTX 1080 Ti |  6812MiB / 11172MiB | 60%      | 36%      |
+---------+------------------------+---------------------+----------+----------+

when using 3 nodes.

It uses NVML to read these values directly for efficiency. I suggest to not parse the output of nvidia-smi as proposed in the other answers. Further, you can track these information from cluster-smi using Python+ZMQ.

Monitoring a GPU cluster

4 Answers4

Client side (i.e., the GPU node)

Server-side (i.e., the Web server)