手把手教程:为k8s集群增加prometheus的gpu监控
参考文档:https://github.com/NVIDIA/gpu-monitoring-tools一、k8s环境1、查看k8s集群kubectl get nodes -o wide2、查看k8s集群gpu所在节点kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocat...
·
参考文档:
https://github.com/NVIDIA/gpu-monitoring-tools
一、k8s环境
1、查看k8s集群
kubectl get nodes -o wide
2、查看k8s集群gpu所在节点
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
3、为有gpu的节点打标签 并查看
kubectl label nodes node0 hardware-type=NVIDIAGPU
kubectl label nodes node2 hardware-type=NVIDIAGPU
kubectl get nodes --show-labels
二、增加gpu监控
1、下载代码
git clone https://github.com/NVIDIA/gpu-monitoring-tools.git
2、打包镜像
cd gpu-monitoring-tools/exporters/prometheus-dcgm/k8s/pod-gpu-metrics-exporter/
sudo docker build -t pod-gpu-metrics-exporter .
cd gpu-monitoring-tools/exporters/prometheus-dcgm/dcgm-exporter/
sudo docker build -t dcgm-exporter .git
3、上传镜像到docker私有仓库
过程略
可参考https://blog.csdn.net/qq_29940863/article/details/89848075
4、修改启动文件镜像地址
cd gpu-monitoring-tools/exporters/prometheus-dcgm/k8s/pod-gpu-metrics-exporter/
vim pod-gpu-metrics-exporter-daemonset.yaml
改动见下图
5、启动
kubectl apply -f pod-gpu-metrics-exporter-daemonset.yaml
三、验证
1、查看集群
kubectl get ds -n monitoring
kubectl get po -n monitoring
注:monitoring(ns)和下图中没有红框的ds和po是在为k8s集群添加prometheus监控时所创建的
参考:https://github.com/coreos/kube-prometheus#quickstart
https://www.jianshu.com/p/2fbbe767870d
2、查看url
访问http://IP:9400/gpu/metrics
注:IP是node0和node2的地址
四、外部prometheus集成
1、编写配置文件
vim prometheus.yml
global:
scrape_interval: 60s
evaluation_interval: 60s
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
labels:
instance: prometheus
- job_name: prom_gpu
metrics_path: /gpu/metrics
static_configs:
- targets: ['IP:9400', 'IP:9400']
labels:
instance: gpu
2、启动镜像
sudo docker run --name prometheus -d -p 9090:9090 -v prometheus.yml:/etc/prometheus/prometheus.yml quay.io/prometheus/prometheus
3、访问
http://IP:9090
五、其它
1、节点环境
gpu所在节点可能需要安装docker-compose
参考:https://docs.docker.com/compose/install/
2、下载
https://download.csdn.net/download/qq_29940863/12588524
更多推荐
已为社区贡献2条内容
所有评论(0)