部署 Thanos Sidecar 模式实现 Prometheus 多集群管理
Thanos 是具有长期存储功能的开源、高可用性 Prometheus 设置,Thanos 是 CNCF 孵化项目。
Thanos Sidecar 模式简介
Thanos 是具有长期存储功能的开源、高可用性 Prometheus 设置,Thanos 是 CNCF 孵化项目。
Thanos 具有指标的全局查询视图、无限保留指标、组件的高可用性等特征。
Thanos 有Sidecar和Receiver 两种运行模式,Sidecar模式下thanos每隔两小时将Prometheus 本地存储的TSDB块上传到对象存储中。
官方网站:https://thanos.io/
项目地址:https://github.com/thanos-io/thanos
官方文档:https://thanos.io/tip/components/sidecar.md/
Prometheus 集群规划
准备3个kubernetes集群,通过thanos sidecar模式统一收集3个集群指标在同一个grafana展示,并上传到对象存储长期存储,集群组件部署规划如下:
- cluster-observer:通过kube-prometheus-stack部署监控相关组件及thanos-sidecar,并且启用grafana。部署bitnami-thanos,启用thanos和minio。
- cluster-A:通过kube-prometheus-stack部署监控相关组件及thanos-sidecar。
- cluster-B:通过kube-prometheus-stack部署监控相关组件及thanos-sidecar。
集群组件部署规划表如下:
集群名称 | 节点名称 | 节点IP | 部署组件 |
---|---|---|---|
cluster-observer | node33 | 192.168.72.33 | prometheus-operator、prometheus、thanos-sidecar、alertmanager、kube-state-metrics、node-exporter、额外部署:grafana、thanos、minio |
cluster-A | node40 | 192.168.72.40 | prometheus-operator、prometheus、thanos-sidecar、alertmanager、kube-state-metrics、node-exporter |
cluster-B | node41 | 192.168.72.41 | prometheus-operator、prometheus、thanos-sidecar、alertmanager、kube-state-metrics、node-exporter |
查看集群节点信息
cluster-observer 集群
root@node33:~# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node33 Ready control-plane 153d v1.27.6 192.168.72.33 <none> Ubuntu 22.04.2 LTS 5.15.0-76-generic containerd://1.6.24
cluster-A 集群
root@node40:~# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node40 Ready control-plane 107d v1.27.7 192.168.72.40 <none> Ubuntu 22.04.2 LTS 5.15.0-76-generic containerd://1.6.24
cluster-B 集群
root@node41:~# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
node41 Ready control-plane 4d21h v1.27.7 192.168.72.41 <none> Ubuntu 22.04.2 LTS 5.15.0-76-generic containerd://1.6.24
3个集群需要提供默认storageclass,用于pod持久化存储,示例如下:
root@node33:~# kubectl get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
openebs-hostpath (default) openebs.io/local Delete WaitForFirstConsumer false 153d
在3个集群分别添加kube-prometheus-stack helm仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
基本架构
指标采集逻辑
cluster-A部署prometheus
创建kube-prometheus-stack values.yaml
文件
$ cat values.yaml
prometheus:
service:
type: NodePort
prometheusSpec:
replicas: 2
retention: 12h
disableCompaction: true
thanos:
objectStorageConfig:
existingSecret:
name: thanos-objstore
key: objstore.yml
externalLabels:
cluster: cluster-A
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
thanosService:
enabled: true
type: NodePort
clusterIP: ""
thanosServiceMonitor:
enabled: true
extraSecret:
name: thanos-objstore
data:
objstore.yml: |
type: S3
config:
bucket: "thanos"
endpoint: "192.168.72.33:32000"
access_key: "admin"
secret_key: "minio123"
insecure: true
alertmanager:
enabled: true
service:
type: "NodePort"
alertmanagerSpec:
replicas: 2
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
grafana:
enabled: false
配置参数说明:
prometheus.prometheusSpec.replicas
参数:指定Prometheus 为2副本,采集同一个集群的两份相同指标,实现高可用,由thanos去重prometheus.prometheusSpec.storageSpec
参数:启用prometheus本地持久化存储,至少保留6h的指标数据,因为thanos sidecar 2h上传一次数据到对象存储,有丢失2h数据的风险,另外查询实时指标时thanos依然需要访问prometheus的本地数据prometheus.prometheusSpec.disableCompaction
参数:禁用prometheus本地的数据压缩功能,thanos也具备去重压缩功能,防止冲突混乱prometheus.prometheusSpec.thanos.objectStorageConfig
参数:thanos sidecar 上传数据到对象存储需要与对象存储实现对接访问,后面部署thanos时会启用minio对象存储组件,也可单独部署prometheus.prometheusSpec.thanos.externalLabels
参数:通过标签标记不同集群及prometheus副本,thanos通过externalLabels
识别不同集群、不同副本的prometheus实例,集群标签手动配置,副本标签由 kube-prometheus-stack 自动添加prometheus.thanosService
参数:将thanos sidecar Service 以NodePort方式暴露出来,生产环境推荐使用ingress方式,用于observe集群的thanos query组件访问prometheus.extraSecret
参数:由kube-prometheus-stack自动创建访问对象存储的secret,供prometheus.prometheusSpec.thanos.objectStorageConfig
参数使用,如果基于安全考虑,也可去除该配置手动创建secretalertmanager.enabled
参数:配置alertmanager双副本及数据持久化,可选grafana.enabled
参数:禁用grafana组件,只需在observe集群启用grafana即可
部署kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace -f values.yaml
查看创建的所有pods,部署了prometheus-operator、prometheus、thanos-sidecar、alertmanager、kube-state-metrics、node-exporter组件。
root@node40:~# kubectl -n monitoring get pods
NAME READY STATUS RESTARTS AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 15m
alertmanager-prometheus-kube-prometheus-alertmanager-1 2/2 Running 0 15m
prometheus-kube-prometheus-operator-85656676ff-vt7gm 1/1 Running 0 4d19h
prometheus-kube-state-metrics-6b6ffbfdd6-v7z9r 1/1 Running 0 4d19h
prometheus-prometheus-kube-prometheus-prometheus-0 3/3 Running 0 4d19h
prometheus-prometheus-kube-prometheus-prometheus-1 3/3 Running 0 4d19h
prometheus-prometheus-node-exporter-nkpph 1/1 Running 0 4d19h
查看prometheus pods,新建了thanos-sidecar
$ kubectl -n monitoring get pods prometheus-prometheus-kube-prometheus-prometheus-0 -o jsonpath='{.spec.containers[*].name}'
prometheus config-reloader thanos-sidecar
查看prometheus service,新建了thanos-discovery
root@node40:~/kube-prometheus-stack# kubectl -n monitoring get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus-kube-prometheus-thanos-discovery NodePort 10.96.3.135 <none> 10901:30901/TCP,10902:30902/TCP 66s
访问prometheus console配置页面,查看两个副本external_labels标签自动添加的prometheus_replica
标签配置:
重新刷新页面查看两个副本的标签配置:
# 副本1
global:
external_labels:
cluster: cluster-A
prometheus: monitoring/prometheus-kube-prometheus-prometheus
prometheus_replica: prometheus-prometheus-kube-prometheus-prometheus-0
# 副本2
global:
external_labels:
cluster: cluster-A
prometheus: monitoring/prometheus-kube-prometheus-prometheus
prometheus_replica: prometheus-prometheus-kube-prometheus-prometheus-1
thanos的--query.replica-label
参数需要配置为external_labels
中的prometheus_replica
,我们将在后面的bitnami-thanos helm chart values.yaml中指定对应参数,实际配置示例如下:
thanos query \
--store=<address_of_store_api> \
--query.replica-label="prometheus_replica"
cluster-B部署prometheus
配置与cluster-A类似,注意修改externalLabels参数,创建kube-prometheus-stackvalues.yaml
文件
$ cat values.yaml
prometheus:
service:
type: NodePort
prometheusSpec:
replicas: 2
retention: 12h
disableCompaction: true
thanos:
objectStorageConfig:
existingSecret:
name: thanos-objstore
key: objstore.yml
externalLabels:
cluster: cluster-B
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
thanosService:
enabled: true
type: NodePort
clusterIP: ""
thanosServiceMonitor:
enabled: true
extraSecret:
name: thanos-objstore
data:
objstore.yml: |
type: S3
config:
bucket: "thanos"
endpoint: "192.168.72.33:32000"
access_key: "admin"
secret_key: "minio123"
insecure: true
alertmanager:
enabled: true
service:
type: "NodePort"
alertmanagerSpec:
replicas: 2
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
grafana:
enabled: false
部署kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace -f values.yaml
查看创建的所有pods,部署了prometheus-operator、prometheus、thanos-sidecar、alertmanager、kube-state-metrics、node-exporter组件。
root@node41:~# kubectl -n monitoring get pods
NAME READY STATUS RESTARTS AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 18m
alertmanager-prometheus-kube-prometheus-alertmanager-1 2/2 Running 0 18m
prometheus-kube-prometheus-operator-85656676ff-slrmb 1/1 Running 0 4d19h
prometheus-kube-state-metrics-6b6ffbfdd6-2tdfb 1/1 Running 0 4d19h
prometheus-prometheus-kube-prometheus-prometheus-0 3/3 Running 0 4d19h
prometheus-prometheus-kube-prometheus-prometheus-1 3/3 Running 0 4d19h
prometheus-prometheus-node-exporter-t5x4s 1/1 Running 0 4d19h
cluster-observer部署prometheus
配置与cluster-A类似,注意修改externalLabels参数,创建kube-prometheus-stackvalues.yaml
文件
$ cat values.yaml
prometheus:
service:
type: NodePort
prometheusSpec:
replicas: 2
retention: 12h
disableCompaction: true
thanos:
objectStorageConfig:
existingSecret:
name: thanos-objstore
key: objstore.yml
externalLabels:
cluster: cluster-observer
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
thanosService:
enabled: true
type: NodePort
clusterIP: ""
thanosServiceMonitor:
enabled: true
extraSecret:
name: thanos-objstore
data:
objstore.yml: |
type: S3
config:
bucket: "thanos"
endpoint: "192.168.72.33:32000"
access_key: "admin"
secret_key: "minio123"
insecure: true
alertmanager:
enabled: true
service:
type: "NodePort"
alertmanagerSpec:
replicas: 2
storage:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
grafana:
enabled: true
service:
type: "NodePort"
persistence:
enabled: true
sidecar:
dashboards:
multicluster:
global:
enabled: true
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Thanos
type: prometheus
url: http://thanos-query-frontend.thanos:9090
access: proxy
isDefault: true
部署kube-prometheus-stack
helm install prometheus prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace -f values.yaml
查看创建的所有pods,部署了prometheus-operator、prometheus、thanos-sidecar、alertmanager、kube-state-metrics、node-exporter组件,以及grafana组件。
root@node33:~# kubectl -n monitoring get pods
NAME READY STATUS RESTARTS AGE
alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 15m
alertmanager-prometheus-kube-prometheus-alertmanager-1 2/2 Running 0 15m
prometheus-grafana-ff75dc75c-7nxzj 3/3 Running 0 15m
prometheus-kube-prometheus-operator-85656676ff-bzpbv 1/1 Running 0 4d19h
prometheus-kube-state-metrics-6b6ffbfdd6-rn2q4 1/1 Running 0 4d19h
prometheus-prometheus-kube-prometheus-prometheus-0 3/3 Running 0 4d19h
prometheus-prometheus-kube-prometheus-prometheus-1 3/3 Running 0 4d19h
prometheus-prometheus-node-exporter-zrz86 1/1 Running 0 4d19h
cluster-observer部署thanos
bitnami-thanos 项目地址:https://github.com/bitnami/charts/tree/main/bitnami/thanos
Thanos 是一个高度可用的指标系统,可以添加到现有 Prometheus 部署之上,提供跨所有 Prometheus 安装的全局查询视图。该图表允许您安装多个 Thanos 组件,因此您可以部署如下所示的架构:
+--------------+ +--------------+ +--------------+
| Thanos |----------------> | Thanos Store | | Thanos |
| Query | | | Gateway | | Compactor |
+--------------+ | +--------------+ +--------------+
push | | |
+--------------+ alerts +--------------+ | | storages | Downsample &
| Alertmanager | <----------| Thanos | <----| | query metrics | compact blocks
| (*) | | Ruler | | | |
+--------------+ +--------------+ | \/ |
^ | | +----------------+ |
| push alerts +--------------|----> | MinIO® (*) | <---------+
| | | |
+------------------------------+ | +----------------+
|+------------+ +------------+| | ^
|| Prometheus |->| Thanos || <----------------+ |
|| (*) |<-| Sidecar (*)|| query | inspect
|+------------+ +------------+| metrics | blocks
+------------------------------+ |
+--------------+
| Thanos |
| Bucket Web |
+--------------+
添加bitnami-thanos helm仓库
helm repo add bitnami https://charts.bitnami.com/bitnami
创建bitnami-thanos values.yaml
文件
$ cat value.yaml
objstoreConfig: |-
type: s3
config:
bucket: thanos
endpoint: thanos-minio.thanos:9000
access_key: admin
secret_key: minio123
insecure: true
query:
enabled: true
replicaCount: 3
replicaLabel: prometheus_replica
stores:
- "192.168.72.33:30901"
- "192.168.72.40:30901"
- "192.168.72.41:30901"
queryFrontend:
enabled: true
service:
type: NodePort
bucketweb:
enabled: true
service:
type: NodePort
compactor:
enabled: true
persistence:
enabled: true
storegateway:
enabled: true
persistence:
enabled: true
metrics:
enabled: true
serviceMonitor:
enabled: true
receive:
enabled: false
ruler:
enabled: true
replicaLabel: prometheus_replica
serviceMonitor:
enabled: true
alertmanagers:
- http://alertmanager-operated.monitoring:9093
config:
groups:
- name: "metamonitoring"
rules:
- alert: "PrometheusDown"
expr: absent(up{prometheus="monitoring/prometheus-kube-prometheus-prometheus"})
persistence:
enabled: true
minio:
enabled: true
auth:
rootUser: admin
rootPassword: "minio123"
defaultBuckets: "thanos"
service:
type: NodePort
nodePorts:
api: "32000"
console: "32001"
配置参数说明:
objstoreConfig
参数:thanos query需访问对象存储查询数据query.replicaLabel
参数:需要与kube-prometheus-stack 生成的副本标签一致query.stores
参数:thanos对接每个集群的thanos sidecar service,已经通过NodePort方式暴露出来receive.enabled
参数:本地部署采用sidecar模式,禁用receive模式(默认禁用)rule.enabled
参数:可选,thanos也支持配置全局告警规则,无特殊全局告警需求建议告警在接近数据的地方实现,即使用每个集群部署的alertmanager进行告警配置minio.enabled
参数:启用thanos自带的minio对象存储,生产环境建议单独部署minio集群或者使用云上对象存储
部署thanos相关组件
helm install thanos bitnami/thanos \
-n thanos --create-namespace -f values.yaml
查看创建的pods
root@node33:~# kubectl -n thanos get pods
NAME READY STATUS RESTARTS AGE
thanos-bucketweb-8575fff6d9-9sxg2 1/1 Running 0 4d19h
thanos-compactor-5cb6c57664-lgw2h 1/1 Running 0 4d19h
thanos-minio-d8bb8598b-8kcl8 1/1 Running 0 4d19h
thanos-query-75576bcd78-pjglm 1/1 Running 0 14m
thanos-query-frontend-758f54d944-h6lgb 1/1 Running 0 4d19h
thanos-storegateway-0 1/1 Running 1 (4d19h ago) 4d19h
使用 Thanos 代替 Prometheus
通过 WebUI 进行 PromQL 查询
Thanos 查询提供了类似于 Prometheus 的接口。可以通过端口转发或配置入口来访问它。
root@ubuntu:~# kubectl -n thanos get svc thanos-query-frontend
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
thanos-query-frontend NodePort 10.96.0.117 <none> 9090:31264/TCP 4d20h
浏览器访问
http://192.168.72.33:31264
我们发现执行 PromQL 查询的能力,就像 Prometheus 一样:
来自不同 Prometheus 实例的警报、目标和规则也可以访问:
并且可以在查询中列出不同配置 StoreAPI:
Thanos 作为 Grafana 数据源
Thanos 最常与 Grafana 结合使用,Thanos query公开与 Prometheus 相同的 API,因此只需在 Grafana 中添加 Prometheus 类型数据源并定位 Thanos query或thanos-query-frontend。
查看grafana service nodeport
root@ubuntu:~# kubectl -n monitoring get svc prometheus-grafana
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus-grafana NodePort 10.96.0.167 <none> 80:30495/TCP 161m
登录到grafana,默认用户名密码为admin/prom-operator
http://192.168.72.33:30495
查看已配置的数据源为http://thanos-query-frontend.thanos:9090
查看dashboard,能够切换到不同集群查看指标
开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!
更多推荐
所有评论(0)