KJH
blackbox exporter 배포 및 alertmanager slack 설정 본문
Blackbox exporter 설정 파일 생성
웹 서비스 엔드포인트를 모니터링하기 위한 http 모듈을 구성하기 위해 Blackbox configuration 파일을 ConfigMap으로 작성합니다.
# kubectl --namespace=monitoring apply -f configmap.yaml apiVersion: v1 kind: ConfigMap metadata: name: prometheus-blackbox-exporter labels: app: prometheus-blackbox-exporter data: blackbox.yaml: | modules: http_2xx: http: no_follow_redirects: false preferred_ip_protocol: ip4 tls_config: insecure_skip_verify: true valid_http_versions: - HTTP/1.1 - HTTP/2 valid_status_codes: [] prober: http timeout: 5s |
Kubernetes에 Blackbox exporter 배포
Kubernetes에 배포할 수 있도록 Deployment와 Service를 작성합니다.
# kubectl --namespace=monitoring apply -f blackbox-exporter.yaml --- kind: Service apiVersion: v1 metadata: name: prometheus-blackbox-exporter labels: app: prometheus-blackbox-exporter spec: type: ClusterIP ports: - name: http port: 9115 protocol: TCP selector: app: prometheus-blackbox-exporter --- apiVersion: apps/v1 kind: Deployment metadata: name: prometheus-blackbox-exporter labels: app: prometheus-blackbox-exporter spec: replicas: 1 selector: matchLabels: app: prometheus-blackbox-exporter template: metadata: labels: app: prometheus-blackbox-exporter spec: restartPolicy: Always containers: - name: blackbox-exporter image: "prom/blackbox-exporter:v0.15.1" imagePullPolicy: IfNotPresent securityContext: readOnlyRootFilesystem: true runAsNonRoot: true runAsUser: 1000 args: - "--config.file=/config/blackbox.yaml" resources: {} ports: - containerPort: 9115 name: http livenessProbe: httpGet: path: /health port: http readinessProbe: httpGet: path: /health port: http volumeMounts: - mountPath: /config name: config - name: configmap-reload image: "jimmidyson/configmap-reload:v0.2.2" imagePullPolicy: "IfNotPresent" securityContext: runAsNonRoot: true runAsUser: 65534 args: - --volume-dir=/etc/config - --webhook-url=http://localhost:9115/-/reload resources: {} volumeMounts: - mountPath: /etc/config name: config readOnly: true volumes: - name: config configMap: name: prometheus-blackbox-exporter |
prometheus-additional.yaml 으로 아래 내용 저장
- job_name: 'kube-api-blackbox' scrape_interval: 1w metrics_path: /probe params: module: [http_2xx] static_configs: - targets: - https://www.google.com - http://www.example.com - https://prometheus.io relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: prometheus-blackbox-exporter:9115 # The blackbox exporter. |
저장된 내용을 base64로 인코딩 후 secret value로 Secret 생성
PROMETHEUS_ADD_CONFIG=$(cat prometheus-additional.yaml | base64) cat << EOF | kubectl --namespace=monitoring apply -f - apiVersion: v1 kind: Secret metadata: name: additional-scrape-configs type: Opaque data: prometheus-additional.yaml: $PROMETHEUS_ADD_CONFIG EOF |
kubectl --namespace=monitoring edit prometheuses prometheus-kube-prometheus-prometheus
아래 값 추가
spec: additionalScrapeConfigs: key: prometheus-additional.yaml name: additional-scrape-configs |
kubectl edit prometheusrules prometheus-kube-prometheus-k8s.rules -n monitoring
아래 값 추가
- name: blackbox-exporter rules: - alert: ProbeFailed expr: probe_success == 0 for: 5m labels: severity: error annotations: summary: "Probe failed (instance {{ $labels.instance }})" description: "Probe failed\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: SlowProbe expr: avg_over_time(probe_duration_seconds[1m]) > 1 for: 5m labels: severity: warning annotations: summary: "Slow probe (instance {{ $labels.instance }})" description: "Blackbox probe took more than 1s to complete\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: HttpStatusCode expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400 for: 5m labels: severity: error annotations: summary: "HTTP Status Code (instance {{ $labels.instance }})" description: "HTTP status code is not 200-399\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: SslCertificateWillExpireSoon expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 90 for: 5m labels: severity: warning annotations: summary: "SSL certificate will expire soon (instance {{ $labels.instance }})" description: "SSL certificate expires in 30 days\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: SslCertificateHasExpired expr: probe_ssl_earliest_cert_expiry - time() <= 0 for: 5m labels: severity: error annotations: summary: "SSL certificate has expired (instance {{ $labels.instance }})" description: "SSL certificate has expired already\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: HttpSlowRequests expr: avg_over_time(probe_http_duration_seconds[1m]) > 1 for: 5m labels: severity: warning annotations: summary: "HTTP slow requests (instance {{ $labels.instance }})" description: "HTTP request took more than 1s\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: SlowPing expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 1 for: 5m labels: severity: warning annotations: summary: "Slow ping (instance {{ $labels.instance }})" description: "Blackbox ping took more than 1s\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" |
proemtheus helm chart values.yaml 에 슬랙 설정 추가
슬랙 app 설정은 incomming webhook만 활성화 하면 바로 사용할 수 있다
적용된 alertname
HttpStatusCode, HttpSlowRequests, SlowPing [repeat_interval : 5m]
SslCertificateWillExpireSoon, SslCertificateHasExpired [repeat_interval: 168h] 일주일
config: global: resolve_timeout: 2m slack_api_url: "https://hooks.slack.com/services/###" receivers: - name: default-slack-alert # Blackhole - name: timeout-slack-alert slack_configs: - send_resolved: true title: |- [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }} text: >- {{ range .Alerts -}} *Alert:* {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }} *Description:* {{ .Annotations.description }} *Details:* {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}` {{ end - name: ping-slack-alert slack_configs: - send_resolved: true title: |- [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }} text: >- {{ range .Alerts -}} *Alert:* {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }} *Description:* {{ .Annotations.description }} *Details:* {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}` {{ end }} {{ end }} - name: cert-slack-alert slack_configs: - send_resolved: true title: |- [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }} text: >- {{ range .Alerts -}} *Alert:* {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }} *Description:* {{ .Annotations.description }} *Details:* {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}` {{ end }} {{ end }} - name: expired-slack-alert slack_configs: - send_resolved: true title: |- [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }} text: >- {{ range .Alerts -}} *Alert:* {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }} *Description:* {{ .Annotations.description }} *Details:* {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}` {{ end }} {{ end }} - name: status-slack-alert slack_configs: - send_resolved: true title: " " text: "{{ range .Alerts }}{{ .Annotations.message }} Monitor is {{ .Status }}: [CLO-SET] ( <{{ .Labels.instance }}> ). {{ .Labels.alertname }} : {{ .Annotations.description }} \n {{ end }}" route: group_wait: 0s group_interval: 30s # 초기 알림이 이미 전송된 알림 그룹에 추가된 새 알림에 대한 알림을 보내기 전에 대기하는 시간(보통 최대 5m 이상)입니다. / s, m, h repeat_interval: 5m #6h # 알림이 이미 성공적으로 전송된 경우 알림을 다시 보내기 전에 대기하는 시간(보통 최대 3시간 이상). receiver: default-slack-alert # All alerts that do not match the following child routes # will remain at the root node and be dispatched to 'default-receiver'. routes: - match: alertname: HttpSlowRequests receiver: timeout-slack-alert group_wait: 10s group_by: ['alertname'] - match: alertname: SlowPing receiver: ping-slack-alert group_wait: 10s group_by: ['alertname'] - match: alertname: SslCertificateWillExpireSoon receiver: cert-slack-alert repeat_interval: 168h group_wait: 10s group_by: ['alertname'] - match: alertname: SslCertificateHasExpired receiver: expired-slack-alert repeat_interval: 168h group_wait: 10s group_by: ['alertname'] - match: alertname: HttpStatusCode receiver: status-slack-alert group_wait: 10s group_by: ['alertname'] templates: - '/etc/alertmanager/config/*.tmpl' |
다음 리포팅엔 prometheus chart 수정을 해서 자동으로 되게 하고자 한다..
'DevOps' 카테고리의 다른 글
azure keyvault secrets provider (0) | 2023.10.17 |
---|---|
Packer (0) | 2023.10.17 |
Istio - 3(설치 및 예제) (0) | 2021.12.05 |
Istio - 2 (architecture) (0) | 2021.12.04 |
Istio - 1 (MSA, Service Mesh) (0) | 2021.12.03 |