前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >21 Jan 2022 使用vector收集pod日志并转发到prometheus remote write

21 Jan 2022 使用vector收集pod日志并转发到prometheus remote write

作者头像
俊采
发布2023-10-17 10:29:35
4430
发布2023-10-17 10:29:35
举报
文章被收录于专栏:LEo的网络日志

集群环境介绍

  • ACM Hub:启用了observability service
  • AKS cluster:将AKS导入到ACM

本文主要实现将AKS cluster上某个pod的日志转发汇总到ACH Hub端,并在ACM Hub端定义相应的alert rule,如果在Hub端检测到相应错误日志,触发alert,用户能及时知道远端AKS集群某个服务出现问题。

安装vector

代码语言:javascript
复制
$ helm repo add vector https://helm.vector.dev
$ helm repo update
$ helm show values vector/vector
$ cat <<-'VALUES' > values.yaml
role: Agent
VALUES
$ helm install vector vector/vector \
  --namespace vector \
  --create-namespace \
  --values values.yaml

pod的日志格式

pod会打印很多格式的日志,这里只关注apache格式的日志。

代码语言:javascript
复制
10.244.0.142 - admin [13/Dec/2021:12:20:36 +0000] "GET /api/v2/metrics/ HTTP/1.1" 200 7983 "http://10.244.2.22:8052/api/v2/metrics" "Prometheus/2.26.1" "-"

10.244.0.142 - admin [13/Dec/2021:12:21:35 +0000] "GET /api/v2/metrics HTTP/1.1" 301 0 "-" "Prometheus/2.26.1" "-"

10.244.0.142 - admin [09/Dec/2021:19:32:13 +0000] "GET /api/v2/metrics/ HTTP/1.1" 500 41 "http://10.244.2.22:8052/api/v2/metrics" "Prometheus/2.26.1" "-"

在deployment上配置忽略部分container日志

因为pod中有多个container,这里我们只希望收集automation-controller-web容器的日志,给pod添加annotation后,vector会忽略指定的container日志。

代码语言:javascript
复制
$ k get deploy automation-controller -o yaml | grep -B 3 'vector.dev/exclude-containers: redis'
  template:
    metadata:
      annotations:
        vector.dev/exclude-containers: redis,automation-controller-task,automation-controller-ee

在deployment中配置了相应的annotation。

配置vector收集并转发日志

vector完整配置如下,我会解释相关配置含义:

代码语言:javascript
复制
$ k get cm vector -o yaml
apiVersion: v1
data:
  agent.yaml: |
    data_dir: /vector-data-dir
    api:
      enabled: true
      address: 127.0.0.1:8686
      playground: false
    sources:
      kubernetes_logs:
        type: kubernetes_logs
        extra_label_selector: "app.kubernetes.io/name=automation-controller"
    sinks:
      prometheus_remote_write:
        type: prometheus_remote_write
        inputs:
          - log_to_metric_id
        endpoint: https://prometheus_remote_write_url/v1/receive
        default_namespace: automation_controller_web
        tls:
          ca_file: /tlscerts/ca/ca.crt
          crt_file: /tlscerts/certs/tls.crt
          key_file: /tlscerts/certs/tls.key
      stdout:
        type: console
        inputs:
          - log_to_metric_id
          - non_2xx_log
        encoding:
          codec: json
    transforms:
      remap_id:
        type: remap
        inputs:
          - kubernetes_logs
        source: . = .message
      filter_id:
        type: filter
        inputs:
          - remap_id
        condition: match!(.message, r'GET /api/v2/metrics[/]? HTTP/1.1')
      apache_log:
        type: remap
        inputs:
          - filter_id
        source: . = parse_apache_log!(.message, "combined")
      non_2xx_log:
        type: filter
        inputs:
          - apache_log
        condition: .status != 200
      log_to_metric_id:
        type: log_to_metric
        inputs:
          - non_2xx_log
        metrics:
          - type: counter
            field: status
            name: response_total
            tags:
              cluster: aap-azure-demo
              host: ""
              method: ""
              status: ""
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: vector
    meta.helm.sh/release-namespace: vector
  creationTimestamp: "2022-01-20T03:03:08Z"
  labels:
    app.kubernetes.io/component: Agent
    app.kubernetes.io/instance: vector
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: vector
    app.kubernetes.io/version: 0.19.0-distroless-libc
    helm.sh/chart: vector-0.4.0
  name: vector
  namespace: vector
  resourceVersion: "29901478"
  uid: ae118ac9-e38a-4770-9468-1896a1c4d4bb

只收集具有该标签app.kubernetes.io/name=automation-controller的pod日志:

代码语言:javascript
复制
    sources:
      kubernetes_logs:
        type: kubernetes_logs
        extra_label_selector: "app.kubernetes.io/name=automation-controller"

将收集的日志转发到prometheus remote write,并启用tls,这里需要提前创建相应的tls配置,并挂载到vector pod:

代码语言:javascript
复制
    sinks:
      prometheus_remote_write:
        type: prometheus_remote_write
        inputs:
          - log_to_metric_id
        endpoint: https://prometheus_remote_write_url/v1/receive
        default_namespace: automation_controller_web
        tls:
          ca_file: /tlscerts/ca/ca.crt
          crt_file: /tlscerts/certs/tls.crt
          key_file: /tlscerts/certs/tls.key

将相关日志打印到终端,方便调试:

代码语言:javascript
复制
      stdout:
        type: console
        inputs:
          - log_to_metric_id
          - non_2xx_log
        encoding:
          codec: json

提取pod日志:

代码语言:javascript
复制
      remap_id:
        type: remap
        inputs:
          - kubernetes_logs
        source: . = .message

仅保留apache相关日志

代码语言:javascript
复制
      filter_id:
        type: filter
        inputs:
          - remap_id
        condition: match!(.message, r'GET /api/v2/metrics[/]? HTTP/1.1')

将日志解析成apache标准格式:

代码语言:javascript
复制
      apache_log:
        type: remap
        inputs:
          - filter_id
        source: . = parse_apache_log!(.message, "combined")

只保留状态码不等于200的日志:

代码语言:javascript
复制
      non_2xx_log:
        type: filter
        inputs:
          - apache_log
        condition: .status != 200

将日志装换成metric方便发送给prometheus remote write:

代码语言:javascript
复制
      log_to_metric_id:
        type: log_to_metric
        inputs:
          - non_2xx_log
        metrics:
          - type: counter
            field: status
            name: response_total
            tags:
              cluster: aap-azure-demo
              host: ""
              method: ""
              status: ""

这里使用的是counter类型的指标。

验证收集的日志

可以从vector pod中查看打印的日志,如下:

  • 装换成apache标准格式日志
代码语言:javascript
复制
{
  "agent": "Prometheus/2.26.1",
  "host": "10.244.0.227",
  "message": "GET /api/v2/metrics HTTP/1.1",
  "method": "GET",
  "path": "/api/v2/metrics",
  "protocol": "HTTP/1.1",
  "referrer": "-",
  "size": 0,
  "status": 301,
  "timestamp": "2022-01-21T07:03:51Z",
  "user": "admin"
}
  • apache日志转换成metric
代码语言:javascript
复制
{
  "name": "response_total",
  "tags": {
    "cluster": "aap-azure-demo",
    "host": "10.244.0.227",
    "method": "GET",
    "status": "301"
  },
  "timestamp": "2022-01-21T07:03:51Z",
  "kind": "incremental",
  "counter": {
    "value": 1.0
  }
}

最后在grafana中查询到的metric如下:

代码语言:javascript
复制
automation_controller_web_response_total{
  cluster="aap-azure-demo",
  host="10.244.0.227",
  method="GET",
  receive="true",
  status="301",
  tenant_id="d031e62c-c103-4df4-a899-3671d0236640"
}

automation_controller_web_response_total{
  cluster="aap-azure-demo",
  host="10.244.0.227",
  method="GET",
  receive="true",
  status="500",
  tenant_id="d031e62c-c103-4df4-a899-3671d0236640"
}

创建alert

基于上面的metric定义相应的alert rule,如果收集到状态码为500的日志,就可以触发一条alert,知道某个集群上的服务出现问题,在ACM Hub端创建相应的alert rule:

代码语言:javascript
复制
kind: ConfigMap
apiVersion: v1
metadata:
  name: thanos-ruler-custom-rules
  namespace: open-cluster-management-observability
data:
  custom_rules.yaml: |
    groups:
    - name: automation-controller-web-health
      rules:
      - alert: responseError
        expr: automation_controller_web_response_total{status="500"} > 3
        for: 1m
        labels:
          cluster: aap-azure-demo
          severity: warning

等待几分钟,就可以在grafana中查看到相应的alert,如下:

代码语言:javascript
复制
ALERTS{
  alertname="responseError",
  alertstate="firing",
  cluster="aap-azure-demo",
  host="10.244.0.227",
  method="GET",
  receive="true",
  severity="warning",
  status="500",
  tenant_id="d031e62c-c103-4df4-a899-3671d0236640"
}

LEo at 00:12

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 集群环境介绍
  • 安装vector
  • pod的日志格式
  • 在deployment上配置忽略部分container日志
  • 配置vector收集并转发日志
  • 验证收集的日志
  • 创建alert
相关产品与服务
Prometheus 监控服务
Prometheus 监控服务(TencentCloud Managed Service for Prometheus,TMP)是基于开源 Prometheus 构建的高可用、全托管的服务,与腾讯云容器服务(TKE)高度集成,兼容开源生态丰富多样的应用组件,结合腾讯云可观测平台-告警管理和 Prometheus Alertmanager 能力,为您提供免搭建的高效运维能力,减少开发及运维成本。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档