前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >如何将TKE/EKS集群事件日志持久化

如何将TKE/EKS集群事件日志持久化

原创
作者头像
聂伟星
修改2022-05-09 12:05:11
1.1K9
修改2022-05-09 12:05:11
举报

腾讯云上的tke集群和eks集群的事件日志默认只会保留一个小时,有的时候,服务出现了问题,需要根据历史事件日志来进行排查下,因为历史事件日志只有1个小时,这样给我们排查带来了极大不便。腾讯云上默认是支持将集群的事件日志采集到cls,但是cls是需要收费的,而且很多人习惯用Elasticsearch来查询日志。 下面我们通过开源的eventrouter来将日志采集到Elasticsearch,然后通过kibana来查询事件日志。 eventrouter介绍说明:https://github.com/heptiolabs/eventrouter

eventrouter服务采用List-Watch机制,获取k8s集群中的实时事件events,并把这些事件推送到不同的通道,这里持久化方案是将eventrouter获取的事件保存到日志文件,然后在pod内部署一个filebeat的sidecar容器采集日志文件,将日志写到es,最终通过kinana来检索es里面的日志。

下面我们来具体部署下,本次部署是在tke集群,eks集群同样的方式部署既可。

1. 部署Elasticsearch

es集群的部署参考下面yaml创建

代码语言:javascript
复制
apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: weixnie-es-test
    meta.helm.sh/release-namespace: weixnie
  labels:
    app: elasticsearch-master
    app.kubernetes.io/managed-by: Helm
    chart: elasticsearch
    heritage: Helm
    release: weixnie-es-test
  name: elasticsearch-master
  namespace: weixnie
spec:
  podManagementPolicy: Parallel
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: elasticsearch-master
  serviceName: elasticsearch-master-headless
  template:
    metadata:
      labels:
        app: elasticsearch-master
        chart: elasticsearch
        heritage: Helm
        release: weixnie-es-test
      name: elasticsearch-master
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - elasticsearch-master
            topologyKey: kubernetes.io/hostname
      containers:
      - env:
        - name: node.name
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        - name: cluster.initial_master_nodes
          value: elasticsearch-master-0,elasticsearch-master-1,elasticsearch-master-2,
        - name: discovery.seed_hosts
          value: elasticsearch-master-headless
        - name: cluster.name
          value: elasticsearch
        - name: network.host
          value: 0.0.0.0
        - name: ES_JAVA_OPTS
          value: -Xmx1g -Xms1g
        - name: node.data
          value: "true"
        - name: node.ingest
          value: "true"
        - name: node.master
          value: "true"
        image: ccr.ccs.tencentyun.com/tke-market/elasticsearch:7.6.2
        imagePullPolicy: IfNotPresent
        name: elasticsearch
        ports:
        - containerPort: 9200
          name: http
          protocol: TCP
        - containerPort: 9300
          name: transport
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - sh
            - -c
            - |
              #!/usr/bin/env bash -e
              # If the node is starting up wait for the cluster to be ready (request params: 'wait_for_status=green&timeout=1s' )
              # Once it has started only check that the node itself is responding
              START_FILE=/tmp/.es_start_file

              http () {
                  local path="${1}"
                  if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
                    BASIC_AUTH="-u ${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
                  else
                    BASIC_AUTH=''
                  fi
                  curl -XGET -s -k --fail ${BASIC_AUTH} http://127.0.0.1:9200${path}
              }

              if [ -f "${START_FILE}" ]; then
                  echo 'Elasticsearch is already running, lets check the node is healthy and there are master nodes available'
                  http "/_cluster/health?timeout=0s"
              else
                  echo 'Waiting for elasticsearch cluster to become ready (request params: "wait_for_status=green&timeout=1s" )'
                  if http "/_cluster/health?wait_for_status=green&timeout=1s" ; then
                      touch ${START_FILE}
                      exit 0
                  else
                      echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )'
                      exit 1
                  fi
              fi
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 3
          timeoutSeconds: 5
        resources: {}
        securityContext:
          capabilities:
            drop:
            - ALL
          runAsNonRoot: true
          runAsUser: 1000
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/share/elasticsearch/data
          name: elasticsearch-master
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - sysctl
        - -w
        - vm.max_map_count=262144
        image: ccr.ccs.tencentyun.com/tke-market/elasticsearch:7.6.2
        imagePullPolicy: IfNotPresent
        name: configure-sysctl
        resources: {}
        securityContext:
          privileged: true
          runAsUser: 0
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1000
        runAsUser: 1000
      terminationGracePeriodSeconds: 120
  updateStrategy:
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: elasticsearch-master
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 30Gi
      volumeMode: Filesystem
    status:
      phase: Pending

2. 部署eventrouter

创建下eventrouter,然后配置下filebeat,这里是直接用filebeat采集到es,如果你想采集到kafaka,然后转存到es,可以配置一个logstash来实现。

代码语言:javascript
复制
apiVersion: v1
kind: ServiceAccount
metadata:
  name: eventrouter 
  namespace: weixnie
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: eventrouter 
rules:
- apiGroups: [""]
  resources: ["events"]
  verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: eventrouter 
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: eventrouter
subjects:
- kind: ServiceAccount
  name: eventrouter
  namespace: weixnie
---
apiVersion: v1
data:
  config.json: |- 
    {
      "sink": "glog"
    }
kind: ConfigMap
metadata:
  name: eventrouter-cm
  namespace: weixnie
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: eventrouter
  namespace: weixnie
  labels:
    app: eventrouter
spec:
  replicas: 1
  selector:
    matchLabels:
      app: eventrouter
  template:
    metadata:
      labels:
        app: eventrouter
        tier: control-plane-addons
    spec:
      containers:
        - name: kube-eventrouter
          image: baiyongjie/eventrouter:v0.2
          imagePullPolicy: IfNotPresent
          command:
            - "/bin/sh"
          args:
            - "-c"
            - "/eventrouter -v 3 -log_dir /data/log/eventrouter"
          volumeMounts:
          - name: config-volume
            mountPath: /etc/eventrouter
          - name: log-path
            mountPath: /data/log/eventrouter
        - name: filebeat
          image: elastic/filebeat:7.6.2
          command:
            - "/bin/sh"
          args:
            - "-c"
            - "filebeat -c /etc/filebeat/filebeat.yml"
          volumeMounts:
          - name: filebeat-config
            mountPath: /etc/filebeat/
          - name: log-path
            mountPath: /data/log/eventrouter
      serviceAccount: eventrouter
      volumes:
        - name: config-volume
          configMap:
            name: eventrouter-cm
        - name: filebeat-config
          configMap:
            name: filebeat-config
        - name: log-path
          emptyDir: {}

---
apiVersion: v1
data:
  filebeat.yml: |-
    filebeat.inputs:
      - type: log
        enabled: true
        paths:
          - "/data/log/eventrouter/*"

    setup.template.name: "tke-event"     # 设置一个新的模板,模板的名称
    setup.template.pattern: "tke-event-*" # 模板匹配那些索引,这里表示以nginx开头的所有的索引
    setup.template.enabled: false     # 关掉默认的模板配置
    setup.template.overwrite: true    # 开启新设置的模板
    setup.ilm.enabled: false  # 索引生命周期管理ilm功能默认开启,开启的情况下索引名称只能为filebeat-*, 通过setup.ilm.enabled false

    output.elasticsearch:
      hosts: ['elasticsearch-master:9200']
      index: "tke-event-%{+yyyy.MM.dd}"
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: weixnie

如果要测试日志是否采集成功,可以看下es的所有是否正常创建,es索引创建正常,则说明日志采集正常

代码语言:javascript
复制
[root@VM-55-14-tlinux ~]# curl 10.55.254.57:9200/_cat/indices
green open .kibana_task_manager_1           31GLIGOZRSWaLvCD9Qi6pw 1 1    2 0    68kb    34kb
green open .apm-agent-configuration         kWHztrKkRJG0QNAQuNc5_A 1 1    0 0    566b    283b
green open ilm-history-1-000001             rAcye5j4SCqp_mcL3r3q2g 1 1   18 0  50.6kb  25.3kb
green open tke-event-2022.04.30             R4R1MOJiSuGCczWsSu2bVA 1 1  390 0 590.3kb 281.3kb
green open .kibana_1                        NveB_wCWTkqKVqadI2DNjw 1 1   10 1 351.9kb 175.9kb

3. 部署kibana

为了方便检索日志,这边创建一个kibana来检索事件日志

代码语言:javascript
复制
apiVersion: v1
data:
  kibana.yml: |
    elasticsearch.hosts: http://elasticsearch-master:9200
    server.host: "0"
    server.name: kibana
kind: ConfigMap
metadata:
  labels:
    app: kibana
  name: kibana
  namespace: weixnie

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: kibana
  name: kibana
  namespace: weixnie
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kibana
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: kibana
    spec:
      containers:
      - image: kibana:7.6.2
        imagePullPolicy: IfNotPresent
        name: kibana
        ports:
        - containerPort: 5601
          name: kibana
          protocol: TCP
        securityContext:
          privileged: false
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/share/kibana/config/kibana.yml
          name: kibana
          subPath: kibana.yml
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: kibana
        name: kibana
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: kibana
  name: kibana
  namespace: weixnie
spec:
  ports:
  - name: 5601-5601-tcp
    port: 5601
    protocol: TCP
    targetPort: 5601
  selector:
    app: kibana
  sessionAffinity: None
  type: ClusterIP

如果集群内安装了nginx-ingress,可以通过ingress来给kibana暴露一个域名开访问

代码语言:javascript
复制
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx-intranet
  name: kibana-ingress
  namespace: weixnie
spec:
  rules:
  - host: kibana.tke.niewx.cn
    http:
      paths:
      - backend:
          serviceName: kibana
          servicePort: 5601
        path: /
        pathType: ImplementationSpecific

4. 测试检索事件

登录下kibana

然后创建下索引,这里filebeat设置的索引名称都是tke-event开头,kibana里面创建一个tke-event-*的索引即可。

下面我们直接删除一个测试pod,来产生事件,看下能否在kibana检索到

代码语言:javascript
复制
[niewx@VM-0-4-centos ~]$ k delete pod nginx-6ccd9d7969-f4rfj
pod "nginx-6ccd9d7969-f4rfj" deleted
[niewx@VM-0-4-centos ~]$ k get pod | grep nginx
nginx-6ccd9d7969-fbz9d            1/1     Running       0          23s
[niewx@VM-0-4-centos ~]$ k describe pod nginx-6ccd9d7969-fbz9d | grep -A 10 Events
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  58s   default-scheduler  Successfully assigned weixnie/nginx-6ccd9d7969-fbz9d to 172.16.22.23
  Normal  Pulling    58s   kubelet            Pulling image "nginx:latest"
  Normal  Pulled     55s   kubelet            Successfully pulled image "nginx:latest"
  Normal  Created    55s   kubelet            Created container nginx
  Normal  Started    55s   kubelet            Started container nginx

这里能检索正常,说明我们的event日志持久化到es成功。

5. 定时清理es索引

事件日志是存在es里面,每天的事件都会写到一个索引,如果事件日志较多,保留太长的时间的事件会很容易将磁盘空间打满,这里我们可以写个脚本,然后配置下cronjob来定时清理es里面的索引。

清理索引脚本clean-es-indices.sh,这里需要传入2个参数,第一个参数是清理多少天以前的索引,第二个参数是es的host地址。还需要注意的是脚本里面日期的格式,因为我这边创建的索引名称日期是+%Y.%m.%d,所以脚本里面是这个,如果日期格式不是这个,需要自行修改脚本,然后重新打镜像。

代码语言:javascript
复制
#/bin/bash

day=$1
es_host=$2

DATA=`date -d "${day} days ago" +%Y.%m.%d`

echo "开始清理  $DATA 索引"

#当前日期
time=`date`

#删除n天前的日志
curl -XGET "http://${es_host}:9200/_cat/indices/?v"|grep $DATA
if [ $? == 0 ];then
  curl -XDELETE "http://${es_host}:9200/*-${DATA}"
  echo "于 $time 清理 $DATA 索引!"
else
  echo "无 $DATA 天前索引需要清理"
fi

写个dockerfile来将脚本打到镜像里面,Dockerfile如下

代码语言:javascript
复制
FROM centos:7
COPY clean-es-indices.sh /

如果没有docker环境构建,这里也可以直接使用我已经打好的镜像ccr.ccs.tencentyun.com/nwx_registry/clean-es-indices:latest

下面我们用这个镜像创建一个cronjob

代码语言:javascript
复制
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  labels:
    k8s-app: clean-es-indices
    qcloud-app: clean-es-indices
  name: clean-es-indices
  namespace: weixnie
spec:
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      completions: 1
      parallelism: 1
      template:
        metadata:
          labels:
            k8s-app: clean-es-indices
            qcloud-app: clean-es-indices
        spec:
          containers:
          - args:
            - sh -x /clean-es-indices.sh 3 elasticsearch-master
            command:
            - sh
            - -c
            image: ccr.ccs.tencentyun.com/nwx_registry/clean-es-indices:latest
            imagePullPolicy: Always
            name: clean-es-indices
            resources: {}
            securityContext:
              privileged: false
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          imagePullSecrets:
          - name: qcloudregistrykey
          restartPolicy: OnFailure
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
  schedule: 0 */23 * * *
  successfulJobsHistoryLimit: 3
  suspend: false

这里的cronjob执行策略是在每小时的第 0 分钟执行, 每隔23小时执行一次,相当于每一天执行一次。启动命令里面的参数,我这里配置是3和elasticsearch-master,我这里是清理3天之前的索引,因为es和cronjob是在同namespace,所以我这里直接通过service name访问。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 1. 部署Elasticsearch
  • 2. 部署eventrouter
  • 3. 部署kibana
  • 4. 测试检索事件
  • 5. 定时清理es索引
相关产品与服务
Elasticsearch Service
腾讯云 Elasticsearch Service(ES)是云端全托管海量数据检索分析服务,拥有高性能自研内核,集成X-Pack。ES 支持通过自治索引、存算分离、集群巡检等特性轻松管理集群,也支持免运维、自动弹性、按需使用的 Serverless 模式。使用 ES 您可以高效构建信息检索、日志分析、运维监控等服务,它独特的向量检索还可助您构建基于语义、图像的AI深度应用。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档