前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >云原生利器 -- SkyWalking

云原生利器 -- SkyWalking

作者头像
用户3013098
发布2022-06-01 08:54:03
1.3K0
发布2022-06-01 08:54:03
举报
文章被收录于专栏:devops运维先行者

1 SkyWalking 简介

SkyWalking 是一个APM(应用程序性能监视器)系统,专门为微服务,云原生和基于容器(Docker,Kubernetes,Mesos)的体系结构而设计。 SkyWalking的功能包括对Cloud Native体系结构中的分布式系统的监视,跟踪,诊断功能。核心功能如下:

  • 服务、服务实例、端点指标分析
  • 根本原因分析,在运行时分析代码
  • 服务拓扑图分析
  • 服务、服务实例和端点依赖关系分析
  • 检测慢速服务和端点
  • 性能优化
  • 分布式跟踪和上下文传播
  • 数据库访问指标,检测慢速数据库访问语句(包括SQL语句)
  • 报警
  • 浏览器性能监控 详情可访问Github地址:https://github.com/apache/skywalking,本文将介绍如何在k8s环境中部署使用SkyWalking 8.3.0版本,实操,不要错过哦!

2 K8s部署

monitoring-nm.yaml

代码语言:javascript
复制
#创建namespace - monitoring
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring

oap-serviceaccount.yaml

代码语言:javascript
复制
#创建SkyWalking相关的rbac权限
#相关文件可查看https://github.com/apache/skywalking-kubernetes/tree/master/chart/skywalking/templates下的k8s配置
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app: skywalking-oap-server
    release: 8.3.0
  name: skywalking-oap-server
  namespace: monitoring
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: skywalking-oap-server
  namespace: monitoring
  labels:
    app: skywalking-oap-server
    release: 8.3.0
rules:
  - apiGroups: [""]
    resources: ["pods","configmaps"]
    verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: skywalking-oap-server
  namespace: monitoring
  labels:
    app: skywalking-oap-server
    release: 8.3.0
rules:
- apiGroups: [""]
  resources: ["pods", "endpoints", "services"]
  verbs: ["get", "watch", "list"]
- apiGroups: ["extensions"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: skywalking-oap-server
  namespace: monitoring
  labels:
    app: skywalking-oap-server
    release: 8.3.0
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: skywalking-oap-server
subjects:
  - kind: ServiceAccount
    name: skywalking-oap-server
    namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: skywalking-oap-server
  labels:
    app: skywalking-oap-server
    release: 8.3.0
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: skywalking-oap-server
subjects:
- kind: ServiceAccount
  name: skywalking-oap-server
  namespace: monitoring

alarm-settings-cmp.yaml

代码语言:javascript
复制
#创建SkyWalking的alarm-settings.yaml ConfigMap配置文件
kind: ConfigMap
apiVersion: v1
metadata:
  name: alarm-settings
  namespace: monitoring
data:
  alarm-settings.yml: |
    rules:
      # Rule unique name, must be ended with `_rule`.
      #1.过去3分钟内服务平均响应时间超过1秒
      service_resp_time_rule:
        metrics-name: service_resp_time
        op: ">"
        threshold: 1000
        period: 10
        count: 3
        silence-period: 60
        message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
      # 2.服务成功率在过去2分钟内低于80%。
      service_sla_rule:
        # Metrics value need to be long, double or int
        metrics-name: service_sla
        op: "<"
        threshold: 8000
        # The length of time to evaluate the metrics
        period: 10
        # How many times after the metrics match the condition, will trigger alarm
        count: 2
        # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
        silence-period: 60
        message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
      #3.服务90%响应时间在过去3分钟内低于1000毫秒.
      service_resp_time_percentile_rule:
        # Metrics value need to be long, double or int
        metrics-name: service_percentile
        op: ">"
        threshold: 1000,1000,1000,1000,1000
        period: 10
        count: 3
        silence-period: 60
        message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
      #4.服务实例在过去2分钟内的平均响应时间超过1秒
      service_instance_resp_time_rule:
        metrics-name: service_instance_resp_time
        op: ">"
        threshold: 1000
        period: 10
        count: 2
        silence-period: 60
        message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
      database_access_resp_time_rule:
        metrics-name: database_access_resp_time
        threshold: 1000
        op: ">"
        period: 10
        count: 2
        silence-period: 60
        message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
      endpoint_relation_resp_time_rule:
        metrics-name: endpoint_relation_resp_time
        threshold: 1000
        op: ">"
        period: 10
        count: 2
        silence-period: 60
        message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
    #  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
    #  Because the number of endpoint is much more than service and instance.
      #5.端点平均响应时间过去2分钟超过1秒。
      endpoint_avg_rule:
        metrics-name: endpoint_avg
        op: ">"
        threshold: 1000
        period: 10
        count: 2
        silence-period: 60
        message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes

sky-deployment.yaml

代码语言:javascript
复制
#创建SkyWalking deployment,这里containers端口开放了11800、12800分别作为grpc、rest端口,且通过nodeport形式暴露给内网环境,使非本k8s环境主机可以访问。
#为了便捷,直接使用aliyun的elasticsearch7.7云服务作为SkyWalking的数据源存储,其余数据源可以查看已支持的https://github.com/apache/skywalking/tree/master/oap-server/server-storage-plugin
apiVersion: apps/v1
kind: Deployment
metadata:
  name: skywalking-oap-server
  namespace: monitoring
  labels:
    app: skywalking-oap-server
    release: 8.3.0
spec:
  replicas: 2
  selector:
    matchLabels:
      app: skywalking-oap-server
  template:
    metadata:
      labels:
        app: skywalking-oap-server
        devops: k8s-app
    spec:
      serviceAccountName: skywalking-oap-server
      containers:
      - name: skywalking-oap-server
        image: apache/skywalking-oap-server:latest
        imagePullPolicy: IfNotPresent
        livenessProbe:
          tcpSocket:
            port: 12800
          initialDelaySeconds: 15
          periodSeconds: 20
        readinessProbe:
          tcpSocket:
            port: 12800
          initialDelaySeconds: 15
          periodSeconds: 20
        securityContext:
          allowPrivilegeEscalation: false
        ports:
        - name: grpc
          containerPort: 11800
        - name: rest
          containerPort: 12800
        resources:
          requests:
            memory: "128Mi"
          limits:
            memory: "4Gi"
            cpu: 4
        env:
        - name: JAVA_OPTS
          value: "-Xmx2g -Xms2g"
        - name: SW_CLUSTER
          value: kubernetes
        - name: SW_CLUSTER_K8S_NAMESPACE
          value: monitoring
        - name: SW_CONFIGURATION
          value: k8s-configmap
        - name: SW_CONFIG_CONFIGMAP_PERIOD
          value: "60"
        - name: SKYWALKING_COLLECTOR_UID
          valueFrom:
            fieldRef:
              fieldPath: metadata.uid
        - name: SW_STORAGE
          value: elasticsearch7
        - name: SW_STORAGE_ES_CLUSTER_NODES
          value: xxxxxxx.elasticsearch.aliyuncs.com:9200
        - name: SW_ES_USER
          value: elastic
        - name: SW_ES_PASSWORD
          value: xxxxx
        volumeMounts:
        - name: zone
          mountPath: /etc/localtime
          readOnly: true
        - name: alarm-settings
          mountPath: /skywalking/config/alarm-settings.yml
          readOnly: true
          subPath: alarm-settings.yml
      volumes:
      - name: zone
        hostPath:
          path: /etc/localtime
      - name: alarm-settings
        configMap:
          name: alarm-settings
---
apiVersion: v1
kind: Service
metadata:
  name: skywalking-oap-server
  namespace: monitoring
  labels: 
    app: skywalking-oap-server
spec:
  selector:
    app: skywalking-oap-server
  ports:
  - name: grpcport
    port: 11800
    targetPort: 11800
    protocol: TCP
    nodePort: 31180
  - name: restport
    port: 12800
    targetPort: 12800
    protocol: TCP
    nodePort: 31280
  type: NodePort

sky-ui.yaml

代码语言:javascript
复制
#创建SkyWalking的ui,注意的是spec.spec.template.spec.containers.env.SW_OAP_ADDRESS需要跟sky-deployment.yaml的name对齐,并加上rest port,并且通过traefik2 的IngressRoute暴露域名。
apiVersion: apps/v1
kind: Deployment
metadata:
  name: skywalking-ui
  namespace: monitoring
  labels:
    app: skywalking-ui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: skywalking-ui
  template:
    metadata:
      labels:
        app: skywalking-ui
    spec:
      containers:
      - name: skywalking-ui
        image: apache/skywalking-ui:latest
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 8080
          name: page
        resources:
          requests:
            memory: "128Mi"
          limits:
            memory: "3G"
            cpu: 2
        env:
        - name: SW_OAP_ADDRESS
          value: skywalking-oap-server:12800
        volumeMounts:
        - name: zone
          mountPath: /etc/localtime
          readOnly: true
      volumes:
      - name: zone
        hostPath:
          path: /etc/localtime
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: skywalking-ui
  name: skywalking-ui
  namespace: monitoring
spec:
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP
      name: page
  selector:
    app: skywalking-ui
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: skywalking-ui
  namespace: monitoring
  labels:
    app: skywalking-ui
spec:
  entryPoints:
    - http
  routes:
    - match: Host(`sw.domain.com`) && PathPrefix(`/`) 
      kind: Rule
      priority: 10
      middlewares:
        - name: net-offical 
          namespace: default
      services:
        - name: skywalking-ui
          namespace: monitoring
          port: 80

按顺序分别kubectl apply部署SkyWalking,部署完成后可查看相关SkyWalking资源。

3 SkyWalking使用

当浏览器登录sw.domain.com的时候,可以看到SkyWalking UI已经准备完成,只不过现在没有服务接入,所有都是空白的,

接下来我们来准备SkyWalking Agent,让JAVA服务接入agent。

3.1 SkyWalking Agent准备

代码语言:javascript
复制
#SkyWalking Agent Dockerfile
FROM alpine:3.8
 
LABEL maintainer=xiayun
 
ENV SKYWALKING_VERSION=8.3.0
 
ADD http://mirrors.tuna.tsinghua.edu.cn/apache/skywalking/${SKYWALKING_VERSION}/apache-skywalking-apm-${SKYWALKING_VERSION}.tar.gz /
 
RUN tar -zxvf /apache-skywalking-apm-${SKYWALKING_VERSION}.tar.gz && \
    mv apache-skywalking-apm-bin skywalking && \
    mv /skywalking/agent/optional-plugins/apm-trace-ignore-plugin* /skywalking/agent/plugins/ && \
    chmod -R 777 /skywalking/agent && \
    echo -e "\n# Ignore Path" >> /skywalking/agent/config/apm-trace-ignore-plugin.config && \
    echo "# see https://github.com/apache/skywalking/blob/8.3.0/docs-hotfix/docs/en/setup/service-agent/java-agent/agent-optional-plugins/trace-ignore-plugin.md" >> /skywalking/agent/config/apm-trace-ignore-plugin.config && \
    echo 'trace.ignore_path=${SW_AGENT_TRACE_IGNORE_PATH:/health}' >> /skywalking/agent/config/apm-trace-ignore-plugin.config && \
    echo 'agent.namespace=${SW_AGENT_NAMESPACE:default-namespace}' >> /skywalking/agent/config/agent.config && \
    echo 'logging.max_file_size=${SW_LOGGING_MAX_FILE_SIZE:1073741824}' >> /skywalking/agent/config/agent.config

通过此SkyWalking Agent Dockerfile文件,生成skywalking-agent:r1.0镜像,并上传至nexus3(nexus3在k8s中部署可以查看公众号的上一篇文章<<云原生利器 -- Nexus3>>

3.2 java k8s文件准备

在java服务的Dockerfile中需要加{JAVA_OPTS}参数,在k8s配置文件中,我们需要增加env变量,如:CMD java {JAVA_OPTS} -jar jar-name然后在java k8s配置文件中,增加initContainers,以k8s sidecar的形式部署SkyWalking agent

代码语言:javascript
复制
#java k8s配置文件
apiVersion: apps/v1
kind: Deployment
metadata:
  name: server-name
  namespace: ENV
  labels:
    prometheus: ENV-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: server-name
  template:
    metadata:
      labels:
        app: server-name
        prometheus: ENV-server
        devops: k8s-app
    spec:
      initContainers:
      - name: skywalking-agent
        image: skywalking-agent:r1.0
        securityContext:
          allowPrivilegeEscalation: false
        resources:
          limits:
            memory: 1Gi
          requests:
            memory: 100Mi
        command:
          - 'sh'
          - '-c'
          - 'set -ex;mkdir -p /vmskywalking/agent;cp -r /skywalking/agent/* /vmskywalking/agent'
        volumeMounts:
        - name: zone
          mountPath: /etc/localtime
          readOnly: true
        - name: sw-agent
          mountPath: /vmskywalking/agent
      containers:
      - name: server-name
        image: 172.16.10.13/ENV-server/server-name:<BUILD_TAG>
        imagePullPolicy: Always
        securityContext:
          allowPrivilegeEscalation: false
        readinessProbe:
          tcpSocket:
            port: 8081
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          tcpSocket:
            port: 8081
          initialDelaySeconds: 300
          periodSeconds: 5
        ports:
        - name: web
          protocol: TCP 
          containerPort: 8081
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            memory: "MAXMEM"
        env:
          - name: JAVA_OPTS
            value: -javaagent:/usr/lib/agent/skywalking-agent.jar
          - name: SW_AGENT_NAME
            value: ENV-server-name
          - name: SW_AGENT_COLLECTOR_BACKEND_SERVICES
            value: skywalking-oap-server.monitoring.svc.cluster.local:11800
          - name: SW_LOGGING_LEVEL
            value: ERROR
          - name: SW_LOGGING_MAX_FILE_SIZE
            value: "1073741824"
          - name: SW_AGENT_NAMESPACE
            value: ENV
          - name: SW_MOUNT_FOLDERS
            value: plugins,activations
          - name: SW_AGENT_TRACE_IGNORE_PATH
            value: /health,/actuator/prometheus,/prometheus
        volumeMounts:
        - name: zone
          mountPath: /etc/localtime
          readOnly: true
        - name: app-logs
          mountPath: /home/admin/server-name/logs
        - name: fonts
          mountPath: /usr/share/fonts
          subPath: fonts
          readOnly: true
        - name: sw-agent
          mountPath: /usr/lib/agent
      volumes:
      - name: zone
        hostPath:
          path: /etc/localtime
      - name: app-logs
        emptyDir: {}
      - name: sw-agent
        emptyDir: {}
      - name: fonts
        persistentVolumeClaim:
          claimName: fonts
---
apiVersion: v1
kind: Service
metadata:
  name: server-name-svc
  namespace: ENV
  labels: 
    prometheus: ENV-server
  annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8081"
        prometheus.io/path: "/actuator/prometheus"
spec:
  template:
    metadata:
      labels:
        name: server-name-svc
        namespace: ENV
        prometheus: ENV-server
spec:
  selector:
    app: server-name
  ports:
  - name: web
    port: 80
    targetPort: 8081

配置完成后,运行java 服务。让我们来看下现在k8s SkyWalking的基础架构,

采用aliyun elasticsearch作为skywalking的存储源,skywalking server跟ui都部署在k8s上,skywalking agent客户端采用k8s sidecar 边车模式跟微服务共享容器空间。

3.3 SkyWalking使用

登录SkyWalking UI页面,右上角刷新一下,可以显示出新增的java服务,如,

从仪表盘的APM中,可以看到Services Load、Slow Services、Un-Health Service、Slow Endpoints的Top10情况。 从拓扑图中,可以看到整个环境中的服务链路调用情况,如,

从追踪中,可以看到服务的链路情况明细,如,

如果trace链路需要忽略某些路径,如/health,/actuator/prometheus,/prometheus这些监控uri,可以在java k8s配置文件中的env.SW_AGENT_TRACE_IGNORE_PATH配置,如需通配路径,参考trace.ignore_path=/your/path/1/**,/your/path/2/**,具体可以查阅https://github.com/apache/skywalking/blob/8.3.0/docs-hotfix/docs/en/setup/service-agent/java-agent/agent-optional-plugins/trace-ignore-plugin.md 性能剖析和日志,目前没有使用到,暂不介绍,等后续更新吧··· 从告警中,可以看到当前服务的链路告警详情,告警规则可以在alarm-settings.yml里配置,告警可以接入WebHook,如Dingtalk Hook,WeChat Hook,Slack Chat Hook,gRPCHook等

代码语言:javascript
复制
rules:
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 60
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.

如此配置中,service_resp_time_rule的告警规则为过去3分钟内服务平均响应时间超过1秒就告警,沉默时间为60分钟。 告警规则主要有以下几点:

  • Rule name。在告警信息中显示的唯一名称。必须以_rule结尾。指定的规则(与规则名不同,这里是对应的告警中的规则map,具体可查看 https://github.com/apache/skywalking/blob/master/docs/en/setup/backend/backend-alarm.md#list-of-all-potential-metrics-name,其中一些常见的,endpoint_percent_rule——端点相应半分比告警,service_percent_rule——服务相应百分比告警)
  • Metrics name。也是 OAL 脚本中的度量名。只支持long,double和int类型。详情见所有可能的度量名称列表.
  • Include names。使用本规则告警的服务列表。比如服务名,端点名。
  • Threshold。阈值,与metrics-name和下面的比较符号相匹配
  • OP。操作符, 支持 >, <, =。欢迎贡献所有的操作符。如 metrics-name: endpoint_percent, threshold: 75,op: < ,表示如果相应时长小于平均75%则发送告警
  • Period.。多久告警规则需要被核实一下。这是一个时间窗口,与后端部署环境时间相匹配。
  • Count。在一个Period窗口中,如果values超过Threshold值(按op),达到Count值,需要发送警报。
  • Silence period。在时间N中触发报警后,在TN -> TN + period这个阶段不告警。默认情况下,它和Period一样,这意味着相同的告警(在同一个Metrics name拥有相同的Id)在同一个Period内只会触发一次。 来看一下dingding的监控告警,

参考文献

1.https://github.com/apache/skywalking 2.https://github.com/apache/skywalking-kubernetes 3.https://skywalking-handbook.netlify.app/

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2020-12-23,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 devops运维先行者 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 1 SkyWalking 简介
  • 2 K8s部署
    • monitoring-nm.yaml
      • oap-serviceaccount.yaml
        • alarm-settings-cmp.yaml
          • sky-deployment.yaml
            • sky-ui.yaml
            • 3 SkyWalking使用
              • 3.1 SkyWalking Agent准备
                • 3.2 java k8s文件准备
                  • 3.3 SkyWalking使用
                  • 参考文献
                  相关产品与服务
                  容器服务
                  腾讯云容器服务(Tencent Kubernetes Engine, TKE)基于原生 kubernetes 提供以容器为核心的、高度可扩展的高性能容器管理服务,覆盖 Serverless、边缘计算、分布式云等多种业务部署场景,业内首创单个集群兼容多种计算节点的容器资源管理模式。同时产品作为云原生 Finops 领先布道者,主导开源项目Crane,全面助力客户实现资源优化、成本控制。
                  领券
                  问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档