SkyWalking 是一个APM(应用程序性能监视器)系统,专门为微服务,云原生和基于容器(Docker,Kubernetes,Mesos)的体系结构而设计。 SkyWalking的功能包括对Cloud Native体系结构中的分布式系统的监视,跟踪,诊断功能。核心功能如下:
#创建namespace - monitoring
apiVersion: v1
kind: Namespace
metadata:
  name: monitoring#创建SkyWalking相关的rbac权限
#相关文件可查看https://github.com/apache/skywalking-kubernetes/tree/master/chart/skywalking/templates下的k8s配置
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app: skywalking-oap-server
    release: 8.3.0
  name: skywalking-oap-server
  namespace: monitoring
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: skywalking-oap-server
  namespace: monitoring
  labels:
    app: skywalking-oap-server
    release: 8.3.0
rules:
  - apiGroups: [""]
    resources: ["pods","configmaps"]
    verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: skywalking-oap-server
  namespace: monitoring
  labels:
    app: skywalking-oap-server
    release: 8.3.0
rules:
- apiGroups: [""]
  resources: ["pods", "endpoints", "services"]
  verbs: ["get", "watch", "list"]
- apiGroups: ["extensions"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: skywalking-oap-server
  namespace: monitoring
  labels:
    app: skywalking-oap-server
    release: 8.3.0
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: skywalking-oap-server
subjects:
  - kind: ServiceAccount
    name: skywalking-oap-server
    namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: skywalking-oap-server
  labels:
    app: skywalking-oap-server
    release: 8.3.0
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: skywalking-oap-server
subjects:
- kind: ServiceAccount
  name: skywalking-oap-server
  namespace: monitoring#创建SkyWalking的alarm-settings.yaml ConfigMap配置文件
kind: ConfigMap
apiVersion: v1
metadata:
  name: alarm-settings
  namespace: monitoring
data:
  alarm-settings.yml: |
    rules:
      # Rule unique name, must be ended with `_rule`.
      #1.过去3分钟内服务平均响应时间超过1秒
      service_resp_time_rule:
        metrics-name: service_resp_time
        op: ">"
        threshold: 1000
        period: 10
        count: 3
        silence-period: 60
        message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
      # 2.服务成功率在过去2分钟内低于80%。
      service_sla_rule:
        # Metrics value need to be long, double or int
        metrics-name: service_sla
        op: "<"
        threshold: 8000
        # The length of time to evaluate the metrics
        period: 10
        # How many times after the metrics match the condition, will trigger alarm
        count: 2
        # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
        silence-period: 60
        message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
      #3.服务90%响应时间在过去3分钟内低于1000毫秒.
      service_resp_time_percentile_rule:
        # Metrics value need to be long, double or int
        metrics-name: service_percentile
        op: ">"
        threshold: 1000,1000,1000,1000,1000
        period: 10
        count: 3
        silence-period: 60
        message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
      #4.服务实例在过去2分钟内的平均响应时间超过1秒
      service_instance_resp_time_rule:
        metrics-name: service_instance_resp_time
        op: ">"
        threshold: 1000
        period: 10
        count: 2
        silence-period: 60
        message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
      database_access_resp_time_rule:
        metrics-name: database_access_resp_time
        threshold: 1000
        op: ">"
        period: 10
        count: 2
        silence-period: 60
        message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
      endpoint_relation_resp_time_rule:
        metrics-name: endpoint_relation_resp_time
        threshold: 1000
        op: ">"
        period: 10
        count: 2
        silence-period: 60
        message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
    #  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
    #  Because the number of endpoint is much more than service and instance.
      #5.端点平均响应时间过去2分钟超过1秒。
      endpoint_avg_rule:
        metrics-name: endpoint_avg
        op: ">"
        threshold: 1000
        period: 10
        count: 2
        silence-period: 60
        message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes#创建SkyWalking deployment,这里containers端口开放了11800、12800分别作为grpc、rest端口,且通过nodeport形式暴露给内网环境,使非本k8s环境主机可以访问。
#为了便捷,直接使用aliyun的elasticsearch7.7云服务作为SkyWalking的数据源存储,其余数据源可以查看已支持的https://github.com/apache/skywalking/tree/master/oap-server/server-storage-plugin
apiVersion: apps/v1
kind: Deployment
metadata:
  name: skywalking-oap-server
  namespace: monitoring
  labels:
    app: skywalking-oap-server
    release: 8.3.0
spec:
  replicas: 2
  selector:
    matchLabels:
      app: skywalking-oap-server
  template:
    metadata:
      labels:
        app: skywalking-oap-server
        devops: k8s-app
    spec:
      serviceAccountName: skywalking-oap-server
      containers:
      - name: skywalking-oap-server
        image: apache/skywalking-oap-server:latest
        imagePullPolicy: IfNotPresent
        livenessProbe:
          tcpSocket:
            port: 12800
          initialDelaySeconds: 15
          periodSeconds: 20
        readinessProbe:
          tcpSocket:
            port: 12800
          initialDelaySeconds: 15
          periodSeconds: 20
        securityContext:
          allowPrivilegeEscalation: false
        ports:
        - name: grpc
          containerPort: 11800
        - name: rest
          containerPort: 12800
        resources:
          requests:
            memory: "128Mi"
          limits:
            memory: "4Gi"
            cpu: 4
        env:
        - name: JAVA_OPTS
          value: "-Xmx2g -Xms2g"
        - name: SW_CLUSTER
          value: kubernetes
        - name: SW_CLUSTER_K8S_NAMESPACE
          value: monitoring
        - name: SW_CONFIGURATION
          value: k8s-configmap
        - name: SW_CONFIG_CONFIGMAP_PERIOD
          value: "60"
        - name: SKYWALKING_COLLECTOR_UID
          valueFrom:
            fieldRef:
              fieldPath: metadata.uid
        - name: SW_STORAGE
          value: elasticsearch7
        - name: SW_STORAGE_ES_CLUSTER_NODES
          value: xxxxxxx.elasticsearch.aliyuncs.com:9200
        - name: SW_ES_USER
          value: elastic
        - name: SW_ES_PASSWORD
          value: xxxxx
        volumeMounts:
        - name: zone
          mountPath: /etc/localtime
          readOnly: true
        - name: alarm-settings
          mountPath: /skywalking/config/alarm-settings.yml
          readOnly: true
          subPath: alarm-settings.yml
      volumes:
      - name: zone
        hostPath:
          path: /etc/localtime
      - name: alarm-settings
        configMap:
          name: alarm-settings
---
apiVersion: v1
kind: Service
metadata:
  name: skywalking-oap-server
  namespace: monitoring
  labels: 
    app: skywalking-oap-server
spec:
  selector:
    app: skywalking-oap-server
  ports:
  - name: grpcport
    port: 11800
    targetPort: 11800
    protocol: TCP
    nodePort: 31180
  - name: restport
    port: 12800
    targetPort: 12800
    protocol: TCP
    nodePort: 31280
  type: NodePort#创建SkyWalking的ui,注意的是spec.spec.template.spec.containers.env.SW_OAP_ADDRESS需要跟sky-deployment.yaml的name对齐,并加上rest port,并且通过traefik2 的IngressRoute暴露域名。
apiVersion: apps/v1
kind: Deployment
metadata:
  name: skywalking-ui
  namespace: monitoring
  labels:
    app: skywalking-ui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: skywalking-ui
  template:
    metadata:
      labels:
        app: skywalking-ui
    spec:
      containers:
      - name: skywalking-ui
        image: apache/skywalking-ui:latest
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 8080
          name: page
        resources:
          requests:
            memory: "128Mi"
          limits:
            memory: "3G"
            cpu: 2
        env:
        - name: SW_OAP_ADDRESS
          value: skywalking-oap-server:12800
        volumeMounts:
        - name: zone
          mountPath: /etc/localtime
          readOnly: true
      volumes:
      - name: zone
        hostPath:
          path: /etc/localtime
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: skywalking-ui
  name: skywalking-ui
  namespace: monitoring
spec:
  ports:
    - port: 80
      targetPort: 8080
      protocol: TCP
      name: page
  selector:
    app: skywalking-ui
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: skywalking-ui
  namespace: monitoring
  labels:
    app: skywalking-ui
spec:
  entryPoints:
    - http
  routes:
    - match: Host(`sw.domain.com`) && PathPrefix(`/`) 
      kind: Rule
      priority: 10
      middlewares:
        - name: net-offical 
          namespace: default
      services:
        - name: skywalking-ui
          namespace: monitoring
          port: 80按顺序分别kubectl apply部署SkyWalking,部署完成后可查看相关SkyWalking资源。
当浏览器登录sw.domain.com的时候,可以看到SkyWalking UI已经准备完成,只不过现在没有服务接入,所有都是空白的,
接下来我们来准备SkyWalking Agent,让JAVA服务接入agent。
#SkyWalking Agent Dockerfile
FROM alpine:3.8
 
LABEL maintainer=xiayun
 
ENV SKYWALKING_VERSION=8.3.0
 
ADD http://mirrors.tuna.tsinghua.edu.cn/apache/skywalking/${SKYWALKING_VERSION}/apache-skywalking-apm-${SKYWALKING_VERSION}.tar.gz /
 
RUN tar -zxvf /apache-skywalking-apm-${SKYWALKING_VERSION}.tar.gz && \
    mv apache-skywalking-apm-bin skywalking && \
    mv /skywalking/agent/optional-plugins/apm-trace-ignore-plugin* /skywalking/agent/plugins/ && \
    chmod -R 777 /skywalking/agent && \
    echo -e "\n# Ignore Path" >> /skywalking/agent/config/apm-trace-ignore-plugin.config && \
    echo "# see https://github.com/apache/skywalking/blob/8.3.0/docs-hotfix/docs/en/setup/service-agent/java-agent/agent-optional-plugins/trace-ignore-plugin.md" >> /skywalking/agent/config/apm-trace-ignore-plugin.config && \
    echo 'trace.ignore_path=${SW_AGENT_TRACE_IGNORE_PATH:/health}' >> /skywalking/agent/config/apm-trace-ignore-plugin.config && \
    echo 'agent.namespace=${SW_AGENT_NAMESPACE:default-namespace}' >> /skywalking/agent/config/agent.config && \
    echo 'logging.max_file_size=${SW_LOGGING_MAX_FILE_SIZE:1073741824}' >> /skywalking/agent/config/agent.config通过此SkyWalking Agent Dockerfile文件,生成skywalking-agent:r1.0镜像,并上传至nexus3(nexus3在k8s中部署可以查看公众号的上一篇文章<<云原生利器 -- Nexus3>>)
在java服务的Dockerfile中需要加{JAVA_OPTS}参数,在k8s配置文件中,我们需要增加env变量,如:CMD java {JAVA_OPTS} -jar jar-name然后在java k8s配置文件中,增加initContainers,以k8s sidecar的形式部署SkyWalking agent
#java k8s配置文件
apiVersion: apps/v1
kind: Deployment
metadata:
  name: server-name
  namespace: ENV
  labels:
    prometheus: ENV-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: server-name
  template:
    metadata:
      labels:
        app: server-name
        prometheus: ENV-server
        devops: k8s-app
    spec:
      initContainers:
      - name: skywalking-agent
        image: skywalking-agent:r1.0
        securityContext:
          allowPrivilegeEscalation: false
        resources:
          limits:
            memory: 1Gi
          requests:
            memory: 100Mi
        command:
          - 'sh'
          - '-c'
          - 'set -ex;mkdir -p /vmskywalking/agent;cp -r /skywalking/agent/* /vmskywalking/agent'
        volumeMounts:
        - name: zone
          mountPath: /etc/localtime
          readOnly: true
        - name: sw-agent
          mountPath: /vmskywalking/agent
      containers:
      - name: server-name
        image: 172.16.10.13/ENV-server/server-name:<BUILD_TAG>
        imagePullPolicy: Always
        securityContext:
          allowPrivilegeEscalation: false
        readinessProbe:
          tcpSocket:
            port: 8081
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          tcpSocket:
            port: 8081
          initialDelaySeconds: 300
          periodSeconds: 5
        ports:
        - name: web
          protocol: TCP 
          containerPort: 8081
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            memory: "MAXMEM"
        env:
          - name: JAVA_OPTS
            value: -javaagent:/usr/lib/agent/skywalking-agent.jar
          - name: SW_AGENT_NAME
            value: ENV-server-name
          - name: SW_AGENT_COLLECTOR_BACKEND_SERVICES
            value: skywalking-oap-server.monitoring.svc.cluster.local:11800
          - name: SW_LOGGING_LEVEL
            value: ERROR
          - name: SW_LOGGING_MAX_FILE_SIZE
            value: "1073741824"
          - name: SW_AGENT_NAMESPACE
            value: ENV
          - name: SW_MOUNT_FOLDERS
            value: plugins,activations
          - name: SW_AGENT_TRACE_IGNORE_PATH
            value: /health,/actuator/prometheus,/prometheus
        volumeMounts:
        - name: zone
          mountPath: /etc/localtime
          readOnly: true
        - name: app-logs
          mountPath: /home/admin/server-name/logs
        - name: fonts
          mountPath: /usr/share/fonts
          subPath: fonts
          readOnly: true
        - name: sw-agent
          mountPath: /usr/lib/agent
      volumes:
      - name: zone
        hostPath:
          path: /etc/localtime
      - name: app-logs
        emptyDir: {}
      - name: sw-agent
        emptyDir: {}
      - name: fonts
        persistentVolumeClaim:
          claimName: fonts
---
apiVersion: v1
kind: Service
metadata:
  name: server-name-svc
  namespace: ENV
  labels: 
    prometheus: ENV-server
  annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8081"
        prometheus.io/path: "/actuator/prometheus"
spec:
  template:
    metadata:
      labels:
        name: server-name-svc
        namespace: ENV
        prometheus: ENV-server
spec:
  selector:
    app: server-name
  ports:
  - name: web
    port: 80
    targetPort: 8081配置完成后,运行java 服务。让我们来看下现在k8s SkyWalking的基础架构,
采用aliyun elasticsearch作为skywalking的存储源,skywalking server跟ui都部署在k8s上,skywalking agent客户端采用k8s sidecar 边车模式跟微服务共享容器空间。
登录SkyWalking UI页面,右上角刷新一下,可以显示出新增的java服务,如,
从仪表盘的APM中,可以看到Services Load、Slow Services、Un-Health Service、Slow Endpoints的Top10情况。 从拓扑图中,可以看到整个环境中的服务链路调用情况,如,
从追踪中,可以看到服务的链路情况明细,如,
如果trace链路需要忽略某些路径,如/health,/actuator/prometheus,/prometheus这些监控uri,可以在java k8s配置文件中的env.SW_AGENT_TRACE_IGNORE_PATH配置,如需通配路径,参考trace.ignore_path=/your/path/1/**,/your/path/2/**,具体可以查阅https://github.com/apache/skywalking/blob/8.3.0/docs-hotfix/docs/en/setup/service-agent/java-agent/agent-optional-plugins/trace-ignore-plugin.md
性能剖析和日志,目前没有使用到,暂不介绍,等后续更新吧···
从告警中,可以看到当前服务的链路告警详情,告警规则可以在alarm-settings.yml里配置,告警可以接入WebHook,如Dingtalk Hook,WeChat Hook,Slack Chat Hook,gRPCHook等
rules:
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 60
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.如此配置中,service_resp_time_rule的告警规则为过去3分钟内服务平均响应时间超过1秒就告警,沉默时间为60分钟。 告警规则主要有以下几点:
1.https://github.com/apache/skywalking 2.https://github.com/apache/skywalking-kubernetes 3.https://skywalking-handbook.netlify.app/