集群巡检是对集群系统进行定期检查和评估的过程,其主要目的是确保集群的稳定性、性能和安全性。以下是集群巡检的几个主要用途:
主动巡检多采用手工方式,通过 CLI 工具或者脚本,向集群主动注入压力,获取集群响应情况,因此存在很多不足之处。
kdoctor 通过如下设计解决传统主动巡检问题:
cat <<EOF | kubectl apply -f -
apiVersion: kdoctor.io/v1beta1
kind: NetReach
metadata:
name: reach-task
spec:
expect:
meanAccessDelayInMs: 1500
successRate: 1
request:
durationInSecond: 10
perRequestTimeoutInMS: 1500
qps: 10
schedule:
roundNumber: 1
roundTimeoutMinute: 1
schedule: 0 1
target:
clusterIP: true
endpoint: true
ingress: false
ipv4: true
loadBalancer: false
multusInterface: false
nodePort: true
EOF
~# kubectl get netreach
NAME FINISH EXPECTEDROUND DONEROUND LASTROUNDSTATUS SCHEDULE
reach-task true 1 1 succeed 0 1
kdoctor controller 会将巡检任务报告聚合并通过聚合 API 的方式进行展示。
~# kubectl get kdoctorreport reach-task -oyaml
apiVersion: system.kdoctor.io/v1beta1
kind: KdoctorReport
metadata:
creationTimestamp: null
name: reach-task
spec:
FailedRoundNumber: null
FinishedRoundNumber: 1
Report:
- EndTimeStamp: "2023-09-21T11:30:33Z"
NetReachTask:
Detail:
- MeanDelay: 50.294117
Metrics:
Duration: 15.004307799s
EndTime: "2023-09-21T11:30:33Z"
Errors: {}
Latencies:
Max_inMx: 0
Mean_inMs: 50.294117
Min_inMs: 0
P50_inMs: 0
P90_inMs: 0
P95_inMs: 0
P99_inMs: 0
RequestCounts: 102
StartTime: "2023-09-21T11:30:18Z"
StatusCodes:
"200": 102
SuccessCounts: 102
TPS: 6.798047691796755
TotalDataSize: 39295 byte
Succeed: true
SucceedRate: 1
TargetMethod: GET
TargetName: AgentClusterV4IP_10.233.32.45:80
TargetUrl: http://10.233.32.45:80
....
Succeed: true
SucceedRate: 1
TargetMethod: GET
TargetName: AgentPodV4IP_kdoctor-netreach-reach-task-pmndx_10.233.74.96
TargetUrl: http://10.233.74.96:80
NodeName: worker-node-1
PodName: kdoctor-netreach-reach-task-lwbtk
ReportType: agent test report
RoundDuration: 15.049239468s
RoundNumber: 1
RoundResult: succeed
StartTimeStamp: "2023-09-21T11:30:18Z"
TaskName: netreach.reach-task
TaskType: NetReach
ReportRoundNumber: 1
RoundNumber: 1
Status: Finished
TaskName: reach-task
TaskType: NetReach
kdoctor 定位,不是取代传统的、专业的测试工具,也不是为了实施一个完整的巡检解决方案,而是希望提供一个简单、快速、高效、云原生化的运维测试工具,弥补当前运维测试中的功能空白,降低运维负担,并把检查结果对接到产品的生态中。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。