k8s是通过sceduler来调度pod的,在调度过程中,由于一些原因,会出现调度不均衡的问题,例如:
这些都会导致pod在调度过程中分配不均,例如会造成节点负载过高,引发pod触发OOM等操作造成服务不可用
其中,节点资源利用不足时是最容易出现问题的,例如,设置的requests和limits不合理,或者没有设置requests/limits都会造成调度不均衡
在这之前,我们需要先装一个metrics,安装方法可参考:k8s的metrics部署
Scheduler在调度过程中,经过了预选阶段和优选阶段,选出一个可用的node节点,来把pod调度到该节点上。那么在预选阶段和优选阶段是如何选出node节点的呢?
最根本的一个调度策略就是判断节点是否有可分配的资源,我们可以通过以下kubectl describe node node名
来查看,现在按照这个调度策略来分析下
查看当前的节点资源占用情况
可以看到,当前的k8s集群共有三个node节点,但是节点的资源分布情况极其不均匀,而实际上,k8s在进行调度时,计算的就是requests的值,不管你limits设置多少,k8s都不关心。所以当这个值没有达到资源瓶颈时,理论上,该节点就会一直有pod调度上去。所以这个时候就会出现调度不均衡的问题。有什么解决办法?
Descheduler 的出现就是为了解决 Kubernetes 自身调度(一次性调度)不足的问题。它以定时任务方式运行,根据已实现的策略,重新去平衡 pod 在集群中的分布。
截止目前,Descheduler 已实现的策略和计划中的功能点如下:
已实现的调度策略
路线图中计划实现的功能点
thresholds
) 决定。这组阈值是以百分比方式指定了 CPU、内存以及 pod数量 的。只有当所有被评估资源都低于它们的阈值时,该 node 节点才会被认为处于利用不足状态。 c. 同时还存在一个 目标阈值(targetThresholds
),用于评估那些节点是否因为超出了阈值而应该从其上驱逐 pod。任何阈值介于 thresholds
和 targetThresholds
之间的节点都被认为资源被合理利用了,因此不会发生 pod 驱逐行为(无论是被驱逐走还是被驱逐来)。 d. 与之相关的还有另一个参数numberOfNodes
,这个参数用来激活指定数量的节点是否处于资源利用不足状态而发生 pod 驱逐行为。当 Descheduler 调度器决定于驱逐 pod 时,它将遵循下面的机制:
Descheduler 会以 Job 形式在 pod 内运行,因为 Job 具有多次运行而无需人为介入的优势。为了避免被自己驱逐 Descheduler 将会以 关键型 pod 运行,因此它只能被创建建到 kube-system
namespace 内。 关于 Critical pod 的介绍请参考:Guaranteed Scheduling For Critical Add-On Pods
要使用 Descheduler,我们需要编译该工具并构建 Docker 镜像,创建 ClusterRole、ServiceAccount、ClusterRoleBinding、ConfigMap 以及 Job。
yaml文件下载地址:https://github.com/kubernetes-sigs/descheduler
git clone https://github.com/kubernetes-sigs/descheduler.git
kubectl create -f kubernetes/rbac.yaml
kubectl create -f kubernetes/configmap.yaml
kubectl create -f kubernetes/job.yaml
kubectl create -f kubernetes/rbac.yaml
kubectl create -f kubernetes/configmap.yaml
kubectl create -f kubernetes/cronjob.yaml
两种方式,一种是以任务的形式启动,另一种是以计划任务的形式启动,建议以计划任务方式去启动
启动之后,可以来验证下descheduler是否启动成功
# kubectl get pod -n kube-system | grep descheduler
descheduler-job-6qtk2 1/1 Running 0 158m
再来验证下pod是否分布均匀
可以看到,目前node02这个节点的pod数是20个,相比较其他节点,还是差了几个,那么我们只对pod数量做重平衡的话,可以对descheduler做如下的配置修改
# cat kubernetes/configmap.yaml
---
apiVersion: v1
kind: ConfigMap
metadata:
name: descheduler-policy-configmap
namespace: kube-system
data:
policy.yaml: |
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemoveDuplicates":
enabled: true
"RemovePodsViolatingInterPodAntiAffinity":
enabled: true
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds: #阈值
#"cpu" : 20 #注释掉下面这些关于cpu和内存的配置项
#"memory": 20
"pods": 24 #把pod的数值调高一些
targetThresholds: #目标阈值
#"cpu" : 50
#"memory": 50
"pods": 25
修改完成后,重启下即可
kubectl delete -f kubernetes/configmap.yaml
kubectl apply -f kubernetes/configmap.yaml
kubectl delete -f kubernetes/cronjob.yaml
kubectl apply -f kubernetes/cronjob.yaml
然后,看下Descheduler的调度日志
# kubectl logs -n kube-system descheduler-job-9rc9h
I0729 08:48:45.361655 1 lownodeutilization.go:151] Node "k8s-node02" is under utilized with usage: api.ResourceThresholds{"cpu":44.375, "memory":24.682000160690105, "pods":22.727272727272727}
I0729 08:48:45.361772 1 lownodeutilization.go:154] Node "k8s-node03" is over utilized with usage: api.ResourceThresholds{"cpu":49.375, "memory":27.064916842870552, "pods":24.545454545454547}
I0729 08:48:45.361807 1 lownodeutilization.go:151] Node "k8s-master01" is under utilized with usage: api.ResourceThresholds{"cpu":50, "memory":3.6347778465158265, "pods":8.181818181818182}
I0729 08:48:45.361828 1 lownodeutilization.go:151] Node "k8s-master02" is under utilized with usage: api.ResourceThresholds{"cpu":40, "memory":0, "pods":5.454545454545454}
I0729 08:48:45.361863 1 lownodeutilization.go:151] Node "k8s-master03" is under utilized with usage: api.ResourceThresholds{"cpu":40, "memory":0, "pods":5.454545454545454}
I0729 08:48:45.361977 1 lownodeutilization.go:154] Node "k8s-node01" is over utilized with usage: api.ResourceThresholds{"cpu":46.875, "memory":32.25716687667426, "pods":27.272727272727273}
I0729 08:48:45.361994 1 lownodeutilization.go:66] Criteria for a node under utilization: CPU: 0, Mem: 0, Pods: 23
I0729 08:48:45.362016 1 lownodeutilization.go:73] Total number of underutilized nodes: 4
I0729 08:48:45.362025 1 lownodeutilization.go:90] Criteria for a node above target utilization: CPU: 0, Mem: 0, Pods: 23
I0729 08:48:45.362033 1 lownodeutilization.go:92] Total number of nodes above target utilization: 2
I0729 08:48:45.362051 1 lownodeutilization.go:202] Total capacity to be moved: CPU:0, Mem:0, Pods:55.2
I0729 08:48:45.362059 1 lownodeutilization.go:203] ********Number of pods evicted from each node:***********
I0729 08:48:45.362066 1 lownodeutilization.go:210] evicting pods from node "k8s-node01" with usage: api.ResourceThresholds{"cpu":46.875, "memory":32.25716687667426, "pods":27.272727272727273}
I0729 08:48:45.362236 1 lownodeutilization.go:213] allPods:30, nonRemovablePods:3, bestEffortPods:2, burstablePods:25, guaranteedPods:0
I0729 08:48:45.362246 1 lownodeutilization.go:217] All pods have priority associated with them. Evicting pods based on priority
I0729 08:48:45.381931 1 evictions.go:102] Evicted pod: "flink-taskmanager-7c7557d6bc-ntnp2" in namespace "default"
I0729 08:48:45.381967 1 lownodeutilization.go:270] Evicted pod: "flink-taskmanager-7c7557d6bc-ntnp2"
I0729 08:48:45.381980 1 lownodeutilization.go:283] updated node usage: api.ResourceThresholds{"cpu":46.875, "memory":32.25716687667426, "pods":26.363636363636363}
I0729 08:48:45.382268 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"flink-taskmanager-7c7557d6bc-ntnp2", UID:"6a5374de-a204-4d2c-a302-ff09c054a43b", APIVersion:"v1", ResourceVersion:"4945574", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0729 08:48:45.399567 1 evictions.go:102] Evicted pod: "flink-taskmanager-7c7557d6bc-t2htk" in namespace "default"
I0729 08:48:45.399613 1 lownodeutilization.go:270] Evicted pod: "flink-taskmanager-7c7557d6bc-t2htk"
I0729 08:48:45.399626 1 lownodeutilization.go:283] updated node usage: api.ResourceThresholds{"cpu":46.875, "memory":32.25716687667426, "pods":25.454545454545453}
I0729 08:48:45.400503 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"flink-taskmanager-7c7557d6bc-t2htk", UID:"bd255dbc-bb05-4258-ac0b-e5be3dc4efe8", APIVersion:"v1", ResourceVersion:"4705479", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0729 08:48:45.450568 1 evictions.go:102] Evicted pod: "oauth-center-tools-api-645d477bcf-hnb8g" in namespace "default"
I0729 08:48:45.450603 1 lownodeutilization.go:270] Evicted pod: "oauth-center-tools-api-645d477bcf-hnb8g"
I0729 08:48:45.450619 1 lownodeutilization.go:283] updated node usage: api.ResourceThresholds{"cpu":45.625, "memory":31.4545002047819, "pods":24.545454545454543}
I0729 08:48:45.451240 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"oauth-center-tools-api-645d477bcf-hnb8g", UID:"caba0aa8-76de-4e23-b163-c660df0ba54d", APIVersion:"v1", ResourceVersion:"3800151", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0729 08:48:45.477605 1 evictions.go:102] Evicted pod: "dazzle-core-api-5d4c899b84-xhlkl" in namespace "default"
I0729 08:48:45.477636 1 lownodeutilization.go:270] Evicted pod: "dazzle-core-api-5d4c899b84-xhlkl"
I0729 08:48:45.477649 1 lownodeutilization.go:283] updated node usage: api.ResourceThresholds{"cpu":44.375, "memory":30.65183353288954, "pods":23.636363636363633}
I0729 08:48:45.477992 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"dazzle-core-api-5d4c899b84-xhlkl", UID:"ce216892-6c50-4c31-b30a-cbe5c708285e", APIVersion:"v1", ResourceVersion:"3800074", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0729 08:48:45.523774 1 request.go:557] Throttling request took 141.499557ms, request: POST:https://10.96.0.1:443/api/v1/namespaces/default/events
I0729 08:48:45.569073 1 evictions.go:102] Evicted pod: "live-foreignapi-api-7bc679b789-z8jnr" in namespace "default"
I0729 08:48:45.569105 1 lownodeutilization.go:270] Evicted pod: "live-foreignapi-api-7bc679b789-z8jnr"
I0729 08:48:45.569119 1 lownodeutilization.go:283] updated node usage: api.ResourceThresholds{"cpu":43.125, "memory":29.84916686099718, "pods":22.727272727272723}
I0729 08:48:45.569151 1 lownodeutilization.go:236] 6 pods evicted from node "k8s-node01" with usage map[cpu:43.125 memory:29.84916686099718 pods:22.727272727272723]
I0729 08:48:45.569172 1 lownodeutilization.go:210] evicting pods from node "k8s-node03" with usage: api.ResourceThresholds{"cpu":49.375, "memory":27.064916842870552, "pods":24.545454545454547}
I0729 08:48:45.569418 1 lownodeutilization.go:213] allPods:27, nonRemovablePods:2, bestEffortPods:0, burstablePods:25, guaranteedPods:0
I0729 08:48:45.569430 1 lownodeutilization.go:217] All pods have priority associated with them. Evicting pods based on priority
I0729 08:48:45.603962 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"live-foreignapi-api-7bc679b789-z8jnr", UID:"37c698e3-b63e-4ef1-917b-ac6bc1be05e0", APIVersion:"v1", ResourceVersion:"3800113", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0729 08:48:45.639483 1 evictions.go:102] Evicted pod: "dazzle-contentlib-api-575f599994-khdn5" in namespace "default"
I0729 08:48:45.639512 1 lownodeutilization.go:270] Evicted pod: "dazzle-contentlib-api-575f599994-khdn5"
I0729 08:48:45.639525 1 lownodeutilization.go:283] updated node usage: api.ResourceThresholds{"cpu":48.125, "memory":26.26225017097819, "pods":23.636363636363637}
I0729 08:48:45.645446 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"default", Name:"dazzle-contentlib-api-575f599994-khdn5", UID:"068aa2ad-f160-4aaa-b25b-f0a9603f9011", APIVersion:"v1", ResourceVersion:"3674763", FieldPath:""}): type: 'Normal' reason: 'Descheduled' pod evicted by sigs.k8s.io/descheduler
I0729 08:48:45.780324 1 evictions.go:102] Evicted pod: "dazzle-datasync-task-577c46668-lltg4" in namespace "default"
I0729 08:48:45.780544 1 lownodeutilization.go:270] Evicted pod: "dazzle-datasync-task-577c46668-lltg4"
I0729 08:48:45.780565 1 lownodeutilization.go:283] updated node usage: api.ResourceThresholds{"cpu":46.875, "memory":25.45958349908583, "pods":22.727272727272727}
I0729 08:48:45.780600 1 lownodeutilization.go:236] 4 pods evicted from node "k8s-node03" with usage map[cpu:46.875 memory:25.45958349908583 pods:22.727272727272727]
I0729 08:48:45.780620 1 lownodeutilization.go:102] Total number of pods evicted: 11
通过这个日志,可以看到
Node “k8s-node01” is over utilized,然后就是有提示evicting pods from node “k8s-node01”,这就说明,Descheduler已经在重新调度了,最终调度结果如下: