本文是对Kubernetes V1.5 Scheduler 的预选策略Predicates Policies和优选策略Priorities Policies的含义解读,并附有部分样例代码代码解析。关于kubernetes调度器更全面的解析见我的其他博客:Kubernetes Scheduler源码分析, Kubernetes Scheduler原理解析
##Predicates Policies分析 在/plugin/pkg/scheduler/algorithm/predicates.go
中实现了以下的预选策略:
下面是NoDiskConflict的代码实现,其他Predicates Policies实现类似,都得如下函数原型: type FitPredicate func(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (bool, []PredicateFailureReason, error)
func NoDiskConflict(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
for _, v := range pod.Spec.Volumes {
for _, ev := range nodeInfo.Pods() {
if isVolumeConflict(v, ev) {
return false, []algorithm.PredicateFailureReason{ErrDiskConflict}, nil
}
}
}
return true, nil, nil
}
func isVolumeConflict(volume v1.Volume, pod *v1.Pod) bool {
// fast path if there is no conflict checking targets.
if volume.GCEPersistentDisk == nil && volume.AWSElasticBlockStore == nil && volume.RBD == nil && volume.ISCSI == nil {
return false
}
for _, existingVolume := range pod.Spec.Volumes {
...
if volume.RBD != nil && existingVolume.RBD != nil {
mon, pool, image := volume.RBD.CephMonitors, volume.RBD.RBDPool, volume.RBD.RBDImage
emon, epool, eimage := existingVolume.RBD.CephMonitors, existingVolume.RBD.RBDPool, existingVolume.RBD.RBDImage
// two RBDs images are the same if they share the same Ceph monitor, are in the same RADOS Pool, and have the same image name
// only one read-write mount is permitted for the same RBD image.
// same RBD image mounted by multiple Pods conflicts unless all Pods mount the image read-only
if haveSame(mon, emon) && pool == epool && image == eimage && !(volume.RBD.ReadOnly && existingVolume.RBD.ReadOnly) {
return true
}
}
}
return false
}
##Priorities Policies分析
现在支持的优先级函数包括以下几种:
下面是ImageLocalityPriority的代码实现,其他Priorities Policies实现类似,都得如下函数原型: type PriorityMapFunction func(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error)
func ImageLocalityPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
node := nodeInfo.Node()
if node == nil {
return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
}
var sumSize int64
for i := range pod.Spec.Containers {
sumSize += checkContainerImageOnNode(node, &pod.Spec.Containers[i])
}
return schedulerapi.HostPriority{
Host: node.Name,
Score: calculateScoreFromSize(sumSize),
}, nil
}
func calculateScoreFromSize(sumSize int64) int {
var score int
switch {
case sumSize == 0 || sumSize < minImgSize:
// score == 0 means none of the images required by this pod are present on this
// node or the total size of the images present is too small to be taken into further consideration.
score = 0
// If existing images' total size is larger than max, just make it highest priority.
case sumSize >= maxImgSize:
score = 10
default:
score = int((10 * (sumSize - minImgSize) / (maxImgSize - minImgSize)) + 1)
}
// Return which bucket the given size belongs to
return score
}
其计算每个Node的Score算法为: score = int((10 * (sumSize - minImgSize) / (maxImgSize - minImgSize)) + 1)
其中: minImgSize int64 = 23 * mb, maxImgSize int64 = 1000 * mb, sumSize为Pod中定义的container Images' size 的总和
。
可见,Node上该Pod要求的容器镜像大小之和越大,得分越高,越有可能是目标Node。
其他Priorities Policies的实现和算法请类似分析。