Device Pulgins在Kubernetes 1.10中是beta特性,开始于Kubernetes 1.8,用来给第三方设备厂商通过插件化的方式将设备资源对接到Kubernetes,给容器提供Extended Resources。
通过Device Plugins方式,用户不需要改Kubernetes的代码,由第三方设备厂商开发插件,实现Kubernetes Device Plugins的相关接口即可。
目前关注度比较高的Device Plugins实现有:
Device plugins启动时,对外暴露几个gRPC Service提供服务,并通过/var/lib/kubelet/device-plugins/kubelet.sock
向kubelet进行注册。
Register接口描述如下:
pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:440 type RegistrationServer interface { Register(context.Context, *RegisterRequest) (*Empty, error) } pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:87 type RegisterRequest struct { // Version of the API the Device Plugin was built against Version string `protobuf:"bytes,1,opt,name=version,proto3"json:"version,omitempty"` // Name of the unix socket the device plugin is listening on// PATH = path.Join(DevicePluginPath, endpoint) Endpoint string `protobuf:"bytes,2,opt,name=endpoint,proto3"json:"endpoint,omitempty"` // Schedulable resource name. As of now it's expected to be a DNS Label ResourceName string `protobuf:"bytes,3,opt,name=resource_name,json=resourceName,proto3"json:"resource_name,omitempty"` // Options to be communicated with Device Manager Options *DevicePluginOptions `protobuf:"bytes,4,opt,name=options"json:"options,omitempty"` }
/var/lib/kubelet/device-plugins/
目录下,比如Nvidia GPU Device Plugin对应/var/lib/kubelet/device-plugins/nvidia.sock
。vendor-domain/resource
,比如nvidia.com/gpu
vendor/k8s.io/kubernetes/pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:71func(m *NvidiaDevicePlugin)GetDevicePluginOptions(context.Context, *pluginapi.Empty)(*pluginapi.DevicePluginOptions, error) { return &pluginapi.DevicePluginOptions{}, nil } github.com/NVIDIA/k8s-device-plugin/server.go:80type DevicePluginOptions struct { // Indicates if PreStartContainer call is required before each container start PreStartRequired bool`protobuf:"varint,1,opt,name=pre_start_required,json=preStartRequired,proto3" json:"pre_start_required,omitempty"` }
/deviceplugin.Registration/Register
pkg/kubelet/apis/deviceplugin/v1alpha/api.pb.go:374 var _Registration_serviceDesc = grpc.ServiceDesc{ ServiceName: "deviceplugin.Registration", HandlerType: (*RegistrationServer)(nil), Methods: []grpc.MethodDesc{ { MethodName: "Register", Handler: _Registration_Register_Handler, }, }, Streams: []grpc.StreamDesc{}, Metadata: "api.proto", }
/deviceplugin.DevicePlugin/ListAndWatch
pkg/kubelet/apis/deviceplugin/v1alpha/api.pb.go:505 var _DevicePlugin_serviceDesc = grpc.ServiceDesc{ ServiceName: "deviceplugin.DevicePlugin", HandlerType: (*DevicePluginServer)(nil), Methods: []grpc.MethodDesc{ { MethodName: "Allocate", Handler: _DevicePlugin_Allocate_Handler, }, }, Streams: []grpc.StreamDesc{ { StreamName: "ListAndWatch", Handler: _DevicePlugin_ListAndWatch_Handler, ServerStreams: true, }, }, Metadata: "api.proto", }
/v1beta1.Registration/Register
/v1beta1.Registration/Register pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:466 var _Registration_serviceDesc = grpc.ServiceDesc{ ServiceName: "v1beta1.Registration", HandlerType: (*RegistrationServer)(nil), Methods: []grpc.MethodDesc{ { MethodName: "Register", Handler: _Registration_Register_Handler, }, }, Streams: []grpc.StreamDesc{}, Metadata: "api.proto", }
/v1beta1.DevicePlugin/GetDevicePluginOptions
pkg/kubelet/apis/deviceplugin/v1beta1/api.pb.go:665 var _DevicePlugin_serviceDesc = grpc.ServiceDesc{ ServiceName: "v1beta1.DevicePlugin", HandlerType: (*DevicePluginServer)(nil), Methods: []grpc.MethodDesc{ { MethodName: "GetDevicePluginOptions", Handler: _DevicePlugin_GetDevicePluginOptions_Handler, }, { MethodName: "Allocate", Handler: _DevicePlugin_Allocate_Handler, }, { MethodName: "PreStartContainer", Handler: _DevicePlugin_PreStartContainer_Handler, }, }, Streams: []grpc.StreamDesc{ { StreamName: "ListAndWatch", Handler: _DevicePlugin_ListAndWatch_Handler, ServerStreams: true, }, }, Metadata: "api.proto", }
Device Plugins的工作流如下:
/var/lib/kubelet/device-plugins/${Endpoint}.sock
对外暴露gRPC服务,不同的API Version对应不同的服务接口,前面已经提过,下面是每个接口的描述。
Allocate
pkg/kubelet/apis/deviceplugin/v1alpha/api.proto // DevicePlugin is the service advertised by Device Plugins service DevicePlugin { // ListAndWatch returns a stream of List of Devices // Whenever a Device state changes or a Device disappears, ListAndWatch // returns the new list rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {} // Allocate is called during container creation so that the Device // Plugin can run device specific operations and instruct Kubelet // of the steps to make the Device available in the container rpc Allocate(AllocateRequest) returns (AllocateResponse) {} }
PreStartContainer
pkg/kubelet/apis/deviceplugin/v1beta1/api.proto // DevicePlugin is the service advertised by Device Plugins service DevicePlugin { // GetDevicePluginOptions returns options to be communicated with Device // Manager rpc GetDevicePluginOptions(Empty) returns (DevicePluginOptions) {} // ListAndWatch returns a stream of List of Devices // Whenever a Device state change or a Device disapears, ListAndWatch // returns the new list rpc ListAndWatch(Empty) returns (stream ListAndWatchResponse) {} // Allocate is called during container creation so that the Device // Plugin can run device specific operations and instruct Kubelet // of the steps to make the Device available in the container rpc Allocate(AllocateRequest) returns (AllocateResponse) {} // PreStartContainer is called, if indicated by Device Plugin during registeration phase, // before each container start. Device plugin can run device specific operations // such as reseting the device before making devices available to the container rpc PreStartContainer(PreStartContainerRequest) returns (PreStartContainerResponse) {} }
/var/lib/kubelet/device-plugins/kubelet.sock
向kubelet进行注册。
ListAndWatch:监控对应Devices的状态变更或者Disappear事件,返回ListAndWatchResponse给kubelet, ListAndWatchResponse就是Device列表。
type ListAndWatchResponse struct { Devices []*Device `protobuf:"bytes,1,rep,name=devices" json:"devices,omitempty"` } type Device struct { // A unique ID assigned by the device plugin used // to identify devices during the communication // Max length of this field is 63 characters ID string `protobuf:"bytes,1,opt,name=ID,json=iD,proto3" json:"ID,omitempty"` // Health of the device, can be healthy or unhealthy, see constants.go Health string `protobuf:"bytes,2,opt,name=health,proto3" json:"health,omitempty"` }
Allocate allows Device Plugin to run device specific operations on the Devices requested
type AllocateRequest struct { ContainerRequests []*ContainerAllocateRequest `protobuf:"bytes,1,rep,name=container_requests,json=containerRequests" json:"container_requests,omitempty"` } type ContainerAllocateRequest struct { DevicesIDs []string `protobuf:"bytes,1,rep,name=devicesIDs" json:"devicesIDs,omitempty"` } // AllocateResponse includes the artifacts that needs to be injected into // a container for accessing 'deviceIDs' that were mentioned as part of // 'AllocateRequest'. // Failure Handling: // if Kubelet sends an allocation request for dev1 and dev2. // Allocation on dev1 succeeds but allocation on dev2 fails. // The Device plugin should send a ListAndWatch update and fail the // Allocation request type AllocateResponse struct { ContainerResponses []*ContainerAllocateResponse `protobuf:"bytes,1,rep,name=container_responses,json=containerResponses" json:"container_responses,omitempty"` } type ContainerAllocateResponse struct { // List of environment variable to be set in the container to access one of more devices. Envs map[string]string `protobuf:"bytes,1,rep,name=envs" json:"envs,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"` // Mounts for the container. Mounts []*Mount `protobuf:"bytes,2,rep,name=mounts" json:"mounts,omitempty"` // Devices for the container. Devices []*DeviceSpec `protobuf:"bytes,3,rep,name=devices" json:"devices,omitempty"` // Container annotations to pass to the container runtime Annotations map[string]string `protobuf:"bytes,4,rep,name=annotations" json:"annotations,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"` } // DeviceSpec specifies a host device to mount into a container. type DeviceSpec struct { // Path of the device within the container. ContainerPath string `protobuf:"bytes,1,opt,name=container_path,json=containerPath,proto3" json:"container_path,omitempty"` // Path of the device on the host. HostPath string `protobuf:"bytes,2,opt,name=host_path,json=hostPath,proto3" json:"host_path,omitempty"` // Cgroups permissions of the device, candidates are one or more of // * r - allows container to read from the specified device. // * w - allows container to write to the specified device. // * m - allows container to create device files that do not yet exist. Permissions string `protobuf:"bytes,3,opt,name=permissions,proto3" json:"permissions,omitempty"` }
PreStartContainer allows Device Plugin to run device specific operations on the Devices requested.
type PreStartContainerRequest struct { DevicesIDs []string `protobuf:"bytes,1,rep,name=devicesIDs" json:"devicesIDs,omitempty"` } // PreStartContainerResponse will be send by plugin in response to PreStartContainerRequest type PreStartContainerResponse struct { }
GetDevicePluginOptions: 目前只有PreStartRequired这一个field。
type DevicePluginOptions struct { // Indicates if PreStartContainer call is required before each container start PreStartRequired bool `protobuf:"varint,1,opt,name=pre_start_required,json=preStartRequired,proto3" json:"pre_start_required,omitempty"` }
我们看看Nvidia Device Plugin是怎么处理的,相关的代码如下:
github.com/NVIDIA/k8s-device-plugin/main.go:15
func main() {
...
log.Println("Starting FS watcher.")
watcher, err := newFSWatcher(pluginapi.DevicePluginPath)
...
restart := true
var devicePlugin *NvidiaDevicePlugin
L:
for {
if restart {
if devicePlugin != nil {
devicePlugin.Stop()
}
devicePlugin = NewNvidiaDevicePlugin()
if err := devicePlugin.Serve(); err != nil {
log.Println("Could not contact Kubelet, retrying. Did you enable the device plugin feature gate?")
log.Printf("You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites")
log.Printf("You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start")
} else {
restart = false
}
}
select {
case event := <-watcher.Events:
if event.Name == pluginapi.KubeletSocket && event.Op&fsnotify.Create == fsnotify.Create {
log.Printf("inotify: %s created, restarting.", pluginapi.KubeletSocket)
restart = true
}
case err := <-watcher.Errors:
log.Printf("inotify: %s", err)
case s := <-sigs:
switch s {
case syscall.SIGHUP:
log.Println("Received SIGHUP, restarting.")
restart = true
default:
log.Printf("Received signal \"%v\", shutting down.", s)
devicePlugin.Stop()
break L
}
}
}
}
fsnotify.Watcher
监控/var/lib/kubelet/device-plugins/
目录。fsnotify.Watcher
的Events Channel收到Create kubelet.sock
事件(说明kubelet发生重启),则会触发Nvidia Device Plugin的重启。因此,这其中只监控了kubelet.sock
的Create事件,能很好处理kubelet重启的问题,但是并没有监控自己的socket是否被删除的事件。所以,如果Nvidia Device Plugin的socket被误删了,那么将会导致kubelet无法与该节点的Nvidia Device Plugin进行socket通信,则意味着Device Plugin的gRPC接口都无法调通:
因此,建议加上对自己device plugin socket的删除事件的监控,一旦监控到删除,则应该触发restart。
select {
case event := <-watcher.Events:
if event.Name == pluginapi.KubeletSocket && event.Op&fsnotify.Create == fsnotify.Create {
log.Printf("inotify: %s created, restarting.", pluginapi.KubeletSocket)
restart = true
}
// 增加对nvidia.sock的删除事件监控
if event.Name == serverSocket && event.Op&fsnotify.Delete == fsnotify.Delete {
log.Printf("inotify: %s deleted, restarting.", serverSocket)
restart = true
}
...
}
kubernetes.io
domain的,因此Extended Resource不允许advertise在kubernetes.io
domain下。
curl --header "Content-Type: application/json-patch+json" \ --request PATCH \ --data '[{"op": "add", "path": "/status/capacity/example.com~1foo", "value": "5"}]' \ http://k8s-master:8080/api/v1/nodes/k8s-node-1/status
注意:~1 is the encoding for the character / in the patch path。
https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
这里我们只讨论Kubernetes 1.10中如何调度使用GPU。
在Kubernetes 1.8之前,官方还是建议enable alpha gate feature: Accelerators,通过请求resource alpha.kubernetes.io/nvidia-gpu
来使用gpu,并且要求容器挂载Host上的nvidia lib和driver到容器内。这部分内容,请参考我的博文:如何在Kubernetes集群中利用GPU进行AI训练。
使用官方nvidia driver除了以上注意事项之外,还需注意:
如果你的集群中存在不同型号的GPU服务器,比如nvidia tesla k80, p100, v100等,而且不同的训练任务需要匹配不同的GPU型号,那么先给Node打上对应的Label:
# Label your nodes with the accelerator type they have.
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100
Pod中通过NodeSelector来指定对应的GPU型号:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.
思考:其实仅仅使用NodeSelector是不能很好解决这个问题的,这要求所有的pod都要加上对应的NodeSelector。对于V100这样的昂贵稀有的GPU卡,通常还要求不能让别的训练任务使用,只给某些算法训练使用,这个时候我们可以通过给Node打上对应的Taint,给需要的Pod的打上对应Toleration就能完美满足需求了。
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
spec:
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
containers:
- image: nvidia/k8s-device-plugin:1.8
name: nvidia-device-plugin-ctr
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
template:
metadata:
# Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
# reserves resources for critical add-on pods so that they can be rescheduled after
# a failure. This annotation works in tandem with the toleration below.
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
# This, along with the annotation above marks this pod as a critical add-on.
- key: CriticalAddonsOnly
operator: Exists
containers:
- image: nvidia/k8s-device-plugin:1.10
name: nvidia-device-plugin-ctr
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
关于Kubernetes对critical pod的处理,越来越有意思了,找个时间单独写个博客再详细聊这个。
几个月前,在我的博客如何在Kubernetes集群中利用GPU进行AI训练对Kubernetes 1.8如何使用GPU进行了分析,在Kubernetes 1.10中,已经推荐使用Device Plugins来使用GPU了。本文分析了Device Plugin的的原理和工作机制,介绍了Extended Resource,Nvidia Device Plugin的异常处理及改进点,如何使用和调度GPU等。下一篇篇博客,我将对NVIDIA/k8s-device-plugin和kubelet device plugin进行源码分析,更加深入了解kubelet和nvidia device plugin的交互细节。