我无法使用EKS创建具有GPU类型的节点组,从云生成中获得此错误:!重复错误(节流率超过状态代码: 400,请求id: 1e091568-812c-45a5-860b-d0d028513d28)在延迟988.442104ms后重试
这是我的clusterconfig.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: CLUSTER_NAME
region: AWS_REGION
nodeGroups:
- name: NODE_GROUP_NAME_GPU
ami: auto
minSize: MIN_SIZE
maxSize: MAX_SIZE
instancesDistribution:
instanceTypes: ["g4dn.xlarge", "g4dn.2xlarge"]
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
spotInstancePools: 1
privateNetworking: true
securityGroups:
withShared: true
withLocal: true
attachIDs: [SECURITY_GROUPS]
iam:
instanceProfileARN: IAM_PROFILE_ARN
instanceRoleARN: IAM_ROLE_ARN
ssh:
allow: true
publicKeyPath: '----'
tags:
k8s.io/cluster-autoscaler/node-template/taint/dedicated: nvidia.com/gpu=true
k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: 'true'
k8s.io/cluster-autoscaler/enabled: 'true'
labels:
lifecycle: Ec2Spot
nvidia.com/gpu: 'true'
k8s.amazonaws.com/accelerator: nvidia-tesla
taints:
nvidia.com/gpu: "true:NoSchedule"
发布于 2022-04-27 07:38:54
解决方案是在集群上安装nividia插件,这样集群就可以识别gpu节点。
https://stackoverflow.com/questions/71777242
复制相似问题