基于 LLM 的 AI OPS 探索系列 - 搭建支持 GPU 的 Kubernetes 集群

原创

用户10763634

修改于 2024-05-15 12:44:18

940

修改于 2024-05-15 12:44:18

背景

在本次 workshop 中，我们介绍了如何使用 K3S 在 AWS 上设置支持 GPU 的 Kubernetes 集群，安装 NVIDIA 驱动和插件，以及部署 GPU 工作负载。通过结合使用 LangChain 和 Ollama，您可以进一步扩展集群的功能，以实现高级用例，如任务工单管理、Git PR 检查、代码审核和自动创建流水线等。这种集成可以极大地提高 IT 运维工作的效率和自动化程度。

前提条件

准备一个 AWS GPU 云实例（例如 g4dn.xlarge/Ubuntu 22.04 ）
基本的 Kubernetes 和 Helm 知识

技术选型与系统架构设计

云平台：在 AWS 上创建 GPU 实例，提供基础计算资源。
容器编排：使用 K3S 轻量级 Kubernetes 发行版，管理容器化应用K3S
驱动与插件：安装并配置 NVIDIA 驱动和插件（nvidia-container-runtime, nvidia-device-plugin）
应用层集成：部署 Ollama，提供 API 接口，并集成 LangChain 实现高级自动化运维功能。

步骤指南

准备 AWS 实例

首先，设置您的 AWS GPU 云实例。实例启动后，通过 SSH 连接到它。

ssh -i your-key.pem ubuntu@your-instance-ip

安装 NVIDIA 驱动和工具

添加 NVIDIA PPA 仓库
安装 NVIDIA 驱动和相关工具
更新软件包列表并安装 NVIDIA 包

sudo add-apt-repository -y ppa:graphics-drivers
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | tee /etc/apt/sources.list.d/nvidia-container-runtime.list
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
sudo apt update
sudo apt install -y nvidia-modprobe nvidia-driver-535 nvidia-headless-535 nvidia-container-runtime

设置 K3S

安装 K3S，并禁用不必要的组件，如 traefik 和 servicelb。
将 K3S 配置文件复制到 kubectl 使用的配置路径

curl -sfL https://get.k3s.io | sh -s - --disable traefik --disable servicelb
cp /etc/rancher/k3s/k3s.yaml ~/.kube/config

运行镜像以验证 GPU资源访问

ctr image pull docker.io/nvidia/cuda:12.1.1-base-ubuntu22.04
ctr run --rm -t --runc-binary=/usr/bin/nvidia-container-runtime \
  --env NVIDIA_VISIBLE_DEVICES=all \
  docker.io/nvidia/cuda:12.1.1-base-ubuntu22.04 \
  cuda-12.1.1-base-ubuntu22.04 \
  nvidia-smi

安装 Helm 和 NVIDIA 插件

安装 Helm
添加 NVIDIA 设备插件仓库
部署 NVIDIA 设备插件和 GPU 特性发现插件

snap install --classic helm
helm repo add nvidia-device-plugin https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvidia-device-plugin nvidia-device-plugin/nvidia-device-plugin --version 0.15.0 --set runtimeClassName=nvidia --namespace kube-system
helm upgrade -i nvidia-device-discovery nvidia-device-plugin/gpu-feature-discovery --version 0.15.0 --namespace gpu-feature-discovery --create-namespace --set runtimeClassName=nvidia

创建 NVIDIA RuntimeClass

创建一个 RuntimeClass.yaml 以指定 NVIDIA 运行时

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia

应用 RuntimeClass：

kubectl apply -f RuntimeClass.yaml

测试 GPU 基准和 CUDA 任务

验证K3S 集群 GPU Node状态

kubectl describe node | grep nvidia.com

部署一个 GPU 基准测试 Pod

# nbody-gpu-benchmark.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.2.1
      args: ["nbody", "-gpu", "-benchmark"]
      resources:
        limits:
          nvidia.com/gpu: 1
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: all

应用基准测试 Pod

kubectl apply -f nbody-gpu-benchmark.yaml

部署一个 CUDA 任务：

# cuda-job.yaml
apiVersion: v1
kind: Pod
metadata:
  name: vec-add-pod
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
    - name: cuda-vector-add
      image: k8s.gcr.io/cuda-vector-add:v0.1
      resources:
        limits:
          nvidia.com/gpu: 1

应用 CUDA 任务 Pod

kubectl apply -f cuda-job.yaml

设置 K3S 和 Ingress

为 NGINX Ingress 创建值和配置文件：

# value.yaml
cat > value.yaml <<EOF
controller:
  nginxplus: false
  ingressClass: nginx
  replicaCount: 2
  service:
    enabled: true
    type: NodePort
    externalIPs:
      - your-external-ip
EOF

# nginx-cm.yaml
cat > nginx-cm.yaml <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-nginx-ingress
  namespace ingress
data:
  use-ssl-certificate-for-ingress: "false"
  external-status-address: your-external-ip
  proxy-connect-timeout: 10s
  proxy-read-timeout: 10s
  client-header-buffer-size: 64k
  client-body-buffer-size: 64k
  client-max-body-size: 1000m
  proxy-buffers: 8 32k
  proxy-body-size: 1024m
  proxy-buffer-size: 32k
  proxy-connect-timeout: 10s
  proxy-read-timeout: 10s
EOF

# nginx-svc-patch.yaml
cat > nginx-svc-patch.yaml <<EOF
spec:
  ports:
  - name: http
    nodePort: 80
    port: 80
    protocol: TCP
    targetPort: 80
  - name: https
    nodePort: 443
    port: 443
    protocol: TCP
    targetPort: 443
EOF
helm repo add nginx-stable https://helm.nginx.com/stable || echo true
helm repo up
helm delete apisix -n ingress || echo true
kubectl create namespace ingress || echo true
helm upgrade --install nginx nginx-stable/nginx-ingress --version=0.15.0 --namespace ingress -f value.yaml
kubectl apply -f nginx-cm.yaml
kubectl patch svc nginx-nginx-ingress -n ingress --patch-file nginx-svc-patch.yaml

部署 Ollama

为 Ollama 创建 deployment、service 和 ingress：

# ollama.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: ollama
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ollama
spec:
  selector:
    matchLabels:
      name: ollama
  template:
    metadata:
      labels:
        name: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - name: http
          containerPort: 11434
          protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: ollama
spec:
  type: ClusterIP
  selector:
    name: ollama
  ports:
  - port: 80
    name: http
    targetPort: 11434
    protocol: TCP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ollama-ingress
  namespace: ollama
spec:
  ingressClassName: nginx
  rules:
  - host: ollama.onwalk.net
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ollama
            port:
              number: 80

应用 Ollama 部署：

kubectl apply -f ollama.yaml

使用 LangChain 扩展 Ollama 以满足高级用例

LangChain 是一个强大的工具，可以帮助自动化和管理复杂的任务。以下是如何将 LangChain 集成到您的 Ollama 部署中，以实现高级 IT 运维工作，例如任务工单管理、Git PR 检查、代码审核以及自动创建流水线。

安装 LangChain：pip install langchain
配置 LangChain 和 Ollama, 创建一个新的 Demo脚本 langchain_ollama.py，并添加以下内容：

from langchain import LangChain, Ollama

# 初始化 LangChain 和 Ollama
langchain = LangChain()
ollama = Ollama(api_url="http://ollama.onwalk.net")

# 定义高级用例
def manage_task_workflow():
    # 任务工单管理逻辑
    tasks = ollama.get_tasks()
    for task in tasks:
        langchain.process_task(task)

def check_git_pr():
    # Git PR 检查逻辑
    prs = ollama.get_pull_requests()
    for pr in prs:
        langchain.review_pr(pr)

def code_review():
    # 代码审核逻辑
    reviews = ollama.get_code_reviews()
    for review in reviews:
        langchain.perform_code_review(review)

def create_pipelines():
    # 自动创建流水线逻辑
    pipelines = ollama.get_pipelines()
    for pipeline in pipelines:
        langchain.create_pipeline(pipeline)

# 执行高级用例
manage_task_workflow()
check_git_pr()
code_review()
create_pipelines()

通过以上脚本，您可以实现任务工单管理、Git PR 检查、代码审核和自动创建流水线等高级 IT 运维任务。

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

aiops

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

aiops

登录后参与评论

0 条评论

热度