在本次 workshop 中,我们介绍了如何使用 K3S 在 AWS 上设置支持 GPU 的 Kubernetes 集群,安装 NVIDIA 驱动和插件,以及部署 GPU 工作负载。通过结合使用 LangChain 和 Ollama,您可以进一步扩展集群的功能,以实现高级用例,如任务工单管理、Git PR 检查、代码审核和自动创建流水线等。这种集成可以极大地提高 IT 运维工作的效率和自动化程度。
首先,设置您的 AWS GPU 云实例。实例启动后,通过 SSH 连接到它。
ssh -i your-key.pem ubuntu@your-instance-ip
sudo add-apt-repository -y ppa:graphics-drivers
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | tee /etc/apt/sources.list.d/nvidia-container-runtime.list
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
sudo apt update
sudo apt install -y nvidia-modprobe nvidia-driver-535 nvidia-headless-535 nvidia-container-runtime
curl -sfL https://get.k3s.io | sh -s - --disable traefik --disable servicelb
cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
ctr image pull docker.io/nvidia/cuda:12.1.1-base-ubuntu22.04
ctr run --rm -t --runc-binary=/usr/bin/nvidia-container-runtime \
--env NVIDIA_VISIBLE_DEVICES=all \
docker.io/nvidia/cuda:12.1.1-base-ubuntu22.04 \
cuda-12.1.1-base-ubuntu22.04 \
nvidia-smi
snap install --classic helm
helm repo add nvidia-device-plugin https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvidia-device-plugin nvidia-device-plugin/nvidia-device-plugin --version 0.15.0 --set runtimeClassName=nvidia --namespace kube-system
helm upgrade -i nvidia-device-discovery nvidia-device-plugin/gpu-feature-discovery --version 0.15.0 --namespace gpu-feature-discovery --create-namespace --set runtimeClassName=nvidia
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
kubectl apply -f RuntimeClass.yaml
kubectl describe node | grep nvidia.com
# nbody-gpu-benchmark.yaml
apiVersion: v1
kind: Pod
metadata:
name: nbody-gpu-benchmark
namespace: default
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.2.1
args: ["nbody", "-gpu", "-benchmark"]
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
kubectl apply -f nbody-gpu-benchmark.yaml
# cuda-job.yaml
apiVersion: v1
kind: Pod
metadata:
name: vec-add-pod
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-vector-add
image: k8s.gcr.io/cuda-vector-add:v0.1
resources:
limits:
nvidia.com/gpu: 1
应用 CUDA 任务 Pod
kubectl apply -f cuda-job.yaml
为 NGINX Ingress 创建值和配置文件:
# value.yaml
cat > value.yaml <<EOF
controller:
nginxplus: false
ingressClass: nginx
replicaCount: 2
service:
enabled: true
type: NodePort
externalIPs:
- your-external-ip
EOF
# nginx-cm.yaml
cat > nginx-cm.yaml <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-nginx-ingress
namespace ingress
data:
use-ssl-certificate-for-ingress: "false"
external-status-address: your-external-ip
proxy-connect-timeout: 10s
proxy-read-timeout: 10s
client-header-buffer-size: 64k
client-body-buffer-size: 64k
client-max-body-size: 1000m
proxy-buffers: 8 32k
proxy-body-size: 1024m
proxy-buffer-size: 32k
proxy-connect-timeout: 10s
proxy-read-timeout: 10s
EOF
# nginx-svc-patch.yaml
cat > nginx-svc-patch.yaml <<EOF
spec:
ports:
- name: http
nodePort: 80
port: 80
protocol: TCP
targetPort: 80
- name: https
nodePort: 443
port: 443
protocol: TCP
targetPort: 443
EOF
helm repo add nginx-stable https://helm.nginx.com/stable || echo true
helm repo up
helm delete apisix -n ingress || echo true
kubectl create namespace ingress || echo true
helm upgrade --install nginx nginx-stable/nginx-ingress --version=0.15.0 --namespace ingress -f value.yaml
kubectl apply -f nginx-cm.yaml
kubectl patch svc nginx-nginx-ingress -n ingress --patch-file nginx-svc-patch.yaml
为 Ollama 创建 deployment、service 和 ingress:
# ollama.yaml
apiVersion: v1
kind: Namespace
metadata:
name: ollama
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
selector:
matchLabels:
name: ollama
template:
metadata:
labels:
name: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- name: http
containerPort: 11434
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: ollama
spec:
type: ClusterIP
selector:
name: ollama
ports:
- port: 80
name: http
targetPort: 11434
protocol: TCP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ollama-ingress
namespace: ollama
spec:
ingressClassName: nginx
rules:
- host: ollama.onwalk.net
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: ollama
port:
number: 80
应用 Ollama 部署:
kubectl apply -f ollama.yaml
LangChain 是一个强大的工具,可以帮助自动化和管理复杂的任务。以下是如何将 LangChain 集成到您的 Ollama 部署中,以实现高级 IT 运维工作,例如任务工单管理、Git PR 检查、代码审核以及自动创建流水线。
from langchain import LangChain, Ollama
# 初始化 LangChain 和 Ollama
langchain = LangChain()
ollama = Ollama(api_url="http://ollama.onwalk.net")
# 定义高级用例
def manage_task_workflow():
# 任务工单管理逻辑
tasks = ollama.get_tasks()
for task in tasks:
langchain.process_task(task)
def check_git_pr():
# Git PR 检查逻辑
prs = ollama.get_pull_requests()
for pr in prs:
langchain.review_pr(pr)
def code_review():
# 代码审核逻辑
reviews = ollama.get_code_reviews()
for review in reviews:
langchain.perform_code_review(review)
def create_pipelines():
# 自动创建流水线逻辑
pipelines = ollama.get_pipelines()
for pipeline in pipelines:
langchain.create_pipeline(pipeline)
# 执行高级用例
manage_task_workflow()
check_git_pr()
code_review()
create_pipelines()
通过以上脚本,您可以实现任务工单管理、Git PR 检查、代码审核和自动创建流水线等高级 IT 运维任务。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。