
作者:SRE 工程师 / 平台架构师 目标读者:SRE、平台工程师、DevOps、技术负责人 技术栈:Chaos Mesh + Istio + Argo CD + FastAPI + Prometheus + GitLab CI 核心目标:让混沌工程成为发布流程的守门员,而非“事后补救工具”
很多团队不敢在生产做混沌,是因为“全量注入 = 全量故障”。 我们的解法是:只对 1% 的真实用户流量注入故障,其余 99% 正常。
chaos-experiment: true Header✅ 效果:真实用户参与验证,但影响范围可控。
1# order-deployment-canary.yaml
2apiVersion: apps/v1
3kind: Deployment
4metadata:
5 name: order-service-canary
6spec:
7 replicas: 1
8 selector:
9 matchLabels:
10 app: order
11 chaos-experiment: "true" # ← 关键标签
12 template:
13 metadata:
14 labels:
15 app: order
16 chaos-experiment: "true"
17 spec:
18 containers:
19 - name: order
20 image: order:v1.2.3
1# istio-vs-order.yaml
2apiVersion: networking.istio.io/v1beta1
3kind: VirtualService
4metadata:
5 name: order-vs
6spec:
7 hosts:
8 - order-service
9 http:
10 - match:
11 - headers:
12 x-chaos-experiment:
13 exact: "true"
14 route:
15 - destination:
16 host: order-service
17 subset: canary
18 weight: 100
19 - route:
20 - destination:
21 host: order-service
22 subset: stable
23 weight: 99
24 - destination:
25 host: order-service
26 subset: canary
27 weight: 1 # ← 1% 流量进 canary
28---
29apiVersion: networking.istio.io/v1beta1
30kind: DestinationRule
31metadata:
32 name: order-dr
33spec:
34 host: order-service
35 subsets:
36 - name: stable
37 labels:
38 chaos-experiment: "false"
39 - name: canary
40 labels:
41 chaos-experiment: "true"
1# Python 后端生成 CR
2def build_gray_network_chaos():
3 return {
4 "apiVersion": "chaos-mesh.org/v1alpha1",
5 "kind": "NetworkChaos",
6 "metadata": {"name": "order-gray-delay"},
7 "spec": {
8 "action": "delay",
9 "mode": "all",
10 "selector": {
11 "namespaces": ["prod"],
12 "labelSelectors": {"chaos-experiment": "true"} # ← 仅 canary
13 },
14 "delay": {"latency": "500ms"},
15 "duration": "10m"
16 }
17 }
1# 仅统计带 Header 的请求
2rate(http_requests_total{service="order", header_x_chaos_experiment="true"}[1m])
🔒 安全边界:即使实验失败,最多影响 1% 用户,且可秒级回滚(删除 VirtualService 规则)。
1// core-path-experiments.json
2[
3 {
4 "name": "checkout-redis-delay",
5 "template": "network-delay",
6 "target": "redis-checkout",
7 "verification": [
8 {"metric": "checkout_error_rate", "threshold": 0.005}
9 ]
10 },
11 {
12 "name": "payment-db-failure",
13 "template": "pod-kill",
14 "target": "mysql-payment",
15 "verification": [
16 {"metric": "payment_timeout_count", "threshold": 10}
17 ]
18 }
19]
1# tasks/scheduled_chaos.py
2from celery.schedules import crontab
3
4app.conf.beat_schedule = {
5 'weekly-core-path-chaos': {
6 'task': 'tasks.chaos.run_scheduled_experiments',
7 'schedule': crontab(day_of_week=0, hour=2, minute=0),
8 'args': ('core-path-experiments.json',)
9 }
10}
11
12@app.task
13def run_scheduled_experiments(config_file: str):
14 experiments = load_json(config_file)
15 for exp in experiments:
16 # 自动创建 + 执行 + 验证
17 run_chaos_experiment(exp)
18 # 失败则发企业微信告警
19 if not all_passed:
20 send_wechat_alert(f"【自动演练失败】{exp['name']}")
1apiVersion: argoproj.io/v1alpha1
2kind: Application
3metadata:
4 name: order-service
5 annotations:
6 chaos-platform/experiment-template: "order-release-validation"
7spec:
8 # ...
1# argo-health-check.py
2@app.get("/health/{app_name}")
3def check_app_health(app_name: str):
4 # 1. 检查是否刚发布(对比 lastDeployedAt)
5 # 2. 如果是新版本,触发混沌验证
6 if is_new_release(app_name):
7 experiment_id = trigger_chaos_validation(app_name)
8 # 3. 轮询结果,最多等待 15 分钟
9 result = wait_for_verification(experiment_id, timeout=900)
10 if not result.passed:
11 return {"status": "Degraded", "message": "Chaos validation failed"}
12 return {"status": "Healthy"}
✅ 效果:如果混沌验证失败,Argo CD 显示应用为 “Degraded”,禁止自动推进到下一个环境(如 staging → prod)。
在 Vue 平台中,我们为每个 Argo CD Application 增加“韧性验证”卡片:
1<template>
2 <div class="release-gate">
3 <h4>韧性验证(发布卡点)</h4>
4 <el-tag v-if="gateStatus === 'pending'" type="warning">验证中...</el-tag>
5 <el-tag v-else-if="gateStatus === 'passed'" type="success">✅ 通过</el-tag>
6 <el-tag v-else type="danger">❌ 失败</el-tag>
7
8 <el-button
9 v-if="gateStatus === 'failed'"
10 size="small"
11 @click="rerunValidation"
12 >
13 重新验证
14 </el-button>
15 </div>
16</template>
开发人员可直接在发布界面看到混沌验证状态,无需跳转。
案例:2025 年 11 月,某支付网关因 Kafka 消费者未处理 Rebalance,导致消息堆积。
kafka_consumer_lag 告警项目 | 投入 | 收益 |
|---|---|---|
平台开发 | 2 名工程师 × 3 个月 | 减少 P1 事故 60% |
实验设计 | SRE 每周 4 小时 | 提前发现 15+ 隐患/季度 |
运维成本 | Chaos Mesh 资源 ≈ 2 CPU / 4GB RAM | 避免单次事故损失 > ¥500,000 |
💡 结论:混沌工程不是成本中心,而是风险对冲工具。
经过四篇文章的演进,我们的平台已具备:
最终目标:
让每一次发布都经过“压力测试”, 让每一个故障都发生在可控范围内, 让系统韧性成为默认属性,而非偶然结果。