
张跃华,腾讯云容器服务 TKE 后台开发工程师,主要负责 LLM 应用相关研发工作。
还记得 2023 年 ChatGPT 刷屏的日子吗?那时大家还在热议“AI 会不会取代人类”,短短两年后,AI 的角色已经发生巨大变化——它们不再是冷冰冰的工具,而是进化成能帮我们写代码、做 PPT,甚至是管理网店的“打工人”。这些 AI Agent 就像《西部世界》里的机器人,只不过它们不会像 Dolores 那样觉醒反抗,而是默默帮你搞钱。 不过,虽然这些“新同事”很“聪明”,但偶尔也会闹点小乌龙。
想象一下,你让 AI Agent 写一份产品报告,结果它在生成的内容里,把竞争对手的产品数据搞错了 —— 这就是 LLM 的“幻觉”问题,俗称“AI 说胡话”。更离谱的是,当 Agent 执行多步任务后,错误几率会像滚雪球一样放大:假设每个环节 LLM 的正确率为 90%,10 步流程后任务整体成功率只剩不到 35%!这就好比让一个刚入门的小白做 PPT,每一步都出点小错,最终成品直接没法看。
除了时常“一本正经地胡说八道”,Agent 还有不少让人头大的毛病:
这些毛病给我们的运营工作带来了两大难题:
传统的监控服务(如 Zabbix、Prometheus)和日志服务(如 ELK、EFK)虽然可以监控应用的基础指标和日志,但对于 Agent 在实际运行过程中涉及的提示词填充、LLM 推理、嵌入检索、工具调用等关键环节,难以实现对每一步的成功率、响应时长、资源消耗(tokens)、异常信息等指标的观测和追踪,这导致我们不能及时感知并准确定位到 Agent 运行中的具体问题。
传统服务测试(如线上拨测、冒烟测试等)主要聚焦于 HTTP API 可用性、业务逻辑正确性、响应时间和错误率等指标,但这些传统指标难以全面评估 Agent 应用的服务质量。其原因包括:
因此,Agent 应用的评估需结合具体业务场景,设计定制化的评估方案,以解决自动化、标准化和量化评估手段缺失的痛点。
为了打造“靠谱” AI Agent,首先需要具备完善的可观测能力。具体来说,我们需要对 Agent 的每一步执行结果进行详细记录和展示,包括提示词填充、LLM 推理、嵌入检索、工具调用等关键环节,确保每个结果都“有据可查”。 在实现层面,可以使用如 Langfuse、LangSmith 等 AI 应用观测工具。这类工具能够帮助我们采集和展示 Agent 的全链路执行过程,便于问题定位、性能分析和成本控制。
对 AI Agent 的实际表现进行精准评估,离不开标准化、可量化的评估体系。LangChain 团队提出的个性化评估全景图,为我们在业务数据集构建、多维度评估指标设计及不同场景下的 Agent 评估等方面,提供了系统且实用的方法论参考。
其中核心内容和特点简要概括为:
数据集(Dataset)
评估器(Evaluator)
任务类型(Task)
应用评估(Applying evals)
该评估体系以数据集构建、评估器选择、任务类型和评估场景为核心,覆盖从数据采集到多维度评估的全流程。通过多来源数据和多样化评估方法,能够全面、准确地反映 Agent 在不同场景下的实际表现,并为现网质量监控和后续优化提供客观依据。实际应用中,可借助 DeepEval、Ragas、Promptfoo 等开源工具,提升评估的效率。
依托完善的可观测和评估体系,我们能够精准监测现网服务质量,并形成“观测指标丰富—评估维度细化—产品能力进化”的正向闭环,助力 Agent 实现“在使用中成长,越用越聪明”。
基于上述架构,我们集成 TKE、Langfuse 和 DeepEval 等工具,实现了调用链可视化与 Agent 的自动化评估,构建了高效、透明、可持续优化的 Agent 运营体系。
TKE 完全兼容 kubernetes API,为 Agent 应用提供高效部署、资源调度、服务发现和动态伸缩等一系列完整功能,解决 Agent 在开发、测试及运维过程的环境一致性问题,提高了大规模容器集群管理的便捷性,帮助我们降低成本,提高效率。同时,将 MCP-Server 部署于腾讯云云函数(SCF),提供外部工具,增强 Agent 的能力。
Langfuse 提供全链路可观测能力,帮助团队实时掌握 Agent 的运行状态和质量。
DeepEval 实现多维度、自动化的 Agent 质量评估,助力持续优化。
# 添加仓库
helm repo add langfuse https://langfuse.github.io/langfuse-k8s
# 修改配置(修改数据库密码、服务类型等,更多参考: https://langfuse.com/self-hosting/configuration)
helm show values langfuse/langfuse-k8s > langfuse-values.yaml
# 部署服务
kubectl create namespace langfuse
helm install langfuse langfuse/langfuse -n langfuse -f langfuse-values.yaml
查看服务地址,面板展示如下:
准备 Conda 虚拟环境,下载必要依赖。 MCP-Server 代码如下(生产 MCP-Server 建议使用腾讯云云函数部署):
# math_server.py
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("Math")
@mcp.tool()
def add(a: int, b: int) -> int:
"""Add two numbers"""
return a + b
@mcp.tool()
def subtract(a: int, b: int) -> int:
"""Subtract b from a"""
return a - b
@mcp.tool()
def multiply(a: int, b: int) -> int:
"""Multiply two numbers"""
return a * b
@mcp.tool()
def divide(a: int, b: int) -> float:
"""Divide a by b"""
if b == 0:
raise ValueError("Division by zero is not allowed.")
return a / b
if __name__ == "__main__":
mcp.run(transport="stdio")Agent 代码如下:
# langgraph-agent.py
import os
import asyncio
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from langchain_mcp_adapters.client import MultiServerMCPClient
from langfuse.callback import CallbackHandler
# react agent + mcp
async def multi_tool_demo(model: ChatOpenAI, query: str, config: dict):
async with MultiServerMCPClient({
"math": {
"command": "python",
# Make sure to update to the full absolute path to your math.py file
"args": ["math_server.py"],
"transport": "stdio",
},
}) as client:
agent = create_react_agent(model, client.get_tools())
try:
response = await agent.ainvoke({"messages": query}, config=config)
print(f"\n工具调用结果(query: {query}):")
for m in response['messages']:
m.pretty_print()
except Exception as e:
print(f"工具调用出错: {e}")
if __name__ == "__main__":
# get keys for your project
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-***"# your langfuse public key
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-***"# your langfuse secret key
os.environ["LANGFUSE_HOST"] = "http://xx.xx.xx.xx"# your langfuse host
query = "今有雉兔同笼,上有三十五头,下有九十四足,问雉兔各几何?"
# init model
model = ChatOpenAI(
model="<YOUR_LLM_ID>",
api_key=os.getenv("OPENAI_API_KEY"),
base_url=os.getenv("OPENAI_API_BASE"),
)
# Initialize Langfuse CallbackHandler for Langchain (tracing)
langfuse_handler = CallbackHandler()
config = {"callbacks": [langfuse_handler]}
# invoke agent
async def run_tools():
await multi_tool_demo(model=model, query=query, config=config)
asyncio.run(run_tools())
执行命令:
python langgraph-agent.py
查看结果:
Agent 执行结果如下:
Langfuse 采集调用链如下:
用户反馈类型可以表示为:
TypeScript 代码示例:
import { LangfuseWeb } from "langfuse";
export function UserFeedbackComponent() {
const langfuseWeb = new LangfuseWeb({
publicKey: "pk-lf-xxxxxxxx", // your langfuse public key
baseUrl: "http://xx.xx.xx.xx", // your langfuse host
});
const handleUserFeedback = async (value: number) => {
try {
await langfuseWeb.score({
traceId: "xxxxxxxx", // 替换为你的实际 traceId(可从 Langfuse 控制台获取)
name: "user_feedback",
value,
comment: value >= 4 ? "满意" : "不满意", // 简单的评论映射
});
alert(`评分 ${value} 已提交!`);
} catch (error) {
console.error("评分提交失败:", error);
alert("提交失败,请重试");
}
};
return (
<div className="flex space-x-2">
{[0, 1, 2, 3, 4, 5].map((score) => (
<button
key={score}
onClick={() => handleUserFeedback(score)}
className="px-3 py-1 border rounded"
>
{score}
</button>
))}
</div>
);
}
程序执行结果如下:
Langfuse 采集用户反馈如下:
图中展示,Langfuse 可以同时采集用户反馈分数和评论。
由于工作中需要对比不同模型评估的效果,为降低模型接入复杂度,我们使用 LangChain 构建评估流程。本次评估仅采用 Task Completion 这一指标,更多评估指标可参考 DeepEval Evaluation Metrics。
TaskCompletionMetric 计算原理:
将其以 Cronjob 形式部署在 TKE 集群中,定期(1次/30min)拉取 Langfuse 中 Traces 数据,合成数据集并进行评估,适用于现网持续拨测。
# agent_eval.py
import os
import logging
import deepeval
import datetime as dt
from typing import Any
from deepeval import evaluate
from deepeval.models import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import TaskCompletionMetric
from langfuse import Langfuse
from langfuse.api import TraceWithDetails
from langchain_openai import ChatOpenAI
class DeepEvalOpenAI(DeepEvalBaseLLM):
def __init__(self, model):
self.model = model
def load_model(self):
return self.model
def generate(self, prompt: str) -> str:
chat_model = self.load_model()
return chat_model.invoke(prompt).content
async def a_generate(self, prompt: str) -> str:
chat_model = self.load_model()
res = await chat_model.ainvoke(prompt)
return res.content
def get_model_name(self):
return"Custom Azure OpenAI Model"
# 拉取 traces
def fetch_traces(langfuse_cli: Any, lookback_minutes: int) -> list[TraceWithDetails]:
now_timestamp = dt.datetime.now(dt.UTC)
from_timestamp = now_timestamp - dt.timedelta(minutes=lookback_minutes)
try:
response = langfuse_cli.fetch_traces(from_timestamp=from_timestamp, to_timestamp=now_timestamp)
return response.data
except Exception as e:
print(f"Failed to get traces: {e}")
return []
# 使用 langchain sdk 自定义 llm
def get_model(model_name: str) -> DeepEvalBaseLLM:
model = ChatOpenAI(
model=model_name,
temperature=0,
max_tokens=None,
timeout=None,
max_retries=2,
api_key=os.getenv("OPENAI_API_KEY"),
base_url=os.getenv("OPENAI_API_BASE"),
)
return DeepEvalOpenAI(model=model)
def handel_traces(traces: list[TraceWithDetails]) -> list[LLMTestCase]:
test_cases = []
for t in traces:
tools_called_map = {}
tools_called_list = []
actual_output = ""
user_input = t.input["messages"]
if isinstance(t.output, str):
logging.error(t)
elif isinstance(t.output, dict) and "messages"in t.output:
for message in t.output["messages"]:
tool_calls = message.get("tool_calls", [])
if isinstance(tool_calls, list) and len(tool_calls) > 0:
for tool_call in tool_calls:
tools_called_map[tool_call["id"]] = ToolCall(
name=tool_call["name"],
input_parameters=tool_call["args"],
output=None,
)
if message["type"] == "tool":
tool_call_id = message.get("tool_call_id")
if tool_call_id in tools_called_map:
tools_called_map[tool_call_id].output = message["content"]
if message["type"] == "ai" and message["response_metadata"]["finish_reason"] == "stop":
actual_output = message["content"]
for _, v in tools_called_map.items():
tools_called_list.append(v)
test_case = LLMTestCase(
input=user_input,
actual_output=actual_output,
tools_called=tools_called_list,
)
test_cases.append(test_case)
return test_cases
if __name__ == "__main__":
# Get keys for your project from the project settings page
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-xxxxxx"# your langfuse public key
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-xxxxxx"# your langfuse secret key
os.environ["LANGFUSE_HOST"] = "http://xx.xx.xx.xx"# your langfuse host
os.environ["DEEPEVAL_RESULTS_FOLDER"] = "/data/deepeval_result"# 本地保存评估结果路径(建议)
CONFIDENT_API_KEY = "xxxxxxxx"# confident ai 的 api key(可选)
llm = get_model(model_name="<YOUR_LLM_ID>") # the llm model id you selected
metric = TaskCompletionMetric(
threshold=0.7,
model=llm,
include_reason=True
)
langfuse = Langfuse()
# 拉取近 30min 的 traces 日志
lookback_minutes = 30
traces = fetch_traces(langfuse_cli=langfuse, lookback_minutes=lookback_minutes)
logging.info(f"Fetched {len(traces)} traces for last {lookback_minutes} minutes.")
# 登陆 confident ai,上报评估结果(可选)
deepeval.login_with_confident_api_key(CONFIDENT_API_KEY)
# 对 traces 日志进行简单处理
test_cases = handel_traces(traces=traces)
logging.info(f"Got {len(test_cases)} test cases.")
# Evaluate end-to-end
evaluate(test_cases=test_cases, metrics=[metric])
执行命令:
python agent_eval.py
评估数据会在标准输出打印,并在本地路径(建议)和 Confident AI(DeepEval 在线平台,可选)上持久化保存。
程序执行结果如下:
Confident AI 展示单次评估数据如下:
Confident AI 展示历史评估数据如下:
将其集成到 Github Action 等 CI/CD 工具中使用,随版本发布进行测试。
# test_llm_app.py 注意文件要以 test_ 开头
import os
import logging
import pytest
import datetime as dt
from typing import Any
from deepeval import assert_test
from deepeval.models import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import TaskCompletionMetric
from langfuse import Langfuse
from langfuse.api import TraceWithDetails
from langchain_openai import ChatOpenAI
from deepeval.dataset import EvaluationDataset
class DeepEvalOpenAI(DeepEvalBaseLLM):
def __init__(self, model):
self.model = model
def load_model(self):
return self.model
def generate(self, prompt: str) -> str:
chat_model = self.load_model()
return chat_model.invoke(prompt).content
async def a_generate(self, prompt: str) -> str:
chat_model = self.load_model()
res = await chat_model.ainvoke(prompt)
return res.content
def get_model_name(self):
return"Custom Azure OpenAI Model"
# 拉取 traces
def fetch_traces(langfuse_cli: Any, lookback_minutes: int) -> list[TraceWithDetails]:
now_timestamp = dt.datetime.now(dt.UTC)
from_timestamp = now_timestamp - dt.timedelta(minutes=lookback_minutes)
try:
response = langfuse_cli.fetch_traces(from_timestamp=from_timestamp, to_timestamp=now_timestamp)
return response.data
except Exception as e:
print(f"Failed to get traces: {e}")
return []
# 使用 langchain sdk 自定义 llm
def get_model(model_name: str) -> DeepEvalBaseLLM:
model = ChatOpenAI(
model=model_name,
temperature=0,
max_tokens=None,
timeout=None,
max_retries=2,
api_key=os.getenv("OPENAI_API_KEY"),
base_url=os.getenv("OPENAI_API_BASE"),
)
return DeepEvalOpenAI(model=model)
# Get keys for your project from the project settings page
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-xxxxxx"# your langfuse public key
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-xxxxxx"# your langfuse secret key
os.environ["LANGFUSE_HOST"] = "http://xx.xx.xx.xx"# your langfuse host
os.environ["DEEPEVAL_RESULTS_FOLDER"] = "/Users/deepeval_result"# 本地保存评估结果路径
llm = get_model(model_name=os.getenv("LLM_ID"))
metric = TaskCompletionMetric(
threshold=0.7,
model=llm,
include_reason=True
)
langfuse = Langfuse()
# 拉取近 30min 的 traces 日志
lookback_minutes = 30
traces = fetch_traces(langfuse_cli=langfuse, lookback_minutes=lookback_minutes)
logging.info(f"Fetched {len(traces)} traces for last {lookback_minutes} minutes.")
test_cases = []
for t in traces:
tools_called_map = {}
tools_called_list = []
actual_output = ""
user_input = t.input["messages"]
if isinstance(t.output, str):
logging.error(t)
elif isinstance(t.output, dict) and "messages"in t.output:
for message in t.output["messages"]:
tool_calls = message.get("tool_calls", [])
if isinstance(tool_calls, list) and len(tool_calls) > 0:
for tool_call in tool_calls:
tools_called_map[tool_call["id"]] = ToolCall(
name=tool_call["name"],
input_parameters=tool_call["args"],
output=None,
)
if message["type"] == "tool":
tool_call_id = message.get("tool_call_id")
if tool_call_id in tools_called_map:
tools_called_map[tool_call_id].output = message["content"]
if message["type"] == "ai" and message["response_metadata"]["finish_reason"] == "stop":
actual_output = message["content"]
for _, v in tools_called_map.items():
tools_called_list.append(v)
test_case = LLMTestCase(
input=user_input,
actual_output=actual_output,
tools_called=tools_called_list,
)
test_cases.append(test_case)
dataset = EvaluationDataset(test_cases=test_cases)
logging.info(f"Got {len(test_cases)} test cases.")
# Loop through test cases
@pytest.mark.parametrize("test_case", dataset)
def test_llm_app(test_case: LLMTestCase):
assert_test(test_case, [metric])
执行命令:
deepeval test run test_llm_app.py -i
程序执行结果如下:
Github Action 示例:
name: LLM App Unit Testing
on:
push:
pull_request:
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.10"
- name: Install Poetry
run: |
curl -SSL https://install.python-poetry.org | python3 -
echo"$HOME/.local/bin" >> $GITHUB_PATH
- name: Install Dependencies
run: poetry install --no-root
- name: Set OpenAI API Key
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: echo"OPENAI_API_KEY=$OPENAI_API_KEY" >> $GITHUB_ENV
- name: Set OpenAI API Base
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_BASE }}
run: echo"OPENAI_API_BASE=$OPENAI_API_BASE" >> $GITHUB_ENV
- name: Set LLM
env:
OPENAI_API_KEY: ${{ secrets.LLM_ID }}
run: echo"LLM_ID=$LLM_ID" >> $GITHUB_ENV
- name: Login to Confident AI
env:
CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}
run: poetry run deepeval login --confident-api-key "$CONFIDENT_API_KEY"
- name: Run DeepEval Test Run
run: poetry run deepeval test run test_llm_app.py -i
上述代码,不管是定时任务还是与 CI/CD 集成,都能让 Agent 的评估变得轻松高效,试试看吧!
通过将 TKE、Langfuse 和 DeepEval 有机结合,我们搭建出一套集调用链可视化、数据集合成与评估自动化的高效流程。 其核心价值体现在:
可观测与量化评估让 Agent 的质量问题无所遁形,让优化有据可依、持续可行。本文旨在抛砖引玉,欢迎大家批评指正、交流探讨,后续 TKE 还将持续带来更多 Agent 技术栈的深度解读与最佳实践,敬请关注!
DeepEval 评估指标介绍:
https://deepeval.com/docs/metrics-introduction
Langfuse 官方文档:
https://langfuse.com/docs
LangSmith 视频:
https://www.youtube.com/watch?v=vygFgCNR7WA&list=PLfaIDFEXuae0um8Fj0V4dHG37fGFU8Q5S
LangGraph 官方文档:
https://langchain-ai.github.io/langgraph
Survey on Evaluation of LLM-based Agents:
https://arxiv.org/html/2503.16416
Creating Evaluation Criteria and Datasets for your LLM App:
https://seya01.medium.com/creating-evaluation-criteria-and-datasets-for-your-llm-app-85d28184dd77