第12章生产化 — 从Demo到可用的Agent

Demo和生产环境之间，有一条巨大的鸿沟。

Demo里Agent跑得很顺畅，大家很开心。上线后，各种问题冒出来：速度太慢、成本太高、错误太多、用户投诉。

这一章讲怎么跨过这条鸿沟。

12.1 一个真实的失败案例

我上线时更盯用户敢不敢用，而不只是模型分数。

一个团队做了一个AI客服Agent。Demo里表现很好——用户问什么，它都能给出专业的回答。

上线第一天，问题来了。

速度。Demo里一次对话只要2秒，上线后变成了8秒。因为Demo是单用户测试，上线后是100个用户同时访问，API排队了。

成本。Demo里一天花10美元，上线后一天花500美元。因为用户问了很多重复的问题，每次都调用模型推理。

错误。Demo里几乎不出错，上线后错误率15%。因为用户问了很多Demo里没测过的问题，Agent答不上来就编答案。

这三个问题，分别对应生产化的三个核心：可观测性、成本控制、评估体系。

12.2 可观测性：怎么知道Agent在干什么

为什么需要

Agent在Demo里跑得很好。但你不知道它内部在干什么——调用了几次模型、搜索了什么内容、每一步花了多久。

上线后出了问题，你无法诊断原因。就像在黑夜里开车不开灯。

三个层次

日志（Logs）。记录Agent每一步的操作。调用了什么工具、传了什么参数、返回了什么结果、花了多少时间。

追踪（Traces）。把一次完整的Agent执行串联成一条链路。你可以看到从用户输入到最终输出的完整路径。

指标（Metrics）。统计性的数据。平均响应时间、成功率、Token消耗、工具调用频率。

2026年的工具

LangSmith：LangChain官方的可观测性平台。和LangGraph深度集成。可以看到Agent的完整执行链路、每一步的输入输出、Token消耗。这是2026年最主流的选择。

Braintrust：评估和可观测性一体化平台。可以同时监控Agent的质量和性能。

怎么实现：用LangSmith只需要设置API Key，之后所有LangGraph的执行都会自动记录到LangSmith平台。

自定义可观测性：包括追踪（Traces）、日志（Logs）、指标（Metrics）三个层次。可以自定义实现，记录每个请求的开始/结束时间、Token消耗、成本等。

12.3 评估（Evals）：怎么衡量Agent好不好

为什么需要

你怎么知道Agent做得好不好？

靠人肉检查？每天1万个任务，检查不过来。靠感觉？“好像还行”不是一个可靠的标准。

需要自动化的评估体系——用指标衡量质量。

评估维度

准确性。Agent的回答是否正确。 完整性。Agent的回答是否遗漏了重要信息。 相关性。Agent的输出是否和用户的需求相关。效率。Agent完成任务花了多少时间、多少Token。 安全性。Agent有没有做不该做的事。

三种评估方法

自动评估。用脚本自动对比Agent输出和标准答案。适合有标准答案的场景（如问答、代码生成）。

LLM-as-Judge。用另一个模型来评判Agent的输出质量。适合没有标准答案的场景（如写作、分析）。2026年，这是最主流的评估方法。

人工评估。让人检查Agent的输出。最准确，但成本高、速度慢。

最佳实践

分层评估。第一层：自动评估，筛选出明显有问题的输出。第二层：LLM-as-Judge，评估中等难度的任务。第三层：人工评估，检查高风险或边界情况。

持续评估。不是评估一次就完事。每次Agent更新后，都跑一遍评估。确保改进不会引入新问题。

怎么实现：用Braintrust或LangSmith的内置评估功能：

完整评估系统代码示例：

12.4 成本控制：怎么不花冤枉钱

模型路由

不同任务用不同模型。

简单查询→低成本模型（DeepSeek V4-Flash，$0.14/百万token）。复杂推理→前沿模型（Claude 4/GPT-5.5）。

模型路由能把成本降低60-80%。大部分任务其实不需要最贵的模型。

语义缓存

用户问了一个问题，Agent回答了。下一个用户问了一个类似的问题，直接用缓存的答案，不需要再调用模型。

2026年，Bifrost和GPTCache提供了开箱即用的语义缓存服务，缓存命中率通常在60-85%。

Token优化

精简Prompt。系统指令不要写得冗长。 上下文压缩。对话太长时，对历史做摘要。

12.5 部署架构

本地部署

Agent运行在你自己的服务器上。

适合：数据隐私要求高。需要离线运行。长期使用成本敏感。

架构：服务器 + Docker/Kubernetes + 模型（本地或API）。

云端部署

Agent运行在云服务商的平台上。

适合：快速上线。需要弹性扩展。不想运维。

架构：Serverless函数 + 云数据库 + 模型API。

混合部署

核心逻辑本地运行，非敏感任务云端调用。

适合：既要隐私，又要性能。

12.6 持续迭代

反馈闭环

Agent上线不是终点，是起点。

收集反馈。在Agent的输出旁边加一个”这个回答有用吗？“的按钮。

分析反馈。定期分析反馈数据。哪些问题出现最多？哪些场景Agent表现最差？

改进Agent。根据反馈调整Prompt、优化检索策略、修复bug、添加新功能。

重新评估。改进后跑一遍评估，确认质量提升，没有引入新问题。

本章小结

生产化的核心要点：

可观测性：日志、追踪、指标。LangSmith是2026年的主流选择。
评估体系：自动评估+LLM-as-Judge+人工评估。分层、持续。
成本控制：模型路由（DeepSeek V4-Flash）、语义缓存（Bifrost/GPTCache）、Token优化。
部署架构：本地、云端、混合。按需求选择。
持续迭代：收集反馈→分析→改进→评估。闭环运转。

下一章是全书最后一章：2026年的技术前沿。

Ch12 Productionization — From Demo to Production Agent

There’s a huge gap between Demo and production environment.

The Agent runs smoothly in Demo, everyone’s happy. After launch, various issues emerge: too slow, too costly, too many errors, user complaints.

This chapter covers how to cross this gap.

12.1 A Real Failure Case

A team built an AI customer service Agent. It performed well in Demo — users asked anything, it gave professional answers.

On the first day of launch, problems came.

Speed. In Demo, one conversation took 2 seconds; after launch it became 8 seconds. Because Demo was single-user testing, after launch it’s 100 concurrent users, API queuing.

Cost. In Demo, it cost $10/day; after launch it costs $500/day. Because users asked many duplicate questions, each triggered model inference.

Errors. In Demo, error rate was nearly 0%; after launch it’s 15%. Because users asked many questions not tested in Demo, the Agent couldn’t answer and made up answers.

These three problems correspond to the three cores of productionization: observability, cost control, and evaluation system.

12.2 Observability: How to Know What the Agent is Doing

Why Needed

The Agent runs well in Demo. But you don’t know what it’s doing internally — how many times it called the model, what content it searched, how long each step took.

After launch, when problems occur, you can’t diagnose the cause. Like driving at night without headlights.

Three Levels

Logs. Record every operation of the Agent. Which tool was called, what parameters were passed, what result was returned, how long it took.

Traces. Concatenate a complete Agent execution into one trace. You can see the complete path from user input to final output.

Metrics. Statistical data. Average response time, success rate, Token consumption, tool call frequency.

2026 Tools

LangSmith: LangChain’s official observability platform. Deep integration with LangGraph. You can see the Agent’s complete execution trace, input/output of each step, Token consumption. This is the mainstream choice in 2026.

BrainTrust: Integrated platform for evaluation and observability. Can monitor both quality and performance of Agents.

Implementation: Using LangSmith only requires setting the API Key, after which all LangGraph executions are automatically logged to the LangSmith platform.

Custom observability: Includes three levels: Traces, Logs, Metrics. You can custom implement to record start/end time, Token consumption, cost, etc. for each request.

12.3 Evaluation (Evals): How to Measure if the Agent is Good

Why Needed

How do you know if the Agent is doing well?

Rely on manual checking? 10,000 tasks per day, can’t check them all. Rely on feeling? “Seems okay” is not a reliable standard.

Need an automated evaluation system — use metrics to measure quality.

Evaluation Dimensions

Accuracy. Is the Agent’s answer correct?

Completeness. Does the Agent’s answer omit important information?

Relevance. Is the Agent’s output relevant to the user’s need?

Efficiency. How much time and Tokens did the Agent take to complete the task?

Safety. Did the Agent do anything it shouldn’t have?

Three Evaluation Methods

Automated evaluation. Use scripts to automatically compare Agent output with standard answers. Suitable for scenarios with standard answers (e.g., Q&A, code generation).

LLM-as-Judge. Use another model to judge the quality of the Agent’s output. Suitable for scenarios without standard answers (e.g., writing, analysis). In 2026, this is the mainstream evaluation method.

Manual evaluation. Have humans check the Agent’s output. Most accurate, but high cost and slow.

Best Practices

Layered evaluation. First layer: automated evaluation, filtering out obviously problematic outputs. Second layer: LLM-as-Judge, evaluating medium-difficulty tasks. Third layer: manual evaluation, checking high-risk or edge cases.

Continuous evaluation. Don’t evaluate just once. After each Agent update, run the evaluation again. Ensure improvements don’t introduce new problems.

Implementation: Use BrainTrust or LangSmith’s built-in evaluation features, supporting three methods: automated evaluation, LLM-as-Judge, and manual evaluation.

12.4 Cost Control: How to Not Spend Money Unnecessarily

Model Routing

Different tasks use different models.

Simple queries → Low-cost model (DeepSeek V4-Flash, $0.14/million tokens).

Complex reasoning → Frontier models (Claude 4/GPT-5.5).

Model routing can reduce costs by 60-80%. Most tasks actually don’t need the most expensive model.

Semantic Caching

A user asked a question, the Agent answered. Another user asked a similar question, directly use the cached answer, no need to call the model again.

In 2026, Bifrost and GPTCache provide out-of-the-box semantic caching services, with cache hit rates typically between 60-85%.

Token Optimization

Simplify Prompt. Don’t write system instructions verbosely.

Context compression. When conversation gets long, summarize the history.

Semantic caching. Cache common Q&A, avoid duplicate model calls.

12.5 Deployment Architecture

Local Deployment

Agent runs on your own server.

Suitable for: High data privacy requirements. Need to run offline. Long-term cost-sensitive.

Architecture: Server + Docker/Kubernetes + Model (local or API).

Cloud Deployment

Agent runs on the cloud service provider’s platform.

Suitable for: Quick launch. Need elastic scaling. Don’t want to operate and maintain.

Architecture: Serverless functions + Cloud database + Model API.

Hybrid Deployment

Core logic runs locally, non-sensitive tasks call cloud APIs.

Suitable for: Both privacy and performance needed.

12.6 Continuous Iteration

Feedback Loop

Agent launch is not the endpoint, it’s the starting point.

Collect feedback. Add a button “Was this answer helpful?” next to the Agent’s output.

Analyze feedback. Regularly analyze feedback data. Which questions appear most? In which scenarios does the Agent perform worst?

Improve Agent. Adjust Prompts based on feedback, optimize retrieval strategies, fix bugs, add new features.

Re-evaluate. After improvements, run the evaluation again to confirm quality improvement and no new problems introduced.

Chapter Summary

Core points of productionization:

Observability: Logs, Traces, Metrics. LangSmith is the mainstream choice in 2026.
Evaluation system: Automated evaluation + LLM-as-Judge + Manual evaluation. Layered, continuous.
Cost control: Model routing (DeepSeek V4-Flash), semantic caching (Bifrost/GPTCache), Token optimization.
Deployment architecture: Local, cloud, hybrid. Choose based on needs.
Continuous iteration: Collect feedback → Analyze → Improve → Evaluate. Closed-loop operation.

The next chapter is the book’s final chapter: 2026 technical frontiers.

第12章 生产化 — 从Demo到可用的Agent

12.1 一个真实的失败案例

12.2 可观测性：怎么知道Agent在干什么

为什么需要

三个层次

2026年的工具

12.3 评估（Evals）：怎么衡量Agent好不好

为什么需要

评估维度

三种评估方法

最佳实践

12.4 成本控制：怎么不花冤枉钱

模型路由

语义缓存

Token优化

12.5 部署架构

本地部署

云端部署

混合部署

12.6 持续迭代

反馈闭环

本章小结

Ch12 Productionization — From Demo to Production Agent

12.1 A Real Failure Case

12.2 Observability: How to Know What the Agent is Doing

Why Needed

Three Levels

2026 Tools

12.3 Evaluation (Evals): How to Measure if the Agent is Good

Why Needed

Evaluation Dimensions

Three Evaluation Methods

Best Practices

12.4 Cost Control: How to Not Spend Money Unnecessarily

Model Routing

Semantic Caching

Token Optimization

12.5 Deployment Architecture

Local Deployment

Cloud Deployment

Hybrid Deployment

12.6 Continuous Iteration

Feedback Loop

Chapter Summary

第12章生产化 — 从Demo到可用的Agent