第3章三次范式演进#

你可能会问：既然模型这么聪明，为什么还需要Harness？

答案是：模型的”聪明”是有条件的。你给它的信息越好，它的表现越好。你给它的控制越稳，它的产出越可靠。

过去三年，行业围绕”怎么让模型真正好用”这个问题，走了三个阶段。

3.1 Prompt Engineering（2023-2024）：学会跟AI说话#

我早期更迷信 Prompt 模板，后来才发现上下文往往更决定下限。

从一个反直觉的现象说起#

2023年初，ChatGPT刚火的时候，很多人发现一个奇怪的现象：同一个模型，换个方式提问，回答质量天差地别。

你问AI”帮我写个方案”，它给你的东西可能很泛。但如果你说”你是一个有10年经验的产品经理，请帮我写一个针对年轻用户的短视频App竞品分析方案，包含市场现状、三家主要竞品的优劣势对比、以及差异化建议”——效果完全不同。

为什么会这样？因为模型是根据你给的上下文来生成回答的。上下文越模糊，它越只能给你泛泛的答案。上下文越具体，它越能聚焦到你真正想要的东西。

被验证有效的技巧#

于是，一个新方向出现了：Prompt Engineering。专门研究怎么写提示词，让AI输出更好的结果。

几个核心技巧：

角色设定。告诉AI它是谁。“你是一个资深Python开发者”比”帮我写代码”能得到更专业的代码。为什么？因为角色设定激活了模型中和这个角色相关的知识。

Few-shot示例。给AI看几个例子，让它模仿。你想让AI提取文章摘要，与其解释半天什么叫摘要，不如直接给三篇”原文→摘要”的示例。模型会从示例中自动学习模式。

Chain-of-Thought（思维链）。让AI一步步想。“请一步一步推理”这句话，能把数学和逻辑题的正确率拉上去不少。原因是它迫使模型把中间推理过程写出来，而不是直接跳到答案。

输出格式控制。指定你要JSON、表格还是Markdown。格式越明确，输出越稳定。

一个关键认知#

Anthropic在他们内部的Prompt Engineering Workshop中有一个结论：Prompt工程的本质是迭代调试，不是找”魔法词”。

没有一个万能的Prompt模板。好的Prompt是试出来的——写一版，看结果，改一版，再看结果。这个过程和写代码调试没有本质区别。

怎么实现：在2026年，你不需要从零开始调Prompt。Claude Code和Cursor都有内置的Prompt优化功能——你描述想要什么，它帮你生成和优化Prompt。如果你想系统化地管理Prompt，LangSmith提供了Prompt版本管理和A/B测试功能。

局限在哪#

但Prompt Engineering有一个根本性的局限：它只管”怎么说话”，不管”看到什么”。

你再会提问，如果AI不知道你公司的业务背景、不知道当前的项目状态、不知道最新的市场数据，它的回答也只能基于训练时学到的通用知识。

打个比方。Prompt Engineering教你如何向图书馆管理员提问——问得越精确，得到的书越对。但图书馆里有没有你要的那本书，不是提问技巧能解决的。

这就是为什么下一个范式来了。

3.2 Context Engineering（2025）：给AI看什么信息#

问题的本质#

2025年，行业意识到一个关键问题：模型的瓶颈不在智商，而在信息。

GPT-4、Claude、Gemini这些模型的推理能力已经很强了。它们答不好问题，往往不是因为笨，而是因为不知道——不知道你公司的背景，不知道用户之前说了什么，不知道最新的数据。

举个真实的场景。你让AI帮你分析竞品，它给你的分析可能很专业，但用的可能是半年前的数据。因为你没告诉它最新的市场动态，它只能用训练时学到旧知识。

于是，一个新的概念被提出来：Context Engineering（上下文工程）。核心思想是：与其优化你对AI说的话，不如优化AI能看到的信息。

五层信息栈#

Swyx（Latent Space主理人）提出了一个实用框架：AI能看到上下文分五层。

第一层：系统指令。告诉AI它的角色、规则、限制。这是最稳定的一层，通常在对话开始前就设定好了。

第二层：对话历史。用户和AI之间的往来消息。这层让AI能”记住”之前聊了什么。

第三层：检索知识。从外部数据库、文档、网页中检索出来的相关信息。这层是RAG的核心——AI不需要记住所有知识，只需要在需要的时候去查。

第四层：工作状态。当前任务的进度、中间结果、用户的偏好。这层让AI能处理多步骤任务。

第五层：工具定义。AI可以调用哪些工具，每个工具怎么用。这层决定了AI的行动能力。

五层加在一起，就是AI在每次推理时能看到的全部信息。

核心技术#

上下文工程有几个关键技术，每个都解决一个具体问题：

上下文卸载。把信息存在外面，需要时再调入。不是把所有信息都塞进上下文窗口，而是用RAG按需检索。这样即使上下文窗口有限，AI也能访问海量知识。

上下文压缩。对话太长时，对历史消息做摘要。保留关键信息，丢弃冗余细节。这样可以在有限的窗口里容纳更长的对话。

Prefix Caching。缓存不变的系统指令和常用前缀，避免每次都重新计算。这能把延迟和成本降下来不少。

一个重要的发现#

Anthropic在2025年发现：很多以前需要复杂RAG pipeline的任务，现在直接用文件系统访问就够了。

给Agent一个grep工具和一个read工具，让它自己去文件系统里找需要的信息。这比预先设计复杂的检索管道更灵活，也更可靠。

这背后是模型能力的提升——模型变强了，不需要那么精细的预处理，它自己就能从原始数据中找到需要的信息。

局限#

上下文工程解决了”给AI看什么”的问题，但还有两个问题没解决：

第一，AI的行动能力。AI能看到信息，但它不能自己去获取信息、执行操作、管理状态。这些都需要额外的系统来支撑。

第二，AI的安全边界。AI能调用工具，但它不该什么工具都用。它能访问文件系统，但它不该删系统文件。它能发邮件，但它不该给全公司群发。这些都需要控制机制。

所以，第三个范式来了。

3.3 Harness Engineering（2026）：构建控制系统#

从一个失败案例说起#

一个团队做了一个AI客服Agent。Demo里表现很好——用户问什么，它都能给出专业的回答。团队很兴奋，准备上线。

上线第一天，问题来了。用户问了一个Agent不确定的问题，它没有说”我不确定”，而是编了一个看起来很专业的错误答案。用户根据这个答案操作，造成了损失。

团队复盘发现：问题不在模型，而在控制。模型不知道什么时候该说”我不确定”，因为没有人在系统层面告诉它这个规则。

这就是Harness Engineering要解决的问题。

什么是Harness#

Harness是围绕模型的控制系统。它包括六个子系统：编排、工具、记忆、沙箱、状态、安全。

模型是大脑，Harness是神经系统+骨骼+肌肉。大脑负责思考，但没有神经系统传递信号、没有骨骼支撑身体、没有肌肉执行动作，大脑再聪明也没用。

为什么2026年是Harness Engineering的元年#

模型足够强了。2023年的模型经常犯低级错误，再好的Harness也救不了。2026年的模型已经可靠到能处理复杂的多步骤任务。Harness的价值才真正体现出来。

工具标准化了。MCP（Model Context Protocol）成为Agent连接工具的行业标准，已捐赠给Linux Foundation，数千个活跃Server，Claude、Cursor、VS Code、Google Gemini全面支持。A2A（Agent-to-Agent）协议成为Agent之间通信的标准。标准化降低了构建Harness的门槛。

失败案例够多了。88%的企业Agent项目卡在生产环境。行业积累了足够的失败教训，知道了什么该做、什么不该做。

为什么Harness才是核心竞争力#

2026年一个明显的变化是：主流模型的能力在趋同。

GPT-5.5、Claude 4、Gemini 3、DeepSeek V4——这些模型在大多数任务上的表现差距已经很小。选择哪个模型，更多取决于价格、速度、可用性，而不是智商。

DeepSeek V4-Flash的价格是$0.14/百万input token，比GPT-5.5便宜几十倍，但大部分任务的表现差距不到10%。在这种情况下，竞争优势不再来自”用了哪个模型”，而来自”围绕模型建了什么”。

三年前，做一个AI产品的核心工作是选模型、调Prompt。今天，核心工作变成了设计编排逻辑、集成工具、管理状态、保障安全。这些工作的技术含量更高，也更难被复制。

把Agent想象成一辆车。Model是发动机。2023年的发动机马力不够，你再怎么优化底盘也没用。2026年的发动机已经够强了——各家的马力差不多。Harness是变速箱、底盘、悬挂、刹车、转向系统。发动机差不多的情况下，车的好坏取决于这些配套系统。

88%的企业Agent项目卡在哪#

Demo vs 生产。Demo环境：精心准备的输入、简单的任务、容忍错误的观众。生产环境：千变万化的用户输入、复杂的边缘情况、零容忍的错误。

六大失败原因：

没有可观测性。Agent在黑箱里运行，出了问题无法诊断。
没有重试和超时机制。工具调用失败了就卡死。
没有成本控制。每天1万次调用，成本失控。
没有安全护栏。Agent删了不该删的文件，发了不该发的邮件。
没有评估体系。不知道Agent做得好不好。
Harness太简陋。把Agent等同于”模型+Prompt”，忽略了其他子系统。

本章小结#

三年三次范式演进：

阶段	时间	核心命题	关键词
Prompt Engineering	2023-2024	怎么跟AI说话	角色设定、Few-shot、思维链
Context Engineering	2025	给AI看什么信息	RAG、上下文压缩、五层信息栈
Harness Engineering	2026	怎么构建控制系统	六大子系统、MCP、A2A

每一代都建立在上一代的基础上。Harness Engineering不是否定Prompt和Context，而是把它们作为更大系统的一部分。

下一章开始，我们逐一拆解Harness的六大子系统。先从编排（Orchestration）开始——它是Agent的执行流程，也是整个Harness的骨架。

Chapter 3: Three Paradigm Shifts#

You might ask: since models are so smart, why do we still need Harness?

The answer is: the model’s “smartness” has conditions. The better information you give it, the better it performs. The more stable control you give it, the more reliable its output.

Over the past three years, the industry has gone through three stages around the question “how to make models truly useful.”

3.1 Prompt Engineering (2023-2024): Learning to Talk to AI#

Starting with a Counterintuitive Phenomenon#

In early 2023, when ChatGPT first went viral, many people discovered a strange phenomenon: same model, ask it differently, answer quality varies wildly.

You ask AI “help me write a proposal,” it might give you something very generic. But if you say “you are a product manager with 10 years of experience, please help me write a competitor analysis proposal for a short-video app targeting young users, including market status, comparison of strengths and weaknesses of three main competitors, and differentiation suggestions” — the effect is completely different.

Why is that? Because the model generates answers based on the context you give it. The more ambiguous the context, the more it can only give you generic answers. The more specific the context, the more it can focus on what you actually want.

Techniques That Have Been Validated#

Thus, a new direction emerged: Prompt Engineering. Dedicated to researching how to write prompts that make AI output better results.

Several core techniques:

Role setting. Tell the AI who it is. “You are a senior Python developer” gets more professional code than “help me write code.” Why? Because role setting activates knowledge in the model related to this role.

Few-shot examples. Show the AI a few examples and let it imitate. You want the AI to extract article summaries; rather than explaining what a summary is for half a day, just give three examples of “original text → summary.” The model will automatically learn the pattern from the examples.

Chain-of-Thought. Let the AI think step by step. The phrase “please reason step by step” can improve accuracy on math and logic problems quite a bit. The reason is it forces the model to write out the intermediate reasoning process, rather than jumping directly to the answer.

Output format control. Specify whether you want JSON, tables, or Markdown. The clearer the format, the more stable the output.

A Key Realization#

Anthropic has a conclusion in their internal Prompt Engineering Workshop: the essence of Prompt Engineering is iterative debugging, not finding “magic words.”

There is no universal Prompt template. Good Prompts are tried out — write one version, look at the results, revise one version, look at the results again. This process is not fundamentally different from debugging code.

How to implement: In 2026, you don’t need to tune Prompts from scratch. Claude Code and Cursor both have built-in Prompt optimization features — you describe what you want, and it helps you generate and optimize Prompts. If you want to systematically manage Prompts, LangSmith provides Prompt version management and A/B testing features.

Where Are the Limitations#

But Prompt Engineering has a fundamental limitation: it only manages “how to speak,” not “what to see.”

No matter how good you are at asking questions, if the AI doesn’t know your company’s business background, doesn’t know the current project status, doesn’t know the latest market data, its answers can only be based on general knowledge learned during training.

Here’s an analogy: Prompt Engineering teaches you how to ask questions to a librarian — the more precisely you ask, the more correct the book you get. But whether the library has the specific book you want, that’s not something asking skills can solve.

This is why the next paradigm came.

3.2 Context Engineering (2025): What Information to Show AI#

The Essence of the Problem#

In 2025, the industry realized a key problem: the model’s bottleneck is not IQ, but information.

GPT-4, Claude, Gemini — these models’ reasoning capabilities are already very strong. When they don’t answer questions well, it’s often not because they’re dumb, but because they don’t know — don’t know your company’s background, don’t know what the user said before, don’t know the latest data.

Here’s a real scenario: you ask the AI to help you analyze competitors; the analysis it gives you might be very professional, Thus, a new concept was proposed: Context Engineering. The core idea is: rather than optimizing what you say to the AI, optimize what information the AI can see.

Five-Layer Information Stack#

Swyx (Latent Space host) proposed a practical framework: AI can see context in five layers.

First layer: System instructions. Tell the AI its role, rules, limitations. This is the most stable layer, usually set before the conversation starts.

Second layer: Conversation history. Messages between the user and AI. This layer lets the AI “remember” what was discussed before.

Third layer: Retrieved knowledge. Relevant information retrieved from external databases, documents, web pages. This layer is the core of RAG — the AI doesn’t need to remember all knowledge, only needs to look it up when needed.

Fourth layer: Working state. Progress of the current task, intermediate results, user preferences. This layer lets the AI handle multi-step tasks.

Fifth layer: Tool definitions. Which tools the AI can call, how each tool is used. This layer determines the AI’s action capability.

The five layers combined together are all the information the AI can see during each reasoning.

Core Technologies#

Context engineering has several key technologies, each solving a specific problem:

Context offloading. Store information outside, call it in when needed. Rather than stuffing all information into the context window, use RAG to retrieve on demand. This way, even with a limited context window, the AI can access vast amounts of knowledge.

Context compression. When conversations get too long, create summaries of historical messages. Keep key information, discard redundant details. This way, you can accommodate longer conversations within a limited window.

Prefix Caching. Cache unchanged system instructions and commonly used prefixes, avoiding recalculation each time. This can reduce latency and costs quite a bit.

An Important Discovery#

Anthropic discovered in 2025: many tasks that previously needed complex RAG pipelines can now be handled with direct file system access.

Give the Agent a grep tool and a read tool, let it go find the information it needs in the file system itself. This is more flexible and more reliable than pre-designing complex retrieval pipelines.

Behind this is the improvement in model capability — models have become stronger, don’t need such fine-grained preprocessing, they themselves can find the needed information from raw data.

Limitations#

Context engineering solved the “what to show AI” problem, but two problems remain unsolved:

First, the AI’s action capability. The AI can see information, Second, the AI’s safety boundaries. The AI can call tools, So, the third paradigm arrived.

3.3 Harness Engineering (2026): Building Control Systems#

Starting with a Failure Case#

A team built an AI customer service Agent. It performed very well in the Demo — whatever the user asked, it could give a professional answer. The team was excited and ready to launch.

On the first day of launch, problems came. A user asked a question the Agent wasn’t certain about; it didn’t say “I’m not sure,” but fabricated a seemingly professional wrong answer. The user operated based on this answer and caused losses.

The team’s post-mortem found: the problem wasn’t in the model, but in control. The model didn’t know when it should say “I’m not sure,” because no one told it this rule at the system level.

This is what Harness Engineering solves.

What is Harness#

Harness is a control system built around the model. It includes six subsystems: orchestration, tools, memory, sandbox, state, safety.

The model is the brain; Harness is the nervous system + skeleton + muscles. The brain is responsible for thinking,

Why 2026 is the First Year of Harness Engineering#

Models are strong enough now. Models in 2023 frequently made low-level mistakes; no matter how good the Harness, you couldn’t save them. By 2026, models have become reliable enough to handle complex multi-step tasks. Only then does the value of Harness truly manifest.

Tools are standardized. MCP (Model Context Protocol) became the industry standard for Agent-tool connections, donated to the Linux Foundation, with thousands of active Servers, and full support from Claude, Cursor, VS Code, Google Gemini. A2A (Agent-to-Agent) protocol became the standard for communication between Agents. Standardization lowered the threshold for building Harness.

Enough failure cases. 88% of enterprise Agent projects are stuck in production environments. The industry has accumulated enough failure lessons, knowing what should be done and what shouldn’t.

Why Harness is the Core Competitive Advantage#

One obvious change in 2026 is: the capabilities of mainstream models are converging.

GPT-5.5, Claude 4, Gemini 3, DeepSeek V4 — the performance gap between these models on most tasks is already small. Choosing which model depends more on price, speed, availability, rather than IQ.

DeepSeek V4-Flash’s price is $0.14/million input tokens, dozens of times cheaper than GPT-5.5, yet the performance gap on most tasks is less than 10%. In this situation, competitive advantage no longer comes from “which model you used,” but from “what you built around the model.”

Three years ago, the core work of building an AI product was model selection and Prompt tuning. Today, the core work has become designing orchestration logic, integrating tools, managing state, and ensuring security. These tasks have higher technical content and are harder to replicate.

Imagine an Agent as a car. The Model is the engine. In 2023, the engine didn’t have enough horsepower; no matter how you optimized the chassis, it was useless. By 2026, the engines are already strong enough — everyone’s horsepower is about the same. Harness is the transmission, chassis, suspension, brakes, steering system. When engines are about the same, the quality of the car depends on these supporting systems.

Where Are the 88% of Enterprise Agent Projects Stuck#

Demo vs Production. Demo environment: carefully prepared inputs, simple tasks, error-tolerant audience. Production environment: ever-changing user inputs, complex edge cases, zero-tolerance errors.

Six Major Failure Reasons:

No observability. Agents run in a black box; when problems occur, they can’t be diagnosed.
No retry and timeout mechanisms. Tool calls fail and then hang.
No cost control. 10,000 calls per day, costs run away.
No safety guardrails. The Agent deleted files it shouldn’t have, sent emails it shouldn’t have.
No evaluation system. Don’t know if the Agent is doing well or not.
Harness too primitive. Treating Agents as “model + Prompt,” ignoring other subsystems.

Chapter Summary#

Three years, three paradigm shifts:

Stage	Time	Core Proposition	Keywords
Prompt Engineering	2023-2024	How to talk to AI	Role setting, Few-shot, Chain-of-Thought
Context Engineering	2025	What information to show AI	RAG, context compression, five-layer information stack
Harness Engineering	2026	How to build control systems	Six subsystems, MCP, A2A

Each generation is built on top of the previous one. Harness Engineering doesn’t deny Prompt and Context, but incorporates them as part of a larger system.

Starting from the next chapter, we’ll disassemble each of the six subsystems of Harness one by one. We’ll start with orchestration — it’s the Agent’s execution flow, and also the skeleton of the entire Harness.